Oswald Lanz (FBK)
Jul 25, 2019 – 11:00 AM
DIISM, Artificial Intelligence laboratory (room 201), Siena SI
In 2015 the first artificial system has been reported to beat human performance on ImageNet visual recognition, with Top-5 error rate below 5%. This has not happened with video yet, for example, the best-ranked entry in the EPIC-Kitchens Action Recognition leaderboard achieves a 45.95% Top-5 recognition accuracy. This gap can be interpreted with the increased difficulty of learning the more complex spatiotemporal patterns in videos, from weak supervision with limited data. In this talk, I will focus on deep architectures for video representation learning in this context. I will present key ideas behind LSTA and HF-Nets that realize spatiotemporal aggregation of video frame sequences under a complementary perspective. LSTA extends LSTM with built-in attention and a novel output gating, it learns a smooth tracking of discriminative features for late aggregation of frame level features, providing a +22% accuracy gain over LSTM baseline on GTEA-61 dataset. On the opposite, HF-Nets perform deep hierarchical aggregation to develop spatiotemporal features early, hereby boosting recognition accuracy of the popular TSN from 17% to 41% on 20BN-something dataset adding almost no overhead. We participated with variants of these models in the CVPR 2019 EPIC-Kitchens Challenge and will conclude the talk with an overlook of our submission.