WACV 2019 Tutorial on

Video CNNs for activity recognition


Tutorial Date

Jan 7th, 2019 (Location: TBD)


Michael Ryoo

(Google Brain / Indiana Univ.)

Chen Sun


In the recent years, the field of human activity recognition has grown dramatically, reflecting its importance in many high-impact societal applications including robot perception, online video search and retrieval, and smart homes/offices/cities. With the initial success of convolutional neural network (CNN) models to learn video representations for their classification, the field is gradually moving towards detecting and forecasting more complex human activities in continuous videos for various realistic scenarios.

This tutorial will review space-time CNN models designed for video understanding. These not only include standard two-stream CNNs and 3-D spatio-temporal XYT CNNs for segmented fixed-size videos, but also cover recent state-of-the-art models to capture longer temporal information in very continuous videos. The use of spatio-temporal pooling, recurrent networks, temporal attention, and various convolutional layers and models will be covered. In addition, we will discuss the recent progress in space-time neural architecture ‘evolution’ for video understanding.

(tentative) SCHEDULE

0900        Introduction

0920        Representing video segments - 3D XYT conv layers, Chen Sun (Google)

0950        Representing longer videos - temporal pooling/attention/conv, Michael Ryoo (Google / IU)

1020        Coffee Break

1040        Spatio-temporal activity localization, Chen Sun (Google)

1110        Neural architecture search/evolution for videos, Michael Ryoo (Google / IU)

1140        Closing - Beyond classification/detection