DATE

Tutorial Date

Jan 7th, 2019 (Location: TBD)

TUTORIAL ORGANIZERS

Michael Ryoo

(Google Brain / Indiana Univ.)

Chen Sun
(Google
Research)

INTRODUCTION

In the recent years, the field of human activity recognition has grown dramatically, reflecting its importance in many high-impact societal applications including robot perception, online video search and retrieval, and smart homes/offices/cities. With the initial success of convolutional neural network (CNN) models to learn video representations for their classification, the field is gradually moving towards detecting and forecasting more complex human activities in continuous videos for various realistic scenarios.

This tutorial will review space-time CNN models designed for video understanding. These not only include standard two-stream CNNs and 3-D spatio-temporal XYT CNNs for segmented fixed-size videos, but also cover recent state-of-the-art models to capture longer temporal information in very continuous videos. The use of spatio-temporal pooling, recurrent networks, temporal attention, and various convolutional layers and models will be covered. In addition, we will discuss the recent progress in space-time neural architecture ‘evolution’ for video understanding.

(tentative) SCHEDULE

0900 Introduction

0920 Representing video segments - 3D XYT conv layers, Chen Sun (Google)

0950 Representing longer videos - temporal pooling/attention/conv, Michael Ryoo (Google / IU)

1020 Coffee Break

1040 Spatio-temporal activity localization, Chen Sun (Google)

1110 Neural architecture search/evolution for videos, Michael Ryoo (Google / IU)

1140 Closing - Beyond classification/detection