JPL First-Person Interaction dataset

Introduction

JPL First-Person Interaction dataset (JPL-Interaction dataset) is composed of human activity videos taken from a first-person viewpoint. The dataset particularly aims to provide first-person videos of interaction-level activities, recording how things visually look from the perspective (i.e., viewpoint) of a person/robot participating in such physical interactions. This dataset was first introduced in the CVPR 2013 paper, "First-Person Activity Recognition: What Are They Doing to Me?" [1].

Dataset

This first-person dataset contains videos of interactions between humans and the observer. We attached a GoPro2 camera to the head of our humanoid model, and asked human participants to interact with the humanoid by performing activities. In order to emulate the mobility of a real robot, we also placed wheels below the humanoid and made an operator to move the humanoid by pushing it from the behind.

There are 7 different types of activities in the dataset, including 4 positive (i.e., friendly) interactions with the observer, 1 neutral interaction, and 2 negative (i.e., hostile) interactions. 'Shaking hands with the observer', 'hugging the observer', 'petting the observer', and 'waving a hand to the observer' are the four friendly interactions. The neutral interaction is the situation where two persons have a conversation about the observer while occasionally pointing it. 'Punching the observer' and 'throwing objects to the observer' are the two negative interactions. Videos were recorded continuously during human activities where each video sequence contains 0 to 3 activities. The videos are in 320*240 resolution with 30 fps.

Download

Continuous videos: videos_part1 videos_part2
This zip file contains a set of continuous activity videos used in [1].

Segmented videos: segmented
The segmented version of the dataset, where each video is temporally segmented to contain a single activity, is also provided here. These segmented videos were used for the 'classification' experiments in [1], and were also used as positive training examples for the 'detection' experiments.

Ground truth labels: labels
Labels describing time intervals of occurring activities and their class IDs. For more details about the experimental setting and results, please refer [1].

Citation

If you make use of the JPL-Interaction dataset in any form, please do cite the following paper:

[1] M. S. Ryoo and L. Matthies, "First-Person Activity Recognition: What Are They Doing to Me?", CVPR 2013.

@inproceedings{ryoo2013first,
title={First-Person Activity Recognition: What Are They Doing to Me?},
author={M. S. Ryoo and L. Matthies},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2013},
month={June},
address={Portland, OR},
}

JPL First-Person Interaction extended dataset

The videos in the JPL-Interaction dataset described above have only 1~2 activities per video on average, and are not appropriate to evaulate 'early recognition (i.e., activity prediction)' capabilities of the systems for real-world scenarios where multiple correlated interactions occur in a sequence with context. Activity prediction is the problem of identifying ongoing/future activities given streaming videos before the activities are fully exectured. In order to conduct research on activity prediction from first-person videos [2], we introduced its extended version where each video contains more natural executions of 2~6 activities in a sequence.

Download

JPL-Interaction-e videos: videos_part1 videos_part2
This zip file contains a set of continuous activity videos used in [2].

Pose data: hoj3d_pose
This zip file contains HOJ3D features obtained based on Kinect pose.

Ground truth labels: labels
Labels describing time intervals of occurring activities and their class IDs. For more details about the experimental setting and results, please refer [2].

Citation

If you make use of the JPL-Interaction extended dataset in any form, please do cite the following paper:

[2] M. S. Ryoo, T. J. Fuchs, L. Xia, J. K. Aggarwal, and L. Matthies, "Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me?", ACM/IEEE International Conference on Human-Robot Interaction (HRI), March 2015 (full paper).

@inproceedings{ryoo2015prediction,
title={Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me?},
author={M. S. Ryoo and T. J. Fuchs and L. Xia and J. K. Aggarwal and L. Matthies},
booktitle={ACM/IEEE International Conference on Human-Robot Interaction (HRI)},
year={2015},
month={March},
address={Portland, OR},
pages={295--302},
}

Michael S. Ryoo - mryoo@indiana.edu

Updated 07/27/2015