Towards Privacy-Preserving Human Activity Recognition
Although extensive research on action recognition has been carried out using standard video cameras, little work has explored recognition performance at extremely low tem- poral or spatial camera resolutions. Reliable action recognition in such a “degraded” environment would promote the development of privacy-preserving smart rooms that would facilitate intelligent interaction with its occupants while mitigating privacy concerns. This work aims to explore the trade-off between action recognition performance, number of cameras, and temporal and spatial resolution in a smart-room environment. As it is impractical to build a physical platform to test every combination of camera positions and resolutions, we use a graphics engine (Unity3D) to simulate a room with various avatars animated using motions captured from real subjects with a Kinect v2 sensor. We study the performance impact of spatial resolutions from a single pixel up-to 10×10 pixels, the impact of temporal resolutions from 2 Hz up-to 30 Hz and the impact of using up-to 5 ceiling cameras. We found that reliable action recognition for smart-room centric gestures can still occur in environments with extremely low temporal and spatial resolutions. When using 5, single-pixel cameras at 30Hz we achieved a correct classification rate (CCR) of 75.70% across 9 actions, only 13.9% lower than the CCR for the same camera setup at 10×10 pixels. We also found that, in terms of the impact on action recognition performance, spatial resolu- tion has the highest impact, followed by number of cameras, and temporal resolution (frame rate).
Left-below is the figure of simulated room inside Unity3D. 5 grayscale cameras are placed inside the room, roughly at the vertices of a pentagon. The room was equipped with a omnidirectional light source above the ceiling. On the right side is a picture of 8 avatars (3 females + 5 males) used for this dataset.
Images below are examples of extreme low resolution data. The action performed is Raising Hand, and is captured with sensor #1
In order to evaluate action recognition performance at such extremely low resolutions, we developed a simple pixel-wise, time-series based algorithm. The use of pixel-wise time series is motivated by the lack of reliable estimation algorithms for common features such as optical flow or spatial-temporal interest points at the extremely low resolutions that we study.
At a high level, our approach is based on extracting, from a gray-scale video sequence, a spatially-aligned and gray-scale-normalized rectangular spatial-temporal video cuboid which tightly encompasses the avatar’s silhouette tunnel (i.e., a sequence of silhouettes). This accounts for some of the global spatial misalignment between different action instances, and reduces inter-avatar appearance variability (e.g., due to clothing).
Average CCR for various spatial resolutions.
Average CCR for various temporal resolution
Average CCR for various camera count
More information can be found at http://vip.bu.edu/projects/vsns/privacy-smartroom/
J. Dai, J. Wu, B. Saghafi, J. Konrad, and P. Ishwar, ”Towards privacy-preserving activity recognition using extremely low resolution temporal and spatial cameras,” in IEEE Computer Society Workshop on Analysis and Modeling of Faces and Gestures at CVPR15, pp. -, June 2015. (Oral presentation)[pdf]
J. Dai, B. Saghafi, J. Wu, J. Konrad, and P. Ishwar, ”Towards privacy-preserving recogni- tion of human activities,” in Proc. IEEE International Conference on Image Processing, pp. -, June 2015.[pdf]