Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance

Hironori Hattori ORCID: orcid.org/0000-0001-9892-3601¹^na1^nAff3,
Namhoon Lee¹^na1^nAff4,
Vishnu Naresh Boddeti¹^nAff5,
Fares Beainy²,
Kris M. Kitani¹ &
…
Takeo Kanade¹

1341 Accesses
29 Citations
1 Altmetric
Explore all metrics

Abstract

We consider scenarios where we have zero instances of real pedestrian data (e.g., a newly installed surveillance system in a novel location in which no labeled real data or unsupervised real data exists yet) and a pedestrian detector must be developed prior to any observations of pedestrians. Given a single image and auxiliary scene information in the form of camera parameters and geometric layout of the scene, our approach infers and generates a large variety of geometrically and photometrically accurate potential images of synthetic pedestrians along with purely accurate ground-truth labels through the use of computer graphics rendering engine. We first present an efficient discriminative learning method that takes these synthetic renders and generates a unique spatially-varying and geometry-preserving pedestrian appearance classifier customized for every possible location in the scene. In order to extend our approach to multi-task learning for further analysis (i.e., estimating pose and segmentation of pedestrians besides detection), we build a more generalized model employing a fully convolutional neural network architecture for multi-task learning leveraging the “free" ground-truth annotations that can be obtained from our pedestrian synthesizer. We demonstrate that when real human annotated data is scarce or non-existent, our data generation strategy can provide an excellent solution for an array of tasks for human activity analysis including detection, pose estimation and segmentation. Experimental results show that our approach (1) outperforms classical models and hybrid synthetic-real models, (2) outperforms various combinations of off-the-shelf state-of-the-art pedestrian detectors and pose estimators that are trained on real data, and (3) surprisingly, our method using purely synthetic data is able to outperform models trained on real scene-specific data when data is limited.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Synthetic Data for Video Surveillance Applications of Computer Vision: A Review

Article Open access 17 May 2024

Scene-specific pedestrian detection based on transfer learning and saliency detection for video surveillance

Article 01 May 2017

Pixel-Wise Crowd Understanding via Synthetic Data

Article 30 August 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The connectivity graph considered in this paper is a Markov Random Field over all regions while ignoring the regions defined as walls and obstacles.

References

Agarwal, A, & Triggs, B. (2006). A local basis representation for estimating human pose from cluttered images. In ACCV. Berlin: Springer.
Athitsos, V., Wang, H., & Stefan, A. (2010). A database-based framework for gesture recognition. Personal and Ubiquitous Computing, 14(6), 511–526.
Article Google Scholar
Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR.
Benfold, B., & Reid, I. (2011). Stable multi-target tracking in real-time surveillance video. In CVPR (pp. 3457–3464).
Boddeti, V. N., Kanade, T., & Kumar, B. V. K. (2013). Correlation filters for object alignment. In CVPR (pp. 2291–2298).
Bose, B., & Grimson, E. (2004). Improving object classification in far-field video. In CVPR, 2004 (Vol. 2, pp. II–II). IEEE.
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.
Article MATH Google Scholar
Broggi, A., Fascioli, A., Grisleri, P., Graf, T., & Meinecke, M. (2005). Model-based validation approaches and matching techniques for automotive vision based pedestrian detection. In CVPR workshop (pp. 1–1). IEEE.
Brooks, R. A. (1981). Symbolic reasoning among 3-d models and 2-d images. Artificial Intelligence, 17(13), 285–348.
Article Google Scholar
Cai, Z., Saberian, M., & Vasconcelos, N. (2015). Learning complexity-aware cascades for deep pedestrian detection. In ICCV.
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J. (2016). Human pose estimation with iterative error feedback. In CVPR.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR (pp. 886–893).
Dhome, M., Yassine, A., & Lavest, J.-M. (1993). Determination of the pose of an articulated object from a single perspective view. In BMVC (pp. 1–10).
Dollár, P., Tu, Z., Perona, P., & Belongie, S. (2009). Integral channel features. In BMVC.
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. PAMI, 34(4), 743–761.
Article Google Scholar
Enzweiler, M., & Gavrila, D. M. (2009). Monocular pedestrian detection: Survey and experiments. PAMI, 31(12), 2179–2195.
Article Google Scholar
Ess, A., Leibe, B., & Van Gool, L. (2007). Depth and appearance for mobile scene analysis. In ICCV (pp. 1–8).
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. PAMI, 32(9), 1627–1645.
Article Google Scholar
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Article Google Scholar
Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., et al. (2015). Flownet: Learning optical flow with convolutional networks. In ICCV.
Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In CVPR.
Girshick, R. (2015). Fast r-cnn. In ICCV.
Girshick, R. B., Felzenszwalb, P. F., & Mcallester, D. A. (2011). Object detection with grammar models. In NIPS.
Grauman, K., Shakhnarovich, G., & Darrell, T. (2003). Inferring 3d structure with a statistical image-based shape model. In ICCV (pp. 641–647). IEEE.
Hattori, H., Boddeti, V. N., Kitani, K. M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In CVPR.
Hattori, K., Hattori, H., Ono, Y., Nishino, K., Itoh, M., Boddeti, V. N., et al. (2014). Carnegie Mellon University Surveillance Research Dataset (CMUSRD). Technical report, Carnegie Mellon University. http://www.consortium.ri.cmu.edu/projSRD.php. Accessed November, 2014.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027.
Hejrati, M., & Ramanan, D. (2014). Analysis by synthesis: 3d object recognition by object reconstruction. In CVPR (pp. 2449–2456). IEEE.
Henriques, J. F., Carreira, J., Caseiro, R., & Batista, J. (2013). Beyond hard negative mining: Efficient detector learning via block-circulant decomposition. In ICCV.
Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. IJCV, 80(1), 3–15.
Article Google Scholar
Huang, S., & Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In CVPR.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Lai, K., Bo, L., & Fox, D. (2012). Unsupervised feature learning for 3d scene labeling. In ICRA.
Liu, W., Anguelov, D., Erhan, D., & Szegedy, C. (2016). SSD: Single shot multibox detector. In ECCV.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Marin, J, Vázquez, D., Gerónimo, D., & López, A. M. (2010). Learning appearance in virtual scenarios for pedestrian detection. In CVPR (pp. 137–144). IEEE.
Matikainen, P., Sukthankar, R., & Hebert, M. (2012). Classifier ensemble recommendation. In ECCV workshop (pp. 209–218). Berlin: Springer.
Movshovitz-Attias, Y., Boddeti, V. N., Wei, Z., & Sheikh, Y. (2014). 3d pose-by-detection of vehicles via discriminatively reduced ensembles of correlation filters. In BMVC.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. arXiv preprint arXiv:1603.06937.
Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection. In ICCV.
Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2012). Teaching 3d geometry to deformable part models. In CVPR (pp. 3362–3369). IEEE.
Pishchulin, L., Andriluka, M., Gehler, P., & Schiele, B. (2013). Strong appearance and expressive spatial models for human pose estimation. In ICCV.
Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In CVPR.
Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormahlen, T., & Schiele, B. (2011). Learning people detection models from few training samples. In CVPR (pp. 1473–1480). IEEE.
Potamias, M., & Athitsos, V. (2008). Nearest neighbor search methods for handshape recognition. In Proceedings of the 1st international conference on pervasive technologies related to assistive environments (p. 30). ACM.
Ramakrishna, V., Munoz, D., Hebert, M., Bagnell, A. J., & Sheikh, Y. (2014). Pose machines: Articulated pose estimation via inference machines. In ECCV.
Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European conference on computer vision (ECCV), volume 9906 of LNCS (pp. 102–118). Berlin: Springer International Publishing.
Rogez, G., & Schmid, C. (2016). Mocap-guided data augmentation for 3d pose estimation in the wild. In NIPS.
Romero, J., Kjellstrom, H., & Kragic, D. (2010). Hands in action: real-time 3d reconstruction of hands in interaction with objects. In ICRA.
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR.
Roth, P. M., Sternig, S., Grabner, H., & Bischof, H. (2009). Classifier grids for robust adaptive object detection. In CVPR (pp. 2727–2734). IEEE.
Sangineto, E. (2014). Statistical and spatial consensus collection for detector adaptation. In ECCV (pp. 456–471). Berlin: Springer.
Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3d models. In BMVC.
Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., et al. (2013). Efficient human pose estimation from single depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2821–2840.
Article Google Scholar
Stalder, S., Grabner, H., & Gool, L. V. (2009). Exploring context to learn scene specific object detectors. In Proceedings of PETS.
Stalder, S., Grabner, H., & Van Gool, L. (2010). Cascaded confidence filtering for improved tracking-by-detection. In ECCV, 2010 (pp. 369–382). Berlin: Springer.
Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In ICCV.
Sun, B., & Saenko, K. (2014). From virtual to reality: Fast adaptation of virtual object detectors to real domains. In BMVC.
Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In CVPR.
Taylor, G. R., Chosak, A. J., & Brewer, P. C. (2007). Ovvv: Using virtual worlds to design and evaluate surveillance systems. In CVPR (pp. 1–8).
Thirde, D., Li, L., & Ferryman, F. (2006). Overview of the PETS2006 challenge. In Proceedings 9th IEEE International workshop on performance evaluation of tracking and surveillance (PETS 2006) (pp. 47–50).
Tian, Y., Wang, X., Luo, P., & Tang, X. (2015). Deep learning strong parts for pedestrian detection. In ICCV.
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR.
Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., et al. (2017). Learning from Synthetic Humans. In CVPR.
Vazquez, D. A., López, J. M., Ponsa, D., & Gerónimo, D. (2014). Virtual and real world adaptation for pedestrian detection. PAMI, 36(4), 797–809.
Article Google Scholar
Wang, M., Li, W., & Wang, X. (2012). Transferring a generic pedestrian detector towards specific scenes. In CVPR (pp. 3274–3281). IEEE.
Wang, M., & Wang, X. (2011). Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In CVPR (pp. 3401–3408). IEEE.
Wang, X., Wang, M., & Li, W. (2014). Scene-specific pedestrian detection for static video surveillance. PAMI, 36(2), 361–374.
Article Google Scholar
Wei, S., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.
Wojek, C., Walk, S., & Schiele, B. (2009). Multi-cue onboard pedestrian detection. In CVPR (pp. 794–801).
Xu, J., Vázquez, D., Ramos, S., López, A. M., & Ponsa, D. (2013). Adapting a pedestrian detector by boosting lda exemplar classifiers. In CVPR workshop (pp. 688–693). IEEE.
Yang, W., Ouyang, W., Li, H., & Wang, X. (2016). End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR.
Yang, Y., Shu, G., & Shah, M. (2013). Semi-supervised learning of feature hierarchies for object detection in a video. In CVPR (pp. 1650–1657). IEEE.
Yang, Y., & Ramanan, D. (2013). Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2878–2890.
Article Google Scholar
Zhang, S., Benenson, R., & Schiele, B. (2015). Filtered channel features for pedestrian detection. In CVPR.

Download references

Author information

Hironori Hattori
Present address: Institute of Industrial Science, The University of Tokyo, Tokyo, Japan
Namhoon Lee
Present address: Engineering Science Department, University of Oxford, Oxford, UK
Vishnu Naresh Boddeti
Present address: Computer Science and Engineering, Michigan State University, Lansing, USA
Hironori Hattori and Namhoon Lee have contributed equally to this work.

Authors and Affiliations

The Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
Hironori Hattori, Namhoon Lee, Vishnu Naresh Boddeti, Kris M. Kitani & Takeo Kanade
Volvo Construction Equipment, Göthenburg, Sweden
Fares Beainy

Authors

Hironori Hattori
View author publications
You can also search for this author in PubMed Google Scholar
Namhoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Vishnu Naresh Boddeti
View author publications
You can also search for this author in PubMed Google Scholar
Fares Beainy
View author publications
You can also search for this author in PubMed Google Scholar
Kris M. Kitani
View author publications
You can also search for this author in PubMed Google Scholar
Takeo Kanade
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hironori Hattori.

Additional information

Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hattori, H., Lee, N., Boddeti, V.N. et al. Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance. Int J Comput Vis 126, 1027–1044 (2018). https://doi.org/10.1007/s11263-018-1077-3

Download citation

Received: 24 July 2017
Accepted: 02 March 2018
Published: 16 March 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11263-018-1077-3

Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Synthetic Data for Video Surveillance Applications of Computer Vision: A Review

Scene-specific pedestrian detection based on transfer learning and saliency detection for video surveillance

Pixel-Wise Crowd Understanding via Synthetic Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Synthetic Data for Video Surveillance Applications of Computer Vision: A Review

Scene-specific pedestrian detection based on transfer learning and saliency detection for video surveillance

Pixel-Wise Crowd Understanding via Synthetic Data

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation