Abstract
We consider scenarios where we have zero instances of real pedestrian data (e.g., a newly installed surveillance system in a novel location in which no labeled real data or unsupervised real data exists yet) and a pedestrian detector must be developed prior to any observations of pedestrians. Given a single image and auxiliary scene information in the form of camera parameters and geometric layout of the scene, our approach infers and generates a large variety of geometrically and photometrically accurate potential images of synthetic pedestrians along with purely accurate ground-truth labels through the use of computer graphics rendering engine. We first present an efficient discriminative learning method that takes these synthetic renders and generates a unique spatially-varying and geometry-preserving pedestrian appearance classifier customized for every possible location in the scene. In order to extend our approach to multi-task learning for further analysis (i.e., estimating pose and segmentation of pedestrians besides detection), we build a more generalized model employing a fully convolutional neural network architecture for multi-task learning leveraging the “free" ground-truth annotations that can be obtained from our pedestrian synthesizer. We demonstrate that when real human annotated data is scarce or non-existent, our data generation strategy can provide an excellent solution for an array of tasks for human activity analysis including detection, pose estimation and segmentation. Experimental results show that our approach (1) outperforms classical models and hybrid synthetic-real models, (2) outperforms various combinations of off-the-shelf state-of-the-art pedestrian detectors and pose estimators that are trained on real data, and (3) surprisingly, our method using purely synthetic data is able to outperform models trained on real scene-specific data when data is limited.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The connectivity graph considered in this paper is a Markov Random Field over all regions while ignoring the regions defined as walls and obstacles.
References
Agarwal, A, & Triggs, B. (2006). A local basis representation for estimating human pose from cluttered images. In ACCV. Berlin: Springer.
Athitsos, V., Wang, H., & Stefan, A. (2010). A database-based framework for gesture recognition. Personal and Ubiquitous Computing, 14(6), 511–526.
Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR.
Benfold, B., & Reid, I. (2011). Stable multi-target tracking in real-time surveillance video. In CVPR (pp. 3457–3464).
Boddeti, V. N., Kanade, T., & Kumar, B. V. K. (2013). Correlation filters for object alignment. In CVPR (pp. 2291–2298).
Bose, B., & Grimson, E. (2004). Improving object classification in far-field video. In CVPR, 2004 (Vol. 2, pp. II–II). IEEE.
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.
Broggi, A., Fascioli, A., Grisleri, P., Graf, T., & Meinecke, M. (2005). Model-based validation approaches and matching techniques for automotive vision based pedestrian detection. In CVPR workshop (pp. 1–1). IEEE.
Brooks, R. A. (1981). Symbolic reasoning among 3-d models and 2-d images. Artificial Intelligence, 17(13), 285–348.
Cai, Z., Saberian, M., & Vasconcelos, N. (2015). Learning complexity-aware cascades for deep pedestrian detection. In ICCV.
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J. (2016). Human pose estimation with iterative error feedback. In CVPR.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR (pp. 886–893).
Dhome, M., Yassine, A., & Lavest, J.-M. (1993). Determination of the pose of an articulated object from a single perspective view. In BMVC (pp. 1–10).
Dollár, P., Tu, Z., Perona, P., & Belongie, S. (2009). Integral channel features. In BMVC.
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. PAMI, 34(4), 743–761.
Enzweiler, M., & Gavrila, D. M. (2009). Monocular pedestrian detection: Survey and experiments. PAMI, 31(12), 2179–2195.
Ess, A., Leibe, B., & Van Gool, L. (2007). Depth and appearance for mobile scene analysis. In ICCV (pp. 1–8).
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. PAMI, 32(9), 1627–1645.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., et al. (2015). Flownet: Learning optical flow with convolutional networks. In ICCV.
Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In CVPR.
Girshick, R. (2015). Fast r-cnn. In ICCV.
Girshick, R. B., Felzenszwalb, P. F., & Mcallester, D. A. (2011). Object detection with grammar models. In NIPS.
Grauman, K., Shakhnarovich, G., & Darrell, T. (2003). Inferring 3d structure with a statistical image-based shape model. In ICCV (pp. 641–647). IEEE.
Hattori, H., Boddeti, V. N., Kitani, K. M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In CVPR.
Hattori, K., Hattori, H., Ono, Y., Nishino, K., Itoh, M., Boddeti, V. N., et al. (2014). Carnegie Mellon University Surveillance Research Dataset (CMUSRD). Technical report, Carnegie Mellon University. http://www.consortium.ri.cmu.edu/projSRD.php. Accessed November, 2014.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027.
Hejrati, M., & Ramanan, D. (2014). Analysis by synthesis: 3d object recognition by object reconstruction. In CVPR (pp. 2449–2456). IEEE.
Henriques, J. F., Carreira, J., Caseiro, R., & Batista, J. (2013). Beyond hard negative mining: Efficient detector learning via block-circulant decomposition. In ICCV.
Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. IJCV, 80(1), 3–15.
Huang, S., & Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In CVPR.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Lai, K., Bo, L., & Fox, D. (2012). Unsupervised feature learning for 3d scene labeling. In ICRA.
Liu, W., Anguelov, D., Erhan, D., & Szegedy, C. (2016). SSD: Single shot multibox detector. In ECCV.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Marin, J, Vázquez, D., Gerónimo, D., & López, A. M. (2010). Learning appearance in virtual scenarios for pedestrian detection. In CVPR (pp. 137–144). IEEE.
Matikainen, P., Sukthankar, R., & Hebert, M. (2012). Classifier ensemble recommendation. In ECCV workshop (pp. 209–218). Berlin: Springer.
Movshovitz-Attias, Y., Boddeti, V. N., Wei, Z., & Sheikh, Y. (2014). 3d pose-by-detection of vehicles via discriminatively reduced ensembles of correlation filters. In BMVC.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. arXiv preprint arXiv:1603.06937.
Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection. In ICCV.
Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2012). Teaching 3d geometry to deformable part models. In CVPR (pp. 3362–3369). IEEE.
Pishchulin, L., Andriluka, M., Gehler, P., & Schiele, B. (2013). Strong appearance and expressive spatial models for human pose estimation. In ICCV.
Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In CVPR.
Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormahlen, T., & Schiele, B. (2011). Learning people detection models from few training samples. In CVPR (pp. 1473–1480). IEEE.
Potamias, M., & Athitsos, V. (2008). Nearest neighbor search methods for handshape recognition. In Proceedings of the 1st international conference on pervasive technologies related to assistive environments (p. 30). ACM.
Ramakrishna, V., Munoz, D., Hebert, M., Bagnell, A. J., & Sheikh, Y. (2014). Pose machines: Articulated pose estimation via inference machines. In ECCV.
Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European conference on computer vision (ECCV), volume 9906 of LNCS (pp. 102–118). Berlin: Springer International Publishing.
Rogez, G., & Schmid, C. (2016). Mocap-guided data augmentation for 3d pose estimation in the wild. In NIPS.
Romero, J., Kjellstrom, H., & Kragic, D. (2010). Hands in action: real-time 3d reconstruction of hands in interaction with objects. In ICRA.
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR.
Roth, P. M., Sternig, S., Grabner, H., & Bischof, H. (2009). Classifier grids for robust adaptive object detection. In CVPR (pp. 2727–2734). IEEE.
Sangineto, E. (2014). Statistical and spatial consensus collection for detector adaptation. In ECCV (pp. 456–471). Berlin: Springer.
Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3d models. In BMVC.
Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., et al. (2013). Efficient human pose estimation from single depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2821–2840.
Stalder, S., Grabner, H., & Gool, L. V. (2009). Exploring context to learn scene specific object detectors. In Proceedings of PETS.
Stalder, S., Grabner, H., & Van Gool, L. (2010). Cascaded confidence filtering for improved tracking-by-detection. In ECCV, 2010 (pp. 369–382). Berlin: Springer.
Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In ICCV.
Sun, B., & Saenko, K. (2014). From virtual to reality: Fast adaptation of virtual object detectors to real domains. In BMVC.
Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In CVPR.
Taylor, G. R., Chosak, A. J., & Brewer, P. C. (2007). Ovvv: Using virtual worlds to design and evaluate surveillance systems. In CVPR (pp. 1–8).
Thirde, D., Li, L., & Ferryman, F. (2006). Overview of the PETS2006 challenge. In Proceedings 9th IEEE International workshop on performance evaluation of tracking and surveillance (PETS 2006) (pp. 47–50).
Tian, Y., Wang, X., Luo, P., & Tang, X. (2015). Deep learning strong parts for pedestrian detection. In ICCV.
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR.
Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., et al. (2017). Learning from Synthetic Humans. In CVPR.
Vazquez, D. A., López, J. M., Ponsa, D., & Gerónimo, D. (2014). Virtual and real world adaptation for pedestrian detection. PAMI, 36(4), 797–809.
Wang, M., Li, W., & Wang, X. (2012). Transferring a generic pedestrian detector towards specific scenes. In CVPR (pp. 3274–3281). IEEE.
Wang, M., & Wang, X. (2011). Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In CVPR (pp. 3401–3408). IEEE.
Wang, X., Wang, M., & Li, W. (2014). Scene-specific pedestrian detection for static video surveillance. PAMI, 36(2), 361–374.
Wei, S., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.
Wojek, C., Walk, S., & Schiele, B. (2009). Multi-cue onboard pedestrian detection. In CVPR (pp. 794–801).
Xu, J., Vázquez, D., Ramos, S., López, A. M., & Ponsa, D. (2013). Adapting a pedestrian detector by boosting lda exemplar classifiers. In CVPR workshop (pp. 688–693). IEEE.
Yang, W., Ouyang, W., Li, H., & Wang, X. (2016). End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR.
Yang, Y., Shu, G., & Shah, M. (2013). Semi-supervised learning of feature hierarchies for object detection in a video. In CVPR (pp. 1650–1657). IEEE.
Yang, Y., & Ramanan, D. (2013). Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2878–2890.
Zhang, S., Benenson, R., & Schiele, B. (2015). Filtered channel features for pedestrian detection. In CVPR.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.
Rights and permissions
About this article
Cite this article
Hattori, H., Lee, N., Boddeti, V.N. et al. Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance. Int J Comput Vis 126, 1027–1044 (2018). https://doi.org/10.1007/s11263-018-1077-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-018-1077-3