[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3480001.3480018acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicdltConference Proceedingsconference-collections
research-article

PointerNet: Spatiotemporal Modeling for Crowd Counting in Videos

Published: 12 November 2021 Publication History

Abstract

The existing video crowd counting methods via deep learning technique are mainly involved in how to leverage the temporal correlation to improve the model. Studies have shown that convolutional neural networks with spatiotemporal three-dimensional kernels (3D CNNs) are promising architectures on video crowd counting. However, the existing methods based on 3D CNNs are insufficient for very deep neural networks in 2D-based CNNs owing to their considerable number of parameters and lack of labeled data, which gives rise to overfitting of 3D CNNs and results in an unsatisfying video crowd counting performance. To address this issue, a novel end-to-end video crowd counting framework, named PointerNet (PseudO-3D (P3D) CNNs INtegrated with Temporal channEl-awaRe (TCA) block) is proposed. The use of P3D kernels causes our framework to possess greater structural diversity and go deep, while having a limited computational cost and memory demand. In addition, the temporal context-aware block was proposed and integrated into our architecture, which assists in exploiting the temporal interdependencies among video sequences. Experiments on three benchmark datasets indicates that the proposed method delivers a state-of-the-art performance.

References

[1]
Liu, X., Weijer, J. v. d. and Bagdanov, A. D. 2018. Leveraging unlabeled data for crowd counting by learning to rank. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, USA, 7661–7669.
[2]
Chan, A. B., Liang, Z.-S. J. and Vasconcelos, N. 2008. Privacy preserving crowd monitoring: Counting people without people models or tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Anchorage, AK, USA, 1, 1–7.
[3]
Zhang, C., Li, H., Wang, X. and Yang, X. 2015. Cross-scene crowd counting via deep convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Boston, MA, USA, 833–841.
[4]
Zhou, B., Wang, X. and Tang, X. 2012. Understanding collective crowd behaviors:learning a mixture model of dynamic pedestrian-agents. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Providence, RI, USA, 2871–2878.
[5]
Sheng, B., Shen, C., Lin, G., Li, J., Yang, W. and Sun, C. 2016. Crowd counting via weighted vlad on dense attribute feature maps. IEEE Transactions on Circuits and Systems for Video Technology, 28(8), 1788–1797.
[6]
Chen, J., Liang, J., Lu, H., Yu, S.-I. and Hauptmann, A. 2016. Videos from the 2013 Boston marathon: An event reconstruction dataset for synchronization and localization. Carnegie Mellon University, https://doi.org/10.1184/R1/6473834.v1
[7]
Ge, W. and Collins, R. T. 2009. Marked point processes for crowd counting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Miami, FL, USA, 2913–2920.
[8]
Leibe, B., Seemann, E. and Schiele, B. 2005. Pedestrian detection in crowded scenes. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). IEEE, San Diego, CA, USA, 1, 878–885.
[9]
Felzenszwalb, P. F., Girshick, R. B., McAllester, D. and Ramanan, D. 2014. Object detection with discriminatively trained part-based models. Computer, 47(2), 6–7.
[10]
Viola, P., Jones, M. J. and Snow, D. 2005. Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision, 63(2), 153–161.
[11]
Zhao, T., Nevatia, R. and Wu, B. 2008. Segmentation and tracking of multiple humans in crowded environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(7), 1198–1211.
[12]
Stewart, R., Andriluka, M. and Ng, A. Y. 2016. End-to-end people detection in crowded scenes. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, 2325–2333.
[13]
Ren, S., He, K., Girshick, R. and Sun, J. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems. MIT Press, Montreal, Canada, 1, 91–99.
[14]
Vu, T., Osokin, A. and Laptev, I. 2015. Context-aware CNNs for person head detection. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, Santiago, Chile, 2893–2901.
[15]
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R. and LeCun, Y. 2014. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.
[16]
Boominathan, L., Kruthiventi, S. S. S. and Babu, R. V. 2016. CrowdNet: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM international conference on Multimedia. Association for Computing Machinery, Amsterdam, The Netherlands, 640–644.
[17]
Sam, D., Surya, S. and Babu, R. 2017. Switching convolutional neural network for crowd counting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, USA, 4031–4039.
[18]
Liu, W., Salzmann, M. and Fua, P. 2019. Context-aware crowd counting. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 5099–5108.
[19]
Pham, V., Kozakaya, T., Yamaguchi, O. and Okada, R. 2015. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, Santiago, Chile, 3253–3261.
[20]
Xu, B. and Qiu, G. 2016. Crowd density estimation based on rich features and random projection forest. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Lake Placid, NY, USA, 1, 1–8.
[21]
Lempitsky, V. and Zisserman, A. 2010. Learning to count objects in images. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1. Curran Associates Inc., Vancouver, British Columbia, Canada, 1324–1332.
[22]
Miao, Y., Han, J., Gao, Y. and Zhang, B. 2019. ST-CNN: Spatial-temporal convolutional neural network for crowd counting in videos. Pattern Recognition Letters, 125, 113–118.
[23]
Fang, Y., Zhan, B., Cai, W., Gao, S. and Hu, B. 2019. Locality-constrained spatial transformer network for video crowd counting. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, Shanghai, China, 814–819.
[24]
Fang, Y., Gao, S., Li, J., Luo, W., He, L. and Hu, B. 2020. Multi-level feature fusion based locality-constrained spatial transformer network for video crowd counting. Neurocomputing, 392, 98–107.
[25]
Xiong, F., Shi, X. and Yeung, D. 2017. Spatiotemporal modeling for crowd counting in videos. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, Italy, 5161–5169.
[26]
Liu, W., Salzmann, M. and Fua, P. 2020. Estimating people flows to better count them in crowded scenes. In European Conference on Computer Vision (ECCV). Springer, Online event, 723–740.
[27]
Wu, X., Xu, B., Zheng, Y., Ye, H., Yang, J. and He, L. 2020. Fast video crowd counting with a temporal aware network. Neurocomputing, 403, 13–20.
[28]
Zou, Z., Shao, H., Qu, X., Wei, W. and Zhou, P. 2019. Enhanced 3D convolutional networks for crowd counting. arXiv preprint arXiv:1908.04121.
[29]
Ji, S., Xu, W., Yang, M. and Yu, K. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.
[30]
Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M. 2015. Learning spatiotemporal features with 3D convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, Santiago, Chile, 4489–4497.
[31]
Carreira, J. and Zisserman, A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, USA, 4724–4733.
[32]
Qiu, Z., Yao, T. and Mei, T. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, Venice, Italy, 5534–5542.
[33]
Zhang, X., Li, Z., Loy, C. C. and Lin, D. 2017. PolyNet: A pursuit of structural diversity in very deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, USA, 3900–3908.
[34]
Hu, J., Shen, L. and Sun, G. 2020. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8), 2011–2023.
[35]
Zhang, Y., Zhou, D., Chen, S., Gao, S. and Ma, Y. 2016. Single-image crowd counting via multi-column convolutional neural network. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA 589–597.
[36]
Chen, K., Loy, C. C., Gong, S. and Xiang, T. 2012. Feature mining for localised crowd counting. In Proceedings British Machine Vision Conference (BMVC). BMVA Press, Guildford, Surrey, U.K., 21.1–21.11.
[37]
Oñoro, D. and López-Sastre, R. 2016. Towards perspective-free object counting with deep learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016, LNCS, Springer, Cham, 9911, 615–629.
[38]
Liu, L., Wang, H., Li, G., Ouyang, W. and Lin, L. 2018. Crowd counting using deep recurrent spatial-aware network. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. AAAI Press, Stockholm, Sweden, 849–855.
[39]
Kumagai, S., Hotta, K. and Kurita, T. 2017. Mixture of counting CNNs: Adaptive integration of cnns specialized to specific appearance for crowd counting. arXiv preprint arXiv:1703.09393.
[40]
Zou, Z., Liu, Y., Xu, S., Wei, W., Wen, S. and Zhou, P. 2020. Crowd counting via hierarchical scale recalibration network. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, USA, 2864-2871.
[41]
Shi, M., Yang, Z. and Chen, Q. 2019. Revisiting perspective information for efficient crowd counting. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 7271–7280.
[42]
Fan, H. and Ling, H. 2017. Sanet: Structure-aware network for visual tracking. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, Honolulu, HI, USA, 2217–2224.
[43]
Li, Y., Zhang, X. and Chen, D. 2018. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA, Salt Lake City, UT, USA, 1091–1100.

Cited By

View all

Index Terms

  1. PointerNet: Spatiotemporal Modeling for Crowd Counting in Videos
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          ICDLT '21: Proceedings of the 2021 5th International Conference on Deep Learning Technologies
          July 2021
          131 pages
          ISBN:9781450390163
          DOI:10.1145/3480001
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 12 November 2021

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Pseudo-3D CNNs
          2. Temporal Channel-Aware Block
          3. Video Crowd Counting

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          ICDLT 2021

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 66
            Total Downloads
          • Downloads (Last 12 months)7
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 13 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media