A Review of Video Object Detection: Datasets, Metrics and Methods
<p>The precision/recall curve.</p> "> Figure 2
<p>Categories of video object detection methods.</p> "> Figure 3
<p>DFF (deep feature flow) framework [<a href="#B28-applsci-10-07834" class="html-bibr">28</a>].</p> "> Figure 4
<p>FGFA (flow-guided feature aggregation) framework [<a href="#B22-applsci-10-07834" class="html-bibr">22</a>].</p> "> Figure 5
<p>Small and large feature extractors in [<a href="#B70-applsci-10-07834" class="html-bibr">70</a>].</p> "> Figure 6
<p>Temporal Conv LSTM architecture. State S<sub>t-1</sub> and Output H<sub>t-1</sub> are retrieved from the memory. Forget Gate, Input Gate and Output Gate operate the 3 × 3 convolutions followed by the activation function.</p> "> Figure 7
<p>PSLA (progressive parse local attention) framework [<a href="#B76-applsci-10-07834" class="html-bibr">76</a>].</p> "> Figure 8
<p>CaTDet framework [<a href="#B78-applsci-10-07834" class="html-bibr">78</a>].</p> "> Figure 9
<p>D or T framework [<a href="#B80-applsci-10-07834" class="html-bibr">80</a>].</p> "> Figure 10
<p>STSN (spatiotemporal sampling networks) framework [<a href="#B83-applsci-10-07834" class="html-bibr">83</a>].</p> "> Figure 11
<p>Timeline of video object detection methods.</p> "> Figure 12
<p>Video object detection methods sorted in different groups.</p> ">
Abstract
:1. Introduction
2. Datasets and Evaluation Metrics
2.1. Datasets
2.2. Evaluation Metrics
- (1)
- The Precision/Recall curve is obtained first. For the Recall (r), the Precision is set to the maximum Precision achieved for any Recall .
- (2)
- The area under the Precision/Recall curve is considered to be the Average Precision (AP). The mean of AP in each category is mAP.
3. Video Object Detection Methods
3.1. Flow-Based
3.2. LSTM-Based
3.3. Attention-Related
3.4. Tracking-Based
3.5. Other Methods
4. Comparison of Video Object Detection Methods
5. Future Trends
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Bateni, S.; Wang, Z.; Zhu, Y.; Hu, Y.; Liu, C. Co-Optimizing Performance and Memory Footprint Via Integrated CPU/GPU Memory Management, an Implementation on Autonomous Driving Platform. In Proceedings of the 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Sydney, Australia, 21–24 April 2020. [Google Scholar]
- Lu, J.; Tang, S.; Wang, J.; Zhu, H.; Wang, Y. A Review on Object Detection Based on Deep Convolutional Neural Networks for Autonomous Driving. In Proceedings of the 2019 Chinese Control and Decision Conference (CCDC), Nanchang, China, 3–5 June 2019. [Google Scholar]
- Wei, H.; Laszewski, M.; Kehtarnavaz, N. Deep Learning-Based Person Detection and Classification for Far Field Video Surveillance. In Proceedings of the 2018 IEEE 13th Dallas Circuits and Systems Conference, Dallas, TX, USA, 12 November 2018. [Google Scholar]
- Guillermo, M.; Tobias, R.R.; De Jesus, L.C.; Billones, R.K.; Sybingco, E.; Dadios, E.P.; Fillone, A. Detection and Classification of Public Security Threats in the Philippines Using Neural Networks. In Proceedings of the 2020 IEEE 2nd Global Conference on Life Sciences and Technologies (LifeTech), Kyoto, Japan, 10–12 March 2020; pp. 1–4. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Shen, Z.; Liu, Z.; Li, J.; Jiang, Y.-G.; Chen, Y.; Xue, X. DSOD: Learning Deeply Supervised Object Detectors from Scratch. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1937–1945. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 9259–9266. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—Eccv 2016; Part I; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2016; pp. 6517–6525. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An. Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region. Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/Cvf Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In Computer Vision—Eccv 2014; Part III; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; IEEE: Zurich, Switzerland, 2014; pp. 346–361. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region.-based Fully Convolutional Networks. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016. [Google Scholar]
- Shrivastava, A.; Gupta, A.; Girshick, R. Training Region—Based Object Detectors with Online Hard Example Mining. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
- Wei, H.; Kehtarnavaz, N. Semi-Supervised Faster RCNN-Based Person Detection and Load Classification for Far Field Video Surveillance. Mach. Learn. Knowl. Extr. 2019, 1, 756–767. [Google Scholar] [CrossRef] [Green Version]
- Zhu, X.; Wang, Y.; Dai, J.; Yuan, L.; Wei, Y. Flow-Guided Feature Aggregation for Video Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 408–417. [Google Scholar]
- Zhang, R.; Miao, Z.; Zhang, Q.; Hao, S.; Wang, S. Video Object Detection by Aggregating Features across Adjacent Frames. In Proceedings of the 2019 3rd International Conference on Machine Vision and Information Technology, Guangzhou, China, 22–24 February 2019. [Google Scholar] [CrossRef]
- Kang, K.; Li, H.; Yan, J.; Zeng, X.; Yang, B.; Xiao, T.; Zhang, C.; Wang, Z.; Wang, R.; Wang, X.; et al. T-CNN: Tubelets With Convolutional Neural Networks for Object Detection from Videos. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 2896–2907. [Google Scholar] [CrossRef] [Green Version]
- Chen, Y.; Cao, Y.; Hu, H.; Wang, L. Memory Enhanced Global-Local Aggregation for Video Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020. [Google Scholar]
- Yang, W.; Liu, B.; Li, W.; Yu, N. Tracking Assisted Faster Video Object Detection. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo, Shanghai, China, 8–12 July 2019; pp. 1750–1755. [Google Scholar]
- Zhu, X.; Dai, J.; Zhu, X.; Wie, Y.; Yuan, L. Towards High Performance Video Object Detection for Mobiles. arXiv 2018, arXiv:1804.05830. [Google Scholar]
- Zhu, X.; Xiong, Y.; Dai, J.; Yuan, L.; Wei, Y. Deep Feature Flow for Video Recognition. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 4141–4150. [Google Scholar]
- Horn, B.K.; Schunck, B.G. Determining Optical-Flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef] [Green Version]
- Nguyen, H.T.; Worring, M.; Dev, A. Detection of moving objects in video using a robust motion similarity measure. IEEE Trans. Image Process. 2000, 9, 137–141. [Google Scholar] [CrossRef] [PubMed]
- Carminati, L.; Benois-Pineau, J. Gaussian mixture classification for moving object detection in video surveillance environment. In Proceedings of the 2005 International Conference on Image Processing, Genova, Italy, 11–14 September 2005; pp. 3361–3364. [Google Scholar]
- Jayabalan, E.; Krishnan, A. Object Detection and Tracking in Videos Using Snake and Optical Flow Approach. In Computer Networks and Information Technologies; Das, V.V., Stephen, J., Chaba, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; p. 299. [Google Scholar]
- Jayabalan, E.; Krishnan, A. Detection and Tracking of Moving Object in Compressed Videos. In Computer Networks and Information Technologies; Das, V.V., Stephen, J., Chaba, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 39–43. [Google Scholar]
- Ghosh, A.; Subudhi, B.N.; Ghosh, S. Object Detection from Videos Captured by Moving Camera by Fuzzy Edge Incorporated Markov Random Field and Local Histogram Matching. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1127–1135. [Google Scholar] [CrossRef]
- Guo, C.; Gao, H. Adaptive graph-cuts algorithm based on higher-order MRF for video moving object detection. Electron. Lett. 2012, 48, 371–373. [Google Scholar] [CrossRef]
- Guo, C.; Liu, D.; Guo, Y.; Sun, Y. An adaptive graph cut algorithm for video moving objects detection. Multimed. Tools Appl. 2014, 72, 2633–2652. [Google Scholar] [CrossRef]
- Yadav, D.K.; Singh, K. A combined approach of Kullback-Leibler divergence and background subtraction for moving object detection in thermal video. Infrared Phys. Technol. 2016, 76, 21–31. [Google Scholar] [CrossRef]
- Oreifej, O.; Li, X.; Shah, M. Simultaneous Video Stabilization and Moving Object Detection in Turbulence. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 450–462. [Google Scholar] [CrossRef] [Green Version]
- Nadimi, S.; Bhanu, B. Physical models for moving shadow and object detection in video. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1079–1087. [Google Scholar] [CrossRef]
- Utsumi, O.; Miura, K.; Ide, I.; Sakai, S.; Tanaka, H. An object detection method for describing soccer games from video. In Proceedings of the IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, 26–29 August 2002; pp. 45–48. [Google Scholar]
- Hossain, M.J.; Dewan MA, A.; Chae, O. Moving object detection for real time video surveillance: An. Edge based approach. IEICE Trans. Commun. 2007, 90, 3654–3664. [Google Scholar] [CrossRef]
- Chiranjeevi, P.; Sengupta, S. Robust detection of moving objects in video sequences through rough set theory framework. Image Vis. Comput. 2012, 30, 829–842. [Google Scholar] [CrossRef]
- Abd Razak, H.; Abd Almisreb, A.; Saleh, M.A.; Tahir, N.M. Anomalous Behaviour Detection using Transfer Learning Algorithm of Series and DAG Network. In Proceedings of the 2019 IEEE 9th International Conference on System Engineering and Technology, Shah Alam, Malaysia, 7 October 2019; pp. 505–509. [Google Scholar]
- Azarang, A.; Manoochehri, H.E.; Kehtarnavaz, N. Convolutional Autoencoder-Based Multispectral Image Fusion. IEEE Access 2019, 7, 35673–35683. [Google Scholar] [CrossRef]
- Majumder, S.; Elloumi, Y.; Akil, M.; Kachouri, R.; Kehtarnavaz, N. A deep learning-based smartphone app for real-time detection of five stages of diabetic retinopathy. In Proceedings of the Real-Time Image Processing and Deep Learning 2020, Online Only, CA, USA, 27 April–8 May 2020. [Google Scholar]
- Wang, Z.; Wang, Y.; Lin, Y.; Delord, E.; Latifur, K. Few-Sample and Adversarial Representation Learning for Continual Stream Mining. In Proceedings of the WWW ’20: The Web Conference 2020, Taipei, Taiwan, 20–24 April 2020. [Google Scholar]
- Maor, G.; Zeng, X.; Wang, Z.; Hu, Y. An FPGA Implementation of Stochastic Computing-based LSTM. In Proceedings of the 2019 IEEE 37th International Conference on Computer Design, Abu Dhabi, UAE, 17–20 November 2019; pp. 38–46. [Google Scholar]
- Chu, X. Human Pose Estimation and Immediacy Prediction with Deep Learning. Ph.D. Thesis, The Chinese University of Hong Kong, Hong Kong, China, August 2017. [Google Scholar]
- Wang, Z.; Tao, H.; Kong, Z.; Chandra, S.; Khan, L. Metric Learning based Framework for Streaming Classification with Concept Evolution. In Proceedings of the 2019 International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019. [Google Scholar]
- Li, H.; Meng, L.; Zhang, J.; Tan, Y.; Ren, Y.; Zhang, H. Multiple Description Coding Based on Convolutional Auto-Encoder. IEEE Access 2019, 7, 26013–26021. [Google Scholar] [CrossRef]
- Zheng, S.; Liu, G.; Suo, H.; Lei, Y. Autoencoder-Based Semi-Supervised Curriculum Learning for Out-of-Domain Speaker Verification. In Proceedings of the INTERSPEECH 2019, Graz, Austria, 15–19 September 2019; pp. 4360–4364. [Google Scholar] [CrossRef]
- Wei, H.; Kehtarnavaz, N. Determining Number of Speakers from Single Microphone Speech Signals by Multi-Label. Convolutional Neural Network. In Proceedings of the IECON 2018—44th Annual Conference of the IEEE Industrial Electronics Society, Washington, DC, USA, 21–23 October 2018; pp. 2706–2710. [Google Scholar]
- Zhao, Y.; Wang, D.; Merks, I.; Zhang, T. Dnn-Based Enhancement of Noisy and Reverberant Speech. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal, Shanghai, China, 20–25 March 2016; pp. 6525–6529. [Google Scholar]
- Tao, F.; Liu, G.; Zhao, Q. An Ensemble Framework of Voice-Based Emotion Recognition System for Films and Tv Programs. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 6209–6213. [Google Scholar]
- Zhao, Y.; Xu, B.; Giri, R.; Zhang, T. Perceptually Guided Speech Enhancement Using Deep Neural Networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 5074–5078. [Google Scholar]
- Tao, F.; Busso, C. Aligning Audiovisual Features for Audiovisual Speech Recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo, San Diego, CA, USA, 23–27 July 2018. [Google Scholar]
- Wei, H.; Chopada, P.; Kehtarnavaz, N. C-MHAD: Continuous Multimodal Human Action Dataset of Simultaneous Video and Inertial Sensing. Sensors 2020, 20, 2905. [Google Scholar] [CrossRef]
- Brena, R.F.; Aguileta, A.A.; Trejo, L.A.; Molino-Minero-Re, E.; Mayora, O. Choosing the Best Sensor Fusion Method: A Machine-Learning Approach. Sensors 2020, 20, 2350. [Google Scholar] [CrossRef]
- Tao, F.; Busso, C. End-to-End Audiovisual Speech Recognition System with Multitask Learning. IEEE Trans. Multimed. 2020. [Google Scholar] [CrossRef]
- Wei, H.; Kehtarnavaz, N. Simultaneous Utilization of Inertial and Video Sensing for Action Detection and Recognition in Continuous Action Streams. IEEE Sens. J. 2020, 20, 6055–6063. [Google Scholar] [CrossRef]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. A survey of depth and inertial sensor fusion for human action recognition. Multimed. Tools Appl. 2017, 76, 4405–4425. [Google Scholar] [CrossRef]
- Li, M.; Sun, L.; Huo, Q. Dff-Den: Deep Feature Flow with Detail Enhancement Network for Hand Segmentation in Depth Video. In Proceedings of the 2018 25th IEEE International Conference on Image Processing, Athens, Greece, 7–10 October 2018; pp. 1548–1552. [Google Scholar]
- Li, M.; Sun, L.; Huo, Q. Flow-guided feature propagation with occlusion aware detail enhancement for hand segmentation in egocentric videos. Comput. Vis. Image Underst. 2019, 187. [Google Scholar] [CrossRef]
- Li, H.; Yang, W.; Liao, Q. Temporal Feature Enhancing Network for Human Pose Estimation in Videos. In Proceedings of the 2019 IEEE International Conference on Image Processing, Taipei, Taiwan, 22–25 September 2019; pp. 579–583. [Google Scholar]
- Zhou, Q.; Liang, X.; Gong, K.; Lin, L. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing. In Proceedings of the 2018 ACM Multimedia Conference, Seoul, Korea, 22–26 October 2018; pp. 1527–1535. [Google Scholar]
- Pi, Z.; Qin, H.; Gao, C.; Sang, N. Jointly detecting and multiple people tracking by semantic and scene information. Neurocomputing 2020, 412, 244–251. [Google Scholar] [CrossRef]
- Wang, S.; Zhou, Y.; Yan, J.; Deng, Z. Fully Motion-Aware Network for Video Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Hetang, C.; Qin, H.; Liu, S.; Yan, J. Impression Network for Video Object Detection. arXiv 2017, arXiv:1712.05896. [Google Scholar]
- Zhu, X.; Dai, J.; Yuan, L.; Wei, Y. Towards High Performance Video Object Detection. In Proceedings of the 2018 IEEE/Cvf Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7210–7218. [Google Scholar]
- Liu, M.; Zhu, M.; White, M.; Li, Y.; Kalenichenko, D. Looking Fast and Slow: Memory-Guided Mobile Video Object Detection. arXiv 2019, arXiv:1903.10172. [Google Scholar]
- Liu, M.; Zhu, M. Mobile Video Object Detection with Temporally-Aware Feature Maps. In Proceedings of the 2018 IEEE/Cvf Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5686–5695. [Google Scholar]
- Zhang, C.; Kim, J. Modeling Long—And Short-Term Temporal Context for Video Object Detection. In Proceedings of the 2019 IEEE International Conference on Image Processing, Taipei, Taiwan, 22–25 September 2019; pp. 71–75. [Google Scholar]
- Lu, Y.; Lu, C.; Tang, C.-K. Online Video Object Detection using Association LSTM. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2363–2371. [Google Scholar]
- Deng, H.; Hua, Y.; Song, T.; Zhang, Z.; Xue, Z.; Ma, R.; Guan, H. Object Guided External Memory Network for Video Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6677–6686. [Google Scholar]
- Deng, J.; Pan, Y.; Yao, T.; Zhou, W.; Li, H.; Mei, T. Relation Distillation Networks for Video Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 7022–7031. [Google Scholar]
- Guo, C.; Fan, B.; Gu, J.; Zhang, Q.; Xiang, S.; Prinet, V.; Pan, C. Progressive Sparse Local Attention for Video object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Wu, H.; Chen, Y.; Wang, N.; Zhang, Z. Sequence Level Semantics Aggregation for Video Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Gangnam-gu, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Mao, H.; Kong, T.; Dally, W.J. CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video. arXiv 2018, arXiv:1810.00434. [Google Scholar]
- Kim, H.U.; Kim, C.S. CDT: Cooperative Detection and Tracking for Tracing Multiple Objects in Video Sequences. In Computer Vision—Eccv 2016; Part VI; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 851–867. [Google Scholar]
- Luo, H.; Xie, W.; Wang, X.; Zeng, W. Detect or Track: Towards Cost-Effective Video Object Detection/Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Detect to Track and Track to Detect. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3057–3065. [Google Scholar]
- Sharma, V.K.; Acharya, B.; Mahapatra, K.K. Online Training of Discriminative Parameter for Object Tracking-by-Detection in a Video. In Soft Computing in Data Analytics; Nayak, J., Abraham, A., Krishna, B., Chandra Sekhar, G., Das, A., Eds.; Springer: Singapore, 2019; pp. 215–223. [Google Scholar]
- Bertasius, G.; Torresani, L.; Shi, J. Object Detection in Video with Spatiotemporal Sampling Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Chen, K.; Chen, K.; Wang, J.; Yang, S.; Zhang, X.; Xiong, Y.; Loy, C.C.; Lin, D. Optimizing Video Object Detection via a Scale-Time Lattice. In Proceedings of the 2018 IEEE/Cvf Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7814–7823. [Google Scholar]
- Wang, T.; Xiong, J.; Xu, X.; Shi, Y. SCNN: A General Distribution Based Statistical Convolutional Neural Network with Application to Video Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 5321–5328. [Google Scholar]
- Du, Y.; Yuan, C.; Hu, W.; Maybank, S. Spatio-temporal self-organizing map deep network for dynamic object detection from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Xiao, F.; Jae Lee, Y. Video Object Detection with an Aligned Spatial-Temporal Memory. arXiv 2017, arXiv:1712.06317. [Google Scholar]
- Jiang, Z.; Gao, P.; Guo, C.; Zhang, Q.; Xiang, S.; Pan, C. Video Object Detection with Locally-Weighted Deformable Neighbors. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8529–8536. [Google Scholar]
- Zhu, H.; Yan, X.; Tang, H.; Chang, Y.; Li, B.; Yuan, X. Moving Object Detection with Deep CNNs. IEEE Access 2020, 8, 29729–29741. [Google Scholar] [CrossRef]
- Chin, T.W.; Ding, R.; Marculescu, D. AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling. arXiv 2019, arXiv:1902.02910. [Google Scholar]
- Rybak, Ł.; Dudczyk, J. A Geometrical Divide of Data Particle in Gravitational Classification of Moons and Circles Data Sets. Entropy 2020, 22, 1088. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—Eccv 2014; Part V; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the Cvpr: 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Real, E.; Shlens, J.; Mazzocchi, S.; Pan, X.; Vanhoucke, V. YouTube-BoundingBoxes: A Large High—Precision Human-Annotated Data Set for Object Detection in Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7464–7473. [Google Scholar]
- Damen, D.; Doughty, H.; Farinella, G.; Fidler, S.; Furnari, A.; Kazakos, E.; Wray, M. The Epic-Kitchens Dataset: Collection, Challenges and Baselines. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 1, 1. [Google Scholar] [CrossRef]
- Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 724–732. [Google Scholar]
- Wang, Y.; Jodoin, P.-M.; Porikli, F.; Konrad, J.; Benezeth, Y.; Ishwar, P. CDnet 2014: An Expanded Change Detection Benchmark Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; p. 393. [Google Scholar]
- Kristan, M.; Matas, J.; Leonardis, A.; Felsberg, M.; Cehovin, L.; Fernandez, G.; Vojir, T.; Hager, G.; Nebehay, G.; Pflugfelder, R.; et al. The Visual Object Tracking VOT2015 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshop, Santiago, Chile, 7–13 December 2015; pp. 564–586. [Google Scholar]
- Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. MOTChallenge 2015: Towards a Benchmark for Multi-Target. Tracking. arXiv 2015, arXiv:1504.01942. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Fei-Fei, L. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Kuehne, H.; Jhuang, H.; Stiefelhagen, R.; Serre, T. HMDB: A Large Video Database for Human Motion Recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011. [Google Scholar]
- Awad, G.; Butt, A.; Fiscus, J.; Joy, D.; Huet, B. Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In Proceedings of the TRECVID 2017, Gaithersburg, MD, USA, 13–15 November 2017; Available online: https://hal.archives-ouvertes.fr/hal-01854790 (accessed on 20 August 2020).
- Dollar, P.; Wojek, C.; Schiele, B.; Perona, P. Pedestrian detection: A benchmark. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 304–311. [Google Scholar]
- Everingham, M.; Eslami, S.M.A.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
- Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
- Han, G.; Zhang, X.; Li, C. Semi-Supervised DFF: Decoupling Detection and Feature Flow for Video Object Detectors. In Proceedings of the 26th ACM international conference on Multimedia, Seoul, Korea, 22–26 October 2018; pp. 1811–1819. [Google Scholar]
- Yang, Y.; Shu, G.; Shah, M. Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1650–1657. [Google Scholar]
- Kumar Singh, K.; Xiao, F.; Jae Lee, Y. Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3548–3556. [Google Scholar]
- Sharma, P.; Huang, C.; Nevatia, R. Unsupervised Incremental Learning for Improved Object Detection in a Video. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3298–3305. [Google Scholar]
- Mao, H.; Yang, X.; Dally, W.J. A Delay Metric for Video Object Detection: What Average Precision Fails to Tell. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Dosovitskiy, A.; Fischer, P.; Ilg, E.; Haeusser, P.; Hazirbas, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T.; IEEE. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the 2015 Ieee International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J.; IEEE. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Luo, C.; Zhan, J.; Wang, L.; Yang, Q. Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks. In International Conference on Artificial Neural Networks; Springer: Cham, Switzerland, 2017. [Google Scholar]
- Deng, J.; Zhou, Y.; Yu, B.; Chen, Z.; Zafeiriou, S.; Tao, D. Speed/Accuracy Tradeoffs for Object Detection From Video. Available online: http://image-net.org/challenges/talks_2017/Imagenet%202017%20VID.pdf (accessed on 20 August 2020).
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Lipton, Z.C.; Berkowitz, J.; Elkan, C. A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv 2015, arXiv:1506.00019. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; NIPS Proceedings: Denver, CO, USA, 2017; Available online: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-30-2017 (accessed on 20 August 2020).
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
- Chen, X.; Yu, J.; Wu, Z. Temporally Identity-Aware SSD With Attentional LSTM. IEEE Trans. Cybern. 2020, 50, 2674–2686. [Google Scholar] [CrossRef] [Green Version]
- Chen, X.; Wu, Z.; Yu, J. TSSD: Temporal Single-Shot Detector Based on Attention and LSTM. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; Maciejewski, A.A., Ed.; IEEE: Piscataway, NJ, USA, 2018; pp. 5758–5763. [Google Scholar]
- Zhu, H.; Wei, H.; Li, B.; Yuan, X.; Kehtarnavaz, N. Real-Time Moving Object Detection in High—Resolution Video Sensing. Sensors 2020, 20, 3591. [Google Scholar] [CrossRef]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/Cvf Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Label is True. | Label is False. | |
---|---|---|
Prediction is true. | True positive (TP) | False positive (FP) |
Prediction is false. | False negative (FN) | True negative (TN) |
Image Detection | Confidence Score | TP or FP |
---|---|---|
1 | 0.92 | TP |
2 | 0.83 | TP |
3 | 0.85 | FP |
4 | 0.75 | TP |
5 | 0.72 | FP |
Image Detection | Confidence Score | TP or FP | Precision | Recall | |
---|---|---|---|---|---|
1 | 0.92 | TP | 1/1 | 1/3 | 1 |
3 | 0.85 | FP | 1/2 | 1/3 | |
2 | 0.83 | TP | 2/3 | 2/3 | 3/4 |
4 | 0.75 | TP | 3/4 | 3/3 | 3/4 |
5 | 0.72 | FP | 3/5 | 3/3 |
Type | Framework | Backbone | mAP (%) | Runtime (fps) |
---|---|---|---|---|
Single frame | R-RCN [19] | ResNet-101 | 73.9 | 4.05 K |
70.3 | 12 XP | |||
Flow based | Impression network [68] | Modified ResNet-101 | 75.5 | 20 1060 |
FGFA [22] | ResNet-101 | 76.3 | 1.36 K | |
DFF [28] | ResNet-101 | 73.1 | 20.25 K | |
THP [69] | ResNet-101 + DCN | 78.6 | 13.0X/8.6K | |
THPM [27] | Mobilenet | 60.2 | 25.6 HuaiWei Mate8 | |
LSTM based | Looking fast and slow [70] | Interleaved | 61.4 | 23.5 Pixel 3 phone |
LSTM-SSD [71] | MobilenetV2-SSDLite | 53.5 | − | |
Flow&LSTM [72] | ResNet-101 | 75.5 | − | |
Attention based | OGEMN [74] | ResNet-101 | 79.3 | 8.9 (1080Ti) |
ResNet-101 + DCN | 80.0 | − | ||
PSLA [76] | ResNet-101 | 77.1 | 30.8V\18.73X | |
ResNet-101 + DCN | 80.0 | 26.0V\13.34X | ||
SELSA [77] | ResNet-101 | 80.25 | − | |
ResNeXt-101 | 83.11 | |||
RDN [75] | ResNet-101 | 81.8 | 10.6 V100 | |
ResNeXt-101 | 83.2 | − | ||
MEGA [25] | ResNet-101 | 82.9 | 8.73 2080Ti | |
ResNeXt-101 | 84.1 | − | ||
Tracking based | D&T loss [81] | ResNet-101 | 75.8 | 7.8X |
Track assisted [26] | ResNet-101 | 70.0 | 30XP | |
Others | TCNN [24] | GoogLeNet | 73.8 | − |
STSN [83] | ResNet-101 + DCN | 78.9 | − |
Type | Framework | Backbone | mAP (%) | Runtime (fps) |
---|---|---|---|---|
Flow-based | FGFA [22] | ResNet-101 | 78.4 | − |
Inception-ResNet | 80.1 | |||
Lstm-based | Looking fast and slow [70] | Interleaved + Quantization + Async | 59.3 | 72.3 Pixel 3 phone |
MobilenetV2-SSDLite + LSTM [71] | MobilenetV2-SSDLite | 64.1 | 4.1 Pixel 3 phone | |
MobilenetV2-SSDLite + LSTM [71] | MobilenetV2-SSDLite | 59.1 | − | |
MobilenetV2-SSDLite + LSTM [71] | MobilenetV2-SSDLite | 50.3 | − | |
MobilenetV2-SSDLite + LSTM [71] | MobilenetV2-SSDLite | 45.1 | 14.6 Pixel 3 phone | |
Attention-based | OGEMN [74] | ResNet-101 | 80.8 | − |
ResNet-101 + DCN | 81.6 | |||
PSLA [76] | ResNet-101 | 78.6 | 5.7X | |
ResNet-101 + DCN | 81.4 | 6.31V\5.13X | ||
SELSA [77] | ResNet-101 | 80.54 | − | |
RDN [75] | ResNet-101 | 83.8 | − | |
ResNeXt-101 | 84.7 | |||
MEGA [25] | ResNet-101 | 84.5 | − | |
ResNeXt-101 | 85.4 | |||
Tracking-based | D&T () [81] | ResNet-101 | 78.6 | − |
D&T () [81] | ResNet-101 | 79.8 | 5X | |
D&T [81] | Inception V4 | 82.0 | − | |
Others | STSN [83] | ResNet-101 + DCN | 80.4 | − |
STMN [87] | ResNet-101 | 80.5 | − |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, H.; Wei, H.; Li, B.; Yuan, X.; Kehtarnavaz, N. A Review of Video Object Detection: Datasets, Metrics and Methods. Appl. Sci. 2020, 10, 7834. https://doi.org/10.3390/app10217834
Zhu H, Wei H, Li B, Yuan X, Kehtarnavaz N. A Review of Video Object Detection: Datasets, Metrics and Methods. Applied Sciences. 2020; 10(21):7834. https://doi.org/10.3390/app10217834
Chicago/Turabian StyleZhu, Haidi, Haoran Wei, Baoqing Li, Xiaobing Yuan, and Nasser Kehtarnavaz. 2020. "A Review of Video Object Detection: Datasets, Metrics and Methods" Applied Sciences 10, no. 21: 7834. https://doi.org/10.3390/app10217834
APA StyleZhu, H., Wei, H., Li, B., Yuan, X., & Kehtarnavaz, N. (2020). A Review of Video Object Detection: Datasets, Metrics and Methods. Applied Sciences, 10(21), 7834. https://doi.org/10.3390/app10217834