STARNet: spatio-temporal aware recurrent network for efficient video object detection on embedded devices

Mohammad Hajizadeh¹,
Mohammad Sabokrou² &
Adel Rahmani¹

312 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The challenge of converting various object detection methods from image to video remains unsolved. When applied to video, image methods frequently fail to generalize effectively due to issues, such as blurriness, different and unclear positions, low quality, and other relevant issues. Additionally, the lack of a good long-term memory in video object detection presents an additional challenge. In the majority of instances, the outputs of successive frames are known to be quite similar; therefore, this fact is relied upon. Furthermore, the information contained in a series of successive or non-successive frames is greater than that contained in a single frame. In this study, we present a novel recurrent cell for feature propagation and identify the optimal location of layers to increase the memory interval. As a result, we achieved higher accuracy compared to other proposed methods in other studies. Hardware limitations can exacerbate this challenge. The paper aims to implement and increase the efficiency of the methods on embedded devices. We achieved 68.7% mAP accuracy on the ImageNet VID dataset for embedded devices in real-time and at a speed of 52 fps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Improved Object Detection by Utilizing the Image Stream

2D recurrent neural networks: a high-performance tool for robust visual tracking in dynamic scenes

Article 13 October 2017

Video Object Detection with an Aligned Spatial-Temporal Memory

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bertasius, G., Torresani, L., Shi, J.: Object detection in video with spatiotemporal sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 331–346 (2018)
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), IEEE. pp. 3464–3468 (2016)
Chen, L., Ai, H., Zhuang, Z., Shang, C.: Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp. 1–6 (2018)
Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346 (2020)
Cui, Y., Yan, L., Cao, Z., Liu, D.: Tf-blender: temporal feature blender for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8138–8147 (2021)
Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Relation distillation networks for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7023–7032 (2019)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Ehteshami Bejnordi, B., Habibian, A., Porikli, F., Ghodrati, A.: Salisa: saliency-based input sampling for efficient video object detection. In: European Conference on Computer Vision, pp. 300–316. Springer (2022)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046 (2017)
Galteri, L., Seidenari, L., Bertini, M., Del Bimbo, A.: Spatio-temporal closed-loop object detection. IEEE Trans. Image Process. 26, 1253–1263 (2017)
Article MathSciNet Google Scholar
Habibian, A., Abati, D., Cohen, T.S., Bejnordi, B.E.: Skip-convolutions for efficient video processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2695–2704 (2021)
Habibian, A., Ben Yahia, H., Abati, D., Gavves, E., Porikli, F.: Delta distillation for efficient video processing. In: European Conference on Computer Vision, pp. 213–229. Springer (2022)
Hajizadeh, M., Sabokrou, M., Rahmani, A.: MobileDenseNet: a new approach to object detection on mobile devices. Expert Syst. Appl. 215, 119348 (2023)
Article Google Scholar
Han, W., Khorrami, P., Paine, T.L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., Huang, T.S.: Seq-nms for video object detection (2016). arXiv preprint arXiv:1602.08465
Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang, X., et al.: T-cnn: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol. 28, 2896–2907 (2017)
Article Google Scholar
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense´ object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Liu, M., Zhu, M.: Mobile video object detection with temporally-aware feature maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5686–5695 (2018)
Liu, M., Zhu, M., White, M., Li, Y., Kalenichenko, D.: Looking fast and slow: Memory-guided mobile video object detection (2019). arXiv preprint arXiv:1903.10172
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)
Mao, H., Zhu, S., Han, S., Dally, W.J.: Patchnet–short-range template matching for efficient video processing (2021). arXiv preprint arXiv:2103.07371
Qin, Z., Li, Z., Zhang, Z., Bao, Y., Yu, G., Peng, Y., Sun, J.: Thundernet: towards real-time generic object detection on mobile devices. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6718–6727 (2019)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Schulter, S., Vernaza, P., Choi, W., Chandraker, M.: Deep network flow for multi-object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6951–6960 (2017)
Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790 (2020)
Tang, Q., Li, J., Shi, Z., Hu, Y.: Lightdet: a lightweight and accurate object detection network. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 2243–2247 (2020)
Wang, S., Zhou, Y., Yan, J., Deng, Z.: Fully motion-aware network for video object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 542–557 (2018)
Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multiobject tracking. In: European Conference on Computer Vision, pp. 107–122. Springer (2020)
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 3645–3649 (2017)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Wu, H., Chen, Y., Wang, N., Zhang, Z.: Sequence level semantics aggregation for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9217–9225 (2019)
Xiao, F., Lee, Y.J.: Video object detection with an aligned spatialtemporal memory. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 485–501 (2018)
Xu, R., Mu, F., Lee, J., Mukherjee, P., Chaterji, S., Bagchi, S., Li, Y.: Smartadapt: multi-branch object detection framework for videos on mobiles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2528–2538 (2023)
Yao, C.H., Fang, C., Shen, X., Wan, Y., Yang, M.H.: Video object detection via object-level temporal aggregation. In: European Conference on Computer Vision, pp. 160–177. Springer (2020)
Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218 (2018)
Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417 (2017)
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)

Download references

Author information

Authors and Affiliations

School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
Mohammad Hajizadeh & Adel Rahmani
School of Computer Science, IPM Institute for Research in Fundamental Sciences, Tehran, Iran
Mohammad Sabokrou

Authors

Mohammad Hajizadeh
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Sabokrou
View author publications
You can also search for this author in PubMed Google Scholar
Adel Rahmani
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MH: Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing—Original Draft, MS: Investigation, Writing—Original Draft, Writing—Review and Editing, Visualization, Supervision. AR: Writing—Review and Editing, Supervision, Project administration,

Corresponding author

Correspondence to Adel Rahmani.

Ethics declarations

Conflict of interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

The comparison of small object detection on the proposed method and EfficientDet [26].

Method	Frame 1	Frame 2	Frame 3	Frame 4
EfficientDet [26] Localization
EfficientDet [26] Classification	Bird 96% Bird 77%	Bird 97% Bird 58%	Bird 94% Fail	Squirrel 86% Bird 93%
Proposed Method Localization
Proposed Method Classification	Bird 100% Bird 98%	Bird 100% Bird 96%	Bird 100% Bird 94%	Bird 99% Bird 99%
EfficientDet [26] Localization
EfficientDet [26] Classification	Dog 93%	Dog 66%	Cat 62%	Bird 51%
Proposed Method Localization
Proposed Method Classification	Dog 99%	Dog 98%	Dog 98%	Dog 98%
EfficientDet [26] Localization
EfficientDet [26] Classification	Motorcycle 43% Car 92%	Motorcycle 48% Car 91%	Car 57% Car 97%	Motorcycle 66% Car 98%
Proposed Method Localization
Proposed Method Classification	Motorcycle 69% Car 96%	Car 80% Car 97%	Car 89% Car 97%	Car 92% Car 99%

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hajizadeh, M., Sabokrou, M. & Rahmani, A. STARNet: spatio-temporal aware recurrent network for efficient video object detection on embedded devices. Machine Vision and Applications 35, 23 (2024). https://doi.org/10.1007/s00138-023-01504-0

Download citation

Received: 29 July 2023
Revised: 22 October 2023
Accepted: 19 December 2023
Published: 31 January 2024
DOI: https://doi.org/10.1007/s00138-023-01504-0