Abstract
The challenge of converting various object detection methods from image to video remains unsolved. When applied to video, image methods frequently fail to generalize effectively due to issues, such as blurriness, different and unclear positions, low quality, and other relevant issues. Additionally, the lack of a good long-term memory in video object detection presents an additional challenge. In the majority of instances, the outputs of successive frames are known to be quite similar; therefore, this fact is relied upon. Furthermore, the information contained in a series of successive or non-successive frames is greater than that contained in a single frame. In this study, we present a novel recurrent cell for feature propagation and identify the optimal location of layers to increase the memory interval. As a result, we achieved higher accuracy compared to other proposed methods in other studies. Hardware limitations can exacerbate this challenge. The paper aims to implement and increase the efficiency of the methods on embedded devices. We achieved 68.7% mAP accuracy on the ImageNet VID dataset for embedded devices in real-time and at a speed of 52 fps.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bertasius, G., Torresani, L., Shi, J.: Object detection in video with spatiotemporal sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 331–346 (2018)
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), IEEE. pp. 3464–3468 (2016)
Chen, L., Ai, H., Zhuang, Z., Shang, C.: Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp. 1–6 (2018)
Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346 (2020)
Cui, Y., Yan, L., Cao, Z., Liu, D.: Tf-blender: temporal feature blender for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8138–8147 (2021)
Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Relation distillation networks for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7023–7032 (2019)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Ehteshami Bejnordi, B., Habibian, A., Porikli, F., Ghodrati, A.: Salisa: saliency-based input sampling for efficient video object detection. In: European Conference on Computer Vision, pp. 300–316. Springer (2022)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046 (2017)
Galteri, L., Seidenari, L., Bertini, M., Del Bimbo, A.: Spatio-temporal closed-loop object detection. IEEE Trans. Image Process. 26, 1253–1263 (2017)
Habibian, A., Abati, D., Cohen, T.S., Bejnordi, B.E.: Skip-convolutions for efficient video processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2695–2704 (2021)
Habibian, A., Ben Yahia, H., Abati, D., Gavves, E., Porikli, F.: Delta distillation for efficient video processing. In: European Conference on Computer Vision, pp. 213–229. Springer (2022)
Hajizadeh, M., Sabokrou, M., Rahmani, A.: MobileDenseNet: a new approach to object detection on mobile devices. Expert Syst. Appl. 215, 119348 (2023)
Han, W., Khorrami, P., Paine, T.L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., Huang, T.S.: Seq-nms for video object detection (2016). arXiv preprint arXiv:1602.08465
Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang, X., et al.: T-cnn: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol. 28, 2896–2907 (2017)
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense´ object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Liu, M., Zhu, M.: Mobile video object detection with temporally-aware feature maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5686–5695 (2018)
Liu, M., Zhu, M., White, M., Li, Y., Kalenichenko, D.: Looking fast and slow: Memory-guided mobile video object detection (2019). arXiv preprint arXiv:1903.10172
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)
Mao, H., Zhu, S., Han, S., Dally, W.J.: Patchnet–short-range template matching for efficient video processing (2021). arXiv preprint arXiv:2103.07371
Qin, Z., Li, Z., Zhang, Z., Bao, Y., Yu, G., Peng, Y., Sun, J.: Thundernet: towards real-time generic object detection on mobile devices. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6718–6727 (2019)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Schulter, S., Vernaza, P., Choi, W., Chandraker, M.: Deep network flow for multi-object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6951–6960 (2017)
Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790 (2020)
Tang, Q., Li, J., Shi, Z., Hu, Y.: Lightdet: a lightweight and accurate object detection network. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 2243–2247 (2020)
Wang, S., Zhou, Y., Yan, J., Deng, Z.: Fully motion-aware network for video object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 542–557 (2018)
Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multiobject tracking. In: European Conference on Computer Vision, pp. 107–122. Springer (2020)
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 3645–3649 (2017)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Wu, H., Chen, Y., Wang, N., Zhang, Z.: Sequence level semantics aggregation for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9217–9225 (2019)
Xiao, F., Lee, Y.J.: Video object detection with an aligned spatialtemporal memory. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 485–501 (2018)
Xu, R., Mu, F., Lee, J., Mukherjee, P., Chaterji, S., Bagchi, S., Li, Y.: Smartadapt: multi-branch object detection framework for videos on mobiles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2528–2538 (2023)
Yao, C.H., Fang, C., Shen, X., Wan, Y., Yang, M.H.: Video object detection via object-level temporal aggregation. In: European Conference on Computer Vision, pp. 160–177. Springer (2020)
Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218 (2018)
Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417 (2017)
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)
Author information
Authors and Affiliations
Contributions
MH: Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing—Original Draft, MS: Investigation, Writing—Original Draft, Writing—Review and Editing, Visualization, Supervision. AR: Writing—Review and Editing, Supervision, Project administration,
Corresponding author
Ethics declarations
Conflict of interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
The comparison of small object detection on the proposed method and EfficientDet [26].
Method | Frame 1 | Frame 2 | Frame 3 | Frame 4 |
---|---|---|---|---|
EfficientDet [26] Localization | ||||
EfficientDet [26] Classification | Bird 96% Bird 77% | Bird 97% Bird 58% | Bird 94% Fail | Squirrel 86% Bird 93% |
Proposed Method Localization | ||||
Proposed Method Classification | Bird 100% Bird 98% | Bird 100% Bird 96% | Bird 100% Bird 94% | Bird 99% Bird 99% |
EfficientDet [26] Localization | ||||
EfficientDet [26] Classification | Dog 93% | Dog 66% | Cat 62% | Bird 51% |
Proposed Method Localization | ||||
Proposed Method Classification | Dog 99% | Dog 98% | Dog 98% | Dog 98% |
EfficientDet [26] Localization | ||||
EfficientDet [26] Classification | Motorcycle 43% Car 92% | Motorcycle 48% Car 91% | Car 57% Car 97% | Motorcycle 66% Car 98% |
Proposed Method Localization | ||||
Proposed Method Classification | Motorcycle 69% Car 96% | Car 80% Car 97% | Car 89% Car 97% | Car 92% Car 99% |
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hajizadeh, M., Sabokrou, M. & Rahmani, A. STARNet: spatio-temporal aware recurrent network for efficient video object detection on embedded devices. Machine Vision and Applications 35, 23 (2024). https://doi.org/10.1007/s00138-023-01504-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-023-01504-0