Abstract
We present a lightweight and efficient semi-supervised video object segmentation network based on the space-time memory framework. To some extent, our method solves the two difficulties encountered in traditional video object segmentation: one is that the single frame calculation time is too long, and the other is that the current frame’s segmentation should use more information from past frames. The algorithm uses a global context (GC) module to achieve high-performance, real-time segmentation. The GC module can effectively integrate multi-frame image information without increased memory and can process each frame in real time. Moreover, the prediction mask of the previous frame is helpful for the segmentation of the current frame, so we input it into a spatial constraint module (SCM), which constrains the areas of segments in the current frame. The SCM effectively alleviates mismatching of similar targets yet consumes few additional resources. We added a refinement module to the decoder to improve boundary segmentation. Our model achieves state-of-the-art results on various datasets, scoring 80.1% on YouTube-VOS 2018 and a \({\cal J}{\rm{\& }}{\cal F}\) score of 78.0% on DAVIS 2017, while taking 0.05 s per frame on the DAVIS 2016 validation dataset.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Chen, D.; Tang, F.; Dong, W. M.; Yao, H. X.; Xu, C. S. SiamCPN: Visual tracking with the Siamese center-prediction network. Computational Visual Media Vol. 7, No. 2, 253–265, 2021.
Li, X.; Liu, S.; De Mello, S.; Wang, X.; Kautz, J.; Yang, M. H. Joint-task self-supervised learning for temporal correspondence. arXiv preprint arXiv:1909.11895, 2019.
Zhang, F. L.; Barnes, C.; Zhang, H. T.; Zhao, J. H.; Salas, G. Coherent video generation for multiple handheld cameras with dynamic foreground. Computational Visual Media Vol. 6, No. 3, 291–306, 2020.
Cheng, J. C.; Tsai, Y. H.; Hung, W. C.; Wang, S. J.; Yang, M. H. Fast and accurate online video object segmentation via tracking parts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7415–7424, 2018.
Maninis, K. K.; Caelles, S.; Chen, Y.; Pont-Tuset, J.; Leal-Taixé, L.; Cremers, D.; Van Gool, L. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 6, 1515–1530, 2019.
Voigtlaender, P.; Chai, Y. N.; Schroff, F.; Adam, H.; Leibe, B.; Chen, L. C. FEELVOS: Fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9473–9482, 2019.
Li, Y.; Shen, Z.; Shan, Y. Fast video object segmentation using the global context module. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12355. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 735–750, 2020.
Hu, Y. T.; Huang, J. B.; Schwing, A. G. MaskRNN: Instance level video object segmentation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 324–333, 2017.
Khoreva, A.; Benenson, R.; Ilg, E.; Brox, T.; Schiele, B. Lucid data dreaming for object tracking. In: Proceedings of the 2017 DAVIS Challenge on Video Object Segmentation — CVPR Workshops, 2017.
Li, X.; Loy, C. C. Video object segmentation with joint re-identification and attention-aware mask propagation. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11207. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 93–110, 2018.
Perazzi, F.; Khoreva, A.; Benenson, R.; Schiele, B.; Sorkine-Hornung, A. Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3491–3500, 2017.
Caelles, S.; Maninis, K.-K.; Pont-Tuset, J.; Leal-Taixé, L.; Cremers, D.; Van Gool, L. One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5320–5329, 2017.
Voigtlaender, P.; Leibe, B. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364, 2017.
Yoon, J. S.; Rameau, F.; Kim, J.; Lee, S.; Shin, S.; Kweon, I. S. Pixel-level matching for video object segmentation using convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2186–2195, 2017.
Wang, Z. Q.; Xu, J.; Liu, L.; Zhu, F.; Shao, L. RANet: Ranking attention network for fast video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3977–3986, 2019.
Oh, S. W.; Lee, J. Y.; Sunkavalli, K.; Kim, S. J. Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7376–7385, 2018.
Yang, L.; Wang, Y.; Xiong, X.; Yang, J.; Katsaggelos, A. K. Efficient video object segmentation via network modulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6499–6507, 2018.
Oh, S. W.; Lee, J.-Y.; Xu, N.; Kim, S. J. Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9225–9234, 2019.
Seong, H.; Hyun, J.; Kim, E. Kernelized memory network for video object segmentation. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12367. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 629–645, 2020.
Zhang, P.; Hu, L.; Zhang, B.; Pan, P. Spatial constrained memory network for semi-supervised video object segmentation. In: Proceedings of the 2020 DAVIS Challenge on Video Object Segmentation — CVPR Workshops, 2020.
Chen, L. C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
Liu, P.; Fu, H. Y.; Ma, H. D. An end-to-end convolutional network for joint detecting and denoising adversarial perturbations in vehicle classification. Computational Visual Media Vol. 7, No. 2, 217–227, 2021.
Huo, Y. C.; Yoon, S. E. A survey on deep learning-based Monte Carlo denoising. Computational Visual Media Vol. 7, No. 2, 169–185, 2021.
Danon, D.; Arar, M.; Cohen-Or, D.; Shamir, A. Image resizing by reconstruction from deep features. Computational Visual Media Vol. 7, No. 4, 453–466, 2021.
Liu, X. T.; Li, C. Z.; Wong, T. T. Boundary-aware texture region segmentation from manga. Computational Visual Media Vol. 3, No. 1, 61–71, 2017.
Chen, Y. H.; Pont-Tuset, J.; Montes, A.; Gool, L. V. Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1189–1198, 2018.
Khoreva, A.; Benenson, R.; Ilg, E.; Brox, T.; Schiele, B. Lucid data dreaming for video object segmentation. International Journal of Computer Vision Vol. 127, No. 9, 1175–1197, 2019.
Wang, X. L.; Jabri, A.; Efros, A. A. Learning correspondence from the cycle-consistency of time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2561–2571, 2019.
Zhang, M. L.; Zhou, Z. H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition Vol. 40, No. 7, 2038–2048, 2007.
Wang, X.; Girshick, R.; Gupta, A.; He, K. Nonlocal neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803, 2018.
Liang, Y. Q.; Li, X.; Jafari, N.; Chen, Q. Video object segmentation with adaptive feature bank and uncertain-region refinement. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 289, 3430–3441, 2020.
Cheng, H. K.; Tai, Y. W.; Tang, C. K. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. arXiv preprint arXiv: 2106.05210, 2021.
Hu, L.; Zhang, P.; Zhang, B.; Pan, P.; Xu, Y.; Jin, R. Learning position and target consistency for memory-based video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4142–4152, 2021.
Xie, H.; Yao, H.; Zhou, S.; Zhang, S.; Sun, W. Efficient regional memory network for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1286–1295, 2021.
Tang, L. L.; Chen, K.; Wu, C. Z.; Hong, Y.; Jia, K.; Yang, Z. X. Improving semantic analysis on point clouds via auxiliary supervision of local geometric priors. IEEE Transactions on Cybernetics Vol. 52, No. 6, 4949–4959, 2022.
Yang, Z. X.; Tang, L. L.; Zhang, K.; Wong, P. K. Multiview CNN feature aggregation with ELM auto-encoder for 3D shape recognition. Cognitive Computation Vol. 10, No. 6, 908–921, 2018.
Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 724–732, 2016.
Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; Van Gool, L. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv preprint arXiv:1704.00675, 2017.
Xu, N.; Yang, L.; Fan, Y.; Yue, D.; Liang, Y.; Yang, J.; Huang, T. YouTube-VOS: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
Bao, L. C.; Wu, B. Y.; Liu, W. CNN in MRF: Video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5977–5986, 2018.
Luiten, J.; Voigtlaender, P.; Leibe, B. PReMVOS: Proposal-generation, refinement and merging for video object segmentation. arXiv preprint arXiv:1807.09190, 2018.
Li, Y.; Wen, L.; Chang, M. C.; Lyu, S. Graph-to-graph energy minimization for video object segmentation. In: Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, 1–8, 2019.
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W. M.; Torr, P. H. S. Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1328–1338, 2019.
Hu, Y. T.; Huang, J. B.; Schwing, A. G. VideoMatch: Matching based video object segmentation. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11212. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 56–73, 2018.
Johnander, J.; Danelljan, M.; Brissman, E.; Khan, F. S.; Felsberg, M. A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8945–8954, 2019.
Lin, T. Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C. L.; Dollár, P. Microsoft COCO: Common objects in context. In: Computer Vision — ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Ventura, C.; Bellver, M.; Girbau, A.; Salvador, A.; Marques, F.; Giro-i-Nieto, X. RVOS: End-to-end recurrent network for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5272–5281, 2019.
Xu, N.; Yang, L.; Fan, Y.; Yang, J.; Yue, D.; Liang, Y.; Price, B.; Cohen, S.; Huang, T. YouTube-VOS: Sequence-to-sequence video object segmentation. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11209. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 603–619, 2018.
Wehrwein, S.; Szeliski, R. Video segmentation with background motion models. In: Proceedings of the British Machine Vision Conference, 96.1–96.12, 2017.
Voigtlaender, P.; Luiten, J.; Leibe, B. BoLTVOS: Boxlevel tracking for video object segmentation. arXiv preprint arXiv:1904.04552, 2019.
Lin, H. J.; Qi, X. J.; Jia, J. Y. AGSS-VOS: Attention guided single-shot video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3948–3956, 2019.
Acknowledgements
This work was partially supported by the National Natural Science Foundation of China (Grant Nos. 61802197, 62072449, and 61632003), the Science and Technology Development Fund, Macau SAR (Grant Nos. 0018/2019/AKP and SKL-IOTSC(UM)-2021-2023), the Guangdong Science and Technology Department (Grant No. 2020B1515130001), and University of Macau (Grant Nos. MYRG2020-00253-FST and MYRG2022-00059-FST).
Author information
Authors and Affiliations
Corresponding author
Additional information
Declaration of competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Yadang Chen received his Ph.D. degree in software engineering from the University of Macau in 2016. He is currently an associate professor in Nanjing University of Information Science and Technology. From 2019 to 2020, he was a postdoctor in the State Key Laboratory of Internet of Things for Smart City and Department of Electromechanical Engineering, University of Macau. From 2017 to 2018, he was a visiting scholar with Michigan State University. His main research interests include video segmentation, video enhancement, video editing, and augmented reality.
Duolin Wang is now a master student in software engineering at Nanjing University of Information Science and Technology, China. His research interests include deep learning and computer vision.
Zhiguo Chen received his M.S. and Ph.D. degrees from the Division of Internet and Multimedia Engineering at Konkuk University, Seoul, Republic of Korea, in 2014 and 2019, respectively. He is an associate professor with the School of Computer Science, Nanjing University of Information Science and Technology. His research interests include artificial intelligence, information security, and cloud computing.
Zhi-Xin Yang obtained his Ph.D. degree in industrial engineering and engineering management from Hong Kong University of Science and Technology. He is currently an associate professor in the State Key Laboratory of Internet of Things for Smart City, Faculty of Science and Technology, and the director of Research Service and Knowledge Transfer Office both at the University of Macau. His current research interests include fault diagnosis and prognosis, machine learning, and computer vision-based robotics.
Enhua Wu completed his B.Sc. studies in Tsinghua University, and received his Ph.D. degree from Department of Computer Science, University of Manchester, UK, in 1984. He has been working at the State Key Lab. of Computer Science, Institute of Software, Chinese Academy of Sciences, since 1985, as a director of the Research Department of Fundamental Theory and Advanced Technology, IOS until 2001. He has also been a full professor of the University of Macau since 1997, where he is now the associate dean of the Faculty of Science and Technology.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Chen, Y., Wang, D., Chen, Z. et al. Global video object segmentation with spatial constraint module. Comp. Visual Media 9, 385–400 (2023). https://doi.org/10.1007/s41095-022-0282-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-022-0282-8