[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3681267acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

MGR-Dark: A Large Multimodal Video Dataset and RGB-IR Benchmark for Gesture Recognition in Darkness

Published: 28 October 2024 Publication History

Abstract

Gesture recognition plays a crucial role in natural human-computer interaction and sign language recognition. Despite considerable progress in normal daylight, research dedicated to gesture recognition in dark environments is scarce. This is partly due to the lack of sufficient datasets for such a task. We bridge the gap of the lack of data for this task by collecting a new dataset: a large-scale multimodal video dataset for gesture recognition in darkness (MGR-Dark). MGR-Dark is distinguished from existing gesture datasets by its gesture collection in darkness, multimodal videos(RGB, Depth, and Infrared), and high video quality. To the best of our knowledge, this is the first multimodal dataset dedicated to human gesture action in dark videos of high quality. Building upon this, we propose a Modality Translation and Cross-modal Distillation (MTCD) RGB-IR benchmark framework. Specifically, the modality translator is firstly utilized to transfer RGB data to pseudo-Infrared data, a progressive cross-modal feature distillation module is then designed to exploit the underlying relations between RGB, pseudo-Infrared and Infrared modalities to guide RGB feature learning. The experiments demonstrate that the dataset and benchmark proposed in this paper are expected to advance research in gesture recognition in dark videos. Our dataset and code can be found at https://github.com/Grass-Shi/MGR-DarkMGR-Dark.

References

[1]
Mahdi Abavisani, Hamid Reza Vaezi Joze, and Vishal M Patel. 2019. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1165--1174.
[2]
Sarah Bargal, Nuno Garcia, Vitaly Ablavsky, Pietro Morerio, Vittorio Murino, and Stan Sclaroff. 2019. DMCL: distillation multiple choice learning for multimodal action recognition. In IEEE Winter Conference on Applications of Computer Vision (WACV).
[3]
Gibran Benitez-Garcia, Jesus Olivares-Mercado, Gabriel Sanchez-Perez, and Keiji Yanai. 2021. IPN hand: A video dataset and benchmark for real-time continuous hand gesture recognition. In 2020 25th international conference on pattern recognition (ICPR). IEEE, 4340--4347.
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[6]
Jiali Duan, Jun Wan, Shuai Zhou, Xiaoyuan Guo, and Stan Z Li. 2018. A unified framework for multi-modal isolated gesture recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 14, 1s (2018), 1--16.
[7]
Andrea D'Eusanio, Alessandro Simoni, Stefano Pini, Guido Borghi, Roberto Vezzani, and Rita Cucchiara. 2020. A transformer-based network for dynamic hand gesture recognition. In 2020 International Conference on 3D Vision (3DV). IEEE, 623--632.
[8]
Nuno C Garcia, Pietro Morerio, and Vittorio Murino. 2018. Modality distillation with multiple stream networks for action recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 103--118.
[9]
Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, Xiaogang Wang, and Hongsheng Li. 2022. Structured domain adaptation with online relation regularization for unsupervised person re-id. IEEE Transactions on Neural Networks and Learning Systems (2022).
[10]
Vikram Gupta, Sai Kumar Dwivedi, Rishabh Dabral, and Arjun Jain. 2019. Progression modelling for online and early gesture detection. In 2019 International Conference on 3D Vision (3DV). IEEE, 289--297.
[11]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[12]
Xiao Jin, Baoyun Peng, Yichao Wu, Yu Liu, Jiaheng Liu, Ding Liang, Junjie Yan, and Xiaolin Hu. 2019. Knowledge distillation via route constrained optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1345--1354.
[13]
Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L Iuzzolino, and Kazuhito Koishida. 2020. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13289--13299.
[14]
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision. IEEE, 2556--2563.
[15]
Yunan Li, Huizhou Chen, Guanwen Feng, and Qiguang Miao. 2023. Learning robust representations with information bottleneck and memory network for RGB-D-based gesture recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 20968--20978.
[16]
Yunan Li, Tianyu Qi, Zhuoqi Ma, Dou Quan, and Qiguang Miao. 2023. Seeking a hierarchical prototype for multimodal gesture recognition. IEEE Transactions on Neural Networks and Learning Systems (2023).
[17]
Yunan Li, Jun Wan, Qiguang Miao, Sergio Escalera, Huijuan Fang, Huizhou Chen, Xiangda Qi, and Guodong Guo. 2020. Cr-net: A deep classification-regression network for multimodal apparent personality analysis. International Journal of Computer Vision, Vol. 128 (2020), 2763--2780.
[18]
Dan Liu, Libo Zhang, and Yanjun Wu. 2022. LD-ConGR: A large RGB-D video dataset for long-distance continuous gesture recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3304--3312.
[19]
Joanna Materzynska, Guillaume Berger, Ingo Bax, and Roland Memisevic. 2019. The jester dataset: A large-scale video dataset of human gestures. In Proceedings of the IEEE/CVF international conference on computer vision workshops. 0--0.
[20]
Qiguang Miao, Yunan Li, Wanli Ouyang, Zhenxin Ma, Xin Xu, Weikang Shi, and Xiaochun Cao. 2017. Multimodal gesture recognition based on the resc3d network. In Proceedings of the IEEE international conference on computer vision workshops. 3047--3055.
[21]
Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 5191--5198.
[22]
Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. 2016. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4207--4215.
[23]
Wenxian Shi, Yuxuan Song, Hao Zhou, Bohan Li, and Lei Li. 2021. Follow your path: a progressive method for knowledge distillation. In Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13--17, 2021, Proceedings, Part III 21. Springer, 596--611.
[24]
Yuanyuan Shi, Xiaolong Fu, Yunan Li, Kaibin Miao, Xiangzeng Liu, Bocheng Zhao, and Qiguang Miao. 2023. A Semi-Supervised Underexposed Image Enhancement Network With Supervised Context Attention and Multi-Exposure Fusion. IEEE Transactions on Multimedia, Vol. 26 (2023), 1229--1243.
[25]
Zhiyuan Shi and Tae-Kyun Kim. 2017. Learning and refining of privileged information-based RNNs for action recognition from depth sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3461--3470.
[26]
Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. 2022. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 10 (2022), 6700--6713.
[27]
Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera, and Stan Z Li. 2016. Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 56--64.
[28]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.
[29]
Ancong Wu, Wei-Shi Zheng, Hong-Xing Yu, Shaogang Gong, and Jianhuang Lai. 2017. RGB-infrared cross-modality person re-identification. In Proceedings of the IEEE international conference on computer vision. 5380--5389.
[30]
Wentian Xin, Qiguang Miao, Yi Liu, Ruyi Liu, Chi-Man Pun, and Cheng Shi. 2023. Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition. In Proceedings of the 31st ACM International Conference on Multimedia. 2211--2220.
[31]
Yuecong Xu, Jianfei Yang, Haozhi Cao, Kezhi Mao, Jianxiong Yin, and Simon See. 2020. ARID: A comprehensive study on recognizing actions in the dark and a new benchmark dataset. CoRR, abs/2006.03876 (2020).
[32]
Bin Yang, Jun Chen, Xianzheng Ma, and Mang Ye. 2023. Translation, association and augmentation: Learning cross-modality re-identification from single-modality annotation. IEEE Transactions on Image Processing (2023).
[33]
Xiaodong Yang, Pavlo Molchanov, and Jan Kautz. 2018. Making convolutional networks recurrent for visual sequence learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6469--6478.
[34]
Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. 2021. Deep learning for person re-identification: A survey and outlook. IEEE transactions on pattern analysis and machine intelligence, Vol. 44, 6 (2021), 2872--2893.
[35]
Zitong Yu, Benjia Zhou, Jun Wan, Pichao Wang, Haoyu Chen, Xin Liu, Stan Z Li, and Guoying Zhao. 2021. Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition. IEEE Transactions on Image Processing, Vol. 30 (2021), 5626--5640.
[36]
SHI Yuanyuan, LI Yunan, FU Xiaolong, MIAO Kaibin, and MIAO Qiguang. 2021. Review of dynamic gesture recognition. Virtual Reality & Intelligent Hardware, Vol. 3, 3 (2021), 183--206.
[37]
Demao Zhang, Zhizhong Zhang, Ying Ju, Cong Wang, Yuan Xie, and Yanyun Qu. 2022. Dual mutual learning for cross-modality person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 8 (2022), 5361--5373.
[38]
Liang Zhang, Guangming Zhu, Lin Mei, Peiyi Shen, Syed Afaq Ali Shah, and Mohammed Bennamoun. 2018. Attention in convolutional LSTM for gesture recognition. Advances in neural information processing systems, Vol. 31 (2018).
[39]
Liang Zhang, Guangming Zhu, Peiyi Shen, Juan Song, Syed Afaq Shah, and Mohammed Bennamoun. 2017. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In Proceedings of the IEEE international conference on computer vision workshops. 3120--3128.
[40]
Yifan Zhang, Congqi Cao, Jian Cheng, and Hanqing Lu. 2018. Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia, Vol. 20, 5 (2018), 1038--1050.
[41]
Benjia Zhou, Yunan Li, and Jun Wan. 2021. Regional attention with architecture-rebuilt 3d network for rgb-d gesture recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3563--3571.
[42]
Benjia Zhou, Pichao Wang, Jun Wan, Yanyan Liang, and Fan Wang. 2023. A Unified Multimodal De-and Re-coupling Framework for RGB-D Motion Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[43]
Benjia Zhou, Pichao Wang, Jun Wan, Yanyan Liang, Fan Wang, Du Zhang, Zhen Lei, Hao Li, and Rong Jin. 2022. Decoupling and recoupling spatiotemporal representation for RGB-D-based motion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20154--20163.
[44]
Guangming Zhu, Liang Zhang, Lu Yang, Lin Mei, Syed Afaq Ali Shah, Mohammed Bennamoun, and Peiyi Shen. 2019. Redundancy and attention in convolutional LSTM for gesture recognition. IEEE transactions on neural networks and learning systems, Vol. 31, 4 (2019), 1323--1335.

Index Terms

  1. MGR-Dark: A Large Multimodal Video Dataset and RGB-IR Benchmark for Gesture Recognition in Darkness

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      ISBN:9798400706868
      DOI:10.1145/3664647
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. gesture recognition
      2. mgr-dark dataset
      3. modality translation
      4. progressive cross-modal distillation

      Qualifiers

      • Research-article

      Funding Sources

      • Teaching Reform Project of Shaanxi Higher Continuing Education under Grant
      • National Science and Technology Major Project under grant
      • Provincial Key Research and Development Program of Shaanxi under grant
      • National Natural Science Foundations of China under grant

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 64
        Total Downloads
      • Downloads (Last 12 months)64
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media