[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3681228acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

All rivers run into the sea: Unified Modality Brain-Inspired Emotional Central Mechanism

Published: 28 October 2024 Publication History

Abstract

In the field of affective computing, fully leveraging information from a variety of sensory modalities is essential for the comprehensive understanding and processing of human emotions. Inspired by the process through which the human brain handles emotions and the theory of cross-modal plasticity, we propose UMBEnet, a brain-like unified modal affective processing network. The primary design of UMBEnet includes a Dual-Stream (DS) structure that fuses inherent prompts with a Prompt Pool and a Sparse Feature Fusion (SFF) module. The design of the Prompt Pool is aimed at integrating information from different modalities, while inherent prompts are intended to enhance the system's predictive guidance capabilities and effectively manage knowledge related to emotion classification. Moreover, considering the sparsity of effective information across different modalities, the SSF module aims to make full use of all available sensory data through the sparse integration of modality fusion prompts and inherent prompts, maintaining high adaptability and sensitivity to complex emotional states. Extensive experiments on the largest benchmark datasets in the Dynamic Facial Expression Recognition (DFER) field, including DFEW, FERV39k, and MAFW, have proven that UMBEnet consistently outperforms the current state-of-the-art methods. Notably, in scenarios of Modality Missingness and multimodal contexts, UMBEnet significantly surpasses the leading current methods, demonstrating outstanding performance and adaptability in tasks that involve complex emotional understanding with rich multimodal information. Code can be obtained at https://github.com/Xinji-Mai/UMBEnet.

References

[1]
Sarah A Abdu, Ahmed H Yousef, and Ashraf Salem. 2021. Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion, Vol. 76 (2021), 204--226.
[2]
Aygul Balcioglu, Rebecca Gillani, Michael Doron, Kendyll Burnell, Taeyun Ku, Alev Erisir, Kwanghun Chung, Idan Segev, and Elly Nedivi. 2023. Mapping thalamic innervation to individual L2/3 pyramidal neurons and modeling their -readout-of visual input. Nature Neuroscience, Vol. 26, 3 (2023), 470--480.
[3]
Tadas Baltruvsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 2 (2018), 423--443.
[4]
Daphne Bavelier and Helen J Neville. 2002. Cross-modal plasticity: where and how? Nature Reviews Neuroscience, Vol. 3, 6 (2002), 443--452.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[6]
Eric Chu and Deb Roy. 2017. Audio-visual sentiment analysis for learning emotional arcs in movies. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE, 829--834.
[7]
Leonardo G Cohen, Pablo Celnik, Alvaro Pascual-Leone, Brian Corwell, Lala Faiz, James Dambrosia, Manabu Honda, Norihiro Sadato, Christian Gerloff, M Dolores Catalá, et al. 1997. Functional relevance of cross-modal plasticity in blind humans. Nature, Vol. 389, 6647 (1997), 180--183.
[8]
Leonardo G Cohen, Robert A Weeks, Norihiro Sadato, Pablo Celnik, Kenji Ishii, and Mark Hallett. 1999. Period of susceptibility for cross-modal plasticity in the blind. Annals of Neurology: Official Journal of the American Neurological Association and the Child Neurology Society, Vol. 45, 4 (1999), 451--460.
[9]
Olivier Collignon, Patrice Voss, Maryse Lassonde, and Franco Lepore. 2009. Cross-modal plasticity for the spatial processing of sounds in visually deprived subjects. Experimental brain research, Vol. 192 (2009), 343--358.
[10]
Hanna Damasio. 2005. Human brain anatomy in computerized images. Oxford university press.
[11]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[12]
K Ezzameli and H Mahersia. 2023. Emotion recognition from unimodal to multimodal analysis: A review. Information Fusion (2023), 101847.
[13]
Yong Fan, Feng Shi, Jeffrey Keith Smith, Weili Lin, John H Gilmore, and Dinggang Shen. 2011. Brain anatomical networks in early human brain development. Neuroimage, Vol. 54, 3 (2011), 1862--1871.
[14]
Niki Maria Foteinopoulou and Ioannis Patras. 2023. EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition. arXiv preprint arXiv:2310.16640 (2023).
[15]
Karl Friston and Gyorgy Buzsáki. 2016. The functional anatomy of time: what and when in the brain. Trends in cognitive sciences, Vol. 20, 7 (2016), 500--511.
[16]
ER Gizewski, T Gasser, Armin de Greiff, A Boehm, and Michael Forsting. 2003. Cross-modal plasticity for sensory and motor activation patterns in blind subjects. Neuroimage, Vol. 19, 3 (2003), 968--975.
[17]
Hannah Glick and Anu Sharma. 2017. Cross-modal plasticity in developmental and age-related hearing loss: Clinical implications. Hearing research, Vol. 343 (2017), 191--201.
[18]
Emma Haddi, Xiaohui Liu, and Yong Shi. 2013. The role of text pre-processing in sentiment analysis. Procedia computer science, Vol. 17 (2013), 26--32.
[19]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[21]
Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. 2020. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM international conference on multimedia. 2881--2889.
[22]
Joseph E LeDoux. 2000. Emotion circuits in the brain. Annual review of neuroscience, Vol. 23, 1 (2000), 155--184.
[23]
Bokyeung Lee, Hyunuk Shin, Bonhwa Ku, and Hanseok Ko. 2023. Frame Level Emotion Guided Dynamic Facial Expression Recognition With Emotion Grouping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5680--5690.
[24]
Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. 2023. CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition. arXiv preprint arXiv:2303.00193 (2023).
[25]
Hanting Li, Mingzhe Sui, Zhaoqing Zhu, et al. 2022. NR-DFERNet: Noise-Robust Network for Dynamic Facial Expression Recognition. arXiv preprint arXiv:2206.04975 (2022).
[26]
Ruichen Li, Jinming Zhao, Jingwen Hu, Shuai Guo, and Qin Jin. 2020. Multi-modal fusion for video sentiment analysis. In Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop. 19--25.
[27]
Junxiong Lin, Zeng Tao, Xuan Tong, Xinji Mai, Haoran Wang, Boyang Wang, Yan Wang, Qing Zhao, Jiawen Yu, Yuxuan Lin, et al. 2024. Suppressing Uncertainties in Degradation Estimation for Blind Super-Resolution. arXiv preprint arXiv:2406.16459 (2024).
[28]
Junxiong Lin, Yan Wang, Zeng Tao, Boyang Wang, Qing Zhao, Haorang Wang, Xuan Tong, Xinji Mai, Yuxuan Lin, Wei Song, et al. 2024. Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution. arXiv preprint arXiv:2403.05808 (2024).
[29]
Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and Shiguang Shan. 2022. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. In Proceedings of the 30th ACM International Conference on Multimedia. 24--32.
[30]
Luca Lo Verde, Maria Concetta Morrone, and Claudia Lunghi. 2017. Early cross-modal plasticity in adults. Journal of Cognitive Neuroscience, Vol. 29, 3 (2017), 520--529.
[31]
Stephen G Lomber, M Alex Meredith, and Andrej Kral. 2010. Cross-modal plasticity in specific auditory cortices underlies visual compensations in the deaf. Nature neuroscience, Vol. 13, 11 (2010), 1421--1427.
[32]
Fuyan Ma, Bin Sun, and Shutao Li. 2023. Logo-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.
[33]
Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, and Wenqiang Zhang. 2024. From Efficient Multimodal Models to World Models: A Survey. arXiv preprint arXiv:2407.00118 (2024).
[34]
Xinji Mai, Haoran Wang, Zeng Tao, Junxiong Lin, Shaoqi Yan, Yan Wang, Jing Liu, Jiawen Yu, Xuan Tong, Yating Li, et al. 2024. OUS: Scene-Guided Dynamic Facial Expression Recognition. arXiv preprint arXiv:2405.18769 (2024).
[35]
Albert Mehrabian. 2017. Nonverbal communication. Routledge.
[36]
Saif M Mohammad. 2016. Sentiment analysis: Detecting valence, emotions, and other affectual states from text. In Emotion measurement. Elsevier, 201--237.
[37]
Moisés Pereira, Flávio Pádua, Adriano Pereira, Fabrício Benevenuto, and Daniel Dalip. 2016. Fusing audio, textual, and visual features for sentiment analysis of news videos. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 10. 659--662.
[38]
Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, and Amir Hussain. 2016. Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing, Vol. 174 (2016), 50--59.
[39]
Maurice Ptito and Ron Kupers. 2005. Cross-modal plasticity in early blindness. Journal of integrative neuroscience, Vol. 4, 04 (2005), 479--488.
[40]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision. 5533--5541.
[41]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[42]
Norihiro Sadato, Tomohisa Okada, Manabu Honda, and Yoshiharu Yonekura. 2002. Critical period for cross-modal plasticity in blind humans: a functional MRI study. Neuroimage, Vol. 16, 2 (2002), 389--400.
[43]
K Sathian and Randall Stilla. 2010. Cross-modal plasticity of tactile perception in blindness. Restorative neurology and neuroscience, Vol. 28, 2 (2010), 271--281.
[44]
Barry E Stein and M Alex Meredith. 1993. The merging of the senses. MIT press.
[45]
Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. 2023. Mae-dfer: Efficient masked autoencoder for self-supervised dynamic facial expression recognition. In Proceedings of the 31st ACM International Conference on Multimedia. 6110--6121.
[46]
Zeng Tao, Yan Wang, Junxiong Lin, Haoran Wang, Xinji Mai, Jiawen Yu, Xuan Tong, Ziheng Zhou, Shaoqi Yan, Qing Zhao, et al. 2024. Align-DFER: Pioneering Comprehensive Dynamic Affective Alignment for Dynamic Facial Expression Recognition with CLIP. arXiv preprint arXiv:2403.04294 (2024).
[47]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[48]
Hanyang Wang, Bo Li, Shuang Wu, Siyuan Shen, Feng Liu, Shouhong Ding, and Aimin Zhou. 2023. Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17958--17968.
[49]
Haoran Wang, Xinji Mai, Zeng Tao, Xuan Tong, Junxiong Lin, Yan Wang, Jiawen Yu, Boyang Wang, Shaoqi Yan, Qing Zhao, et al. 2024. Seeking Certainty In Uncertainty: Dual-Stage Unified Framework Solving Uncertainty in Dynamic Facial Expression Recognition. arXiv preprint arXiv:2406.16473 (2024).
[50]
Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. 2022. Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20922--20931.
[51]
Yan Wang, Yixuan Sun, Wei Song, Shuyong Gao, Yiwen Huang, Zhaoyu Chen, Weifeng Ge, and Wenqiang Zhang. 2022. Dpcnet: Dual path multi-excitation collaborative network for facial expression representation learning in videos. In Proceedings of the 30th ACM International Conference on Multimedia. 101--110.
[52]
Ali Yadollahi, Ameneh Gholipour Shahraki, and Osmar R Zaiane. 2017. Current state of text sentiment analysis from opinion to emotion mining. ACM Computing Surveys (CSUR), Vol. 50, 2 (2017), 1--33.
[53]
Shiqing Zhang, Yijiao Yang, Chen Chen, Xingnan Zhang, Qingming Leng, and Xiaoming Zhao. 2023. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Systems with Applications (2023), 121692.
[54]
Zengqun Zhao and Ioannis Patras. 2023. Prompting Visual-Language Models for Dynamic Facial Expression Recognition. arXiv preprint arXiv:2308.13382 (2023).

Index Terms

  1. All rivers run into the sea: Unified Modality Brain-Inspired Emotional Central Mechanism

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. affective computing
    2. cross-modal plasticity
    3. dynamic facial expression recognition
    4. modality missingness

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 71
      Total Downloads
    • Downloads (Last 12 months)71
    • Downloads (Last 6 weeks)21
    Reflects downloads up to 18 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media