[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3681708acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Domain Knowledge Enhanced Vision-Language Pretrained Model for Dynamic Facial Expression Recognition

Published: 28 October 2024 Publication History

Abstract

Dynamic facial expression recognition (DFER) is a rapidly developing field that focuses on recognizing facial expressions in video sequences. However, the complex temporal modeling caused by noisy frames, along with the limited training data significantly hinder the further development of DFER. Previous efforts in this domain have been limited as they tackled these issues separately. Inspired by recent advances of pretrained vision-language models (e.g., CLIP), we propose to leverage it to jointly address the two limitations in DFER. Since the raw CLIP model lacks the ability to model temporal relationships and determine the optimal task-related textual prompts, we utilize DFER-specific domain knowledge, including characteristics of temporal correlations and relationships between facial behavior descriptions at different levels, to guide the adaptation of CLIP to DFER. Specifically, we propose enhancements to CLIP's visual encoder through the design of a hierarchical video encoder that captures both short- and long-term temporal correlations in DFER. Meanwhile, we align facial expressions with action units through prior knowledge to construct semantically rich textual prompts, which are further enhanced with visual contents. Furthermore, we introduce a class-aware consistency regularization mechanism that adaptively filters out noisy frames, bolstering the model's robustness against interference. Extensive experiments on three in-the-wild dynamic facial expression datasets demonstrate that our method outperforms the state-of-the-art DFER approaches. The code is available at https://github.com/liliupeng28/DK-CLIP.

References

[1]
Wissam J. Baddar and Yong Man Ro. 2019. Mode variational lstm robust to unseen modes of variation: Application to facial expression recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30. 3215--3223.
[2]
Carmen Bisogni, Aniello Castiglione, Sanoar Hossain, Fabio Narducci, and Saiyed Umer. 2022. Impact of deep learning approaches on facial expression recognition in healthcare industries. IEEE Transactions on Industrial Informatics, Vol. 18, 8 (2022), 5619--5627.
[3]
Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, and Munawar Hayat. 2023. Marlin: Masked autoencoder for facial video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1493--1504.
[4]
Weicong Chen, Dong Zhang, Ming Li, and Dah-Jye Lee. 2020. Stcam: Spatial-temporal and channel attention module for dynamic facial expression recognition. IEEE Transactions on Affective Computing (2020).
[5]
Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. 2001. Emotion recognition in human-computer interaction. IEEE Signal processing magazine, Vol. 18, 1 (2001), 32--80.
[6]
Zijun Cui, Tengfei Song, Yuru Wang, and Qiang Ji. 2020. Knowledge augmented deep neural networks for joint facial expression and action unit recognition. Advances in Neural Information Processing Systems, Vol. 33 (2020), 14338--14349.
[7]
Celso M de Melo, Jonathan Gratch, Stacy Marsella, and Catherine Pelachaud. 2023. Social functions of machine emotional expressions. Proc. IEEE (2023).
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[9]
Shichuan Du, Yong Tao, and Aleix M Martinez. 2014. Compound facial expressions of emotion. Proceedings of the national academy of sciences, Vol. 111, 15 (2014), E1454--E1462.
[10]
P Ekman and W V Friesen. 1978. Facial Action Coding System (FACS): a Technique for the Measurement of Facial Actions. Rivista Di Psichiatria, Vol. 47, 2 (1978), 126--38.
[11]
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445--450.
[12]
Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, Adam Prügel-Bennett, and Jonathon Hare. 2020. Fmix: Enhancing mixed sample data augmentation. arXiv preprint arXiv:2002.12047 (2020).
[13]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[14]
Qiaoping Hu, Chuanneng Mei, Fei Jiang, Ruimin Shen, Yitian Zhang, Ce Wang, and Junpeng Zhang. 2020. RFAU: A database for facial action unit analysis in real classrooms. IEEE Transactions on Affective Computing, Vol. 13, 3 (2020), 1452--1465.
[15]
Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. 2020. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM international conference on multimedia. 2881--2889.
[16]
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113--19122.
[17]
Ziv Lautman and Shahar Lev-Ari. 2022. The Use of Smart Devices for Mental Health Diagnosis and Care., 5359 pages.
[18]
Min Kyu Lee, Dong Yoon Choi, Dae Ha Kim, and Byung Cheol Song. 2019. Visual scene-aware hybrid neural network architecture for video-based facial expression recognition. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1--8.
[19]
Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. 2023. CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition. arXiv preprint arXiv:2303.00193 (2023).
[20]
Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. 2023. Intensity-aware loss for dynamic facial expression recognition in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 67--75.
[21]
Hanting Li, Mingzhe Sui, Zhaoqing Zhu, et al. 2022. Nr-dfernet: Noise-robust network for dynamic facial expression recognition. arXiv preprint arXiv:2206.04975 (2022).
[22]
Zuojin Li, Liukui Chen, Ling Nie, and Simon X Yang. 2021. A novel learning model of driver fatigue features representation for steering wheel angle. IEEE Transactions on Vehicular Technology, Vol. 71, 1 (2021), 269--281.
[23]
Mengyi Liu, Shaoxin Li, Shiguang Shan, and Xilin Chen. 2015. Au-inspired deep networks for facial expression feature learning. Neurocomputing, Vol. 159 (2015), 126--136.
[24]
Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and Shiguang Shan. 2022. MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild. In Proceedings of the 30th ACM International Conference on Multimedia. 24--32.
[25]
Yuanyuan Liu, Wenbin Wang, Chuanxu Feng, Haoyu Zhang, Zhe Chen, and Yibing Zhan. 2023. Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognition, Vol. 138 (2023), 109368.
[26]
Zhentao Liu, Min Wu, Weihua Cao, Luefeng Chen, Jianping Xu, Ri Zhang, Mengtian Zhou, and Junwei Mao. 2017. A facial expression emotion recognition based human-robot interaction system. IEEE CAA J. Autom. Sinica, Vol. 4, 4 (2017), 668--676.
[27]
Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. 2022. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5206--5215.
[28]
Fuyan Ma, Bin Sun, and Shutao Li. 2022. Spatio-temporal transformer for dynamic facial expression recognition in the wild. arXiv preprint arXiv:2205.04749 (2022).
[29]
Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. 2022. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision. Springer, 1--18.
[30]
Tao Pu, Tianshui Chen, Yuan Xie, Hefeng Wu, and Liang Lin. 2021. Au-expression knowledge constrained representation learning for facial expression recognition. In 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 11154--11161.
[31]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[32]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.
[33]
Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, Vol. 33 (2020), 596--608.
[34]
Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. 2023. MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition. In Proceedings of the 30th ACM International Conference on Multimedia.
[35]
Y-I Tian, Takeo Kanade, and Jeffrey F Cohn. 2001. Recognizing action units for facial expression analysis. IEEE Transactions on pattern analysis and machine intelligence, Vol. 23, 2 (2001), 97--115.
[36]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, Vol. 35 (2022), 10078--10093.
[37]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).
[38]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[39]
Hanyang Wang, Bo Li, Shuang Wu, Siyuan Shen, Feng Liu, Shouhong Ding, and Aimin Zhou. 2023. Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17958--17968.
[40]
Mengmeng Wang, Jiazheng Xing, and Yong Liu. 2021. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021).
[41]
Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. 2022. Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20922--20931.
[42]
Yan Wang, Yixuan Sun, Wei Song, Shuyong Gao, Yiwen Huang, Zhaoyu Chen, Weifeng Ge, and Wenqiang Zhang. 2022. Dpcnet: Dual path multi-excitation collaborative network for facial expression representation learning in videos. In Proceedings of the 30th ACM International Conference on Multimedia. 101--110.
[43]
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).
[44]
Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. 2022. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. arXiv preprint arXiv:2209.06430 (2022).
[45]
Huiyuan Yang, Lijun Yin, Yi Zhou, and Jiuxiang Gu. 2021. Exploiting semantic embedding and visual feature for facial action unit detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10482--10491.
[46]
Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. 2021. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, Vol. 34 (2021), 18408--18419.
[47]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
[48]
Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
[49]
Tong Zhang, Wenming Zheng, Zhen Cui, Yuan Zong, and Yang Li. 2018. Spatial--temporal recurrent neural network for emotion recognition. IEEE transactions on cybernetics, Vol. 49, 3 (2018), 839--847.
[50]
Zengqun Zhao and Qingshan Liu. 2021. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 1553--1561.
[51]
Zengqun Zhao and Ioannis Patras. 2023. Prompting visual-language models for dynamic facial expression recognition. arXiv preprint arXiv:2308.13382 (2023).
[52]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816--16825.
[53]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision, Vol. 130, 9 (2022), 2337--2348.

Index Terms

  1. Domain Knowledge Enhanced Vision-Language Pretrained Model for Dynamic Facial Expression Recognition

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. dynamic facial expression recognition
    2. vision-language model

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 91
      Total Downloads
    • Downloads (Last 12 months)91
    • Downloads (Last 6 weeks)82
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media