More Web Proxy on the site http://driver.im/

research-article

Open access

Domain Knowledge Enhanced Vision-Language Pretrained Model for Dynamic Facial Expression Recognition

Authors:

Taihao LiAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 5673 - 5682

https://doi.org/10.1145/3664647.3681708

Published: 28 October 2024 Publication History

Abstract

Dynamic facial expression recognition (DFER) is a rapidly developing field that focuses on recognizing facial expressions in video sequences. However, the complex temporal modeling caused by noisy frames, along with the limited training data significantly hinder the further development of DFER. Previous efforts in this domain have been limited as they tackled these issues separately. Inspired by recent advances of pretrained vision-language models (e.g., CLIP), we propose to leverage it to jointly address the two limitations in DFER. Since the raw CLIP model lacks the ability to model temporal relationships and determine the optimal task-related textual prompts, we utilize DFER-specific domain knowledge, including characteristics of temporal correlations and relationships between facial behavior descriptions at different levels, to guide the adaptation of CLIP to DFER. Specifically, we propose enhancements to CLIP's visual encoder through the design of a hierarchical video encoder that captures both short- and long-term temporal correlations in DFER. Meanwhile, we align facial expressions with action units through prior knowledge to construct semantically rich textual prompts, which are further enhanced with visual contents. Furthermore, we introduce a class-aware consistency regularization mechanism that adaptively filters out noisy frames, bolstering the model's robustness against interference. Extensive experiments on three in-the-wild dynamic facial expression datasets demonstrate that our method outperforms the state-of-the-art DFER approaches. The code is available at https://github.com/liliupeng28/DK-CLIP.

References

[1]

Wissam J. Baddar and Yong Man Ro. 2019. Mode variational lstm robust to unseen modes of variation: Application to facial expression recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30. 3215--3223.

[2]

Carmen Bisogni, Aniello Castiglione, Sanoar Hossain, Fabio Narducci, and Saiyed Umer. 2022. Impact of deep learning approaches on facial expression recognition in healthcare industries. IEEE Transactions on Industrial Informatics, Vol. 18, 8 (2022), 5619--5627.

[3]

Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, and Munawar Hayat. 2023. Marlin: Masked autoencoder for facial video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1493--1504.

[4]

Weicong Chen, Dong Zhang, Ming Li, and Dah-Jye Lee. 2020. Stcam: Spatial-temporal and channel attention module for dynamic facial expression recognition. IEEE Transactions on Affective Computing (2020).

[5]

Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. 2001. Emotion recognition in human-computer interaction. IEEE Signal processing magazine, Vol. 18, 1 (2001), 32--80.

[6]

Zijun Cui, Tengfei Song, Yuru Wang, and Qiang Ji. 2020. Knowledge augmented deep neural networks for joint facial expression and action unit recognition. Advances in Neural Information Processing Systems, Vol. 33 (2020), 14338--14349.

[7]

Celso M de Melo, Jonathan Gratch, Stacy Marsella, and Catherine Pelachaud. 2023. Social functions of machine emotional expressions. Proc. IEEE (2023).

[8]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[9]

Shichuan Du, Yong Tao, and Aleix M Martinez. 2014. Compound facial expressions of emotion. Proceedings of the national academy of sciences, Vol. 111, 15 (2014), E1454--E1462.

[10]

P Ekman and W V Friesen. 1978. Facial Action Coding System (FACS): a Technique for the Measurement of Facial Actions. Rivista Di Psichiatria, Vol. 47, 2 (1978), 126--38.

[11]

Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445--450.

Digital Library

[12]

Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, Adam Prügel-Bennett, and Jonathon Hare. 2020. Fmix: Enhancing mixed sample data augmentation. arXiv preprint arXiv:2002.12047 (2020).

[13]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).

[14]

Qiaoping Hu, Chuanneng Mei, Fei Jiang, Ruimin Shen, Yitian Zhang, Ce Wang, and Junpeng Zhang. 2020. RFAU: A database for facial action unit analysis in real classrooms. IEEE Transactions on Affective Computing, Vol. 13, 3 (2020), 1452--1465.

[15]

Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. 2020. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM international conference on multimedia. 2881--2889.

Digital Library

[16]

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113--19122.

[17]

Ziv Lautman and Shahar Lev-Ari. 2022. The Use of Smart Devices for Mental Health Diagnosis and Care., 5359 pages.

[18]

Min Kyu Lee, Dong Yoon Choi, Dae Ha Kim, and Byung Cheol Song. 2019. Visual scene-aware hybrid neural network architecture for video-based facial expression recognition. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1--8.

Digital Library

[19]

Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. 2023. CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition. arXiv preprint arXiv:2303.00193 (2023).

[20]

Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao. 2023. Intensity-aware loss for dynamic facial expression recognition in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 67--75.

Digital Library

[21]

Hanting Li, Mingzhe Sui, Zhaoqing Zhu, et al. 2022. Nr-dfernet: Noise-robust network for dynamic facial expression recognition. arXiv preprint arXiv:2206.04975 (2022).

[22]

Zuojin Li, Liukui Chen, Ling Nie, and Simon X Yang. 2021. A novel learning model of driver fatigue features representation for steering wheel angle. IEEE Transactions on Vehicular Technology, Vol. 71, 1 (2021), 269--281.

[23]

Mengyi Liu, Shaoxin Li, Shiguang Shan, and Xilin Chen. 2015. Au-inspired deep networks for facial expression feature learning. Neurocomputing, Vol. 159 (2015), 126--136.

Digital Library

[24]

Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and Shiguang Shan. 2022. MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild. In Proceedings of the 30th ACM International Conference on Multimedia. 24--32.

Digital Library

[25]

Yuanyuan Liu, Wenbin Wang, Chuanxu Feng, Haoyu Zhang, Zhe Chen, and Yibing Zhan. 2023. Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognition, Vol. 138 (2023), 109368.

Digital Library

[26]

Zhentao Liu, Min Wu, Weihua Cao, Luefeng Chen, Jianping Xu, Ri Zhang, Mengtian Zhou, and Junwei Mao. 2017. A facial expression emotion recognition based human-robot interaction system. IEEE CAA J. Autom. Sinica, Vol. 4, 4 (2017), 668--676.

[27]

Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. 2022. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5206--5215.

[28]

Fuyan Ma, Bin Sun, and Shutao Li. 2022. Spatio-temporal transformer for dynamic facial expression recognition in the wild. arXiv preprint arXiv:2205.04749 (2022).

[29]

Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. 2022. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision. Springer, 1--18.

Digital Library

[30]

Tao Pu, Tianshui Chen, Yuan Xie, Hefeng Wu, and Liang Lin. 2021. Au-expression knowledge constrained representation learning for facial expression recognition. In 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 11154--11161.

Digital Library

[31]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[32]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.

[33]

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, Vol. 33 (2020), 596--608.

[34]

Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. 2023. MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition. In Proceedings of the 30th ACM International Conference on Multimedia.

Digital Library

[35]

Y-I Tian, Takeo Kanade, and Jeffrey F Cohn. 2001. Recognizing action units for facial expression analysis. IEEE Transactions on pattern analysis and machine intelligence, Vol. 23, 2 (2001), 97--115.

Digital Library

[36]

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, Vol. 35 (2022), 10078--10093.

[37]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).

[38]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).

[39]

Hanyang Wang, Bo Li, Shuang Wu, Siyuan Shen, Feng Liu, Shouhong Ding, and Aimin Zhou. 2023. Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17958--17968.

[40]

Mengmeng Wang, Jiazheng Xing, and Yong Liu. 2021. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021).

[41]

Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. 2022. Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20922--20931.

[42]

Yan Wang, Yixuan Sun, Wei Song, Shuyong Gao, Yiwen Huang, Zhaoyu Chen, Weifeng Ge, and Wenqiang Zhang. 2022. Dpcnet: Dual path multi-excitation collaborative network for facial expression representation learning in videos. In Proceedings of the 30th ACM International Conference on Multimedia. 101--110.

Digital Library

[43]

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).

[44]

Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. 2022. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. arXiv preprint arXiv:2209.06430 (2022).

[45]

Huiyuan Yang, Lijun Yin, Yi Zhou, and Jiuxiang Gu. 2021. Exploiting semantic embedding and visual feature for facial action unit detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10482--10491.

[46]

Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. 2021. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, Vol. 34 (2021), 18408--18419.

[47]

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).

[48]

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).

Digital Library

[49]

Tong Zhang, Wenming Zheng, Zhen Cui, Yuan Zong, and Yang Li. 2018. Spatial--temporal recurrent neural network for emotion recognition. IEEE transactions on cybernetics, Vol. 49, 3 (2018), 839--847.

[50]

Zengqun Zhao and Qingshan Liu. 2021. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 1553--1561.

Digital Library

[51]

Zengqun Zhao and Ioannis Patras. 2023. Prompting visual-language models for dynamic facial expression recognition. arXiv preprint arXiv:2308.13382 (2023).

[52]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816--16825.

[53]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision, Vol. 130, 9 (2022), 2337--2348.

Digital Library

Index Terms

Domain Knowledge Enhanced Vision-Language Pretrained Model for Dynamic Facial Expression Recognition
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Expression-invariant face recognition by facial expression transformations

In this paper, we present a method of expression-invariant face recognition that transforms input face image with an arbitrary expression into its corresponding neutral facial expression image. When a new face image with an arbitrary expression is ...
Facial expression recognition using tracked facial actions: Classifier performance analysis

In this paper, we address the analysis and recognition of facial expressions in continuous videos. More precisely, we study classifiers performance that exploit head pose independent temporal facial action parameters. These are provided by an appearance-...
Improving dynamic facial expression recognition with feature subset selection

This paper addresses the dynamic recognition of basic facial expressions in videos using feature subset selection. Feature selection has been already used by some static classifiers where the facial expression is recognized from one single image. Past ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
334
Total Downloads

Downloads (Last 12 months)334
Downloads (Last 6 weeks)201

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten