[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3681583acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

Published: 28 October 2024 Publication History

Abstract

In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.

Supplemental Material

MP4 File - 4871-video.mp4
This video presents a brief overview of our work titled "Open Set Video-based Facial Expression Recognition with Human Expression-Sensitive Prompting." We begin by introducing the Open Set Video-based Facial Expression Recognition task, followed by an analysis of the challenges it presents, which motivated the development of our HESP approach. We then describe our framework, comprising three key modules: a text prompting module, a visual prompting module, and an open-set multi-task learning scheme. The video also covers comparative experiments, ablation studies, and visualization analyses. Finally, we conclude with a summary of our work.

References

[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
[2]
Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. 2022. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022).
[3]
Wentao Bao, Qi Yu, and Yu Kong. 2021. Evidential deep learning for open set action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13349--13358.
[4]
Abhijit Bendale and Terrance E Boult. 2016. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1563--1572.
[5]
Carmen Bisogni, Aniello Castiglione, Sanoar Hossain, Fabio Narducci, and Saiyed Umer. 2022. Impact of deep learning approaches on facial expression recognition in healthcare industries. IEEE Transactions on Industrial Informatics, Vol. 18, 8 (2022), 5619--5627.
[6]
Joyati Chattopadhyay, Souvik Kundu, Arpita Chakraborty, and Jyoti Sekhar Banerjee. 2020. Facial expression recognition for human computer interaction. New Trends in Computational Vision and Bio-inspired Computing: Selected works presented at the ICCVBIC 2018, Coimbatore, India (2020), 1181--1192.
[7]
Guangyao Chen, Peixi Peng, Xiangqian Wang, and Yonghong Tian. 2021. Adversarial reciprocal points learning for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 11 (2021), 8065--8081.
[8]
Guangyao Chen, Limeng Qiao, Yemin Shi, Peixi Peng, Jia Li, Tiejun Huang, Shiliang Pu, and Yonghong Tian. 2020. Learning open set network with discriminative reciprocal points. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16. Springer, 507--522.
[9]
Hao Chen, Bhiksha Raj, Xing Xie, and Jindong Wang. 2024. On Catastrophic Inheritance of Large Foundation Models. arXiv preprint arXiv:2402.01909 (2024).
[10]
Hao Chen, Jindong Wang, Zihan Wang, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, and Bhiksha Raj. 2024. Learning with Noisy Foundation Models. arXiv preprint arXiv:2403.06869 (2024).
[11]
Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5203--5212.
[12]
Abhinav Dhall, OV Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. 2015. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 423--426.
[13]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[14]
Yangbo Feng, Junyu Gao, Shicai Yang, and Changsheng Xu. 2023. Spatial-Temporal Exclusive Capsule Network for Open Set Action Recognition. IEEE Transactions on Multimedia (2023).
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[16]
Hongzhi Huang, Yu Wang, Qinghua Hu, and Ming-Ming Cheng. 2022. Class-specific semantic reconstruction for open set recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 45, 4 (2022), 4214--4228.
[17]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In European Conference on Computer Vision. Springer, 709--727.
[18]
Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. 2020. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM international conference on multimedia. 2881--2889.
[19]
Dae Hoe Kim, Wissam J Baddar, Jinhyeok Jang, and Yong Man Ro. 2017. Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Transactions on Affective Computing, Vol. 10, 2 (2017), 223--236.
[20]
Shu Kong and Deva Ramanan. 2021. Opengan: Open-set recognition via open data generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 813--822.
[21]
Ning Liao, Yifeng Liu, Li Xiaobo, Chenyi Lei, Guoxin Wang, Xian-Sheng Hua, and Junchi Yan. 2022. Cohoz: Contrastive multimodal prompt tuning for hierarchical open-set zero-shot recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 3262--3271.
[22]
Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. 2020. SAANet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing, Vol. 413 (2020), 145--157.
[23]
Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and Shiguang Shan. 2022. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. In Proceedings of the 30th ACM International Conference on Multimedia. 24--32.
[24]
Yuanyuan Liu, Chuanxu Feng, Xiaohui Yuan, Lin Zhou, Wenbin Wang, Jie Qin, and Zhongwen Luo. 2022. Clip-aware expressive feature learning for video-based facial expression recognition. Information Sciences, Vol. 598 (2022), 182--195.
[25]
Yuanyuan Liu, Wenbin Wang, Chuanxu Feng, Haoyu Zhang, Zhe Chen, and Yibing Zhan. 2023. Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognition, Vol. 138 (2023), 109368.
[26]
Cheng Lu, Wenming Zheng, Chaolong Li, Chuangao Tang, Suyuan Liu, Simeng Yan, and Yuan Zong. 2018. Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In Proceedings of the 20th ACM international conference on multimodal interaction. 646--652.
[27]
WonJun Moon, Junho Park, Hyun Seok Seong, Cheol-Ho Cho, and Jae-Pil Heo. 2022. Difficulty-aware simulator for open set recognition. In European Conference on Computer Vision. Springer, 365--381.
[28]
Lawrence Neal, Matthew Olson, Xiaoli Fern, Weng-Keen Wong, and Fuxin Li. 2018. Open set learning with counterfactual images. In Proceedings of the European conference on computer vision (ECCV). 613--628.
[29]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[30]
Hairui Ren, Fan Tang, Xingjia Pan, Juan Cao, Weiming Dong, Zhiwen Lin, Ke Yan, and Changsheng Xu. 2023. A 2 Pt: Anti-Associative Prompt Tuning for Open Set Visual Recognition. IEEE Transactions on Multimedia (2023).
[31]
Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, Vol. 34 (2021), 200--212.
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[33]
Vidit Vidit, Martin Engilberge, and Mathieu Salzmann. 2023. Clip the gap: A single domain generalization approach for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3219--3229.
[34]
Shijie Wang, Jianlong Chang, Haojie Li, Zhihui Wang, Wanli Ouyang, and Qi Tian. 2023. Open-set fine-grained retrieval via prompting vision-language evaluator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19381--19391.
[35]
Torsten Wilhelm. 2019. Towards facial expression analysis in a driver assistance system. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1--4.
[36]
Junyang Wu, Xianhang Li, Chen Wei, Huiyu Wang, Alan Yuille, Yuyin Zhou, and Cihang Xie. 2022. Unleashing the power of visual prompting at the pixel level. arXiv preprint arXiv:2212.10556 (2022).
[37]
Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S Davis, and Yu-Gang Jiang. 2024. Building an open-vocabulary video CLIP model with better architectures, optimization and data. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
[38]
Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2024. Cpt: Colorful prompt tuning for pre-trained vision-language models. AI Open, Vol. 5 (2024), 30--38.
[39]
Giacomo Zara, Subhankar Roy, Paolo Rota, and Elisa Ricci. 2023. AutoLabel: CLIP-based framework for open-set video domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11504--11513.
[40]
Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, and Ke Ding. 2023. Text-visual prompting for efficient 2d temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14794--14804.
[41]
Yuhang Zhang, Yue Yao, Xuannan Liu, Lixiong Qin, Wenjing Wang, and Weihong Deng. 2024. Open-Set Facial Expression Recognition. arXiv preprint arXiv:2401.12507 (2024).
[42]
Yuan-Hang Zhang, Rulin Huang, Jiabei Zeng, and Shiguang Shan. 2020. M 3 f: Multi-modal continuous valence-arousal estimation in the wild. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 632--636.
[43]
Zengqun Zhao and Qingshan Liu. 2021. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 1553--1561.
[44]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2921--2929.
[45]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision, Vol. 130, 9 (2022), 2337--2348.
[46]
Junjie Zhu, Bingjun Luo, Ao Sun, Jinghang Tan, Xibin Zhao, and Yue Gao. 2023. Variance-Aware Bi-Attention Expression Transformer for Open-Set Facial Expression Recognition in the Wild. In Proceedings of the 31st ACM International Conference on Multimedia. 862--870.

Index Terms

  1. Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      ISBN:9798400706868
      DOI:10.1145/3664647
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. clip
      2. open-set recognition
      3. textual prompting
      4. video-based facial expression recognition
      5. visual prompting

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 49
        Total Downloads
      • Downloads (Last 12 months)49
      • Downloads (Last 6 weeks)49
      Reflects downloads up to 10 Dec 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media