More Web Proxy on the site http://driver.im/

short-paper

An Occam's Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets

Authors:

Valentin Vielzeuf,

Corentin Kervadec,

Stéphane Pateux,

Alexis Lechervy,

Frédéric JurieAuthors Info & Claims

ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

Pages 589 - 593

https://doi.org/10.1145/3242969.3264980

Published: 02 October 2018 Publication History

Abstract

This paper presents a light-weight and accurate deep neural model for audiovisual emotion recognition. To design this model, the authors followed a philosophy of simplicity, drastically limiting the number of parameters to learn from the target datasets, always choosing the simplest learning methods: i) transfer learning and low-dimensional space embedding allows to reduce the dimensionality of the representations, ii) visual temporal information handled by a simple score-per-frame selection process averaged across time, iii) simple frame selection mechanism for weighting images within sequences, iv) fusion of the different modalities at prediction level (late fusion). The paper also highlights the inherent challenges of the AFEW dataset and the difficulty of model selection with as few as 383 validation sequences. The proposed real-time emotion classifier achieved a state-of-the-art accuracy of 60.64 % on the test set of AFEW, and ranked 4th at the Emotion in the Wild 2018 challenge.

References

[1]

Roland Goecke Abhinav Dhall, Amanjot Kaur and Tom Gedeon. 2018. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction ACM International Conference on Multimodal Interaction. ACM.

Digital Library

[2]

Dinesh Acharya, Zhiwu Huang, Danda Paudel, and Luc Van Gool. 2018. Covariance Pooling for Facial Expression Recognition. arXiv preprint arXiv:1805.04855 (2018).

[3]

Grigory Antipov, Moez Baccouche, Sid-Ahmed Berrani, and Jean-Luc Dugelay. 2016. Apparent age estimation from face images combining general and children-specialized deep learning models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 96--104.

[4]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video Advances in Neural Information Processing Systems. 892--900.

Digital Library

[5]

Lisa Feldman Barrett and James A. Russell. 1999. The structure of current affect: Controversies and emerging consensus. Current directions in psychological science, Vol. 8, 1 (1999), 10--14.

[6]

C. Fabian Benitez-Quiroz, Ramprakash Srinivasan, Qianli Feng, Yan Wang, and Aleix M. Martinez. 2017. EmotioNet Challenge: Recognition of facial expressions of emotion in the wild. arXiv preprint arXiv:1703.01210 (2017).

[7]

Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32.

Digital Library

[8]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation Vol. 42, 4 (2008), 335.

[9]

Terrance DeVries and Graham W. Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).

[10]

Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2011. Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2106--2112.

[11]

Abhinav Dhall, Roland Goecke, Simon Lucey, Tom Gedeon, et al. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, Vol. 19, 3 (2012), 34--41.

Digital Library

[12]

Abhinav Dhall, O. V. Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. 2015. Video and image based emotion recognition challenges in the wild: Emotiw 2015 Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 423--426.

Digital Library

[13]

Paul Ekman and Wallace V. Friesen. 1977. Facial action coding system. (1977).

[14]

Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462.

Digital Library

[15]

Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using cnn-rnn and c3d hybrid networks Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 445--450.

Digital Library

[16]

Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. Technical Report.

Digital Library

[17]

Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests International Conference on Neural Information Processing. Springer, 117--124.

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[19]

Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. 2017. Learning supervised scoring ensemble for emotion recognition in the wild Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 553--560.

Digital Library

[20]

Markus Kächele, Martin Schels, Sascha Meudt, Günther Palm, and Friedhelm Schwenker. 2016. Revisiting the EmotiW challenge: how wild is it really? Journal on Multimodal User Interfaces Vol. 10, 2 (2016), 151--162.

[21]

Bo-Kyeong Kim, Jihyeon Roh, Suh-Yeon Dong, and Soo-Young Lee. 2016. Hierarchical committee of deep convolutional neural networks for robust facial expression recognition. Journal on Multimodal User Interfaces Vol. 10, 2 (2016), 173--189.

[22]

Boris Knyazev, Roman Shvetsov, Natalia Efremova, and Artem Kuharenko. 2018. Leveraging large face recognition data for emotion classification Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on. IEEE, 692--696.

[23]

Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2584--2593.

[24]

Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable pooling with Context Gating for video classification. arXiv preprint arXiv:1706.06905 (2017).

[25]

Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. 2017. Affectnet: A database for facial expression, valence, and arousal computing in the wild. arXiv preprint arXiv:1708.03985 (2017).

[26]

Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, and Benoit Huet. 2017. Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 536--543.

Digital Library

[27]

Robert Plutchik and Henry Kellerman. 2013. Theories of emotion. Vol. Vol. 1. Academic Press.

[28]

Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, and Maja Pantic. 2017. AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 3--9.

Digital Library

[29]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research Vol. 15, 1 (2014), 1929--1958.

Digital Library

[30]

Michel F. Valstar, Enrique Sánchez-Lozano, Jeffrey F. Cohn, László A. Jeni, Jeffrey M. Girard, Zheng Zhang, Lijun Yin, and Maja Pantic. 2017. Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on. IEEE, 839--847.

[31]

Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 569--576.

Digital Library

[32]

Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing au-aware facial features and their latent relations for emotion recognition in the wild Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 451--458.

Digital Library

[33]

Zhiding Yu and Cha Zhang. 2015. Image based static facial expression recognition with multiple deep network learning Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 435--442.

Digital Library

[34]

Xingyu Zeng, Wanli Ouyang, Junjie Yan, Hongsheng Li, Tong Xiao, Kun Wang, Yu Liu, Yucong Zhou, Bin Yang, Zhe Wang, et al. 2017. Crafting gbd-net for object detection. IEEE transactions on pattern analysis and machine intelligence (2017).

Cited By

Ni SLiu SCai HWang HXiao H(2024)Facial Expression Recognition with Multi-level Integration Disentangled Generative Adversarial Network2024 IEEE International Conference on Industrial Technology (ICIT)10.1109/ICIT58233.2024.10540810(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.1109/ICIT58233.2024.10540810
Khan UXu QLiu YLagstedt AAlamäki AKauttonen J(2024)Exploring contactless techniques in multimodal emotion recognition: insights into diverse applications, challenges, solutions, and prospectsMultimedia Systems10.1007/s00530-024-01302-230:3Online publication date: 6-Apr-2024
https://doi.org/10.1007/s00530-024-01302-2
Yan LXia JWang C(2023)Self-Adaptive Facial Expression Recognition Based on Local Feature Augmentation and Global Information Correlation2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53992.2023.10394580(1271-1276)Online publication date: 1-Oct-2023
https://doi.org/10.1109/SMC53992.2023.10394580
Show More Cited By

Index Terms

An Occam's Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
      2. Computer vision tasks
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Multi-Feature Based Emotion Recognition for Video Clips
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

In this paper, we present our latest progress in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode. This paper presents the details of our techniques used in the Audio-Video ...
Audiovisual emotion recognition in wild

People express emotions through different modalities. Utilization of both verbal and nonverbal communication channels allows to create a system in which the emotional state is expressed more clearly and therefore easier to understand. Expanding the ...
Multi-view common space learning for emotion recognition in the wild
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

It is a very challenging task to recognize emotion in the wild. Recently, combining information from various views or modalities has attracted more attention. Cross modality features and features extracted by different methods are regarded as multi-view ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

October 2018

687 pages

ISBN:9781450356923

DOI:10.1145/3242969

General Chairs:
Sidney K. D'Mello
University of Illinois, USA
,
Panayiotis (Panos) Georgiou
University of Southern California, USA
,
Stefan Scherer
University of Southern California, USA
,
Program Chairs:
Emily Mower Provost
University of Michigan, USA
,
Mohammad Soleymani
University of Southern California, USA
,
Marcelo Worsley
Northwestern University, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: Specialist Interest Group in Computer-Human Interaction of the ACM

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

ICMI '18

Sponsor:

SIGCHI

ICMI '18: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 16 - 20, 2018

CO, Boulder, USA

Acceptance Rates

ICMI '18 Paper Acceptance Rate 63 of 149 submissions, 42%;

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
366
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ni SLiu SCai HWang HXiao H(2024)Facial Expression Recognition with Multi-level Integration Disentangled Generative Adversarial Network2024 IEEE International Conference on Industrial Technology (ICIT)10.1109/ICIT58233.2024.10540810(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.1109/ICIT58233.2024.10540810
Khan UXu QLiu YLagstedt AAlamäki AKauttonen J(2024)Exploring contactless techniques in multimodal emotion recognition: insights into diverse applications, challenges, solutions, and prospectsMultimedia Systems10.1007/s00530-024-01302-230:3Online publication date: 6-Apr-2024
https://doi.org/10.1007/s00530-024-01302-2
Yan LXia JWang C(2023)Self-Adaptive Facial Expression Recognition Based on Local Feature Augmentation and Global Information Correlation2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53992.2023.10394580(1271-1276)Online publication date: 1-Oct-2023
https://doi.org/10.1109/SMC53992.2023.10394580
Xu XLiu CCao SLu L(2023)A high‐performance and lightweight framework for real‐time facial expression recognitionIET Image Processing10.1049/ipr2.1288117:12(3500-3509)Online publication date: 26-Jul-2023
https://doi.org/10.1049/ipr2.12881
Rostamzadeh SAbouhossein ASaremi MTaheri FEbrahimian MVosoughi S(2023)A comparative investigation of machine learning algorithms for predicting safety signs comprehension based on socio-demographic factors and cognitive sign featuresScientific Reports10.1038/s41598-023-38065-113:1Online publication date: 5-Jul-2023
https://doi.org/10.1038/s41598-023-38065-1
Li YXi MJiang D(2023)Cross-view adaptive graph attention network for dynamic facial expression recognitionMultimedia Systems10.1007/s00530-023-01122-w29:5(2715-2728)Online publication date: 14-Jun-2023
https://doi.org/10.1007/s00530-023-01122-w
Florea CVertan CFlorea L(2023)SoftClusterMix: learning soft boundaries for empirical risk minimizationNeural Computing and Applications10.1007/s00521-023-08338-x35:16(12039-12053)Online publication date: 14-Feb-2023
https://doi.org/10.1007/s00521-023-08338-x
Liu JLuo XHuang Y(2022)Facial expression recognition based on improved residual network2nd International Conference on Information Technology and Intelligent Control (CITIC 2022)10.1117/12.2653443(21)Online publication date: 27-Sep-2022
https://doi.org/10.1117/12.2653443
Xi YMao QZhou L(2022)Weighted contrastive learning using pseudo labels for facial expression recognitionThe Visual Computer10.1007/s00371-022-02642-839:10(5001-5012)Online publication date: 26-Aug-2022
https://doi.org/10.1007/s00371-022-02642-8
Do NKim SYang HLee GYeom S(2021)Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid ModelsSensors10.3390/s2107234421:7(2344)Online publication date: 27-Mar-2021
https://doi.org/10.3390/s21072344
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents