[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3242969.3264980acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

An Occam's Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets

Published: 02 October 2018 Publication History

Abstract

This paper presents a light-weight and accurate deep neural model for audiovisual emotion recognition. To design this model, the authors followed a philosophy of simplicity, drastically limiting the number of parameters to learn from the target datasets, always choosing the simplest learning methods: i) transfer learning and low-dimensional space embedding allows to reduce the dimensionality of the representations, ii) visual temporal information handled by a simple score-per-frame selection process averaged across time, iii) simple frame selection mechanism for weighting images within sequences, iv) fusion of the different modalities at prediction level (late fusion). The paper also highlights the inherent challenges of the AFEW dataset and the difficulty of model selection with as few as 383 validation sequences. The proposed real-time emotion classifier achieved a state-of-the-art accuracy of 60.64 % on the test set of AFEW, and ranked 4th at the Emotion in the Wild 2018 challenge.

References

[1]
Roland Goecke Abhinav Dhall, Amanjot Kaur and Tom Gedeon. 2018. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction ACM International Conference on Multimodal Interaction. ACM.
[2]
Dinesh Acharya, Zhiwu Huang, Danda Paudel, and Luc Van Gool. 2018. Covariance Pooling for Facial Expression Recognition. arXiv preprint arXiv:1805.04855 (2018).
[3]
Grigory Antipov, Moez Baccouche, Sid-Ahmed Berrani, and Jean-Luc Dugelay. 2016. Apparent age estimation from face images combining general and children-specialized deep learning models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 96--104.
[4]
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video Advances in Neural Information Processing Systems. 892--900.
[5]
Lisa Feldman Barrett and James A. Russell. 1999. The structure of current affect: Controversies and emerging consensus. Current directions in psychological science, Vol. 8, 1 (1999), 10--14.
[6]
C. Fabian Benitez-Quiroz, Ramprakash Srinivasan, Qianli Feng, Yan Wang, and Aleix M. Martinez. 2017. EmotioNet Challenge: Recognition of facial expressions of emotion in the wild. arXiv preprint arXiv:1703.01210 (2017).
[7]
Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32.
[8]
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation Vol. 42, 4 (2008), 335.
[9]
Terrance DeVries and Graham W. Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).
[10]
Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2011. Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2106--2112.
[11]
Abhinav Dhall, Roland Goecke, Simon Lucey, Tom Gedeon, et al. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, Vol. 19, 3 (2012), 34--41.
[12]
Abhinav Dhall, O. V. Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. 2015. Video and image based emotion recognition challenges in the wild: Emotiw 2015 Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 423--426.
[13]
Paul Ekman and Wallace V. Friesen. 1977. Facial action coding system. (1977).
[14]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462.
[15]
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using cnn-rnn and c3d hybrid networks Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 445--450.
[16]
Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. Technical Report.
[17]
Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests International Conference on Neural Information Processing. Springer, 117--124.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[19]
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. 2017. Learning supervised scoring ensemble for emotion recognition in the wild Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 553--560.
[20]
Markus Kächele, Martin Schels, Sascha Meudt, Günther Palm, and Friedhelm Schwenker. 2016. Revisiting the EmotiW challenge: how wild is it really? Journal on Multimodal User Interfaces Vol. 10, 2 (2016), 151--162.
[21]
Bo-Kyeong Kim, Jihyeon Roh, Suh-Yeon Dong, and Soo-Young Lee. 2016. Hierarchical committee of deep convolutional neural networks for robust facial expression recognition. Journal on Multimodal User Interfaces Vol. 10, 2 (2016), 173--189.
[22]
Boris Knyazev, Roman Shvetsov, Natalia Efremova, and Artem Kuharenko. 2018. Leveraging large face recognition data for emotion classification Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on. IEEE, 692--696.
[23]
Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2584--2593.
[24]
Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable pooling with Context Gating for video classification. arXiv preprint arXiv:1706.06905 (2017).
[25]
Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. 2017. Affectnet: A database for facial expression, valence, and arousal computing in the wild. arXiv preprint arXiv:1708.03985 (2017).
[26]
Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, and Benoit Huet. 2017. Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 536--543.
[27]
Robert Plutchik and Henry Kellerman. 2013. Theories of emotion. Vol. Vol. 1. Academic Press.
[28]
Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, and Maja Pantic. 2017. AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 3--9.
[29]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research Vol. 15, 1 (2014), 1929--1958.
[30]
Michel F. Valstar, Enrique Sánchez-Lozano, Jeffrey F. Cohn, László A. Jeni, Jeffrey M. Girard, Zheng Zhang, Lijun Yin, and Maja Pantic. 2017. Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on. IEEE, 839--847.
[31]
Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 569--576.
[32]
Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing au-aware facial features and their latent relations for emotion recognition in the wild Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 451--458.
[33]
Zhiding Yu and Cha Zhang. 2015. Image based static facial expression recognition with multiple deep network learning Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 435--442.
[34]
Xingyu Zeng, Wanli Ouyang, Junjie Yan, Hongsheng Li, Tong Xiao, Kun Wang, Yu Liu, Yucong Zhou, Bin Yang, Zhe Wang, et al. 2017. Crafting gbd-net for object detection. IEEE transactions on pattern analysis and machine intelligence (2017).

Cited By

View all
  • (2024)Facial Expression Recognition with Multi-level Integration Disentangled Generative Adversarial Network2024 IEEE International Conference on Industrial Technology (ICIT)10.1109/ICIT58233.2024.10540810(1-6)Online publication date: 25-Mar-2024
  • (2024)Exploring contactless techniques in multimodal emotion recognition: insights into diverse applications, challenges, solutions, and prospectsMultimedia Systems10.1007/s00530-024-01302-230:3Online publication date: 6-Apr-2024
  • (2023)Self-Adaptive Facial Expression Recognition Based on Local Feature Augmentation and Global Information Correlation2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53992.2023.10394580(1271-1276)Online publication date: 1-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
October 2018
687 pages
ISBN:9781450356923
DOI:10.1145/3242969
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • SIGCHI: Specialist Interest Group in Computer-Human Interaction of the ACM

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. emotion recognition

Qualifiers

  • Short-paper

Conference

ICMI '18
Sponsor:
  • SIGCHI

Acceptance Rates

ICMI '18 Paper Acceptance Rate 63 of 149 submissions, 42%;
Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Facial Expression Recognition with Multi-level Integration Disentangled Generative Adversarial Network2024 IEEE International Conference on Industrial Technology (ICIT)10.1109/ICIT58233.2024.10540810(1-6)Online publication date: 25-Mar-2024
  • (2024)Exploring contactless techniques in multimodal emotion recognition: insights into diverse applications, challenges, solutions, and prospectsMultimedia Systems10.1007/s00530-024-01302-230:3Online publication date: 6-Apr-2024
  • (2023)Self-Adaptive Facial Expression Recognition Based on Local Feature Augmentation and Global Information Correlation2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53992.2023.10394580(1271-1276)Online publication date: 1-Oct-2023
  • (2023)A high‐performance and lightweight framework for real‐time facial expression recognitionIET Image Processing10.1049/ipr2.1288117:12(3500-3509)Online publication date: 26-Jul-2023
  • (2023)A comparative investigation of machine learning algorithms for predicting safety signs comprehension based on socio-demographic factors and cognitive sign featuresScientific Reports10.1038/s41598-023-38065-113:1Online publication date: 5-Jul-2023
  • (2023)Cross-view adaptive graph attention network for dynamic facial expression recognitionMultimedia Systems10.1007/s00530-023-01122-w29:5(2715-2728)Online publication date: 14-Jun-2023
  • (2023)SoftClusterMix: learning soft boundaries for empirical risk minimizationNeural Computing and Applications10.1007/s00521-023-08338-x35:16(12039-12053)Online publication date: 14-Feb-2023
  • (2022)Facial expression recognition based on improved residual network2nd International Conference on Information Technology and Intelligent Control (CITIC 2022)10.1117/12.2653443(21)Online publication date: 27-Sep-2022
  • (2022)Weighted contrastive learning using pseudo labels for facial expression recognitionThe Visual Computer10.1007/s00371-022-02642-839:10(5001-5012)Online publication date: 26-Aug-2022
  • (2021)Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid ModelsSensors10.3390/s2107234421:7(2344)Online publication date: 27-Mar-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media