More Web Proxy on the site http://driver.im/

research-article

Video-based emotion recognition in the wild using deep transfer learning and score fusion

Authors:

Albert Ali SalahAuthors Info & Claims

Image and Vision Computing, Volume 65, Issue C

Pages 66 - 75

https://doi.org/10.1016/j.imavis.2017.01.012

Published: 01 September 2017 Publication History

Abstract

Multimodal recognition of affective states is a difficult problem, unless the recording conditions are carefully controlled. For recognition in the wild, large variances in face pose and illumination, cluttered backgrounds, occlusions, audio and video noise, as well as issues with subtle cues of expression are some of the issues to target. In this paper, we describe a multimodal approach for video-based emotion recognition in the wild. We propose using summarizing functionals of complementary visual descriptors for video modeling. These features include deep convolutional neural network (CNN) based features obtained via transfer learning, for which we illustrate the importance of flexible registration and fine-tuning. Our approach combines audio and visual features with least squares regression based classifiers and weighted score level fusion. We report state-of-the-art results on the EmotiW Challenge for in the wild facial expression recognition. Our approach scales to other problems, and ranked top in the ChaLearn-LAP First Impressions Challenge 2016 from video clips collected in the wild. We present transfer learning strategies for robust emotion recognition in the wild.We compare and contrast a set of visual descriptors and video modeling methods.We propose a small but effective set of summarizing functionals for video modeling.We compare feature and score level fusion alternatives.We report state-of-the-art results on EmotiW, Chalearn LAP FI, and CK+ corpora.

References

[1]

T.R. Almaev, M.F. Valstar, Local Gabor binary patterns from three orthogonal planes for automatic facial expression recognition, in: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), IEEE, 2013, pp. 356-361.

Digital Library

[2]

K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets, in: British Machine Vision Conference, 2014.

[3]

N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proc. IEEE CVPR, vol. 1, 2005, pp. 886-893.

Digital Library

[4]

A. Dhall, R. Goecke, J. Joshi, J. Hoey, T. Gedeon, EmotiW 2016: video and group-level emotion recognition challenges, in: Proc. of the 18th ACM International Conference on Multimodal Interaction, ICMI 2016, 2016, pp. 427-432.

Digital Library

[5]

A. Dhall, R. Goecke, J. Joshi, K. Sikka, T. Gedeon, Emotion recognition in the wild challenge 2014: baseline, data and protocol, in: Proc. of the 16th ACM International Conference on Multimodal Interaction, 2014, pp. 461-466.

Digital Library

[6]

A. Dhall, R. Goecke, J. Joshi, M. Wagner, T. Gedeon, Emotion recognition in the wild challenge 2013, in: Proc. of the 15th ACM International Conference on Multimodal Interaction, 2013, pp. 509-516.

Digital Library

[7]

A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Collecting large, richly annotated facial-expression databases from movies, IEEE MultiMedia, 19 (July 2012) 34-41.

Digital Library

[8]

A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, T. Gedeon, Video and image based emotion recognition challenges in the wild: EmotiW 2015, in: Proc. of the 17Th ACM International Conference on Multimodal Interaction, ICMI 15, ACM, New York, NY, USA, 2015, pp. 423-426.

Digital Library

[9]

H. Dibekliolu, A.A. Salah, T. Gevers, Are you really smiling at me? Spontaneous versus posed enjoyment smiles, in: Proc. ECCV, Springer, 2012, pp. 525-538.

Digital Library

[10]

S. DMello, J. Kory, Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies., in: Proc. ACM ICMI, 2012, pp. 31-38.

Digital Library

[11]

S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, C. Pal, Recurrent neural networks for emotion recognition in video, in: Proc. of the 17th ACM International Conference on Multimodal Interaction, 2015, pp. 467-474.

Digital Library

[12]

H.J. Escalante, V. Ponce-Lpez, J. Wan, M. Riegler, B. Chen, A. Clapes, S. Escalera, I. Guyon, X. Bar, P. Halvorsen, H. Mller, M. Larson, ChaLearn joint contest on multimedia challenges beyond visual analysis: an overview, in: ICPR Contest Proceedings, 2016.

[13]

F. Eyben, M. Wllmer, B. Schuller, openSMILE: the munich versatile and fast open-source audio feature extractor, in: Proc. Intl. Conf. Multimedia, 2010, pp. 1459-1462.

Digital Library

[14]

P.F. Felzenszwalb, D.P. Huttenlocher, Pictorial structures for object recognition, Int. J. Comput. Vis., 61 (2005) 55-79.

Digital Library

[15]

I.J. Goodfellow, D. Erhan, P.L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, Challenges in representation learning: a report on three machine learning contests, Neural Netw., 64 (2015) 59-63.

Digital Library

[16]

F. Grpnar, H. Kaya, A.A. Salah, Multimodal fusion of audio, scene, and face features for first impression estimation, in: Proc. ICPR, 2016.

[17]

J. Heikkil, V. Ojansivu, E. Rahtu, Improved blur insensitivity for decorrelatedlocal phase quantization, in: 20th Int. Conf. Pattern Recog. (ICPR 10), 2010, pp. 818-821.

Digital Library

[18]

G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, SIEEE Trans. Syst. Man Cybern. Part B: Cybernetics, 42 (2012) 513-529.

Digital Library

[19]

B. Jiang, M. Valstar, B. Martinez, M. Pantic, A dynamic appearance descriptor approach to facial actions temporal modeling, IEEE Trans. Cybern., 44 (2014) 161-174.

[20]

B. Jiang, M.F. Valstar, M. Pantic, Action unit detection using sparse appearance descriptors in spacetime video volumes, in: Proc. IEEE FG, IEEE, 2011, pp. 314-321.

[21]

S.E. Kahou, C. Pal, X. Bouthillier, Combining modality specific deep neural networks for emotion recognition in video., in: Proc. of the 15th ACM International Conference on Multimodal Interaction, ICMI 13, 2013, pp. 543-550.

Digital Library

[22]

H. Kaya, F. Grpnar, S. Afshar, A.A. Salah, Contrasting and combining least squares based learners for emotion recognition in the wild, in: Proc. of the 17Th ACM International Conference on Multimodal Interaction, ICMI 15, ACM, New York, NY, USA, 2015, pp. 459-466.

Digital Library

[23]

H. Kaya, A.A. Karpov, A.A. Salah, Fisher vectors with cascaded normalization for paralinguistic analysis, in: INTERSPEECH, 2015, pp. 909-913.

[24]

H. Kaya, A.A. Salah, Combining modality-specific extreme learning machines for emotion recognition in the wild, J. Multimodal User Interfaces, 10 (2016) 139-149.

[25]

B.-K. Kim, J. Roh, S.-Y. Dong, S.-Y. Lee, Hierarchical committee of deep convolutional neural networks for robust facial expression recognition, J. Multimodal User Interfaces, 10 (2016) 173-189.

[26]

A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, Adv. Neural Inf. Proces. Syst. (2012) 1097-1105.

Digital Library

[27]

E. Krumhuber, A.S. Manstead, D. Cosker, D. Marshall, P.L. Rosin, A. Kappas, Facial dynamics as indicators of trustworthiness and cooperative behavior, Emotion, 7 (2007) 730-735.

[28]

M. Liu, S. Li, S. Shan, R. Wang, X. Chen, Deeply learning deformable facial action parts model for dynamic expression analysis, in: Proc. ACCV, Springer, 2014, pp. 143-157.

[29]

M. Liu, R. Wang, Z. Huang, S. Shan, X. Chen, Partial least squares regression on Grassmannian manifold for emotion recognition, in: Proc. of the 15th ACM International Conference on Multimodal Interaction, 2013, pp. 525-530.

Digital Library

[30]

M. Liu, R. Wang, S. Li, Z. Huang, S. Shan, X. Chen, Video modeling and learning on Riemannian manifold for emotion recognition in the wild, J. Multimodal User Interfaces, 10 (2016) 113-124.

[31]

M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, X. Chen, Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild, in: Proc. of the 16th ACM International Conference on Multimodal Interaction, 2014, pp. 494-501.

Digital Library

[32]

D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., 60 (2004) 91-110.

Digital Library

[33]

P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended CohnKanade dataset (CK+): a complete dataset for action unit and emotion-specified expression, in: Proc. IEEE CVPR Workshops, IEEE, 2010, pp. 94-101.

[34]

M. Mathias, R. Benenson, M. Pedersoli, L. Van Gool, Face detection without bells and whistles, in: Proc. ECCV, Springer International Publishing, 2014, pp. 720-735.

[35]

D. McDuff, R. Kaliouby, T. Senechal, M. Amr, J. Cohn, R. Picard, Affectiva-mit facial expression (AM-FED): naturalistic and spontaneous facial expressions collected, in: Proc. IEEE CVPR Workshops, 2013, pp. 881-888.

Digital Library

[36]

H.W. Ng, V.D. Nguyen, V. Vonikakis, S. Winkler, Deep learning for emotion recognition on small datasets using transfer learning, in: Proc. of the 17th ACM International Conference on Multimodal Interaction, 2015, pp. 443-449.

Digital Library

[37]

T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Trans. Pattern Anal. Mach. Intell., 24 (2002) 971-987.

Digital Library

[38]

O.M. Parkhi, A. Vedaldi, A. Zisserman, Deep Face Recognition, in: British Machine Vision Conference, 2015.

[39]

F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1-8.

[40]

C.R. Rao, S.K. Mitra, Generalized Inverse of Matrices and Its Applications, Wiley, New York, 1971.

[41]

R. Rifkin, G. Yeo, T. Poggio, Regularized least-squares classification, NATO Sci. Ser. Sub Ser. III Comput. Syst. Sci., 190 (2003) 131-154.

[42]

F. Ringeval, A. Sonderegger, J. Sauer, D. Lalanne, Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions, in: Proc. of EmoSPACE 2013, Held in Conjunction With FG 2013, Shanghai, China, April, IEEE, 2013.

[43]

R. Rothe, R. Timofte, L. Gool, DEX: Deep EXpectation of apparent age from a single image, in: Proc. IEEE CVPR Workshops, 2015, pp. 10-15.

Digital Library

[44]

B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C.A. Mller, S.S. Narayanan, The INTERSPEECH 2010 paralinguistic challenge, in: Proc. INTERSPEECH, 2010, pp. 2794-2797.

[45]

K. Sikka, K. Dykstra, S. Sathyanarayana, G. Littlewort, M. Bartlett, Multiple kernel learning for emotion recognition in the wild, in: Proc. of the 15th ACM International Conference on Multimodal Interaction, 2013, pp. 517-524.

Digital Library

[46]

J. Sivic, A. Zisserman, Efficient visual search of videos cast as text retrieval, IEEE Trans. Pattern Anal. Mach. Intell., 31 (2009) 591-606.

Digital Library

[47]

B. Sun, L. Li, T. Zuo, Y. Chen, G. Zhou, X. Wu, Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild, in: Proc. of the 16th ACM International Conference on Multimodal Interaction, 2014, pp. 481-486.

Digital Library

[48]

J.A. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural. Process. Lett., 9 (1999) 293-300.

Digital Library

[49]

M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Torres Torres, S. Scherer, G. Stratou, R. Cowie, M. Pantic, AVEC 2016 depression, mood, and emotion recognition workshop and challenge, in: Proc. of AVEC16, Co-Located With ACM MM 2016, Amsterdam, The Netherlands, ACM, October 2016.

Digital Library

[50]

M. Valstar, M. Pantic, Induced disgust, happiness and surprise: an addition to the MMI facial expression database, in: Proc. 3rd Intern, Workshop on EMOTION: Corpora for Research on Emotion and Affect, 2010, pp. 65.

[51]

M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, M. Pantic, AVEC 2014: 3D dimensional affect and depression recognition challenge, in: Proc. of the 4rd ACM Intl. Workshop on Audio/Visual Emotion Challenge, AVEC 14, 2014.

Digital Library

[52]

M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, M. Pantic, AVEC 2013the continuous audio/visual emotion and depression recognition challenge., in: Proc. of the 3rd ACM Intl. Workshop on Audio/Visual Emotion Challenge, AVEC 13, 2013, pp. 3-10.

Digital Library

[53]

Z. Wang, S. Wang, Q. Ji, Capturing complex spatio-temporal relations among facial muscles for facial expression recognition, in: Proc. IEEE CVPR, 2013, pp. 3422-3429.

Digital Library

[54]

H. Wold, Partial least squares, in: Encyclopedia of Statistical Sciences, Wiley, New York, 1985, pp. 581-591.

[55]

C.H. Wu, J.C. Lin, W.L. Wei, Survey on audiovisual emotion recognition: databases, features, and data fusion strategies, 2014.

[56]

A. Yao, J. Shao, N. Ma, Y. Chen, Capturing AU-aware facial features and their latent relations for emotion recognition in the wild, in: Proc. of the 17th ACM International Conference on Multimodal Interaction, 2015, pp. 451-458.

Digital Library

[57]

Z. Yu, C. Zhang, Image based static facial expression recognition with multiple deep network learning, in: Proc. of the 17th ACM International Conference on Multimodal Interaction, 2015, pp. 435-442.

Digital Library

[58]

S. Zafeiriou, A. Papaioannou, I. Kotsia, M. Nicolaou, G. Zhao, Facial affect in-the-wild, in: Proc. IEEE CVPR Workshops, 2016, pp. 36-47.

[59]

S. Zafeiriou, C. Zhang, Z. Zhang, A survey on face detection in the wild: past, present and future, Comput. Vis. Image Underst., 138 (2015) 1-24.

Digital Library

[60]

Z. Zeng, M. Pantic, G.I. Roisman, T.S. Huang, A survey of affect recognition methods: audio, visual, and spontaneous expressions, IEEE Trans. Pattern Anal. Mach. Intell., 31 (2009) 39-58.

Digital Library

[61]

Y. Zhang, Q. Ji, Active and dynamic information fusion for facial expression understanding from image sequences, IEEE Trans. Pattern Anal. Mach. Intell., 27 (2005) 699-714.

Digital Library

[62]

G. Zhao, M. Pietikainen, Dynamic texture recognition using local binary patterns with an application to facial expressions, IEEE Trans. Pattern Anal. Mach. Intell., 29 (2007) 915-928.

Digital Library

Cited By

Zhang SXiao Z(2024)Facial Expression Recognition Using a Semantic-Based Bottleneck Attention ModuleInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35241820:1(1-25)Online publication date: 17-Sep-2024
https://dl.acm.org/doi/10.4018/IJSWIS.352418
Xu YJiang XWu D(2024)Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2024.336676715:3(1659-1668)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TAFFC.2024.3366767
Hwooi SOthmani ASabri A(2023)Monitoring Application-driven Continuous Affect Recognition From Video FramesProceedings of the 2023 5th International Conference on Image, Video and Signal Processing10.1145/3591156.3591161(36-42)Online publication date: 24-Mar-2023
https://dl.acm.org/doi/10.1145/3591156.3591161
Show More Cited By

Recommendations

Multi-clue fusion for emotion recognition in the wild
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction

In the past three years, Emotion Recognition in the Wild (EmotiW) Grand Challenge has drawn more and more attention due to its huge potential applications. In the fourth challenge, aimed at the task of video based emotion recognition, we propose a multi-...
Bi-modality Fusion for Emotion Recognition in the Wild
ICMI '19: 2019 International Conference on Multimodal Interaction

The emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, ...
Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol
ICMI '14: Proceedings of the 16th International Conference on Multimodal Interaction

The Second Emotion Recognition In The Wild Challenge (EmotiW) 2014 consists of an audio-video based emotion classification challenge, which mimics the real-world conditions. Traditionally, emotion recognition has been performed on data captured in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Image and Vision Computing

Image and Vision Computing Volume 65, Issue C

September 2017

73 pages

ISSN:0262-8856

Issue’s Table of Contents

Copyright © Elsevier B.V.

Publisher

Butterworth-Heinemann

United States

Publication History

Published: 01 September 2017

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang SXiao Z(2024)Facial Expression Recognition Using a Semantic-Based Bottleneck Attention ModuleInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35241820:1(1-25)Online publication date: 17-Sep-2024
https://dl.acm.org/doi/10.4018/IJSWIS.352418
Xu YJiang XWu D(2024)Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2024.336676715:3(1659-1668)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TAFFC.2024.3366767
Hwooi SOthmani ASabri A(2023)Monitoring Application-driven Continuous Affect Recognition From Video FramesProceedings of the 2023 5th International Conference on Image, Video and Signal Processing10.1145/3591156.3591161(36-42)Online publication date: 24-Mar-2023
https://dl.acm.org/doi/10.1145/3591156.3591161
Gao YWang LLiu JDang JOkada S(2023)Adversarial Domain Generalized Transformer for Cross-Corpus Speech Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.329079515:2(697-708)Online publication date: 29-Jun-2023
https://dl.acm.org/doi/10.1109/TAFFC.2023.3290795
Hsu JWu C(2023)Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.325890014:4(3231-3243)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1109/TAFFC.2023.3258900
Gao HWu MChen ZLi YWang XAn SLi JLiu C(2023)SSA-ICLNeural Networks10.1016/j.neunet.2022.11.025158:C(228-238)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.neunet.2022.11.025
Rahmani SHosseini SZall RKangavari MKamran SHua W(2023)Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspectsKnowledge-Based Systems10.1016/j.knosys.2022.110219261:COnline publication date: 15-Feb-2023
https://dl.acm.org/doi/10.1016/j.knosys.2022.110219
Liu JAng MChaw JKor ANg K(2023)Emotion assessment and application in human–computer interaction interface based on backpropagation neural network and artificial bee colony algorithmExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120857232:COnline publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120857
Nguyen DNguyen DSridharan SDenman SNguyen TDean DFookes C(2023)Meta-transfer learning for emotion recognitionNeural Computing and Applications10.1007/s00521-023-08248-y35:14(10535-10549)Online publication date: 24-Jan-2023
https://dl.acm.org/doi/10.1007/s00521-023-08248-y
Liu XXu HWang M(2022)Sparse Spatial-Temporal Emotion Graph Convolutional Network for Video Emotion RecognitionComputational Intelligence and Neuroscience10.1155/2022/35188792022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/3518879
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents