[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Video-based emotion recognition in the wild using deep transfer learning and score fusion

Published: 01 September 2017 Publication History

Abstract

Multimodal recognition of affective states is a difficult problem, unless the recording conditions are carefully controlled. For recognition in the wild, large variances in face pose and illumination, cluttered backgrounds, occlusions, audio and video noise, as well as issues with subtle cues of expression are some of the issues to target. In this paper, we describe a multimodal approach for video-based emotion recognition in the wild. We propose using summarizing functionals of complementary visual descriptors for video modeling. These features include deep convolutional neural network (CNN) based features obtained via transfer learning, for which we illustrate the importance of flexible registration and fine-tuning. Our approach combines audio and visual features with least squares regression based classifiers and weighted score level fusion. We report state-of-the-art results on the EmotiW Challenge for in the wild facial expression recognition. Our approach scales to other problems, and ranked top in the ChaLearn-LAP First Impressions Challenge 2016 from video clips collected in the wild. We present transfer learning strategies for robust emotion recognition in the wild.We compare and contrast a set of visual descriptors and video modeling methods.We propose a small but effective set of summarizing functionals for video modeling.We compare feature and score level fusion alternatives.We report state-of-the-art results on EmotiW, Chalearn LAP FI, and CK+ corpora.

References

[1]
T.R. Almaev, M.F. Valstar, Local Gabor binary patterns from three orthogonal planes for automatic facial expression recognition, in: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), IEEE, 2013, pp. 356-361.
[2]
K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets, in: British Machine Vision Conference, 2014.
[3]
N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proc. IEEE CVPR, vol. 1, 2005, pp. 886-893.
[4]
A. Dhall, R. Goecke, J. Joshi, J. Hoey, T. Gedeon, EmotiW 2016: video and group-level emotion recognition challenges, in: Proc. of the 18th ACM International Conference on Multimodal Interaction, ICMI 2016, 2016, pp. 427-432.
[5]
A. Dhall, R. Goecke, J. Joshi, K. Sikka, T. Gedeon, Emotion recognition in the wild challenge 2014: baseline, data and protocol, in: Proc. of the 16th ACM International Conference on Multimodal Interaction, 2014, pp. 461-466.
[6]
A. Dhall, R. Goecke, J. Joshi, M. Wagner, T. Gedeon, Emotion recognition in the wild challenge 2013, in: Proc. of the 15th ACM International Conference on Multimodal Interaction, 2013, pp. 509-516.
[7]
A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Collecting large, richly annotated facial-expression databases from movies, IEEE MultiMedia, 19 (July 2012) 34-41.
[8]
A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, T. Gedeon, Video and image based emotion recognition challenges in the wild: EmotiW 2015, in: Proc. of the 17Th ACM International Conference on Multimodal Interaction, ICMI 15, ACM, New York, NY, USA, 2015, pp. 423-426.
[9]
H. Dibekliolu, A.A. Salah, T. Gevers, Are you really smiling at me? Spontaneous versus posed enjoyment smiles, in: Proc. ECCV, Springer, 2012, pp. 525-538.
[10]
S. DMello, J. Kory, Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies., in: Proc. ACM ICMI, 2012, pp. 31-38.
[11]
S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, C. Pal, Recurrent neural networks for emotion recognition in video, in: Proc. of the 17th ACM International Conference on Multimodal Interaction, 2015, pp. 467-474.
[12]
H.J. Escalante, V. Ponce-Lpez, J. Wan, M. Riegler, B. Chen, A. Clapes, S. Escalera, I. Guyon, X. Bar, P. Halvorsen, H. Mller, M. Larson, ChaLearn joint contest on multimedia challenges beyond visual analysis: an overview, in: ICPR Contest Proceedings, 2016.
[13]
F. Eyben, M. Wllmer, B. Schuller, openSMILE: the munich versatile and fast open-source audio feature extractor, in: Proc. Intl. Conf. Multimedia, 2010, pp. 1459-1462.
[14]
P.F. Felzenszwalb, D.P. Huttenlocher, Pictorial structures for object recognition, Int. J. Comput. Vis., 61 (2005) 55-79.
[15]
I.J. Goodfellow, D. Erhan, P.L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, Challenges in representation learning: a report on three machine learning contests, Neural Netw., 64 (2015) 59-63.
[16]
F. Grpnar, H. Kaya, A.A. Salah, Multimodal fusion of audio, scene, and face features for first impression estimation, in: Proc. ICPR, 2016.
[17]
J. Heikkil, V. Ojansivu, E. Rahtu, Improved blur insensitivity for decorrelatedlocal phase quantization, in: 20th Int. Conf. Pattern Recog. (ICPR 10), 2010, pp. 818-821.
[18]
G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, SIEEE Trans. Syst. Man Cybern. Part B: Cybernetics, 42 (2012) 513-529.
[19]
B. Jiang, M. Valstar, B. Martinez, M. Pantic, A dynamic appearance descriptor approach to facial actions temporal modeling, IEEE Trans. Cybern., 44 (2014) 161-174.
[20]
B. Jiang, M.F. Valstar, M. Pantic, Action unit detection using sparse appearance descriptors in spacetime video volumes, in: Proc. IEEE FG, IEEE, 2011, pp. 314-321.
[21]
S.E. Kahou, C. Pal, X. Bouthillier, Combining modality specific deep neural networks for emotion recognition in video., in: Proc. of the 15th ACM International Conference on Multimodal Interaction, ICMI 13, 2013, pp. 543-550.
[22]
H. Kaya, F. Grpnar, S. Afshar, A.A. Salah, Contrasting and combining least squares based learners for emotion recognition in the wild, in: Proc. of the 17Th ACM International Conference on Multimodal Interaction, ICMI 15, ACM, New York, NY, USA, 2015, pp. 459-466.
[23]
H. Kaya, A.A. Karpov, A.A. Salah, Fisher vectors with cascaded normalization for paralinguistic analysis, in: INTERSPEECH, 2015, pp. 909-913.
[24]
H. Kaya, A.A. Salah, Combining modality-specific extreme learning machines for emotion recognition in the wild, J. Multimodal User Interfaces, 10 (2016) 139-149.
[25]
B.-K. Kim, J. Roh, S.-Y. Dong, S.-Y. Lee, Hierarchical committee of deep convolutional neural networks for robust facial expression recognition, J. Multimodal User Interfaces, 10 (2016) 173-189.
[26]
A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, Adv. Neural Inf. Proces. Syst. (2012) 1097-1105.
[27]
E. Krumhuber, A.S. Manstead, D. Cosker, D. Marshall, P.L. Rosin, A. Kappas, Facial dynamics as indicators of trustworthiness and cooperative behavior, Emotion, 7 (2007) 730-735.
[28]
M. Liu, S. Li, S. Shan, R. Wang, X. Chen, Deeply learning deformable facial action parts model for dynamic expression analysis, in: Proc. ACCV, Springer, 2014, pp. 143-157.
[29]
M. Liu, R. Wang, Z. Huang, S. Shan, X. Chen, Partial least squares regression on Grassmannian manifold for emotion recognition, in: Proc. of the 15th ACM International Conference on Multimodal Interaction, 2013, pp. 525-530.
[30]
M. Liu, R. Wang, S. Li, Z. Huang, S. Shan, X. Chen, Video modeling and learning on Riemannian manifold for emotion recognition in the wild, J. Multimodal User Interfaces, 10 (2016) 113-124.
[31]
M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, X. Chen, Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild, in: Proc. of the 16th ACM International Conference on Multimodal Interaction, 2014, pp. 494-501.
[32]
D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., 60 (2004) 91-110.
[33]
P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended CohnKanade dataset (CK+): a complete dataset for action unit and emotion-specified expression, in: Proc. IEEE CVPR Workshops, IEEE, 2010, pp. 94-101.
[34]
M. Mathias, R. Benenson, M. Pedersoli, L. Van Gool, Face detection without bells and whistles, in: Proc. ECCV, Springer International Publishing, 2014, pp. 720-735.
[35]
D. McDuff, R. Kaliouby, T. Senechal, M. Amr, J. Cohn, R. Picard, Affectiva-mit facial expression (AM-FED): naturalistic and spontaneous facial expressions collected, in: Proc. IEEE CVPR Workshops, 2013, pp. 881-888.
[36]
H.W. Ng, V.D. Nguyen, V. Vonikakis, S. Winkler, Deep learning for emotion recognition on small datasets using transfer learning, in: Proc. of the 17th ACM International Conference on Multimodal Interaction, 2015, pp. 443-449.
[37]
T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Trans. Pattern Anal. Mach. Intell., 24 (2002) 971-987.
[38]
O.M. Parkhi, A. Vedaldi, A. Zisserman, Deep Face Recognition, in: British Machine Vision Conference, 2015.
[39]
F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1-8.
[40]
C.R. Rao, S.K. Mitra, Generalized Inverse of Matrices and Its Applications, Wiley, New York, 1971.
[41]
R. Rifkin, G. Yeo, T. Poggio, Regularized least-squares classification, NATO Sci. Ser. Sub Ser. III Comput. Syst. Sci., 190 (2003) 131-154.
[42]
F. Ringeval, A. Sonderegger, J. Sauer, D. Lalanne, Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions, in: Proc. of EmoSPACE 2013, Held in Conjunction With FG 2013, Shanghai, China, April, IEEE, 2013.
[43]
R. Rothe, R. Timofte, L. Gool, DEX: Deep EXpectation of apparent age from a single image, in: Proc. IEEE CVPR Workshops, 2015, pp. 10-15.
[44]
B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C.A. Mller, S.S. Narayanan, The INTERSPEECH 2010 paralinguistic challenge, in: Proc. INTERSPEECH, 2010, pp. 2794-2797.
[45]
K. Sikka, K. Dykstra, S. Sathyanarayana, G. Littlewort, M. Bartlett, Multiple kernel learning for emotion recognition in the wild, in: Proc. of the 15th ACM International Conference on Multimodal Interaction, 2013, pp. 517-524.
[46]
J. Sivic, A. Zisserman, Efficient visual search of videos cast as text retrieval, IEEE Trans. Pattern Anal. Mach. Intell., 31 (2009) 591-606.
[47]
B. Sun, L. Li, T. Zuo, Y. Chen, G. Zhou, X. Wu, Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild, in: Proc. of the 16th ACM International Conference on Multimodal Interaction, 2014, pp. 481-486.
[48]
J.A. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural. Process. Lett., 9 (1999) 293-300.
[49]
M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Torres Torres, S. Scherer, G. Stratou, R. Cowie, M. Pantic, AVEC 2016 depression, mood, and emotion recognition workshop and challenge, in: Proc. of AVEC16, Co-Located With ACM MM 2016, Amsterdam, The Netherlands, ACM, October 2016.
[50]
M. Valstar, M. Pantic, Induced disgust, happiness and surprise: an addition to the MMI facial expression database, in: Proc. 3rd Intern, Workshop on EMOTION: Corpora for Research on Emotion and Affect, 2010, pp. 65.
[51]
M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, M. Pantic, AVEC 2014: 3D dimensional affect and depression recognition challenge, in: Proc. of the 4rd ACM Intl. Workshop on Audio/Visual Emotion Challenge, AVEC 14, 2014.
[52]
M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, M. Pantic, AVEC 2013the continuous audio/visual emotion and depression recognition challenge., in: Proc. of the 3rd ACM Intl. Workshop on Audio/Visual Emotion Challenge, AVEC 13, 2013, pp. 3-10.
[53]
Z. Wang, S. Wang, Q. Ji, Capturing complex spatio-temporal relations among facial muscles for facial expression recognition, in: Proc. IEEE CVPR, 2013, pp. 3422-3429.
[54]
H. Wold, Partial least squares, in: Encyclopedia of Statistical Sciences, Wiley, New York, 1985, pp. 581-591.
[55]
C.H. Wu, J.C. Lin, W.L. Wei, Survey on audiovisual emotion recognition: databases, features, and data fusion strategies, 2014.
[56]
A. Yao, J. Shao, N. Ma, Y. Chen, Capturing AU-aware facial features and their latent relations for emotion recognition in the wild, in: Proc. of the 17th ACM International Conference on Multimodal Interaction, 2015, pp. 451-458.
[57]
Z. Yu, C. Zhang, Image based static facial expression recognition with multiple deep network learning, in: Proc. of the 17th ACM International Conference on Multimodal Interaction, 2015, pp. 435-442.
[58]
S. Zafeiriou, A. Papaioannou, I. Kotsia, M. Nicolaou, G. Zhao, Facial affect in-the-wild, in: Proc. IEEE CVPR Workshops, 2016, pp. 36-47.
[59]
S. Zafeiriou, C. Zhang, Z. Zhang, A survey on face detection in the wild: past, present and future, Comput. Vis. Image Underst., 138 (2015) 1-24.
[60]
Z. Zeng, M. Pantic, G.I. Roisman, T.S. Huang, A survey of affect recognition methods: audio, visual, and spontaneous expressions, IEEE Trans. Pattern Anal. Mach. Intell., 31 (2009) 39-58.
[61]
Y. Zhang, Q. Ji, Active and dynamic information fusion for facial expression understanding from image sequences, IEEE Trans. Pattern Anal. Mach. Intell., 27 (2005) 699-714.
[62]
G. Zhao, M. Pietikainen, Dynamic texture recognition using local binary patterns with an application to facial expressions, IEEE Trans. Pattern Anal. Mach. Intell., 29 (2007) 915-928.

Cited By

View all
  • (2024)Facial Expression Recognition Using a Semantic-Based Bottleneck Attention ModuleInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35241820:1(1-25)Online publication date: 17-Sep-2024
  • (2024)Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2024.336676715:3(1659-1668)Online publication date: 1-Jul-2024
  • (2023)Monitoring Application-driven Continuous Affect Recognition From Video FramesProceedings of the 2023 5th International Conference on Image, Video and Signal Processing10.1145/3591156.3591161(36-42)Online publication date: 24-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Image and Vision Computing
Image and Vision Computing  Volume 65, Issue C
September 2017
73 pages

Publisher

Butterworth-Heinemann

United States

Publication History

Published: 01 September 2017

Author Tags

  1. Convolutional neural networks
  2. EmotiW
  3. Emotion recognition in the wild
  4. Kernel extreme learning machine
  5. Multimodal fusion
  6. Partial least squares

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Facial Expression Recognition Using a Semantic-Based Bottleneck Attention ModuleInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35241820:1(1-25)Online publication date: 17-Sep-2024
  • (2024)Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2024.336676715:3(1659-1668)Online publication date: 1-Jul-2024
  • (2023)Monitoring Application-driven Continuous Affect Recognition From Video FramesProceedings of the 2023 5th International Conference on Image, Video and Signal Processing10.1145/3591156.3591161(36-42)Online publication date: 24-Mar-2023
  • (2023)Adversarial Domain Generalized Transformer for Cross-Corpus Speech Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.329079515:2(697-708)Online publication date: 29-Jun-2023
  • (2023)Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.325890014:4(3231-3243)Online publication date: 1-Oct-2023
  • (2023)SSA-ICLNeural Networks10.1016/j.neunet.2022.11.025158:C(228-238)Online publication date: 1-Jan-2023
  • (2023)Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspectsKnowledge-Based Systems10.1016/j.knosys.2022.110219261:COnline publication date: 15-Feb-2023
  • (2023)Emotion assessment and application in human–computer interaction interface based on backpropagation neural network and artificial bee colony algorithmExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120857232:COnline publication date: 1-Dec-2023
  • (2023)Meta-transfer learning for emotion recognitionNeural Computing and Applications10.1007/s00521-023-08248-y35:14(10535-10549)Online publication date: 24-Jan-2023
  • (2022)Sparse Spatial-Temporal Emotion Graph Convolutional Network for Video Emotion RecognitionComputational Intelligence and Neuroscience10.1155/2022/35188792022Online publication date: 1-Jan-2022
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media