More Web Proxy on the site http://driver.im/

research-article

Fusical: Multimodal Fusion for Video Sentiment

Authors:

Boyang Tom Jin,

Leila Abdelrahman,

Cong Kevin Chen,

Amil KhanzadaAuthors Info & Claims

ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

Pages 798 - 806

https://doi.org/10.1145/3382507.3417966

Published: 22 October 2020 Publication History

Abstract

Determining the emotional sentiment of a video remains a challenging task that requires multimodal, contextual understanding of a situation. In this paper, we describe our entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify group videos containing large variations in language, people, and environment, into one of three sentiment classes. Our end-to-end approach consists of independently training models for different modalities, including full-frame video scenes, human body keypoints, embeddings extracted from audio clips, and image-caption word embeddings. Novel combinations of modalities, such as laughter and image-captioning, and transfer learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points higher than that of the baseline ensemble.

Supplementary Material

MP4 File (3382507.3417966.mp4)

Fusical presents two novel modalities: laughter detection and image captioning, to enhance multimodal video sentiment analysis. We present how we incorporate these modalities, and also compare our independent modality and fusion results. Through Fusical, we are encouraging future scientists to think critically about how we can use our own senses and cognition to inspire creative Deep Learning methods for affective computing.\r\n\r\nWe are on GitHub: https://github.com/fusical/emotiw

Download
97.33 MB

References

[1]

Xiuzhuang Zhou, Kai Jin, Yuanyuan Shang, and Guodong Guo. Visually interpretable representation learning for depression recognition from facial images. IEEE Transactions on Affective Computing, 2018.

[2]

Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. From individual to group-level emotion recognition: Emotiw 5.0. In Proceedings of the 19th ACM international conference on multimodal interaction, pages 524--528, 2017.

Digital Library

[3]

Andrew C Gallagher and Tsuhan Chen. Understanding images of groups of people. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 256--263. IEEE, 2009.

[4]

Abhinav Dhall, Roland Goecke, and Tom Gedeon. Automatic group happiness intensity analysis. IEEE Transactions on Affective Computing, 6(1):13--26, 2015.

Digital Library

[5]

Garima Sharma, Shreya Ghosh, and Abhinav Dhall. Automatic group level affect and cohesion prediction in videos. In Nadia Bianchi-Berthouze, Julien Epps, Andrea Kleinsmith, and Picard Rosalind, editors, International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) 2019, pages 161--167, United States of America, 2019. IEEE, Institute of Electrical and Electronics Engineers. International Conference on Affective Computing and Intelligent Interaction Workshops and Demos 2019, ACIIW 2019 ; Conference date: 03-09--2019 Through 06-09--2019.

[6]

Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 873--883, Vancouver, Canada, July 2017. Association for Computational Linguistics.

[7]

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. ArXiv, abs/1606.06259, 2016.

[8]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator, 2014.

[9]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention, 2015.

[10]

Ian Goodfellow, Dumitru Erhan, Pierre Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, and Y. Bengio. Challenges in representation learning: A report on three machine learning contests. Neural Networks, 64, 07 2013.

[11]

Y. Zhang, Z. Yang, H. Lu, X. Zhou, P. Phillips, Q. Liu, and S. Wang. Facial emotion recognition based on biorthogonal wavelet entropy, fuzzy support vector machine, and stratified cross validation. IEEE Access, 4:8375--8385, 2016.

[12]

Amil Khanzada, Charles Bai, and Ferhat Turker Celepcikay. Facial expression recognition with deep learning, 2020.

[13]

audEERING. Opensmile. https://www.audeering.com/opensmile/, 2020.

[14]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, page 892--900, Red Hook, NY, USA, 2016. Curran Associates Inc.

[15]

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776--780, 2017.

Digital Library

[16]

J. Cramer, H-.H. Wu, J. Salamon, and J. P. Bello. Look, listen and learn more: Design choices for deep audio embeddings. In IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 3852--3856, Brighton, UK, May 2019.

[17]

Nat Steinsultz. Laugh detector. https://github.com/ideo/LaughDetection, 2018.

[18]

Zeshan Hussain, Tariq Patanam, and Hardie Cate. Group visual sentiment analysis. ArXiv, abs/1701.01885, 2017.

[19]

Xin Guo, Luisa F. Polan'ia, and Kenneth E. Barner. Group-level emotion recognition using deep models on image scene, faces, and skeletons. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI '17, page 603--608, New York, NY, USA, 2017. Association for Computing Machinery.

Digital Library

[20]

Zheng Lian, Ya Li, Jianhua Tao, and Jian Huang. Investigation of multimodal features, classifiers and fusion methods for emotion recognition. CoRR, abs/1809.06225, 2018.

[21]

Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI '17, page 553--560, New York, NY, USA, 2017. Association for Computing Machinery.

Digital Library

[22]

Navonil Majumder, Soujanya Poria, Gangeshwar Krishnamurthy, Niyati Chhaya, Rada Mihalcea, and Alexander Gelbukh. Variational fusion for multimodal sentiment analysis. ArXiv, abs/1908.06008, 2019.

[23]

Roland Goecke Abhinav Dhall, Garima Sharma and Tom Gedeon. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges. ACM International Conference on Multimodal Interaction 2020, 2020.

Digital Library

[24]

Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

Digital Library

[25]

Jason Cramer, Ho Hsiang Wu, Justin Salamon, and Juan Pablo Bello. Look, listen, and learn more: Design choices for deep audio embeddings. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pages 3852--3856. Institute of Electrical and Electronics Engineers Inc., May 2019. 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 ; Conference date: 12-05--2019 Through 17-05--2019.

[26]

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.

Digital Library

[27]

Dan Ellis, Shawn Hershey, Aren Jansen, and Manoj Plakal. Vggish. URL https://github. com/tensorflow/models/tree/master/research/audioset/vggish, 2019.

[28]

Adam Geitgey. Face recognition. https://github.com/ageitgey/face_recognition, 2020.

[29]

Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755--1758, 2009.

Digital Library

[30]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740--755. Springer, 2014.

[31]

Shreya Ghosh, Abhinav Dhall, and Nicu Sebe. Predicting group cohesiveness in images. CoRR, abs/1812.11771, 2018.

[32]

Octavio Arriaga, Matias Valdenegro-Toro, and Paul Plö ger. Real-time convolutional neural networks for emotion and gender classification. CoRR, abs/1710.07557, 2017.

[33]

Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15--18, 2019.

[34]

Daniel Bernhardt. Emotion inference from human body motion. PhD thesis, Citeseer, 2010.

[35]

Ann Devitt and Khurshid Ahmad. Sentiment polarity identification in financial news: A cohesion-based approach. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 984--991, Prague, Czech Republic, June 2007. Association for Computational Linguistics.

[36]

Alexander Patrick Mathews, Lexing Xie, and Xuming He. Senticap: Generating image descriptions with sentiments. In Thirtieth AAAI conference on artificial intelligence, 2016.

[37]

Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. Image captioning with sentiment terms via weakly-supervised sentiment dataset. In BMVC, 2016.

[38]

Music and Audio Research Laboratory NYU. Face classification and detection. https://github.com/marl/openl3, 2019.

[39]

Octavio Arriaga. Face classification and detection. https://github.com/oarriaga/face_classification/tree/master/trained_models, 2017.

[40]

Romain Beaumont and G. Kranthi. Pretrained-show-and-tell-model.

[41]

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2014.

[42]

Raghavendra Kotikalapudi and contributors. keras-vis. https://github.com/raghakot/keras-vis, 2017.

[43]

Keisen. tf-keras-vis. https://github.com/keisen/tf-keras-vis, 2020.

Cited By

Kumar DDhamdhere PRaman B(2024)Fusing Multimodal Streams for Improved Group Emotion Recognition in VideosPattern Recognition10.1007/978-3-031-78305-0_26(403-418)Online publication date: 4-Dec-2024
https://doi.org/10.1007/978-3-031-78305-0_26
Rathod BVanzara RPandya D(2023)A recent survey on perceived group sentiment analysisJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10398897(103988)Online publication date: Dec-2023
https://doi.org/10.1016/j.jvcir.2023.103988
Marteau TSodoyer DAmbellouis SAfanou S(2022)Level fusion analysis of recurrent audio and video neural network for violence detection in railway2022 30th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO55093.2022.9909622(563-567)Online publication date: 29-Aug-2022
https://doi.org/10.23919/EUSIPCO55093.2022.9909622

Index Terms

Fusical: Multimodal Fusion for Video Sentiment
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Machine learning algorithms
      1. Ensemble methods
    2. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue
CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective ...
Conversational speech synthesis and the need for some laughter

This paper reports progress in the synthesis of conversational speech, from the viewpoint of work carried out on the analysis of a very large corpus of expressive speech in normal everyday situations. With recent developments in concatenative techniques,...
Automatic Recognition of Affective Laughter in Spontaneous Dyadic Interactions from Audiovisual Signals
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

Laughter is a highly spontaneous behavior that frequently occurs during social interactions. It serves as an expressive-communicative social signal which conveys a large spectrum of affect display. Even though many studies have been performed on the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

October 2020

920 pages

ISBN:9781450375818

DOI:10.1145/3382507

General Chairs:
Khiet Truong
University of Twente, the Netherlands
,
Dirk Heylen
University of Twente, the Netherlands
,
Mary Czerwinski
Microsoft Research, USA
,
Program Chairs:
Nadia Berthouze
University College London, United Kingdom
,
Mohamed Chetouani
Sorbonne University, France
,
Mikio Nakano
C4A Research Institute, Japan

Copyright © 2020 ACM.

© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMI '20

Sponsor:

SIGCHI

ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 25 - 29, 2020

Virtual Event, Netherlands

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
416
Total Downloads

Downloads (Last 12 months)57
Downloads (Last 6 weeks)2

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kumar DDhamdhere PRaman B(2024)Fusing Multimodal Streams for Improved Group Emotion Recognition in VideosPattern Recognition10.1007/978-3-031-78305-0_26(403-418)Online publication date: 4-Dec-2024
https://doi.org/10.1007/978-3-031-78305-0_26
Rathod BVanzara RPandya D(2023)A recent survey on perceived group sentiment analysisJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10398897(103988)Online publication date: Dec-2023
https://doi.org/10.1016/j.jvcir.2023.103988
Marteau TSodoyer DAmbellouis SAfanou S(2022)Level fusion analysis of recurrent audio and video neural network for violence detection in railway2022 30th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO55093.2022.9909622(563-567)Online publication date: 29-Aug-2022
https://doi.org/10.23919/EUSIPCO55093.2022.9909622

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents