[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3382507.3417966acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Fusical: Multimodal Fusion for Video Sentiment

Published: 22 October 2020 Publication History

Abstract

Determining the emotional sentiment of a video remains a challenging task that requires multimodal, contextual understanding of a situation. In this paper, we describe our entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify group videos containing large variations in language, people, and environment, into one of three sentiment classes. Our end-to-end approach consists of independently training models for different modalities, including full-frame video scenes, human body keypoints, embeddings extracted from audio clips, and image-caption word embeddings. Novel combinations of modalities, such as laughter and image-captioning, and transfer learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points higher than that of the baseline ensemble.

Supplementary Material

MP4 File (3382507.3417966.mp4)
Fusical presents two novel modalities: laughter detection and image captioning, to enhance multimodal video sentiment analysis. We present how we incorporate these modalities, and also compare our independent modality and fusion results. Through Fusical, we are encouraging future scientists to think critically about how we can use our own senses and cognition to inspire creative Deep Learning methods for affective computing.\r\n\r\nWe are on GitHub: https://github.com/fusical/emotiw

References

[1]
Xiuzhuang Zhou, Kai Jin, Yuanyuan Shang, and Guodong Guo. Visually interpretable representation learning for depression recognition from facial images. IEEE Transactions on Affective Computing, 2018.
[2]
Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. From individual to group-level emotion recognition: Emotiw 5.0. In Proceedings of the 19th ACM international conference on multimodal interaction, pages 524--528, 2017.
[3]
Andrew C Gallagher and Tsuhan Chen. Understanding images of groups of people. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 256--263. IEEE, 2009.
[4]
Abhinav Dhall, Roland Goecke, and Tom Gedeon. Automatic group happiness intensity analysis. IEEE Transactions on Affective Computing, 6(1):13--26, 2015.
[5]
Garima Sharma, Shreya Ghosh, and Abhinav Dhall. Automatic group level affect and cohesion prediction in videos. In Nadia Bianchi-Berthouze, Julien Epps, Andrea Kleinsmith, and Picard Rosalind, editors, International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) 2019, pages 161--167, United States of America, 2019. IEEE, Institute of Electrical and Electronics Engineers. International Conference on Affective Computing and Intelligent Interaction Workshops and Demos 2019, ACIIW 2019 ; Conference date: 03-09--2019 Through 06-09--2019.
[6]
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 873--883, Vancouver, Canada, July 2017. Association for Computational Linguistics.
[7]
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. ArXiv, abs/1606.06259, 2016.
[8]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator, 2014.
[9]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention, 2015.
[10]
Ian Goodfellow, Dumitru Erhan, Pierre Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, and Y. Bengio. Challenges in representation learning: A report on three machine learning contests. Neural Networks, 64, 07 2013.
[11]
Y. Zhang, Z. Yang, H. Lu, X. Zhou, P. Phillips, Q. Liu, and S. Wang. Facial emotion recognition based on biorthogonal wavelet entropy, fuzzy support vector machine, and stratified cross validation. IEEE Access, 4:8375--8385, 2016.
[12]
Amil Khanzada, Charles Bai, and Ferhat Turker Celepcikay. Facial expression recognition with deep learning, 2020.
[13]
audEERING. Opensmile. https://www.audeering.com/opensmile/, 2020.
[14]
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, page 892--900, Red Hook, NY, USA, 2016. Curran Associates Inc.
[15]
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776--780, 2017.
[16]
J. Cramer, H-.H. Wu, J. Salamon, and J. P. Bello. Look, listen and learn more: Design choices for deep audio embeddings. In IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 3852--3856, Brighton, UK, May 2019.
[17]
Nat Steinsultz. Laugh detector. https://github.com/ideo/LaughDetection, 2018.
[18]
Zeshan Hussain, Tariq Patanam, and Hardie Cate. Group visual sentiment analysis. ArXiv, abs/1701.01885, 2017.
[19]
Xin Guo, Luisa F. Polan'ia, and Kenneth E. Barner. Group-level emotion recognition using deep models on image scene, faces, and skeletons. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI '17, page 603--608, New York, NY, USA, 2017. Association for Computing Machinery.
[20]
Zheng Lian, Ya Li, Jianhua Tao, and Jian Huang. Investigation of multimodal features, classifiers and fusion methods for emotion recognition. CoRR, abs/1809.06225, 2018.
[21]
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI '17, page 553--560, New York, NY, USA, 2017. Association for Computing Machinery.
[22]
Navonil Majumder, Soujanya Poria, Gangeshwar Krishnamurthy, Niyati Chhaya, Rada Mihalcea, and Alexander Gelbukh. Variational fusion for multimodal sentiment analysis. ArXiv, abs/1908.06008, 2019.
[23]
Roland Goecke Abhinav Dhall, Garima Sharma and Tom Gedeon. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges. ACM International Conference on Multimodal Interaction 2020, 2020.
[24]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[25]
Jason Cramer, Ho Hsiang Wu, Justin Salamon, and Juan Pablo Bello. Look, listen, and learn more: Design choices for deep audio embeddings. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pages 3852--3856. Institute of Electrical and Electronics Engineers Inc., May 2019. 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 ; Conference date: 12-05--2019 Through 17-05--2019.
[26]
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
[27]
Dan Ellis, Shawn Hershey, Aren Jansen, and Manoj Plakal. Vggish. URL https://github. com/tensorflow/models/tree/master/research/audioset/vggish, 2019.
[28]
Adam Geitgey. Face recognition. https://github.com/ageitgey/face_recognition, 2020.
[29]
Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755--1758, 2009.
[30]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740--755. Springer, 2014.
[31]
Shreya Ghosh, Abhinav Dhall, and Nicu Sebe. Predicting group cohesiveness in images. CoRR, abs/1812.11771, 2018.
[32]
Octavio Arriaga, Matias Valdenegro-Toro, and Paul Plö ger. Real-time convolutional neural networks for emotion and gender classification. CoRR, abs/1710.07557, 2017.
[33]
Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15--18, 2019.
[34]
Daniel Bernhardt. Emotion inference from human body motion. PhD thesis, Citeseer, 2010.
[35]
Ann Devitt and Khurshid Ahmad. Sentiment polarity identification in financial news: A cohesion-based approach. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 984--991, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
[36]
Alexander Patrick Mathews, Lexing Xie, and Xuming He. Senticap: Generating image descriptions with sentiments. In Thirtieth AAAI conference on artificial intelligence, 2016.
[37]
Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. Image captioning with sentiment terms via weakly-supervised sentiment dataset. In BMVC, 2016.
[38]
Music and Audio Research Laboratory NYU. Face classification and detection. https://github.com/marl/openl3, 2019.
[39]
Octavio Arriaga. Face classification and detection. https://github.com/oarriaga/face_classification/tree/master/trained_models, 2017.
[40]
Romain Beaumont and G. Kranthi. Pretrained-show-and-tell-model.
[41]
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2014.
[42]
Raghavendra Kotikalapudi and contributors. keras-vis. https://github.com/raghakot/keras-vis, 2017.
[43]
Keisen. tf-keras-vis. https://github.com/keisen/tf-keras-vis, 2020.

Cited By

View all
  • (2024)Fusing Multimodal Streams for Improved Group Emotion Recognition in VideosPattern Recognition10.1007/978-3-031-78305-0_26(403-418)Online publication date: 4-Dec-2024
  • (2023)A recent survey on perceived group sentiment analysisJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10398897(103988)Online publication date: Dec-2023
  • (2022)Level fusion analysis of recurrent audio and video neural network for violence detection in railway2022 30th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO55093.2022.9909622(563-567)Online publication date: 29-Aug-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction
October 2020
920 pages
ISBN:9781450375818
DOI:10.1145/3382507
© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. affective computing
  2. computer vision
  3. emotion
  4. emotiw
  5. ensemble
  6. facial
  7. fer
  8. fusion
  9. image-captioning
  10. laughter
  11. multimodal sentiment classification
  12. neural networks
  13. pose
  14. word embeddings

Qualifiers

  • Research-article

Conference

ICMI '20
Sponsor:
ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
October 25 - 29, 2020
Virtual Event, Netherlands

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)57
  • Downloads (Last 6 weeks)2
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fusing Multimodal Streams for Improved Group Emotion Recognition in VideosPattern Recognition10.1007/978-3-031-78305-0_26(403-418)Online publication date: 4-Dec-2024
  • (2023)A recent survey on perceived group sentiment analysisJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10398897(103988)Online publication date: Dec-2023
  • (2022)Level fusion analysis of recurrent audio and video neural network for violence detection in railway2022 30th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO55093.2022.9909622(563-567)Online publication date: 29-Aug-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media