[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3474085.3475647acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer Network

Published: 17 October 2021 Publication History

Abstract

Recognizing human emotions from videos has attracted significant attention in numerous computer vision and multimedia applications, such as human-computer interaction and health care. It aims to understand the emotional response of humans, where candidate emotion categories are generally defined by specific psychological theories. However, with the development of psychological theories, emotion categories become increasingly diverse and fine-grained, samples are also increasingly difficult to collect. In this paper, we investigate a new task of zero-shot video emotion recognition, which aims to recognize rare unseen emotions. Specifically, we propose a novel multimodal protagonist-aware transformer network, which is composed of two branches: one is equipped with a novel dynamic emotional attention mechanism and a visual transformer to learn better visual representations; the other is an acoustic transformer for learning discriminative acoustic representations. We manage to align the visual and acoustic representations with semantic embeddings of fine-grained emotion labels through jointly mapping them into a common space under a noise contrastive estimation objective. Extensive experimental results on three datasets demonstrate the effectiveness of the proposed method.

Supplementary Material

MP4 File (MM21-fp2596.mp4)
This video presents our proposed novel multimodal protagonist-aware transformer network for the zero-shot video emotion recognition problem. By a dynamic emotional attention mechanism and a visual transformer, we manage to explicitly model the visual context of the protagonist from the video. By constructing an affective embedding space with multimodal features, the affective gap between visual and semantic features can be effectively bridged. Experiments on four widely used video emotion datasets show that our method significantly outperforms the state-of-the-art approaches for zero-shot video emotion recognition. In the future, we would explore more sophisticated multimodal transformers to improve the fine-grained emotion recognition under ZSL.

References

[1]
R. Plutchik and H. Kellerman, Emotion, theory, research, and experience: theory, research and experience. Academic press, 1980.
[2]
H. Schlosberg, "Three dimensions of emotion." Psychological review, vol. 61, no. 2, p. 81, 1954.
[3]
E. Acar, F. Hopfgartner, and S. Albayrak, "A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material," Multim. Tools Appl., vol. 76, no. 9, pp. 11,809--11,837, 2017.
[4]
R. Gupta, M. K. Abadi, J. A. C. Cabré, F. Morreale, T. H. Falk, and N. Sebe, "A quality adaptive multimodal affect recognition system for user-centric multimedia indexing," in ICMR. ACM, 2016, pp. 317--320.
[5]
S. Nemati and A. R. Naghsh-Nilchi, "An evidential data fusion method for affective music video retrieval," Intell. Data Anal., vol. 21, no. 2, pp. 427--441, 2017.
[6]
A. Shukla, H. Katti, M. S. Kankanhalli, and R. Subramanian, "Looking beyond a clever narrative: Visual context and attention are primary drivers of affect in video advertisements," in ICMI. ACM, 2018, pp. 210--219.
[7]
Y. Zhu, M. Tong, Z. Jiang, S. Zhong, Q. Tian et al., "Hybrid feature-based analysis of video's affective content using protagonist detection," Expert Systems with Applications, vol. 128, pp. 316--326, 2019.
[8]
S. Nemati and A. R. Naghsh-Nilchi, "Incorporating social media comments in affective video retrieval," J. Inf. Sci., vol. 42, no. 4, pp. 524--538, 2016.
[9]
J. Niu, Y. Su, S. Mo, and Z. Zhu, "A novel affective visualization system for videos based on acoustic and visual features," in MMM. Springer, 2017, pp. 15--27.
[10]
D. J. McDuff and M. Soleymani, "Large-scale affective content analysis: Combining media content features and facial reactions," in FG. IEEE Computer Society, 2017, pp. 339--345.
[11]
S. Wang, Y. Zhu, L. Yue, and Q. Ji, "Emotion recognition with the help of privileged information," IEEE Trans. Auton. Ment. Dev., vol. 7, no. 3, pp. 189--200, 2015.
[12]
P. Khorrami, T. L. Paine, K. Brady, C. K. Dagli, and T. S. Huang, "How deep neural networks can improve emotion recognition on video data," in ICIP. IEEE, 2016, pp. 619--623.
[13]
J. Huang, Y. Li, J. Tao, Z. Lian, and J. Yi, "End-to-end continuous emotion recognition from video using 3d convlstm networks," in ICASSP. IEEE, 2018, pp. 6837--6841.
[14]
M. K. Abadi, R. Subramanian, S. M. Kia, P. Avesani, I. Patras, and N. Sebe, "Decaf: Meg-based multimodal database for decoding affective physiological responses," IEEE Trans. Affect. Comput., vol. 6, no. 3, pp. 209--222, 2015.
[15]
J. A. M. Correa, M. K. Abadi, N. Sebe, and I. Patras, "AMIGOS: A dataset for affect, personality and mood research on individuals and groups," IEEE Trans. Affect. Comput., vol. 12, no. 2, pp. 479--493, 2021.
[16]
R. Subramanian, J. Wache, M. K. Abadi, R. L. Vieriu, S. Winkler, and N. Sebe, "Ascertain: Emotion and personality recognition using commercial sensors," IEEE Trans. Affect. Comput., vol. 9, no. 2, pp. 147--160, 2016.
[17]
B. Li, "A dynamic and dual-process theory of humor." Georgia Institute of Technology, 2015.
[18]
S. C. Marsella and J. Gratch, "Ema: A process model of appraisal dynamics," Cognitive Systems Research, vol. 10, no. 1, pp. 70--90, 2009.
[19]
P. Ekman, "An argument for basic emotions," Cognition & emotion, vol. 6, no. 3--4, pp. 169--200, 1992.
[20]
J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J. Maglio, and P. A. Reuter-Lorenz, "Emotional category data on images from the international affective picture system," Behavior research methods, vol. 37, no. 4, pp. 626--630, 2005.
[21]
A. Ortony, G. L. Clore, and A. Collins, The cognitive structure of emotions. Cambridge university press, 1990.
[22]
J. A. M. Correa, M. K. Abadi, N. Sebe, and I. Patras, "AMIGOS: A dataset for mood, personality and affect research on individuals and groups," CoRR, vol. abs/1702.02510, 2017.
[23]
M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, "A multimodal database for affect recognition and implicit tagging," IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 42--55, 2012.
[24]
B. Xu, Y. Fu, Y. Jiang, B. Li, and L. Sigal, "Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization," IEEE Trans. Affect. Comput., vol. 9, no. 2, pp. 255--270, 2018.
[25]
B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka, "Rethinking zero-shot video classification: End-to-end training for realistic applications," in CVPR. IEEE, 2020, pp. 4612--4622.
[26]
S. Wang and Q. Ji, "Video affective content analysis: A survey of state-of-the-art methods," IEEE Trans. Affect. Comput., vol. 6, no. 4, pp. 410--430, 2015.
[27]
H. L. Wang and L. F. Cheong, "Affective understanding in film," IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 6, pp. 689--704, 2006.
[28]
A. Shukla, "Multimodal emotion recognition from advertisements with application to computational advertising," Ph.D. dissertation, Ph. D. Dissertation. International Institute of Information Technology Hyderabad, 2018.
[29]
A. Shukla, S. S. Gullapuram, H. Katti, K. Yadati, M. S. Kankanhalli, and R. Subramanian, "Evaluating content-centric vs. user-centric ad affect recognition," in ICMI. ACM, 2017, pp. 402--410.
[30]
S. Zhao, S. Wang, M. Soleymani, D. Joshi, and Q. Ji, "Affective computing for large-scale heterogeneous multimedia data: A survey," CoRR, vol. abs/1911.05609, 2019.
[31]
Y.-G. Jiang, B. Xu, and X. Xue, "Predicting emotions in user-generated videos," in AAAI, vol. 28, no. 1, 2014.
[32]
L. Torresani, M. Szummer, and A. W. Fitzgibbon, "Efficient object category recognition using classemes," in ECCV (1), ser. Lecture Notes in Computer Science, vol. 6311. Springer, 2010, pp. 776--789.
[33]
L. Li, H. Su, E. P. Xing, and F. Li, "Object bank: A high-level image representation for scene classification & semantic feature sparsification," in NIPS. Curran Associates, Inc., 2010, pp. 1378--1386.
[34]
D. Borth, T. Chen, R. Ji, and S. Chang, "Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content," in ACM Multimedia. ACM, 2013, pp. 459--460.
[35]
S. Wang, S. Chen, and Q. Ji, "Content-based video emotion tagging augmented by users' multiple physiological responses," IEEE Trans. Affect. Comput., vol. 10, no. 2, pp. 155--166, 2017.
[36]
A. Shukla, S. S. Gullapuram, H. Katti, K. Yadati, M. Kankanhalli, and R. Subramanian, "Evaluating content-centric vs. user-centric ad affect recognition," in ICMI, 2017, pp. 402--410.
[37]
R. Gupta, M. K. Abadi, J. A. C. Cabré, F. Morreale, T. H. Falk, and N. Sebe, "A quality adaptive multimodal affect recognition system for user-centric multimedia indexing," in ICMR. ACM, 2016, pp. 317--320.
[38]
S. Nemati and A. R. Naghsh-Nilchi, "An evidential data fusion method for affective music video retrieval," Intell. Data Anal., vol. 21, no. 2, pp. 427--441, 2017.
[39]
A. Shukla, H. Katti, M. Kankanhalli, and R. Subramanian, "Looking beyond a clever narrative: Visual context and attention are primary drivers of affect in video advertisements," in ICMI, 2018, pp. 210--219.
[40]
C. Xu, S. Cetintas, K. Lee, and L. Li, "Visual sentiment prediction with deep convolutional neural networks," CoRR, vol. abs/1411.5731, 2014.
[41]
P. Khorrami, T. L. Paine, K. Brady, C. K. Dagli, and T. S. Huang, "How deep neural networks can improve emotion recognition on video data," in ICIP. IEEE, 2016, pp. 619--623.
[42]
T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, "Emoticon: Context-aware multimodal emotion recognition using frege's principle," in CVPR. IEEE, 2020, pp. 14,222--14,231.
[43]
C. Baecchi, T. Uricchio, M. Bertini, and A. D. Bimbo, "Deep sentiment features of context and faces for affective video analysis," in ICMR. ACM, 2017, pp. 72--77.
[44]
M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell, "Zero-shot learning with semantic output codes," in NIPS. Curran Associates, Inc., 2009, pp. 1410--1418.
[45]
M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean, "Zero-shot learning by convex combination of semantic embeddings," in ICLR, 2014.
[46]
A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, "Devise: A deep visual-semantic embedding model," 2013.
[47]
A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth, "Describing objects by their attributes," in CVPR. IEEE Computer Society, 2009, pp. 1778--1785.
[48]
X. Wang, Y. Ye, and A. Gupta, "Zero-shot recognition via semantic embeddings and knowledge graphs," in CVPR. IEEE Computer Society, 2018, pp. 6857--6866.
[49]
J. Gao, T. Zhang, and C. Xu, "I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs," in AAAI, vol. 33, no. 01, 2019, pp. 8303--8311.
[50]
Y. Geng, J. Chen, Z. Chen, J. Z. Pan, Z. Ye, Z. Yuan, Y. Jia, and H. Chen, "Ontozsl: Ontology-enhanced zero-shot learning," in WWW. hskip 1em plus 0.5em minus 0.4emrelax ACM / IW3C2, 2021, pp. 3325--3336.
[51]
R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng, "Zero-shot learning through cross-modal transfer," in NIPS, 2013, pp. 935--943.
[52]
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, "Generative adversarial networks," Commun. ACM, vol. 63, no. 11, pp. 139--144, 2020.
[53]
P. Qin, X. Wang, W. Chen, C. Zhang, W. Xu, and W. Y. Wang, "Generative adversarial zero-shot relational learning for knowledge graphs," in AAAI, vol. 34, no. 05, 2020, pp. 8673--8680.
[54]
C. Zhan, D. She, S. Zhao, M. Cheng, and J. Yang, "Zero-shot emotion recognition via affective structural embedding," in ICCV. IEEE, 2019, pp. 1151--1160.
[55]
D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in ICCV. IEEE Computer Society, 2015, pp. 4489--4497.
[56]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in CVPR. IEEE Computer Society, 2016, pp. 770--778.
[57]
M. Bishay, G. Zoumpourlis, and I. Patras, "TARN: temporal attentive relation network for few-shot and zero-shot action recognition," in BMVC. BMVA Press, 2019, p. 154.
[58]
C. Gan, M. Lin, Y. Yang, G. Melo, and A. G. Hauptmann, "Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition," in AAAI, vol. 30, no. 1, 2016.
[59]
C. Gan, Y. Yang, L. Zhu, D. Zhao, and Y. Zhuang, "Recognizing an action using its name: A knowledge-based approach," Int. J. Comput. Vis., vol. 120, no. 1, pp. 61--77, 2016.
[60]
M. Rohrbach, S. Ebert, and B. Schiele, "Transfer learning in a transductive setting," in NIPS, 2013, pp. 46--54.
[61]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," in ICLR. OpenReview.net, 2021.
[62]
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in ECCV (1), ser. Lecture Notes in Computer Science, vol. 12346. Springer, 2020, pp. 213--229.
[63]
Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, "End-to-end video instance segmentation with transformers," CoRR, vol. abs/2011.14503, 2020.
[64]
K. Gavrilyuk, R. Sanford, M. Javan, and C. G. M. Snoek, "Actor-transformers for group activity recognition," in CVPR. IEEE, 2020, pp. 836--845.
[65]
R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, "Video action transformer network," in CVPR. Computer Vision Foundation / IEEE, 2019, pp. 244--253.
[66]
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," in NeurIPS, 2020.
[67]
S. Mohammad, "Word affect intensities," in LREC. European Language Resources Association (ELRA), 2018.
[68]
T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, "A simple framework for contrastive learning of visual representations," in ICML, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 1597--1607.
[69]
A. Habibian, T. Mensink, and C. G. M. Snoek, "Videostory: A new multimedia embedding for few-example recognition and translation of events," in ACM Multimedia. ACM, 2014, pp. 17--26.
[70]
Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen, "LIRIS-ACCEDE: A video database for affective content analysis," IEEE Trans. Affect. Comput., vol. 6, no. 1, pp. 43--55, 2015.
[71]
L. Yu, L. Lee, J. Wang, and K. Wong, "IJCNLP-2017 task 2: Dimensional sentiment analysis for chinese phrases," in IJCNLP (Shared Tasks). Asian Federation of Natural Language Processing, 2017, pp. 9--16.
[72]
S. Narayan, A. Gupta, F. S. Khan, C. G. M. Snoek, and L. Shao, "Latent embedding feedback and discriminative features for zero-shot classification," in ECCV, ser. Lecture Notes in Computer Science, vol. 12367. Springer, 2020, pp. 479--495.
[73]
B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka, "Rethinking zero-shot video classification: End-to-end training for realistic applications," in CVPR. IEEE, 2020, pp. 4612--4622.
[74]
K. K. Parida, N. Matiyali, T. Guha, and G. Sharma, "Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos," in WACV. IEEE, 2020, pp. 3240--3249.
[75]
J. Kahn, M. Riviè re, W. Zheng, E. Kharitonov, Q. Xu, P. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux, "Libri-light: A benchmark for ASR with limited or no supervision," in ICASSP. IEEE, 2020, pp. 7669--7673.
[76]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," in ICASSP. IEEE, 2015, pp. 5206--5210.
[77]
J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in EMNLP. ACL, 2014, pp. 1532--1543.
[78]
B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal, "Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization," IEEE Trans. Affect. Comput., vol. 9, no. 2, pp. 255--270, 2016.

Cited By

View all
  • (2024)Emotion Recognition from Videos Using Multimodal Large Language ModelsFuture Internet10.3390/fi1607024716:7(247)Online publication date: 13-Jul-2024
  • (2024)Multimodal few-shot classification without attribute embeddingJournal on Image and Video Processing10.1186/s13640-024-00620-92024:1Online publication date: 8-Jan-2024
  • (2024)A Versatile Multimodal Learning Framework for Zero-Shot Emotion RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.336227034:7(5728-5741)Online publication date: Jul-2024
  • Show More Cited By

Index Terms

  1. Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer Network

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. affective computing
    2. multimodal
    3. zero-shot learning

    Qualifiers

    • Research-article

    Funding Sources

    • Beijing Natural Science Foundation
    • Key Research Program of Frontier Sciences of CAS
    • National Natural Science Foundation of China
    • National Key Research and Development Program of China
    • Pengcheng-Huami Joint Lab

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)108
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Emotion Recognition from Videos Using Multimodal Large Language ModelsFuture Internet10.3390/fi1607024716:7(247)Online publication date: 13-Jul-2024
    • (2024)Multimodal few-shot classification without attribute embeddingJournal on Image and Video Processing10.1186/s13640-024-00620-92024:1Online publication date: 8-Jan-2024
    • (2024)A Versatile Multimodal Learning Framework for Zero-Shot Emotion RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.336227034:7(5728-5741)Online publication date: Jul-2024
    • (2023)AffectFAL: Federated Active Affective Computing with Non-IID DataProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612442(871-882)Online publication date: 26-Oct-2023
    • (2023)Most Important Person-guided Dual-branch Cross-Patch Attention for Group Affect Recognition2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01883(20541-20551)Online publication date: 1-Oct-2023
    • (2023)A Systematic Literature Review on Multimodal Machine Learning: Applications, Challenges, Gaps and Future DirectionsIEEE Access10.1109/ACCESS.2023.324385411(14804-14831)Online publication date: 2023
    • (2023)Emotion Recognition from Videos Using Transformer ModelsComputational Vision and Bio-Inspired Computing10.1007/978-981-19-9819-5_4(45-56)Online publication date: 8-Apr-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media