[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3474085.3475493acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

Published: 17 October 2021 Publication History

Abstract

Recognizing the emotional state of people is a basic but challenging task in video understanding. In this paper, we propose a new task in this field, named Pairwise Emotional Relationship Recognition (PERR). This task aims to recognize the emotional relationship between the two interactive characters in a given video clip. It is different from the traditional emotion and social relation recognition task. Varieties of information, consisting of character appearance, behaviors, facial emotions, dialogues, background music as well as subtitles contribute differently to the final results, which makes the task more challenging but meaningful in developing more advanced multi-modal models. To facilitate the task, we develop a new dataset called Emotional RelAtionship of inTeractiOn (ERATO) based on dramas and movies. ERATO is a large-scale multi-modal dataset for PERR task, which has 31,182 video clips, lasting about 203 video hours. Different from the existing datasets, ERATO contains interaction-centric videos with multi-shots, varied video length, and multiple modalities including visual, audio and text. As a minor contribution, we propose a baseline model composed of Synchronous Modal-Temporal Attention (SMTA) unit to fuse the multi-modal information for the PERR task. In contrast to other prevailing attention mechanisms, our proposed SMTA can steadily improve the performance by about 1%. We expect the ERATO as well as our proposed SMTA to open up a new way for PERR task in video understanding and further improve the research of multi-modal fusion methodology.

Supplementary Material

ZIP File (mfp1780aux.zip)
Supplementary Material for "Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark"

References

[1]
[n.d.]. https://pyscenedetect.readthedocs.io/en/latest/.
[2]
2010. Multi-PIE. Image Vis Comput 28, 5 (2010), 807-813.
[3]
2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision & Pattern Recognition.
[4]
2016. Tega: A Social Robot. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI).
[5]
Juan León Alcázar, Fabian Caba, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbeláez, and Bernard Ghanem. 2020. Active speakers in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12465--12474.
[6]
P. K. Atrey, M. A. Hossain, A. E. Saddik, and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, Vol. 16, 6 (2010), 345--379.
[7]
Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Liming Chen. 2015. LIRIS-ACCEDE: A video database for affective content analysis. IEEE Transactions on Affective Computing, Vol. 6, 1 (2015), 43--55.
[8]
C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez. 2016. EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9]
Bugental and D Blunt. 2000. Acquisition of the Algorithms of Social Life: A Domain-Based Approach. Psychological Bulletin (2000).
[10]
G. Castellano, L. Kessous, and G. Caridakis. 2008. Emotion Recognition through Multiple Modalities: Face, Body Gesture, Speech. Affect and Emotion in Human-Computer Interaction (2008).
[11]
Roddy Cowie, Ellen Douglas-Cowie, Susie Savvidou*, Edelle McMahon, Martin Sawey, and Marc Schröder. 2000. 'FEELTRACE': An instrument for recording perceived emotion in real time. In ISCA tutorial and research workshop (ITRW) on speech and emotion.
[12]
Alan M. Dahms. 1972. Emotional Intimacy: Overlooked Requirement for Survival 1st ed.). Pruett Pub. Co;.
[13]
Joseph De Rivera. 1992. Emotional climate: Social structure and emotional dynamics. In A preliminary draft of this chapter was discussed at a workshop on emotional climate sponsored by the Clark European Center in Luxembourg on Jul 12--14, 1991. John Wiley & Sons.
[14]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690--4699.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[16]
Abhinav Dhall, Roland Goecke, and Tom Gedeon. 2015a. Automatic group happiness intensity analysis. IEEE Transactions on Affective Computing, Vol. 6, 1 (2015), 13--26.
[17]
A. Dhall, R. Goecke, J. Joshi, J. Hoey, and D. Gedeon, Tom. 2016. EmotiW 2016: video and group-level emotion recognition challenges. In the 18th ACM International Conference.
[18]
A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon. 2013. Emotion Recognition In The Wild Challenge 2013. In Proceedings of the 15th ACM on International conference on multimodal interaction.
[19]
Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2011a. Acted facial expressions in the wild database. Australian National University, Canberra, Australia, Technical Report TR-CS-11, Vol. 2 (2011), 1.
[20]
Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2011b. Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, 2106--2112.
[21]
Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012a. Collecting large, richly annotated facial-expression databases from movies. IEEE Annals of the History of Computing, Vol. 19, 03 (2012), 34--41.
[22]
A. Dhall, J. Joshi, I. Radwan, and R. Goecke. 2012b. Finding Happiest Moments in a Social Context. In Asian Conference on Computer Vision.
[23]
A. Dhall, J. Joshi, K. Sikka, R. Goecke, and N. Sebe. 2015b. The more the merrier: Analysing the affect of a group of people in images. In IEEE International Conference Workshops on Automatic Face and Gesture Recognition.
[24]
A. Dhall, A. Kaur, R. Goecke, and T. Gedeon. 2018. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction. (2018).
[25]
S. K. D'Mello and J. Kory. 2015. A Review and Meta-Analysis of Multimodal Affect Detection Systems. Acm Computing Surveys, Vol. 47, 3 (2015), 1--36.
[26]
Ellen Douglas-Cowie, Roddy Cowie, Ian Sneddon, Cate Cox, Orla Lowry, Margaret McRorie, Jean-Claude Martin, Laurence Devillers, Sarkis Abrilian, Anton Batliner, et al. 2007. The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data. In International conference on affective computing and intelligent interaction. Springer, 488--500.
[27]
Xinyu Fu, Eugene Ch'ng, Uwe Aickelin, and Simon See. 2017. CRNN: a joint neural network for redundancy detection. In 2017 IEEE international conference on smart computing (SMARTCOMP). IEEE, 1--8.
[28]
A. C. Gallagher and T. Chen. 2009. Understanding images of groups of people. IEEE (2009).
[29]
Stephanie Glen. 2016. Inter-rater Reliability IRR: Definition, Calculation. https://www.statisticshowto.com/inter-rater-reliability/ Retrieved June 7, 2006 from
[30]
Arushi Goel, Keng Teck Ma, and Cheston Tan. 2019. An end-to-end network for generating social relationship graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11186--11195.
[31]
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in Vision: A Survey. arXiv preprint arXiv:2101.01169 (2021).
[32]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[33]
D. Kollias, P. Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, Bjrn Schuller, I. Kotsia, and S. Zafeiriou. 2018. Deep Affect Prediction in-the-wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond. (2018).
[34]
J. Kossaifi, G. Tzimiropoulos, S. Todorovic, and M. Pantic. 2017. AFEW-VA database for valence and arousal estimation in-the-wild. Image and Vision Computing (2017).
[35]
A. Kukleva, M. Tapaswi, and I. Laptev. 2020. Learning Interactions and Relationships between Movie Characters. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36]
Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin Park, and Kwanghoon Sohn. 2019. Context-aware emotion recognition networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10143--10152.
[37]
Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, and Yale Song. 2020. Parameter Efficient Multimodal Transformers for Video Representation Learning. arXiv preprint arXiv:2012.04124 (2020).
[38]
Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707--710.
[39]
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. 2017. Dual-glance model for deciphering social relationships. In Proceedings of the IEEE International Conference on Computer Vision. 2650--2659.
[40]
X. Liu, W. Liu, M. Zhang, J. Chen, and T. Mei. 2019. Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019.
[41]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019).
[42]
P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, and I. Matthews. 2010. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Computer Vision Pattern Recognition Workshops.
[43]
Jinna Lv, Wu Liu, Lili Zhou, Bin Wu, and Huadong Ma. 2018. Multi-stream fusion model for social relation recognition from videos. In International Conference on Multimedia Modeling. Springer, 355--368.
[44]
Michael Lyons, Shigeru Akamatsu, Miyuki Kamachi, and Jiro Gyoba. 1998. Coding facial expressions with gabor wavelets. In Proceedings Third IEEE international conference on automatic face and gesture recognition. IEEE, 200--205.
[45]
D. Mcduff, R. Kaliouby, T. Senechal, M. Amr, J. Cohn, and R. Picard. 2013. Affectiva-MIT Facial Expression Dataset (AM-FED): Naturalistic and Spontaneous Facial Expressions Collected. IEEE (2013).
[46]
T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha. 2020. EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's Principle.
[47]
Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor. 2017. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, Vol. 10, 1 (2017), 18--31.
[48]
Ali Mollahosseini, Behzad Hasani, Michelle J Salvador, Hojjat Abdollahi, David Chan, and Mohammad H Mahoor. 2016. Facial expression recognition from world wild web. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 58--65.
[49]
Yannis Panagakis, Mihalis A Nicolaou, Stefanos Zafeiriou, and Maja Pantic. 2015. Robust correlated and individual component analysis. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 8 (2015), 1665--1678.
[50]
M. Pantic, M. Valstar, R. Rademaker, and L. Maat. 2005. Web-based database for facial expression analysis. In Proc IEEE International Conference on Multimedia & Expo.
[51]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
[52]
James A Russell. 1980. A circumplex model of affect. Journal of personality and social psychology, Vol. 39, 6 (1980), 1161.
[53]
Marc Schröder, Elisabetta Bevacqua, Florian Eyben, Hatice Gunes, Dirk Heylen, Mark ter Maat, Sathish Pammi, Maja Pantic, Catherine Pelachaud, Björn Schuller, et al. 2009. A demonstration of audiovisual sensitive artificial listeners. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, 1--2.
[54]
Garima Sharma, Shreya Ghosh, and Abhinav Dhall. 2019. Automatic group level affect and cohesion prediction in videos. In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 161--167.
[55]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7464--7473.
[56]
Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A domain based approach to social relation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3481--3490.
[57]
Makarand Tapaswi, Martin Bauml, and Rainer Stiefelhagen. 2014. Storygraphs: visualizing character interactions as a timeline. In Proceedings of the IEEE conference on computer vision and pattern recognition. 827--834.
[58]
Y-I Tian, Takeo Kanade, and Jeffrey F Cohn. 2001. Recognizing action units for facial expression analysis. IEEE Transactions on pattern analysis and machine intelligence, Vol. 23, 2 (2001), 97--115.
[59]
Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In European conference on computer vision. Springer, 56--72.
[60]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
[61]
Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. Moviegraphs: Towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8581--8590.
[62]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.
[63]
Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. 2019. Learning actor relation graphs for group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9964--9974.
[64]
S. Zafeiriou, A. Papaioannou, I. Kotsia, M. Nicolaou, and G. Zhao. 2016. Facial Affect "In-the-Wild": A Survey and a New Database. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[65]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, Vol. 23, 10 (2016), 1499--1503.
[66]
N. Zhang, M. Paluri, Y. Taigman, R. Fergus, and L. Bourdev. 2015. Beyond Frontal Faces: Improving Person Recognition Using Multiple Cues. IEEE Computer Society (2015), 4804--4813.
[67]
Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8739--8748.

Cited By

View all
  • (2024)MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01219(12830-12840)Online publication date: 16-Jun-2024
  • (2023)Social Relation Atmosphere Recognition with Relevant Visual ConceptsIEICE Transactions on Information and Systems10.1587/transinf.2023PCP0008E106.D:10(1638-1649)Online publication date: 1-Oct-2023
  • (2023)Pairwise-Emotion Data Distribution Smoothing for Emotion RecognitionPattern Recognition and Computer Vision10.1007/978-981-99-8435-0_13(164-175)Online publication date: 24-Dec-2023
  • Show More Cited By

Index Terms

  1. Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 October 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. dataset
      2. emotional relationship
      3. modal-temporal attention
      4. multi-modal learning

      Qualifiers

      • Research-article

      Conference

      MM '21
      Sponsor:
      MM '21: ACM Multimedia Conference
      October 20 - 24, 2021
      Virtual Event, China

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)45
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 03 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01219(12830-12840)Online publication date: 16-Jun-2024
      • (2023)Social Relation Atmosphere Recognition with Relevant Visual ConceptsIEICE Transactions on Information and Systems10.1587/transinf.2023PCP0008E106.D:10(1638-1649)Online publication date: 1-Oct-2023
      • (2023)Pairwise-Emotion Data Distribution Smoothing for Emotion RecognitionPattern Recognition and Computer Vision10.1007/978-981-99-8435-0_13(164-175)Online publication date: 24-Dec-2023
      • (2022)Emotion Analysis and Dialogue Breakdown Detection in Dialogue of Chat Systems Based on Deep Neural NetworksElectronics10.3390/electronics1105069511:5(695)Online publication date: 24-Feb-2022
      • (2022)Enlarging the Long-time Dependencies via RL-based Memory Network in Movie Affective AnalysisProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548076(5739-5750)Online publication date: 10-Oct-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media