Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.04210 (cs)

[Submitted on 8 Jan 2024]

Title:FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

Authors:Zhi-Song Liu, Robin Courant, Vicky Kalogeiton

Abstract:Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model that relies on cross- and self-attention for visual, audio and text data to predict funny moments in videos. Unlike most methods that rely on ground truth data in the form of subtitles, in this work we exploit modalities that come naturally with videos: (a) video frames as they contain visual information indispensable for scene understanding, (b) audio as it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses and (c) text automatically extracted with a speech-to-text model as it can provide rich information when processed by a Large Language Model. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet-W successfully exploits visual, auditory and textual cues to identify funny moments, while our findings reveal FunnyNet-W's ability to predict funny moments in the wild. FunnyNet-W sets the new state of the art for funny moment detection with multimodal cues on all datasets with and without using ground truth information.

Comments:	22 pages, 14 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2401.04210 [cs.CV]
	(or arXiv:2401.04210v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.04210

Submission history

From: Zhi-Song Liu [view email]
[v1] Mon, 8 Jan 2024 19:39:36 UTC (7,407 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators