Content-based video parsing and indexing based on audio-visual interaction

S Tsekeridou, I Pitas - IEEE transactions on circuits and systems …, 2001 - ieeexplore.ieee.org
Audio-source parsing and indexing leads to the extraction of a speaker … parsing and
indexing results in the extraction of a talking-face shot mapping over time. Integration of the audio

[BOOK][B] Content-based audio classification and retrieval for audiovisual data parsing

T Zhang, CCJ Kuo - 2013 - books.google.com
PARSING Based on observations described above, we propose a scheme for video content
parsing by using video models together with audio … demultiplexed into the audio data stream …

Unified multisensory perception: Weakly-supervised audio-visual video parsing

Y Tian, D Li, C Xu - Computer Vision–ECCV 2020: 16th European …, 2020 - Springer
… Thus, we introduce a Look, Listen, and Parse dataset for audio-visual video scene parsing,
which contains 11,849 YouTube video clips spanning over 25 categories for a total of 32.9 h …

Exploring heterogeneous clues for weakly-supervised audio-visual video parsing

Y Wu, Y Yang - Proceedings of the IEEE/CVF Conference …, 2021 - openaccess.thecvf.com
… In addition, we also evaluate the overall audio-visual scene parsing performance of our
method by computing aggregated results, ie., “Type@AV” and “Event@AV”. Specifically, Type@…

Video content parsing based on combined audio and visual information

T Zhang, CCJ Kuo - Multimedia Storage and Archiving …, 1999 - spiedigitallibrary.org
… for automatic parsing of video data based on both audio and visual content analysis. The
accompanying audio signal in video is segmented and classified into basic audio types such …

Multi-modal grouping network for weakly-supervised audio-visual video parsing

S Mo, Y Tian - Advances in Neural Information Processing …, 2022 - proceedings.neurips.cc
… For these non-aligned cases, audio signals … audio-visual video parsing (AVVP) task [3] that
aims to parse a video into temporal event segments and predict the audible, visible, or audio-…

Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing

YB Lin, HY Tseng, HY Lee, YY Lin… - Advances in Neural …, 2021 - proceedings.neurips.cc
… The audio-visual video parsing task aims to temporally parse a video into … audio and visual
parsing branches for the weakly-supervised audio-visual video parsing task as a baseline. …

Label-anticipated event disentanglement for audio-visual video parsing

J Zhou, D Guo, Y Mao, Y Zhong, X Chang… - European Conference on …, 2025 - Springer
… improved performances across audio, visual, and audio-visual event parsing, in contrast to
… metrics, indicative of the comprehensive audio and visual event parsing performance, at the …

Improving audio-visual video parsing with pseudo visual labels

J Zhou, D Guo, Y Zhong, M Wang - arXiv preprint arXiv:2303.02344, 2023 - arxiv.org
… in parsing the audio events and audio-visual events. … comparable performance in audio
event parsing but generally … for the visual event and audio-visual event parsing. These results …

Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing

X Jiang, X Xu, Z Chen, J Zhang, J Song… - Proceedings of the 30th …, 2022 - dl.acm.org
… Similar to the previous works [19, 31, 36], we evaluate our method with F-score by parsing
all types of events (audio, visual, and audio-visual events) under both segment-level and event…