Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.06287 (cs)

[Submitted on 11 Jan 2024 (v1), last revised 7 Jun 2024 (this version, v3)]

Title:Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition

Authors:Yukun Zuo, Hantao Yao, Liansheng Zhuang, Changsheng Xu

Abstract:Audio-visual video recognition (AVVR) aims to integrate audio and visual clues to categorize videos accurately. While existing methods train AVVR models using provided datasets and achieve satisfactory results, they struggle to retain historical class knowledge when confronted with new classes in real-world situations. Currently, there are no dedicated methods for addressing this problem, so this paper concentrates on exploring Class Incremental Audio-Visual Video Recognition (CIAVVR). For CIAVVR, since both stored data and learned model of past classes contain historical knowledge, the core challenge is how to capture past data knowledge and past model knowledge to prevent catastrophic forgetting. We introduce Hierarchical Augmentation and Distillation (HAD), which comprises the Hierarchical Augmentation Module (HAM) and Hierarchical Distillation Module (HDM) to efficiently utilize the hierarchical structure of data and models, respectively. Specifically, HAM implements a novel augmentation strategy, segmental feature augmentation, to preserve hierarchical model knowledge. Meanwhile, HDM introduces newly designed hierarchical (video-distribution) logical distillation and hierarchical (snippet-video) correlative distillation to capture and maintain the hierarchical intra-sample knowledge of each data and the hierarchical inter-sample knowledge between data, respectively. Evaluations on four benchmarks (AVE, AVK-100, AVK-200, and AVK-400) demonstrate that the proposed HAD effectively captures hierarchical information in both data and models, resulting in better preservation of historical class knowledge and improved performance. Furthermore, we provide a theoretical analysis to support the necessity of the segmental feature augmentation strategy.

Comments:	Accepted by TPAMI
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.06287 [cs.CV]
	(or arXiv:2401.06287v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.06287

Submission history

From: Yukun Zuo [view email]
[v1] Thu, 11 Jan 2024 23:00:24 UTC (378 KB)
[v2] Wed, 10 Apr 2024 18:16:32 UTC (7,251 KB)
[v3] Fri, 7 Jun 2024 00:50:18 UTC (7,251 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators