Computer Science > Computer Vision and Pattern Recognition

arXiv:2409.02371 (cs)

[Submitted on 4 Sep 2024 (v1), last revised 7 Sep 2024 (this version, v2)]

Title:Unfolding Videos Dynamics via Taylor Expansion

Authors:Siyi Chen, Minkyu Choi, Zesen Zhao, Kuan Han, Qing Qu, Zhongming Liu

Abstract:Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2409.02371 [cs.CV]
	(or arXiv:2409.02371v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.02371

Submission history

From: Siyi Chen [view email]
[v1] Wed, 4 Sep 2024 01:41:09 UTC (18,622 KB)
[v2] Sat, 7 Sep 2024 16:15:11 UTC (18,622 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Unfolding Videos Dynamics via Taylor Expansion

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Unfolding Videos Dynamics via Taylor Expansion

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators