Multi-modal Zero-shot Temporal Action Detection and Localization

This repository contains the code and implementation details for a series of experiments utilizing CLIP and CLAP for zero-shot temporal action localization. The primary goal is to explore the potential of multi-modal learning for detecting and localizing actions in videos without the need for training on extensive annotated data.

Introduction

This project aims to leverage the powerful pre-trained models, CLIP (Contrastive Language-Image Pretraining) and CLAP (Contrastive Language-Audio Pretraining), to achieve zero-shot temporal action localization in videos. These models provide an interesting approach to learning from multi-modal data, such as image-text and audio-text pairs, and have demonstrated impressive zero-shot learning capabilities.

In our experiments, we explore different strategies to adapt these models for temporal action localization and detection tasks without the need for fine-tuning on specific action classes.

Dependencies and Installation

To use this repository, you will need to have Python 3.7 or later installed. Additionally, you will need the following Python libraries:

PyTorch
torchvision
torchaudio
OpenCV
NumPy
Pandas

You can install the required packages using conda and the environment.yml file:

conda env create -n zstad --file environment.yml

Dataset Preparation

Before running the experiments, you will need to prepare a dataset for temporal action localization. In this project, we used the Thumos 2014 dataset. Instructions for downloading and preparing the dataset can be found here.

Usage Instructions

TBC

Experimental Results

TBC

Citations

TBC

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
data		data
src		src
tests		tests
.gitignore		.gitignore
DATASET.md		DATASET.md
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-modal Zero-shot Temporal Action Detection and Localization

Table of Contents

Introduction

Dependencies and Installation

Dataset Preparation

Usage Instructions

Experimental Results

Citations

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ed-fish/mm-ZSTAD

Folders and files

Latest commit

History

Repository files navigation

Multi-modal Zero-shot Temporal Action Detection and Localization

Table of Contents

Introduction

Dependencies and Installation

Dataset Preparation

Usage Instructions

Experimental Results

Citations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages