8000 GitHub - ed-fish/mm-ZSTAD: Multi-modal zero-shot temporal action detection and localization
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ed-fish/mm-ZSTAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-modal Zero-shot Temporal Action Detection and Localization

This repository contains the code and implementation details for a series of experiments utilizing CLIP and CLAP for zero-shot temporal action localization. The primary goal is to explore the potential of multi-modal learning for detecting and localizing actions in videos without the need for training on extensive annotated data.

Table of Contents

  1. Introduction
  2. Dependencies and Installation
  3. Dataset Preparation
  4. Usage Instructions
  5. Experimental Results
  6. Citations
  7. License

Introduction

This project aims to leverage the powerful pre-trained models, CLIP (Contrastive Language-Image Pretraining) and CLAP (Contrastive Language-Audio Pretraining), to achieve zero-shot temporal action localization in videos. These models provide an interesting approach to learning from multi-modal data, such as image-text and audio-text pairs, and have demonstrated impressive zero-shot learning capabilities.

In our experiments, we explore different strategies to adapt these models for temporal action localization and detection tasks without the need for fine-tuning on specific action classes.

Dependencies and Installation

To use this repository, you will need to have Python 3.7 or later installed. Additionally, you will need the following Python libraries:

  • PyTorch
  • torchvision
  • torchaudio
  • OpenCV
  • NumPy
  • Pandas

You can install the required packages using conda and the environment.yml file:

conda env create -n zstad --file environment.yml

Dataset Preparation

Before running the experiments, you will need to prepare a dataset for temporal action localization. In this project, we used the Thumos 2014 dataset. Instructions for downloading and preparing the dataset can be found here.

Usage Instructions

TBC

Experimental Results

TBC

Citations

TBC

License

This project is released under the MIT License.

About

Multi-modal zero-shot temporal action detection and localization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0