8000 GitHub - ant-research/UniAD: [CVPR'25] Official implementation for paper - Contextual AD Narration with Interleaved Multimodal Sequence
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[CVPR'25] Official implementation for paper - Contextual AD Narration with Interleaved Multimodal Sequence

License

Notifications You must be signed in to change notification settings

ant-research/UniAD

Repository files navigation

Contextual AD Narration with Interleaved Multimodal Sequence

Hanlin Wang1,3, Zhan Tong2, Kecheng Zheng3, Yujun Shen3, Limin Wang1,4,†
1State Key Laboratory for Novel Software Technology, Nanjing University
2ESAT, KU Leuven 3Ant Group 4Shanghai Artificial Intelligence Laboratory
corresponding author

Setup

Follow the following guide to set up the environment.

  1. Git clone repo

    git clone https://github.com/ant-research/UniAD
    cd UniAD
    
  2. Download and unzip checkpoints

    Download necessary files from here

    Download 'CLIP_L14_frames_features_5fps.h5' from MAD

    Use method in AutoAD-II to get 'MAD_examplers.pth.tar'

    Download 'LLAMA2-7B'

  3. Create environment and install packages

    Create environment for MAD:

    conda create -n UniAD_MAD python=3.8 -y
    conda activate UniAD_MAD
    pip install -r requirements_MAD_clean.txt
    

    Create environment for CMDAD & TVAD:

    conda create -n UniAD_CMDAD python=3.8 -y
    conda activate UniAD_CMDAD
    pip install -r requirements_CMD_clean.txt
    pip install --no-deps torchvision==0.13.1
    

    Create environment for critic evaluation in CMDAD & TVAD:

    conda create -n UniAD_critic python=3.9 -y
    pip install -r requirements_critic.txt
    

Running

We train our model with 8 A100 GPUs and evaluate with a single A6000 GPU card.

Evaluation

Conduct evaluation on MAD:

CUDA_VISIBLE_DEVICES=0 python main.py --LLM_path 'LLAMA2-7B path' --batch-size-val 3 --char_feature_path 'MAD_examplers.pth.tar path' --char_prompt_type 0 --resume 'MAD.pt path' --if_finutune_GPT 0 --if_img_only 0 --if_lora 1 --if_only_flamingo 2 --mylogs 'output file diectory path' --movie_feature_path 'CLIP_L14_frames_features_5fps.h5 path' --name MAD_LLAMA2 --previous_video_num 1 --val-data 'MAD_eval_char_refine_final.json path' --workers 4

Conduct evaluation on CMDAD & TVAD:

CUDA_VISIBLE_DEVICES=0 python main.py --if_finutune_GPT 0 --mylogs 'output file diectory path' --name CMDAD_LLAMA2 --precision fp32 --if_lora 1 --train-data "" --val-data 'cmdad_char_refine_eval.json path' --log-every-n-steps 1 --dataset-type json --batch-size 1 --batch-size-val 1 --workers 1 --Visual_Loss 0 --LLM_path 'LLAMA2-7B path' --if_only_flamingo 2 --if_special_prompt 0 --num_latents 32 --num_char 32 --if_img_only 0 --movie_feature_eval_path 'VideoLLaMa_CMD_eval_fp16.h5 path' --char_feature_path 'chars_all_videollama.pth.tar path' --previous_video_num 1 --lr 0.0001 --resume 'CMDAD_TVAD.pt path' --eval_data_name CMDAD

CUDA_VISIBLE_DEVICES=0 python main.py --if_finutune_GPT 0 --mylogs 'output file diectory path' --name TVAD_LLAMA2 --precision fp32 --if_lora 1 --train-data "" --val-data 'tvad_char_refine_eval.json path' --log-every-n-steps 1 --dataset-type json --batch-size 1 --batch-size-val 1 --workers 1 --Visual_Loss 0 --LLM_path 'LLAMA2-7B path' --if_only_flamingo 2 --if_special_prompt 0 --num_latents 32 --num_char 32 --if_img_only 0 --movie_feature_eval_path 'TV_eval_videollama.h5 path' --char_feature_path 'chars_all_videollama.pth.tar path' --previous_video_num 1 --lr 0.0001 --resume 'CMDAD_TVAD.pt path' --eval_data_name TVAD

Note:

During the preparation for open-sourcing, we conducted ablation experiments on CMDAD and TVAD using the latest experimental settings. We found that the effects of the context modeling and character refinement module were minimal after introducing the VideoLLaMA model and the character prediction results from AutoAD-Zero. For more specific details, please refer to our updated arXiv paper.

Train on MAD

Prepare training data of MAD from MAD and use character prediction results from AutoAD to organize the data into the following format:

[
  {
    "start": "",
    "end": "",
    "ad": "",
    "char": [],
    "ad_chars": [],
    "ad_chars_in_chars": [],
    "context": [
      {
        "start": "",
        "end": "",
        "ad": "",
        "char": [],
        "ad_chars": [],
        "ad_chars_in_chars": []
      }
    ],
    "movie_id": ""
  },
  ...
]

Then run:

torchrun --nproc_per_nod 8 -m main --if_finutune_GPT 0 --accum-freq 4 --if_lora 1 --if_only_flamingo 2 --num_latents 30
    - --if_special_prompt 0
    - --if_img_only 0
    - --num_char 30
    - --previous_video_num 1
    - --AD_pretrained 1
    - --AD_pretrained_checkpoint 'LLaMA_AD_pretrain.pt path'
    - --movie_feature_path 'CLIP_L14_frames_features_5fps.h5 path'
    - --char_feature_path 'MAD_examplers.pth.tar path'
    - --dataset 'MAD'
    - --train-data 'train data path'
    - --val-data ''
    - --LLM_path 'LLAMA2-7B path'
    - --batch-size 3
    - --batch-size-val 3
    - --epochs 10
    - --LLM_name 'LLaMA'
    - --lr 0.00005
    - --warmup 6000
    - --save-frequency 1
    - --val-frequency 1
    - --precision 'fp32'
    - --mylogs 'output file diectory path'
    - --name 'output file name'

Citation

Don't forget to cite this source if it proves useful in your research!

@article{wang2024uniad, 
	title={Contextual AD Narration with Interleaved Multimodal Sequence}, 
	author={Hanlin Wang and Zhan Tong and Kecheng Zheng and Yujun Shen and Limin Wang}, 
	year={2025}, 
	eprint={2403.12922}, 
	archivePrefix={arXiv}, 
	primaryClass={cs.CV}}

Acknowledgement

Our implementation is based on

Thanks for their remarkable contribution and released code!

Note

Note: This repo is governed by the license of llama. We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, including hate speech, violence, pornography, deception, etc.

(注:本仓库受llama的许可协议限制。我们强烈建议,用户不应传播及不应允许他人传播以下内容,包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)

About

[CVPR'25] Official implementation for paper - Contextual AD Narration with Interleaved Multimodal Sequence

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0