Hanlin Wang1,3, Zhan Tong2, Kecheng Zheng3, Yujun Shen3, Limin Wang1,4,†
1State Key Laboratory for Novel Software Technology, Nanjing University
2ESAT, KU Leuven 3Ant Group 4Shanghai Artificial Intelligence Laboratory
†corresponding author
Follow the following guide to set up the environment.
-
Git clone repo
git clone https://github.com/ant-research/UniAD cd UniAD
-
Download and unzip checkpoints
Download necessary files from here
Download 'CLIP_L14_frames_features_5fps.h5' from MAD
Use method in AutoAD-II to get 'MAD_examplers.pth.tar'
Download 'LLAMA2-7B'
-
Create environment and install packages
Create environment for MAD:
conda create -n UniAD_MAD python=3.8 -y conda activate UniAD_MAD pip install -r requirements_MAD_clean.txt
Create environment for CMDAD & TVAD:
conda create -n UniAD_CMDAD python=3.8 -y conda activate UniAD_CMDAD pip install -r requirements_CMD_clean.txt pip install --no-deps torchvision==0.13.1
Create environment for critic evaluation in CMDAD & TVAD:
conda create -n UniAD_critic python=3.9 -y pip install -r requirements_critic.txt
We train our model with 8 A100 GPUs and evaluate with a single A6000 GPU card.
Conduct evaluation on MAD:
CUDA_VISIBLE_DEVICES=0 python main.py --LLM_path 'LLAMA2-7B path' --batch-size-val 3 --char_feature_path 'MAD_examplers.pth.tar path' --char_prompt_type 0 --resume 'MAD.pt path' --if_finutune_GPT 0 --if_img_only 0 --if_lora 1 --if_only_flamingo 2 --mylogs 'output file diectory path' --movie_feature_path 'CLIP_L14_frames_features_5fps.h5 path' --name MAD_LLAMA2 --previous_video_num 1 --val-data 'MAD_eval_char_refine_final.json path' --workers 4
Conduct evaluation on CMDAD & TVAD:
CUDA_VISIBLE_DEVICES=0 python main.py --if_finutune_GPT 0 --mylogs 'output file diectory path' --name CMDAD_LLAMA2 --precision fp32 --if_lora 1 --train-data "" --val-data 'cmdad_char_refine_eval.json path' --log-every-n-steps 1 --dataset-type json --batch-size 1 --batch-size-val 1 --workers 1 --Visual_Loss 0 --LLM_path 'LLAMA2-7B path' --if_only_flamingo 2 --if_special_prompt 0 --num_latents 32 --num_char 32 --if_img_only 0 --movie_feature_eval_path 'VideoLLaMa_CMD_eval_fp16.h5 path' --char_feature_path 'chars_all_videollama.pth.tar path' --previous_video_num 1 --lr 0.0001 --resume 'CMDAD_TVAD.pt path' --eval_data_name CMDAD
CUDA_VISIBLE_DEVICES=0 python main.py --if_finutune_GPT 0 --mylogs 'output file diectory path' --name TVAD_LLAMA2 --precision fp32 --if_lora 1 --train-data "" --val-data 'tvad_char_refine_eval.json path' --log-every-n-steps 1 --dataset-type json --batch-size 1 --batch-size-val 1 --workers 1 --Visual_Loss 0 --LLM_path 'LLAMA2-7B path' --if_only_flamingo 2 --if_special_prompt 0 --num_latents 32 --num_char 32 --if_img_only 0 --movie_feature_eval_path 'TV_eval_videollama.h5 path' --char_feature_path 'chars_all_videollama.pth.tar path' --previous_video_num 1 --lr 0.0001 --resume 'CMDAD_TVAD.pt path' --eval_data_name TVAD
During the preparation for open-sourcing, we conducted ablation experiments on CMDAD and TVAD using the latest experimental settings. We found that the effects of the context modeling and character refinement module were minimal after introducing the VideoLLaMA model and the character prediction results from AutoAD-Zero. For more specific details, please refer to our updated arXiv paper.
Prepare training data of MAD from MAD and use character prediction results from AutoAD to organize the data into the following format:
[
{
"start": "",
"end": "",
"ad": "",
"char": [],
"ad_chars": [],
"ad_chars_in_chars": [],
"context": [
{
"start": "",
"end": "",
"ad": "",
"char": [],
"ad_chars": [],
"ad_chars_in_chars": []
}
],
"movie_id": ""
},
...
]
Then run:
torchrun --nproc_per_nod 8 -m main --if_finutune_GPT 0 --accum-freq 4 --if_lora 1 --if_only_flamingo 2 --num_latents 30
- --if_special_prompt 0
- --if_img_only 0
- --num_char 30
- --previous_video_num 1
- --AD_pretrained 1
- --AD_pretrained_checkpoint 'LLaMA_AD_pretrain.pt path'
- --movie_feature_path 'CLIP_L14_frames_features_5fps.h5 path'
- --char_feature_path 'MAD_examplers.pth.tar path'
- --dataset 'MAD'
- --train-data 'train data path'
- --val-data ''
- --LLM_path 'LLAMA2-7B path'
- --batch-size 3
- --batch-size-val 3
- --epochs 10
- --LLM_name 'LLaMA'
- --lr 0.00005
- --warmup 6000
- --save-frequency 1
- --val-frequency 1
- --precision 'fp32'
- --mylogs 'output file diectory path'
- --name 'output file name'
Don't forget to cite this source if it proves useful in your research!
@article{wang2024uniad,
title={Contextual AD Narration with Interleaved Multimodal Sequence},
author={Hanlin Wang and Zhan Tong and Kecheng Zheng and Yujun Shen and Limin Wang},
year={2025},
eprint={2403.12922},
archivePrefix={arXiv},
primaryClass={cs.CV}}
Our implementation is based on
Thanks for their remarkable contribution and released code!
Note: This repo is governed by the license of llama. We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, including hate speech, violence, pornography, deception, etc.
(注:本仓库受llama的许可协议限制。我们强烈建议,用户不应传播及不应允许他人传播以下内容,包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)