Contextual AD Narration with Interleaved Multimodal Sequence

Hanlin Wang^1,3, Zhan Tong², Kecheng Zheng³, Yujun Shen³, Limin Wang^1,4,†
¹State Key Laboratory for Novel Software Technology, Nanjing University
²ESAT, KU Leuven ³Ant Group ⁴Shanghai Artificial Intelligence Laboratory ^{^†corresponding author}

Contextual AD Narration with Interleaved Multimodal Sequence

Setup

Follow the following guide to set up the environment.

Git clone repo

git clone https://github.com/ant-research/UniAD
cd UniAD

Download and unzip checkpoints

Download necessary files from here

Download 'CLIP_L14_frames_features_5fps.h5' from MAD

Use method in AutoAD-II to get 'MAD_examplers.pth.tar'

Download 'LLAMA2-7B'

Create environment and install packages

Create environment for MAD:

conda create -n UniAD_MAD python=3.8 -y
conda activate UniAD_MAD
pip install -r requirements_MAD_clean.txt

Create environment for CMDAD & TVAD:

conda create -n UniAD_CMDAD python=3.8 -y
conda activate UniAD_CMDAD
pip install -r requirements_CMD_clean.txt
pip install --no-deps torchvision==0.13.1

Create environment for critic evaluation in CMDAD & TVAD:

conda create -n UniAD_critic python=3.9 -y
pip install -r requirements_critic.txt

Running

We train our model with 8 A100 GPUs and evaluate with a single A6000 GPU card.

Evaluation

Conduct evaluation on MAD:

CUDA_VISIBLE_DEVICES=0 python main.py --LLM_path 'LLAMA2-7B path' --batch-size-val 3 --char_feature_path 'MAD_examplers.pth.tar path' --char_prompt_type 0 --resume 'MAD.pt path' --if_finutune_GPT 0 --if_img_only 0 --if_lora 1 --if_only_flamingo 2 --mylogs 'output file diectory path' --movie_feature_path 'CLIP_L14_frames_features_5fps.h5 path' --name MAD_LLAMA2 --previous_video_num 1 --val-data 'MAD_eval_char_refine_final.json path' --workers 4

Conduct evaluation on CMDAD & TVAD:

CUDA_VISIBLE_DEVICES=0 python main.py --if_finutune_GPT 0 --mylogs 'output file diectory path' --name CMDAD_LLAMA2 --precision fp32 --if_lora 1 --train-data "" --val-data 'cmdad_char_refine_eval.json path' --log-every-n-steps 1 --dataset-type json --batch-size 1 --batch-size-val 1 --workers 1 --Visual_Loss 0 --LLM_path 'LLAMA2-7B path' --if_only_flamingo 2 --if_special_prompt 0 --num_latents 32 --num_char 32 --if_img_only 0 --movie_feature_eval_path 'VideoLLaMa_CMD_eval_fp16.h5 path' --char_feature_path 'chars_all_videollama.pth.tar path' --previous_video_num 1 --lr 0.0001 --resume 'CMDAD_TVAD.pt path' --eval_data_name CMDAD

CUDA_VISIBLE_DEVICES=0 python main.py --if_finutune_GPT 0 --mylogs 'output file diectory path' --name TVAD_LLAMA2 --precision fp32 --if_lora 1 --train-data "" --val-data 'tvad_char_refine_eval.json path' --log-every-n-steps 1 --dataset-type json --batch-size 1 --batch-size-val 1 --workers 1 --Visual_Loss 0 --LLM_path 'LLAMA2-7B path' --if_only_flamingo 2 --if_special_prompt 0 --num_latents 32 --num_char 32 --if_img_only 0 --movie_feature_eval_path 'TV_eval_videollama.h5 path' --char_feature_path 'chars_all_videollama.pth.tar path' --previous_video_num 1 --lr 0.0001 --resume 'CMDAD_TVAD.pt path' --eval_data_name TVAD

Note:

During the preparation for open-sourcing, we conducted ablation experiments on CMDAD and TVAD using the latest experimental settings. We found that the effects of the context modeling and character refinement module were minimal after introducing the VideoLLaMA model and the character prediction results from AutoAD-Zero. For more specific details, please refer to our updated arXiv paper.

Train on MAD

Prepare training data of MAD from MAD and use character prediction results from AutoAD to organize the data into the following format：

[
  {
    "start": "",
    "end": "",
    "ad": "",
    "char": [],
    "ad_chars": [],
    "ad_chars_in_chars": [],
    "context": [
      {
        "start": "",
        "end": "",
        "ad": "",
        "char": [],
        "ad_chars": [],
        "ad_chars_in_chars": []
      }
    ],
    "movie_id": ""
  },
  ...
]

Then run:

torchrun --nproc_per_nod 8 -m main --if_finutune_GPT 0 --accum-freq 4 --if_lora 1 --if_only_flamingo 2 --num_latents 30
    - --if_special_prompt 0
    - --if_img_only 0
    - --num_char 30
    - --previous_video_num 1
    - --AD_pretrained 1
    - --AD_pretrained_checkpoint 'LLaMA_AD_pretrain.pt path'
    - --movie_feature_path 'CLIP_L14_frames_features_5fps.h5 path'
    - --char_feature_path 'MAD_examplers.pth.tar path'
    - --dataset 'MAD'
    - --train-data 'train data path'
    - --val-data ''
    - --LLM_path 'LLAMA2-7B path'
    - --batch-size 3
    - --batch-size-val 3
    - --epochs 10
    - --LLM_name 'LLaMA'
    - --lr 0.00005
    - --warmup 6000
    - --save-frequency 1
    - --val-frequency 1
    - --precision 'fp32'
    - --mylogs 'output file diectory path'
    - --name 'output file name'

Citation

Don't forget to cite this source if it proves useful in your research!

@article{wang2024uniad, 
	title={Contextual AD Narration with Interleaved Multimodal Sequence}, 
	author={Hanlin Wang and Zhan Tong and Kecheng Zheng and Yujun Shen and Limin Wang}, 
	year={2025}, 
	eprint={2403.12922}, 
	archivePrefix={arXiv}, 
	primaryClass={cs.CV}}

Acknowledgement

Our implementation is based on

Thanks for their remarkable contribution and released code!

Note

Note: This repo is governed by the license of llama. We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, including hate speech, violence, pornography, deception, etc.

(注：本仓库受llama的许可协议限制。我们强烈建议，用户不应传播及不应允许他人传播以下内容，包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
CMDAD_and_TVAD_eval		CMDAD_and_TVAD_eval
MAD_eval		MAD_eval
MAD_train		MAD_train
LICENSE		LICENSE
LICENSE_LLaMA.md		LICENSE_LLaMA.md
README.md		README.md
requirements_CMD_clean.txt		requirements_CMD_clean.txt
requirements_MAD_clean.txt		requirements_MAD_clean.txt
requirements_critic.txt		requirements_critic.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Contextual AD Narration with Interleaved Multimodal Sequence

Setup

Running

Evaluation

Note:

Train on MAD

Citation

Acknowledgement

Note

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ant-research/UniAD

Folders and files

Latest commit

History

Repository files navigation

Contextual AD Narration with Interleaved Multimodal Sequence

Setup

Running

Evaluation

Note:

Train on MAD

Citation

Acknowledgement

Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages