Pytorch implementation for our ICME2025 submission "STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing".
- inference code
- paper & supplementary material
- youtube demo
- training code
- fine-tuning code
chinese.mp4 |
korean.mp4 |
japanese.mp4 |
spanish.mp4 |
We compare our method with DiffTalk(CVPR23'), DINet(AAAI23'), IP-LAP(CVPR23'), MuseTalk(Arxiv2024), PC-AVS(CVPR21'), TalkLip(CVPR23'), Wav2Lip(MM'20)
Ours.mp4 |
DiffTalk.mp4 |
DINet.mp4 |
IP-LAP.mp4 |
MuseTalk.mp4 |
PC-AVS.mp4 |
TalkLIp.mp4 |
Wav2Lip.mp4 |
- Python 3.8.7
- torch 1.12.1
- torchvision 0.13.1
- librosa 0.9.2
- ffmpeg
First create conda environment:
conda create -n stsa python=3.8
conda activate stsa
Pytorch 1.12.1 is used, other requirements are listed in "requirements.txt". Please run:
pip install -r requirements.txt
Download the pretrained weights, and put the weights under ./checkpoints After this, run the following command:
python inference.py --video_path "demo_templates/video/speakerine.mp4" --audio_path "demo_templates/audio/education.wav"
You can specify the --video_path
and --audio_path
option to inference other videos.