STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing

Pytorch implementation for our ICME2025 submission "STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing".

Todo:

Demo:

Multilingual Generation

chinese.mp4	korean.mp4
japanese.mp4	spanish.mp4

Long Video Generation Compared with SOTA Methods

We compare our method with DiffTalk(CVPR23'), DINet(AAAI23'), IP-LAP(CVPR23'), MuseTalk(Arxiv2024), PC-AVS(CVPR21'), TalkLip(CVPR23'), Wav2Lip(MM'20)

Ours.mp4	DiffTalk.mp4	DINet.mp4	IP-LAP.mp4
MuseTalk.mp4	PC-AVS.mp4	TalkLIp.mp4	Wav2Lip.mp4

Inference:

Requirements

Python 3.8.7
torch 1.12.1
torchvision 0.13.1
librosa 0.9.2
ffmpeg

Prepare Environment

First create conda environment:

conda create -n stsa python=3.8
conda activate stsa

Pytorch 1.12.1 is used, other requirements are listed in "requirements.txt". Please run:

pip install -r requirements.txt

Quick Start

Download the pretrained weights, and put the weights under ./checkpoints After this, run the following command:

python inference.py --video_path "demo_templates/video/speakerine.mp4" --audio_path "demo_templates/audio/education.wav"

You can specify the --video_path and --audio_path option to inference other videos.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
checkpoints		checkpoints
demo_templates		demo_templates
inserts		inserts
models		models
README.md		README.md
boundary_heatmap_draw.py		boundary_heatmap_draw.py
draw_landmark.py		draw_landmark.py
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing

Todo:

Demo:

Multilingual Generation

Long Video Generation Compared with SOTA Methods

Inference:

Requirements

Prepare Environment

Quick Start

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SCAILab-USTC/STSA

Folders and files

Latest commit

History

Repository files navigation

STSA: Spatial-Temporal Semantic Alignment for Facial Visual Dubbing

Todo:

Demo:

Multilingual Generation

Long Video Generation Compared with SOTA Methods

Inference:

Requirements

Prepare Environment

Quick Start

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages