This is the repo of our work titled “Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception”, which was available on arxiv at "https://arxiv.org/pdf/2504.06753".
This project requires downloading four datasets independently.
Speech - ASVspoof2019
Sound - Codecfake-A3
Singing Voice - CtrSVDD_train&dev, CtrSVDD_eval
Music - FakeMusicCaps
Upon downloading all datasets, please arrange them in accordance with the directory structure outlined below. If any path errors occur, please modify the 'Data folder prepare' section in config.py accordingly.
# Project Directory Structure
## ASVspoof2019 Dataset
│ ├── ASVspoof2019
│ │ ├── LA
│ │ │ ├── ASVspoof2019_LA_train
│ │ │ │ └── flac
│ │ │ │ └── *.flac (25,380 audio files)
│ │ │ ├── ASVspoof2019_LA_dev
│ │ │ │ └── flac
│ │ │ │ └── *.flac (24,844 audio files)
│ │ │ ├── ASVspoof2019_LA_eval
│ │ │ │ └── flac
│ │ │ │ └── *.flac (71,237 audio files)
│ │ │ ├── ASVspoof2019_LA_cm_protocols
│ │ │ │ ├── ASVspoof2019.LA.cm.train.trn.txt (training labels)
│ │ │ │ ├── ASVspoof2019.LA.cm.dev.trl.txt (development labels)
│ │ │ │ ├── ASVspoof2019.LA.cm.eval.trl.txt (evaluation labels)
## CtrSVDD Dataset
│ ├── CtrSVDD
│ │ ├── train
│ │ │ └── *.wav (84,404 audio files)
│ │ ├── dev
│ │ │ └── *.wav (43,625 audio files)
│ │ ├── eval
│ │ │ └── *.wav (92,769 audio files)
│ │ ├── label
│ │ │ ├── train.txt (training labels)
│ │ │ ├── dev.txt (development labels)
│ │ │ ├── eval.txt (evaluation labels)
## Fakemusiccaps Dataset
│ ├── Fakemusiccaps
│ │ ├── audio
│ │ │ └── *.wav (33,041 audio files)
│ │ ├── label
│ │ │ ├── train.txt (training labels)
│ │ │ ├── dev.txt (development labels)
│ │ │ ├── eval.txt (evaluation labels)
## Fakesound Dataset
│ ├── Codecfake_A3
│ │ ├── 16kaudio
│ │ │ └── *.wav (99,112 audio files)
│ │ ├── label
│ │ │ ├── train.txt (training labels)
│ │ │ ├── dev.txt (development labels)
│ │ │ ├── eval.txt (evaluation labels)
conda create -n add python==3.9.18
pip install -r requirements.txt
All SSL feature use the Hugging Face vervison. This requires you to download the corresponding SSL offline to your own directory. --local-dir
corresponds to the SSL address in config.py.
huggingface-cli download facebook/wav2vec2-xls-r-300m --local-dir yourpath/huggingface/wav2vec2-xls-r-300m/
huggingface-cli download microsoft/wavlm-large --local-dir yourpath/huggingface/wavlm-large/
huggingface-cli download m-a-p/MERT-v1-330M --local-dir yourpath/huggingface/MERT-300M/
Example: Speech-trained WPT-XLSRAASIST
python main_train.py --gpu 0 --train_task speech --model wpt-w2v2aasist --batch_size 32 --o ./ckpt_pt/speech_wpt-w2v2aasist
To change the training data, please refer to main_train.py --train_task
,
choices=["speech", "sound", "singing", "music", "cotrain"]
To change the CM, please refer to config.py --model
,
choices=['aasist', 'specresnet', 'fr-w2v2aasist','fr-wavlmaasist', 'fr-mertaasist', ❄
'ft-w2v2aasist','ft-wavlmaasist', 'ft-mertaasist', 🔥
'pt-w2v2aasist', 'pt-wavlmaasist', 'pt-mertaasist', ⭐
'wpt-w2v2aasist', 'wpt-wavlmaasist', 'wpt-mertaasist' ⭐⭐⭐
]
All training scripts for this paper can be found in script/train_ref.sh
All inference scripts for this paper can be found in script/test_ref.sh
Compute EER score. This will iterate through all the result.txt files in the ckpt folder and return the EER scores.
python evaluate_all.py -p ckpt_best/cotrain_wpt_xlsraasist
You can generate the attention map using script/visual.sh.
Also, you can generate the T-SNE figure using script/T-SNE.sh.
If you find this repository is useful to your research, please cite it as follows:
@article{xie2025detect,
title={Detect All-Type
5265
Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception},
author={Xie, Yuankun and Fu, Ruibo and Wang, Zhiyong and Wang, Xiaopeng and Cao, Songjun and Ma, Long and Cheng, Haonan and Ye, Long},
journal={arXiv preprint arXiv:2504.06753},
year={2025}
}