AudioX: Diffusion Transformer for Anything-to-Audio Generation

This is the repository for "AudioX: Diffusion Transformer for Anything-to-Audio Generation".

📺 Demo Video

AudioX_DEMO.mp4

✨ Abstract

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture.

✨ Teaser

(a) Overview of AudioX, illustrating its capabilities across various tasks. (b) Radar chart comparing the performance of different methods across multiple benchmarks. AudioX demonstrates superior Inception Scores (IS) across a diverse set of datasets in audio and music generation tasks.

✨ Method

Overview of the AudioX Framework.

Code

To be released.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AudioX: Diffusion Transformer for Anything-to-Audio Generation

📺 Demo Video

✨ Abstract

✨ Teaser

✨ Method

Code

About

Uh oh!

Releases

Packages

Tarek-g/audiox

Folders and files

Latest commit

History

Repository files navigation

AudioX: Diffusion Transformer for Anything-to-Audio Generation

📺 Demo Video

✨ Abstract

✨ Teaser

✨ Method

Code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages