Starred repositories
Official implementation of paper: Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis
The TTSDS benchmark evaluates synthetic speech quality by considering prosody, speaker identity, and intelligibility, comparing these factors with real speech and noise datasets.
Unofficial PyTorch implementation of "Autoregressive Speech Synthesis without Vector Quantization (MELLE)"
Awesome Unified Multimodal Models
OmniGen2: Exploration to Advanced Multimodal Generation.
ZIQI-Eval: A Music Evaluation Benchmark for Large Language Models
Official repository for the paper - SLAP: Siamese Language-Audio Pretraining without negative samples for Music Understanding
Official implementation of "Contrastive Audio-Language Learning for Music" (ISMIR 2022)
Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Mel cepstral distortion (MCD) computations in python.
LIGHTVOC AN UPSAMPLING-FREE GAN VOCODER BASED ON CONFORMER AND INVERSE SHORT-TIME FOURIER TRANSFORM
A low-bitrate single-codebook 16 kHz speech codec based on focal modulation
A collection of literature after or concurrent with Masked Autoencoder (MAE) (Kaiming He el al.).
Pytorch Implementation (unofficial) of the paper "Mean Flows for One-step Generative Modeling" by Geng et al.
Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimodal sequence-to-sequence learning.