This is a list of speech tasks and datasets, which can provide training data for Generative AI, AIGC, AI model training, intelligent speech tool development, and speech applications.
I will add new tasks and datasets to this repo continously.
You are welcome to put an issue or email me at hwang258@jhu.edu, to point out any unlisted tasks and datasets!
Task | DataSets | Input Mode | Output Mode | Modeling Target | Level | Description |
---|---|---|---|---|---|---|
Accent Classification | AccentDB Extended Dataset | Audio | Label | Classification | Acoustic, Language | Accent classification involves the recognition and classification of specific speech accents.The task involves the recognition and classification of specific speech accents. The possible answers include American, Australian, Bangla, British, Indian, Malayalam, Odiya, Telugu, or Welsh. The objective is to correctly identify these accents based on the given speech samples, contributing to a system's ability to understand and interact with various speakers. |
Accented Text-to-speech | L2-ARCTIC | Text, Audio | Audio | Generation | Acoustic, Language | Accented text-to-speech (TTS) synthesis aims to synthesize speech with a given foreign accent instead of native speech. |
Acoustic Echo Cancellation | AEC Challenge | Audio | Audio | Regression | Acoustic | The Acoustic Echo Cancellation block is designed to remove echoes, reverberation, and unwanted added sounds from a signal that passes through an acoustic space. |
Automatic Speech Recognition | LibriSpeech Common Voice VoxPopuli MLS Libri-light AISHELL GigaSpeech CoVoST Libriheavy TED-LIUM TIMIT WenetSpeech |
Audio | Text | Classification | Content | Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to process human speech into a written format. |
DeepFake Detection (Spoof Detection) | ASVspoof 2015 Dataset ASVspoof 2017 Dataset ASVspoof2019 ASVspoof2021 ASVspoof5 ADD Challenge In-the-Wild WaveFake SingFake |
Audio | Binary Label | Binary Classification | Acoustic | Audio deepfake detection is a task that aims to distinguish genuine utterances from fake ones via machine learning techniques. <!-- |
Dialogue Act Classification | DailyTalk Dataset | Audio | Label | Classification | Understanding | Dialogue act classification aims to identify the primary purpose or function of an utterance within its dialogue context.The aim of this task is to identify the action in the audio. The possible answers could be question, inform, directive, or commissive. These identification tasks are important, as dialogue acts are central to understanding human conversation and dialogue-based AI system communication. |
Dialogue Act Pairing | DailyTalk Dataset | Audio, Label | Binary Label | Binary Classification | Understanding | Dialogue act pairing involves assessing the congruence of dialogue acts—that is, whether a response dialogue act is appropriate given a query dialogue act. The objective is to determine whether a given dialogue act pairing is congruent or not. The answer could either be true or false. Being able to accurately judge the appropriateness of dialogue acts is key for a universal speech model to understand and participate in human conversations effectively. |
Dialogue Emotion Classification | DailyTalk Dataset | Audio | Label | Classification | Emotion | Dialogue emotion classification is a task that assesses an AI model's ability to identify the most suitable emotion in a given dialogue extract. The main goal of this task is to correctly identify the communicated emotion in an audio clip. Possible answers include anger, disgust, fear, sadness, happiness, surprise, or no emotion. It is an evaluation of the model's capacity to interpret and distinguish emotions conveyed through speech, accounting both for linguistic content and paralinguistic indicators. |
Dysarthric Speech Assessments | UASpeech TORGO |
Audio | Scalar | Regression | Acoustic | Dysarthric speech assessments regarding speech intelligibility are conducted to check the patient’s status and track the effectiveness of treatments. |
Dysarthric Speech Recognition | UASpeech TORGO |
Audio | Text | Classification | Content | Dysarthric Speech Recognition is a task that aims to transcribe dysarthria speech which is a motor speech disorder caused by conditions like Parkinson’s disease or amyotrophic lateral sclerosis (ALS). |
Emotion Recognition | Multimodal EmotionLines Dataset IEMOCAP MELD CREMA-D MSP-Podcast SAVEE MESD CMU-MOSEI MEAD |
Audio | Label | Classification | Emotion | Emotion recognition aims to identify the most appropriate emotional category for a given utterance.Recognizing the emotion expressed in an utterance can be quite challenging. While we can sometimes identify emotion from the linguistic content alone, the more important factors often lie in paralinguistic features — like pitch, rhythm, and other prosodic elements. For a universal speech model, understanding these paralinguistic features is crucial, as they distinguish speech from mere text in a significant manner. |
Emotional TTS | RAVDESS EMOV-DB LJSpeech Dataset IEMOCAP |
Text,Label | Audio | Generation | Acoustic, Emotion | Emotional text-to-speech (TTS) aims to synthesize speech with specfic emotional types. |
Enhancement Detection | LibriTTS-TestClean | Audio | Binary Label | Binary Classification | Acoustic | Enhancement detection is a task focused on determining whether a given audio has been created or modified by a speech enhancement model. The objective of enhancement detection is to ascertain if an audio file has been created or altered by a speech enhancement model. The expected answer is either yes or no. The task poses a challenging problem because the speech model must not only process the content of the speech but also detect minute modifications that might indicate enhancement. |
Expressive TTS | Expresso | Text, Label | Audio | Generation | Acoustic, Understanding | Expressive text-to-speech (TTS) aims to synthesize speech with specfic reading types or improvised styles. |
HowFarAreYou | 3DSpeaker Dataset Spatial LibriSpeech |
Audio | Scalar | Regression | Acoustic | The HowFarAreYou task aims to determine the distance of the speaker from the source of sound. The task's goal is to ascertain the approximate distance of a speaker, based on the provided audio or speech. The task's response could be an exact value, such as 0.4m, 2.0m, or 4.0m, indicating the speaker's distance from the sound source. Gauging the speaker's distance provides insights into the audio's spatial characteristics, which forms a crucial aspect of auditory scene analysis. |
Instruct TTS | None available | Text | Audio | Generation | Acoustic, Understanding | Instruct text-to-speech (TTS) aims to synthesize speech with varying speaking styles to better reflect human speech patterns, given a certain instruction. |
Intent Classification | FluentSpeechCommands Dataset SLURP ATIS Snips |
Audio | Label | Classification | Understanding | Intent classification aims to identify the actionable item behind a spoken message. The objective of this task is to understand and categorize the intent performed by a spoken message. The recognized actions can vary, including activate, bring, change language, deactivate, decrease, or increase. Identifying the intent accurately is pivotal for building reliable speech-based applications and interfaces. We categorize this task into three types: Action, Location, and Object. |
Keyword Spotting | Google Speech Commands V1 Dataset LibriPhrase |
Audio, Text | Binary Label | Binary Classification | Content | Keyword spotting is a process that helps to detect keywords or phrases used in phone calls or audio recordings. These words and phrases can then be used to adjust the urgency of the call, train your employees, and gauge customer satisfaction. |
Language Identification | VoxForge Dataset Common Voice VoxLingua107 |
Audio | Label | Classification | Language | Language Identification task is aimed to determine the language spoken in a given speech recording. The main goal of this task is to identify the language spoken in a specific speech recording. This is an essential part of speech processing, as it facilitates the understanding and translations for different languages. The language spoken could be German, English, Spanish, Italian, Russian, or French. |
Laughter Synthesis | Laughterscape | Audio, Audio | Audio | Generation | Acoustic | Laughter Synthesis task is aimed to generate sound of laughter of a given speaker. |
Multilingual Speech Recognition | Common Voice VoxLingua107 MLS FLEURS CMU Wilderness YODAS |
Audio | Text | Classification | Content, Language | The task of Multilingual Speech Recognition (MSR) involves developing systems that can accurately transcribe speech data across multiple languages. Unlike traditional speech recognition systems that are designed for a specific language, MSR systems aim to handle diverse languages and dialects. |
MultiSpeaker Detection | LibriSpeech-TestClean Dataset VCTK Dataset |
Audio | Binary Label | Binary Classification | Speaker | MultiSpeaker Detection aims to analyze the speech audio to determine whether there is more than one speaker present in it. The core objective of this task is to analyze the speech audio for the presence of more than one speaker. It is crucial for a universal speech model to detect this as the presence of multiple speakers can alter the context and understanding of the spoken content. |
Noise Detection | LJSpeech dataset VCTK Dataset Musan Dataset |
Audio | Binary Label | Binary Classification | Acoustic | Noise Detection aims to idenetify if the speech audio is clean or mixed with noises.The objective of noise detection is to ascertain if an audio file has been added the noise. The expected answer is either yes or no. There are many types of noises - like music, speech, gaussian or others. The task poses a challenging problem because the speech model must not only process the content of the speech but also understand the degradation of speech. |
Noise SNR Level Prediction | VCTK Dataset Musan Dataset |
Audio | Scalar | Regression | Acoustic | Noise SNR Level Prediction aims to predict the signal-to-noise ratio of the speech audio.The objective of noise SNR level prediction is to evaluate the noise SNR level of an audio file. The expected answer could be zero, five, ten, fifteen or zero. There are many types of noises - like music, speech, gaussian or others. The task poses a challenging problem because the speech model must not only process the content of the speech but also understand the degree of noise degradation. |
Non-verbal Voice Recognition | CNVVE | Audio | Label | Classification | Content | Non-verbal Voice Recognition is to recognize non-verbal or non-lexical voice expressions, like humming. |
Offensive Language Identification | OLID | Audio | Label | Classification | Understanding | Offensive Language Identification is to identify the type and the target of offensive texts in social media. |
Overlapping Speech Detection | AMI meeting corpora DIHARD I Challenge Data DIHARD II Challenge Data VoxConverse |
Audio | Label, Timestamp | Classification | Content, Speaker | Overlapped speech detection (OSD) is a task that estimates onsets and offsets of segments (i.e., a small part of an audio clip) within an audio clip (i.e., utterance, session, conversation as a whole) where more than one speaker is speaking simultaneously. |
Reverberation Detection | LJSpeech Dataset VCTK Dataset RIRs Noises Dataset |
Audio | Binary Label | Binary Classification | Acoustic | Reverberation Detection aims to detect if the speech audio is clean or mixed with room impulse responses (RIRs) and noises, that is to say reverberation noises. The objective of reverberation detection is to ascertain if an audio file has been added the reverberation noises. The expected answer is either clean or noisy. The reverberation noises can be originated from large room, medium room or small room. The task poses a challenging problem because the speech model must not only process the content of the speech but also understand the degradation of speech in reververation cases. |
Sarcasm Detection | MUStARD Dataset | Audio | Binary Label | Binary Classification | Understanding | Sarcasm Detection aims to detect if the sarcasm or the irony present in the speech audio. The objective of sarcasm detection is to recognize the presence of sarcasm or ironic expressions in the speech. The expected answer is either true or false. The task poses a challenging problem because the speech model should understand upper level of the semantic information. |
Slot Filling | SLURP ATIS Snips |
Audio | Text | Classification | Understanding | The goal of Slot Filling is to identify from a running dialog different slots, which correspond to different parameters of the user’s query. For instance, when a user queries for nearby restaurants, key slots for location and preferred food are required for a dialog system to retrieve the appropriate information. Thus, the main challenge in the slot-filling task is to extract the target entity. |
Speaker Counting | MUStARD Dataset | Audio | Label | Classification | Speaker | Speaker Counting aims to identify the total number of speaker in speech audio. The objective of speaker counting is to determine the number of speakers in the audio recording. The expected answer should be one, two, three, four, or five. The task poses a challenging problem because the speech model should undersdand the pattern of different speakers. |
Speaker Diarization | CHIME 5 CHIME 6 DIHARD II LibriCSS AISHELL-4 VoxConverse |
Audio | Label, Timestamp | Classification | Speaker | Speaker diarisation is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. |
Speaker Identification | LibriSpeech-TestClean Dataset VCTK Dataset VoxCeleb1 VoxCeleb2 CN-Celeb AVSpeech VoxTube |
Audio | Label | Classification | Speaker | Speaker recognition deals with the identification of the speaker in an audio stream. |
Speaker Verification | LibriSpeech-TestClean Dataset VCTK Dataset VoxCeleb1 VoxCeleb2 CN-Celerb |
Audio, Audio | Binary Label | Binary Classification | Speaker | Speaker verification aims to verify whether the two given speech audios are from the same speaker. The objective of speaker verification is to exam if the patterns in the two audio recordings come from the same speaker. The expected answer is either yes or no. The task poses a challenging problem because the speech model should undersdand the pattern of different speakers. |
Speech Edit | LibriTTS VCTK Dataset LJSpeech Dataset |
Audio, Text | Audio | Generation | Acoustic, Content | Speech edit allows the user to edit the recorded speech, e.g., insert missed words, replace mispronounced words, and/or remove unwanted speech or non-speech events, without degrading the quality and naturalness of the edited speech. |
Speech Command Recognition | Google Speech Commands V1 Dataset | Audio | Label | Classification | Content | Speech Command Recognition aims to identify the spoken command. The objective of speech command recognition is to comprehend and grasp the command presented in the speech. The expected answer should be yes, no, up, down, left, right, on, off, stop, go, zero, one, two, three, four, five, six, seven, eight, nine, bed, bird, cat, dog, happy, house, marvin, sheila, tree, wow, or silence. The task poses a challenging problem because the speech model should understand the content information from the speech audios. |
Speech Dereverberation | Reverb-WSJ0 WHAMR! CHIME 5 CHIME 6 |
Audio | Audio | Regression | Acoustic | Speech Dereverberation is the process by which the effects of reverberation are removed from sound, after such reverberant sound has been picked up by microphones. |
Speech Detection | LJSpeech dataset LibriSpeech-TestClean Dataset LibriSpeech-TestOther Dataset InaGVAD |
Audio | Binary Label | Binary Classification | Content | Speech Detection, also known as voice activity detection and speech activity detection, aims to identify whether the given audio clip contains real speech or not. The objective of speech detection is to analyze the audio and determine whether it consists of real speech or not. The expected answer is either yes or no. The task poses a challenging problem because the speech model should understand not only the content information from the speech audios but the pattern of human voice. |
Speech Enhancement | VoiceBank+DEMAND DNS-Challenge WHAM! WHAMR! |
Audio | Audio | Regression | Acoustic | Speech enhancement aims to improve speech quality by using various algorithms. The objective of enhancement is improvement in intelligibility and/or overall perceptual quality of degraded speech signal using audio signal processing techniques. |
Speech Separation | WSJ0-2mix LibriMix Real-M WHAM! WHAMR! CHIME 5 CHIME 6 AISHELL-4 |
Audio | Audio, Audio | Regression | Speaker | Speech Separation is the extraction of multiple speech signals from a mixture. |
Speech Text Matching | LJSpeech dataset LibriSpeech-TestClean Dataset LibriSpeech-TestOther Dataset |
Audio, Text | Binary Label | Binary Classification | Content | Speech Text Matching aims to determine if the speech and text are matched. The objective of speech text matching is to assess whether the speech and text share the same underlying message or not. The expected answer is either yes or no. The task poses a challenging problem because the speech model should understand the content information from the speech audios. |
Speech-to-speech Translation | CVSS CoVoST 2 |
Audio | Audio | Generation | Language, Content | Speech-to-speech translation consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. |
Speech-to-text Translation | MUST-C | Audio | Text | Generation | Language, Content | Speech-to-text translation, consisting of automatic speech recognition (ASR) and machine translation (MT), refers to the process where spoken language is not only converted to text but also translated into another language. |
Speech Quality Assessment | VCC2018 BVCC |
Audio | Scalar | Regression | Acoustic | Speech Quality Assessment is a task that estimates the quality of speech, like mean-opinion-score (MOS). |
Spoken Question Answering | Spoken-SQuAD ODSQA NMSQA |
Audio | Text | Generation | Understanding | Spoken Question Answering (SQA) aims to find the answer from a spoken document given a question in either text or spoken form. SQA is crucial for personal assistants when replying to the questions from the user’s spoken queries. |
Spoken Term Detection | LJSpeech dataset LibriSpeech-TestClean Dataset LibriSpeech-TestOther Dataset |
Audio, Text | Binary Label | Binary Classification | Content | Spoken Term Detection aims to check for the existence of the given word in the speech. The objective of spoken term detection is to analyze the speech and indicate whether the word is mentioned or not. The expected answer is either yes or no. The task poses a challenging problem because the speech model should understand the content information from the speech audios. |
Stress Detection | MIR-SD Dataset | Audio | Binary Label | Binary Classification | Acoustic | Stress Detection aims to determine the stress placement in English vocabulary. The objective of stress detection is to analyze the stress patterns in English words. The expected answer should be zero, one, two, three, four, or five. For a universal speech model, understanding these paralinguistic features is crucial, as they distinguish speech from mere text in a significant manner. |
Target Speaker Extraction | WSJ0-2mix LibriMix Real-M WHAM! WHAMR! CHIME 5 CHIME 6 |
Audio,Audio | Audio | Regression | Speaker | Target Speaker Extraction aims to segregate the speech of a target speaker from a mixture of interfering speakers with the help of auxiliary information. |
Text-To-Speech Synthesis | LJ Speech LibriTTS AISHELL 3 LibriTTS-R YTTTS |
Text | Audio | Generation | Acoustic | Text-to-speech (TTS) synthesis converts normal language text into speech. |
Vocal Sound Classification | VocalSound | Audio | Label | Classification | Acoustic | Vocal Sound Classificatio aims at automatic human vocal sound recognition for laughter, sighs, coughs, throat clearing, sneezes, and sniffs. |
Voice Conversion | LibriTTS VCTK Dataset ESD |
Audio, Audio | Audio | Generation | Acoustic, Speaker | Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information. |
Input | Output | Tasks |
---|---|---|
Audio | Audio | Acoustic Echo Cancellation, Speech Dereverberation, Speech Enhancement, Speech-to-speech Translation |
Audio | Audio, Audio | Speech Separation, Voice Conversion |
Audio | Binary Label | DeepFake Detection, Dialogue Act Classification, Enhancement Detection, MultiSpeaker Detection, Noise Detection, Reverberation Detection, Sarcasm Detection, Speech Detection, Spoof Detection, Stress Detection |
Audio | Label | Accent Classification, Dialogue Act Classification, Dialogue Emotion Classification, Emotion Recognition, Intent Classification, Language Identification, Non-verbal Voice Recognition, Offensive Language Identification, Speaker Counting, Speaker Identification, Speech Command Recognition, Vocal Sound Classification |
Audio | Label, Timestamp | Speaker Diarization, Overlapping Speech Detection |
Audio | Scalar | Dysarthric Speech Assessments, HowFarAreYou, Noise SNR Level Prediction, Speech Quality Assessment |
Audio | Text | Automatic Speech Recognition, Dysarthric Speech Recognition, Multilingual Speech Recognition, Slot Filling, Speech-to-text Translation, Spoken Question Answering |
Audio, Audio | Audio | Laughter Synthesis, Target Speaker Extraction |
Audio, Audio | Binary Label | Speaker Verification |
Audio, Label | Binary Label | Dialogue Act Pairing, Keyword Spotting |
Audio, Text | Audio | Accented Text-to-speech, Speech Edit |
Audio, Text | Binary Label | Speech Text Matching, Spoken Term Detection |
Text | Audio | Instruct TTS, Text-To-Speech Synthesis |
Text, Label | Audio | Emotional TTS, Expressive TTS |
Level | Tasks |
---|---|
Acoustic | Accent Classification, Accented Text-to-speech, Acoustic Echo Cancellation, DeepFake Detection, Dysarthric Speech Assessments, Emotional TTS, Enhancement Detection, Expressive TTS, HowFarAreYou, Instruct TTS, Laughter Synthesis, Noise Detection, Noise SNR Level Prediction, Reverberation Detection, Speech Edit, Speech Dereverberation, Speech Enhancement, Speech Quality Assessment, Spoof Detection, Stress Detection, Text-To-Speech Synthesis, Vocal Sound Classification, Voice Conversion |
Content | Automatic Speech Recognition, Dysarthric Speech Recognition, Keyword Spotting, Multilingual Speech Recognition, Non-verbal Voice Recognition, Overlapping Speech Detection, Speech Edit, Speech Command Recognition, Speech Detection, Speech Text Matching, Speech-to-speech Translation, Speech-to-text Translation, Spoken Term Detection, Vocal Sound Classification |
Emotion | Dialogue Emotion Classification, Emotion Recognition, Emotional TTS |
Language | Accent Classification, Accented Text-to-speech, Language Identification, Multilingual Speech Recognition, Speech-to-speech Translation, Speech-to-text Translation |
Speaker | MultiSpeaker Detection, Overlapping Speech Detection, Speaker Counting, Speaker Diarization, Speaker Identification, Speaker Verification, Speech Separation, Target Speaker Extraction, Voice Conversion |
Understanding | Dialogue Act Classification, Dialogue Act Pairing, Expressive TTS, Instruct TTS, Intent Classification, Offensive Language Identification, Sarcasm Detection, Slot Filling, Spoken Question Answering |