Mote et al., 2024 - Google Patents

Unsupervised domain adaptation for speech emotion recognition using K-Nearest neighbors voice conversion

Mote et al., 2024

Document ID: 13684695535790690964
Author: Mote P; Sisman B; Busso C
Publication year: 2024
Publication venue: Proceedings of INTERSPEECH

External Links

Cited by

Snippet

Abundant speech data for speech emotion recognition (SER) is often unlabeled, rendering it ineffective for model training. Models trained on existing labeled datasets struggle with unlabeled data due to mismatches in data distributions. To avoid the cost of annotating …

Continue reading at lab-msp.com (PDF) (other versions)

230000006978 adaptation 0 title abstract description 36

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06K9/6232—Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods
- G06K9/6247—Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods based on an approximation criterion, e.g. principal component analysis
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/28—Processing or translating of natural language
- G06F17/2872—Rule based translation
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6267—Classification techniques
- G06K9/6268—Classification techniques relating to the classification paradigm, e.g. parametric or non-parametric approaches
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
- G06N99/005—Learning machines, i.e. computer in which a programme is changed according to experience gained by the machine itself during a complete run
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signal, using source filter models or psychoacoustic analysis
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computer systems based on biological models
- G06N3/02—Computer systems based on biological models using neural network models

Similar Documents

Publication	Publication Date	Title
US9058811B2 (en)	2015-06-16	Speech synthesis with fuzzy heteronym prediction using decision trees
Esling et al.	2018	Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces.
Denisov et al.	2020	Pretrained semantic speech embeddings for end-to-end spoken language understanding via cross-modal teacher-student learning
Mote et al.	2024	Unsupervised domain adaptation for speech emotion recognition using K-Nearest neighbors voice conversion
US20230087916A1 (en)	2023-03-23	Transforming text data into acoustic feature
Abdelwahab et al.	2017	Incremental adaptation using active learning for acoustic emotion recognition
CN113505611B (en)	2022-04-15	Training methods and systems for better speech translation models in generative adversarial
Lee et al.	2021	Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities
Naderi et al.	2023	Cross corpus speech emotion recognition using transfer learning and attention-based fusion of wav2vec2 and prosody features
Liao et al.	2019	Incorporating symbolic sequential modeling for speech enhancement
Sorin et al.	2020	Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS.
Fernandez-Lopez et al.	2022	End-to-end lip-reading without large-scale data
Dey et al.	2023	Cross-corpora spoken language identification with domain diversification and generalization
Kaur et al.	2022	Impact of feature extraction and feature selection algorithms on Punjabi speech emotion recognition using convolutional neural network
Abdulsalam et al.	2018	Speech emotion recognition using minimum extracted features
Liu et al.	2022	Controllable accented text-to-speech synthesis
Xia et al.	2020	Learning salient segments for speech emotion recognition using attentive temporal pooling
Martinez-Quezada et al.	2022	English mispronunciation detection module using a Transformer network integrated into a chatbot.
Cheng et al.	2025	Audio Texture Manipulation by Exemplar-Based Analogy
Álvarez et al.	2007	A comparison using different speech parameters in the automatic emotion recognition using Feature Subset Selection based on Evolutionary Algorithms
Sahu	2019	Towards Building Generalizable Speech Emotion Recognition Models
Cohen	2004	A survey of machine learning methods for predicting prosody in radio speech
Alasiry et al.	2025	Efficient audio-visual emotion recognition approach
Salvi	2024	Data-driven techniques for speech and multimodal deepfake detection
US20250191577A1 (en)	2025-06-12	Model training device, model training method and automatic speech recognition apparatus for improving speech recognition of non-native speakers