Mote et al., 2024 - Google Patents
Unsupervised domain adaptation for speech emotion recognition using K-Nearest neighbors voice conversionMote et al., 2024
View PDF- Document ID
- 13684695535790690964
- Author
- Mote P
- Sisman B
- Busso C
- Publication year
- Publication venue
- Proceedings of INTERSPEECH
External Links
Snippet
Abundant speech data for speech emotion recognition (SER) is often unlabeled, rendering it ineffective for model training. Models trained on existing labeled datasets struggle with unlabeled data due to mismatches in data distributions. To avoid the cost of annotating …
- 230000006978 adaptation 0 title abstract description 36
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06K9/6232—Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods
- G06K9/6247—Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods based on an approximation criterion, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/28—Processing or translating of natural language
- G06F17/2872—Rule based translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6267—Classification techniques
- G06K9/6268—Classification techniques relating to the classification paradigm, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
- G06N99/005—Learning machines, i.e. computer in which a programme is changed according to experience gained by the machine itself during a complete run
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signal, using source filter models or psychoacoustic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computer systems based on biological models
- G06N3/02—Computer systems based on biological models using neural network models
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9058811B2 (en) | Speech synthesis with fuzzy heteronym prediction using decision trees | |
Esling et al. | Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces. | |
Denisov et al. | Pretrained semantic speech embeddings for end-to-end spoken language understanding via cross-modal teacher-student learning | |
Mote et al. | Unsupervised domain adaptation for speech emotion recognition using K-Nearest neighbors voice conversion | |
US20230087916A1 (en) | Transforming text data into acoustic feature | |
Abdelwahab et al. | Incremental adaptation using active learning for acoustic emotion recognition | |
CN113505611B (en) | Training methods and systems for better speech translation models in generative adversarial | |
Lee et al. | Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities | |
Naderi et al. | Cross corpus speech emotion recognition using transfer learning and attention-based fusion of wav2vec2 and prosody features | |
Liao et al. | Incorporating symbolic sequential modeling for speech enhancement | |
Sorin et al. | Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS. | |
Fernandez-Lopez et al. | End-to-end lip-reading without large-scale data | |
Dey et al. | Cross-corpora spoken language identification with domain diversification and generalization | |
Kaur et al. | Impact of feature extraction and feature selection algorithms on Punjabi speech emotion recognition using convolutional neural network | |
Abdulsalam et al. | Speech emotion recognition using minimum extracted features | |
Liu et al. | Controllable accented text-to-speech synthesis | |
Xia et al. | Learning salient segments for speech emotion recognition using attentive temporal pooling | |
Martinez-Quezada et al. | English mispronunciation detection module using a Transformer network integrated into a chatbot. | |
Cheng et al. | Audio Texture Manipulation by Exemplar-Based Analogy | |
Álvarez et al. | A comparison using different speech parameters in the automatic emotion recognition using Feature Subset Selection based on Evolutionary Algorithms | |
Sahu | Towards Building Generalizable Speech Emotion Recognition Models | |
Cohen | A survey of machine learning methods for predicting prosody in radio speech | |
Alasiry et al. | Efficient audio-visual emotion recognition approach | |
Salvi | Data-driven techniques for speech and multimodal deepfake detection | |
US20250191577A1 (en) | Model training device, model training method and automatic speech recognition apparatus for improving speech recognition of non-native speakers |