Ma et al., 2008 - Google Patents
A probabilistic principal component analysis based hidden markov model for audio-visual speech recognitionMa et al., 2008
View PDF- Document ID
- 4155981611434297575
- Author
- Ma Z
- Leijon A
- Publication year
- Publication venue
- 2008 42nd Asilomar Conference on Signals, Systems and Computers
External Links
Snippet
Lipreading is an efficient method among those proposed to improve the performance of speech recognition systems, especially in acoustic noisy environments. This paper proposes a simple audio-visual speech recognition (AVSR) system, which could improve the …
- 238000000513 principal component analysis 0 title abstract description 42
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. hidden Markov models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G06K9/00268—Feature extraction; Face representation
- G06K9/00281—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G06K9/00288—Classification, e.g. identification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G06K9/00268—Feature extraction; Face representation
- G06K9/00275—Holistic features and representations, i.e. based on the facial image taken as a whole
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids transforming into visible information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00335—Recognising movements or behaviour, e.g. recognition of gestures, dynamic facial expressions; Lip-reading
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Potamianos et al. | Recent advances in the automatic recognition of audiovisual speech | |
Sargin et al. | Audiovisual synchronization and fusion using canonical correlation analysis | |
Neti et al. | Audio visual speech recognition | |
US7636662B2 (en) | System and method for audio-visual content synthesis | |
CN103856689B (en) | Character dialogue subtitle extraction method oriented to news video | |
Neti et al. | Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop | |
JP2001092974A (en) | Speaker recognizing method, device for executing the same, method and device for confirming audio generation | |
Gurbuz et al. | Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition | |
Tao et al. | Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection. | |
Jiang et al. | Improved face and feature finding for audio-visual speech recognition in visually challenging environments | |
Kozlov et al. | SVID speaker recognition system for NIST SRE 2012 | |
Navarathna et al. | Multiple cameras for audio-visual speech recognition in an automotive environment | |
Liu et al. | MSDWild: Multi-modal Speaker Diarization Dataset in the Wild. | |
Kanak et al. | Joint audio-video processing for biometric speaker identification | |
Wong et al. | A new multi-purpose audio-visual UNMC-VIER database with multiple variabilities | |
Gimeno-Gómez et al. | LIP-RTVE: An audiovisual database for continuous Spanish in the wild | |
Paleček et al. | Audio-visual speech recognition in noisy audio environments | |
Sui et al. | A 3D audio-visual corpus for speech recognition | |
Ma et al. | A probabilistic principal component analysis based hidden markov model for audio-visual speech recognition | |
Paleček | Experimenting with lipreading for large vocabulary continuous speech recognition | |
Stiefelhagen et al. | Audio-visual perception of a lecturer in a smart seminar room | |
Narwekar et al. | PRAV: A Phonetically Rich Audio Visual Corpus. | |
Ibrahim et al. | A lip geometry approach for feature-fusion based audio-visual speech recognition | |
Chetty et al. | Biometric person authentication with liveness detection based on audio-visual fusion | |
Rallabandi et al. | Investigating Utterance Level Representations for Detecting Intent from Acoustics. |