Abstract
As one of the techniques for robust speech recognition under noisy environment, audio-visual speech recognition using lip dynamic visual information together with audio information is attracting attention and the research is advanced in recent years. Since visual information plays a great role in audio-visual speech recognition, what to select as the visual feature becomes a significant point. This paper proposes, for spoken word recognition, to utilize c combined parameter(combined parameter) as the visual feature extracted by Active Appearance Model applied to a face image including the lip area. Combined parameter contains information of the coordinate value and the intensity value as the visual feature. The recognition rate was improved by the proposed feature compared to the conventional features such as DCT and the principal component score. Finally, we integrated the phoneme score from audio information and the viseme score from visual information with high accuracy.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Potamianos, G., Graf, H.P.: Discriminative Training Of HMM Stream Exponents For Audio-Visual Speech Recognition. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1998), Florham Park, NJ, pp. 3733–3736 (1998)
Verma, A., Faruquie, T., Neti, C., Basu, S., Senior, A.: Late Integration In Audio-Visual Continuous Speech Recognition. In: Automatic Speech Recognition and Understanding (1999)
Tomlinson, M.J., Russell, M.J., Brooke, N.M.: Integrating audio and visual information to provide highly robust speech recognition. In: Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP1996), pp. 821–824 (1996)
Kumar, K., Navratil, J., Marcheret, E., Libal, V., Ramaswamy, G., Potamianos, G.: Audio-Visual Speech Synchronization Detection Using a Bimodal Linear Prediction Model. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 53–59 (1999)
Iwano, K., Tamura, S., Furui, S.: Bimodal speech recognition using lip movement measured by optical-flow analysis. In: Proc. International Workshop on HSC 2001, pp. 187–190 (2001)
Jun, H., Hua, Z.: Research on Visual Speech Feature Extraction. In: 2009 International Conference on Computer Engineering and Technology, pp. 499–502 (2009)
Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 484–498. Springer, Heidelberg (1998)
Dornaika, F., Ahlberg, J.: Fast and reliable active appearance model search for 3-d face tracking. IEEE Transactions on Systems, Man, and Cybernetics, 1838–1853 (2004)
Viola, P., Jones, M.: Rapid Object Detection Using Boosted Cascade of Simple Features. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1–9 (2001)
Fukuda, Y., Hiki, S.: Characteristic of the mouth shape in the production of Japanese-Stroboscopic observation. In: IEICE, pp. 259–265 (1978)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Komai, Y., Ariki, Y., Takiguchi, T. (2011). Audio-Visual Speech Recognition Based on AAM Parameter and Phoneme Analysis of Visual Feature. In: Ho, YS. (eds) Advances in Image and Video Technology. PSIVT 2011. Lecture Notes in Computer Science, vol 7087. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25367-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-25367-6_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25366-9
Online ISBN: 978-3-642-25367-6
eBook Packages: Computer ScienceComputer Science (R0)