Abstract
Collecting an audio visual data corpus based on the linguistic rules is an unquestionable, must-take step in order to conduct major research in multimedia fields as AVSR, lip synchronization and visual speech synthesis. Building up a reliable data corpus where it covers all phonemes in all phonemic combinations of a language is a difficult and time consuming task. To partially deal with this problem, in this research, vc, cv and vcv combinations, instead of the entire possible phonemic combinations were used, where they carry the most language information. This paper gives an indication on the new data corpus, capturing 14 respondents. To better perceive coarticulation effect in speech, continuous speech was considered other than isolated and continuous digits. This makes the collection process a more time and cost-saving one, maintaining the efficiency high.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bailly-Bailliere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariethoz, J., Matas, J., Messer, K., Popovici, V., Poree, F., Ruiz, B., Thiran, J.P.: The BANCA Database and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)
Movellan, J.R.: Visual speech recognition with stochastic networks. In: Tesauro, G., Toruetzky, D., Leen, T. (eds.) Advances in Neutral Information Processing Systems, vol. 7. MIT Press, Cambridge (1995/2000); Neti, C., Pontamianos, G., Luettin, J., Matthews, I.
Chibelushi, C.C., Deravi, F., Mason, J.S.D.: Survey of audio visual speech databases. Tech. Rep., Department of Electrical and Electronic Engineering, University of Wales, Swansea, UK (1996)
ChiŃu, A.G., Rothkrantz, L.J.M.: Building a Data Corpus for Audio-Visual Speech Recognition. In: AGC, April 2007, pp. 88–92 (2007)
Cisar, P., Zelezny, M., Krnoul, Z., Kanis, J., Zelinka, J., Müller, L.: Design and recording of Czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of the Auditory-Visual Speech Processing International Conference, AVSP 2005, Vancouver Island, pp. 1–4 (2005)
Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing 14(5), 449–480 (2004)
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America 120(5), 2421–2424 (2006)
Ezzat, T., Poggio, T.: Visual Speech Synthesis by Morphing Visemes. International Journal of Computer Vision 38, 45–57 (2000)
Movalleli, G.: Sara Lip-Reading Test: Construction, Evaluation and operating on a group of people with hearing disorder. MSc Thesis, Department of Rehabilitation in Tehran University of medical sciences (2002) (in Persian)
Goecke, R., Millar, J.B., Zelinsky, A., Robert-Ribes, J.: A detailed description of the AVOZES data corpus. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2001), Salt Lake City, Utah, USA, May 2001, pp. 486–491 (2001)
Goecke, R., Millar, J.B.: The Audio-Video Australian English Speech Data Corpus AVOZES. In: Proceedings of the 8th International Conference on Spoken Language Processing, ICSLP 2004, vol. III, pp. 2525–2528 (2004)
Grimm, M., Narayanan, S.: The Vera am Mittag German audio-visual emotional speech database. In: ICME 2008, pp. 865–868. IEEE, Los Alamitos (2008)
Jahani, C.: The Glottal Plosive: A Phoneme in Spoken Modern Persian or Not? In: Csató, É.Á., Isaksson, B., Jahani, C. (eds.) Linguistic Convergence and Areal Diffusion: Case studies from Iranian, Semitic and Turkic, pp. 79–96. RoutledgeCurzon, London (2005)
Meng, H.M., Ching, P.C., Lee, T., Mak, M.-W., Mak, B., Moon, Y.S., Siu, M.-H., Tang, X., Hui, H.P.S., Lee, A., Lo, W.-K., Ma, B., Sio, E.K.T.: The Multi-Biometric, Multi-Device and Multilingual (M3) Corpus. In: Proceedings of The Second Workshop on Multimodal User Authentication (MMUA), Toulouse, France, May 11-12 (2006)
Ladefoged, P.: Vowels and Consonants, 2nd edn. Blackwell Publishers Pub., Malden (2004)
Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., Huang, T.: AVICAR: Audio- Visual Speech Corpus in a Car Environment. In: Proceedings of International Conference on Spoken Language Processing – INTERSPEECH, Jeju Island, Korea, October 4-8 (2004)
Liangi, L., Luo, Y., Huang, F., Nefian, A.V.: A multi-stream audio-video large-vocabulary mandarin Chinese speech database. In: IEEE International Conference on Multimedia and Expo., vol. 3, pp. 1787–1790 (2004)
Marassa, L.K., Lansing, C.R.: Visual Word Recognition in Two Facial Motion Conditions: full-face versus Lip plus Mandible. Journal of speech and hearing Research 38(6), 1387–1394 (1995)
Messer, K., Matas, J., Kittler, J., Luettin, J.: XM2VTSDB: the extended M2VTS database. In: Proceedings of the 2nd International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA 1999), Washington, DC, USA, March 1999, pp. 72–77 (1999)
Millar, J.B., Wagner, M., Goecke, R.: Aspects of Speaking-Face Data Corpus Design Methodology. In: Proc. 8th Int. Conf. Spoken Language Processing, ICSLP, Jeju, Korea, vol. II, pp. 1157–1160 (2004)
Mostefa, D., Moreau, N., Choukri, K., Potamianos, G., Chu, S.M., Tyagi, A., Casas, J.R., Turmo, J., Christoforetti, L., Tobia, F., Pnevmatikakis, A., Mylonakis, V., Talantzis, F., Burger, S., Stiefelhagen, R., Bernardin, K., Rochet, C.: The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Journal of Language Resources and Evaluation 41, 389–407 (2008)
Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP Journal on Applied Signal Processing 2002, 1189–1201 (2002)
Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human computer-interface research. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), Orlando, Fla, USA, May 2002, vol. 2, pp. 2017–2020 (2002)
Pera, V., Moura, A., Freitas, D.: LPFAV2: a New Multi-Modal Database for Developing Speech Recognition Systems for an Assistive Technology Application. In: SPECOM 2004: 9th Conference Speech and Computer, St. Petersburg, Russia, September 20-22 (2004)
Pigeon, S., Vandendorpe, L.: The m2vts multimodal face database (release 1.00). In: Bigün, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 403–409. Springer, Heidelberg (1997)
Trojanová, J., Hrúz, M., Campr, P., Železný, M.: Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)
Wojdeł, J.C., Wiggers, P., Rothkrantz, L.J.M.: An audiovisual corpus for multimodal speech recognition in Dutch language. In: Proceedings of the International Conference on Spoken Language Processing, ICSLP 2002, Denver CO, USA, September 2002, pp. 1917–1920 (2002)
Samareh, Y.: Persian phonetics. Markaze Nashre Daneshgahi Pub., Tehran (1998) (in Persian)
Yotsukura, T., Nakamura, S., Morishima, S.: Construction of audio-visual speech corpus using motion-capture system and corpus based facial animation. The IEICE Transaction on Information and System E 88-D(11), 2377–2483 (2005)
Chen, T.: Audiovisual speech processing. IEEE Signal Processing Mag. 18, 9–21 (2001)
International Phonetic Association. In: Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet, pp. 124–125. Cambridge University Press, Cambridge (1999) ISBN 978-0521637510
Chaudhari, U.V., Ramaswamy, G.N.: Information fusion and decision cascading for audiovisual speaker recognition based on time-varying stream reliability prediction. Paper presented at the Int. Conf. Multimedia Expo. (1999)
Chibelushi, C.C., Gandon, S., Mason, J.S.D.: Design issue for a digital audio-visual integrated database. Integrated Audio-visual Processing for Recognition, Synthesis and Communication (1996)
Fox, N.A., O’Mullane, B., Reilly, R.B.: The realistic multi-modal VALID database and visual speaker identification comparison experiments. Paper presented at the 5th International Conference on Audio- and Video-Based Biometric Person Authentication (2005)
Bastanfard, A., Fezel, M., Kelishami, A.A., Aghaahmadi, M.: A comprehensive audio-visual corpus for teaching sound Persian phoneme articulation. In: IEEE International Conference on Systems, Man, and Cybernetics (accepted, 2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bastanfard, A., Fazel, M., Kelishami, A.A., Aghaahmadi, M. (2010). The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation. In: Boll, S., Tian, Q., Zhang, L., Zhang, Z., Chen, YP.P. (eds) Advances in Multimedia Modeling. MMM 2010. Lecture Notes in Computer Science, vol 5916. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11301-7_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-11301-7_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11300-0
Online ISBN: 978-3-642-11301-7
eBook Packages: Computer ScienceComputer Science (R0)