The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation

Azam Bastanfard²¹,
Maryam Fazel²²,
Alireza Abdi Kelishami²³ &
…
Mohammad Aghaahmadi²³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5916))

Included in the following conference series:

International Conference on Multimedia Modeling

2174 Accesses
4 Citations

Abstract

Collecting an audio visual data corpus based on the linguistic rules is an unquestionable, must-take step in order to conduct major research in multimedia fields as AVSR, lip synchronization and visual speech synthesis. Building up a reliable data corpus where it covers all phonemes in all phonemic combinations of a language is a difficult and time consuming task. To partially deal with this problem, in this research, vc, cv and vcv combinations, instead of the entire possible phonemic combinations were used, where they carry the most language information. This paper gives an indication on the new data corpus, capturing 14 respondents. To better perceive coarticulation effect in speech, continuous speech was considered other than isolated and continuous digits. This makes the collection process a more time and cost-saving one, maintaining the efficiency high.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

An audio-visual corpus for multimodal automatic speech recognition

Article Open access 07 January 2017

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

Viseme set identification from Malayalam phonemes and allophones

Article 04 November 2019

References

Bailly-Bailliere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariethoz, J., Matas, J., Messer, K., Popovici, V., Poree, F., Ruiz, B., Thiran, J.P.: The BANCA Database and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)
Chapter Google Scholar
Movellan, J.R.: Visual speech recognition with stochastic networks. In: Tesauro, G., Toruetzky, D., Leen, T. (eds.) Advances in Neutral Information Processing Systems, vol. 7. MIT Press, Cambridge (1995/2000); Neti, C., Pontamianos, G., Luettin, J., Matthews, I.
Google Scholar
Chibelushi, C.C., Deravi, F., Mason, J.S.D.: Survey of audio visual speech databases. Tech. Rep., Department of Electrical and Electronic Engineering, University of Wales, Swansea, UK (1996)
Google Scholar
ChiŃu, A.G., Rothkrantz, L.J.M.: Building a Data Corpus for Audio-Visual Speech Recognition. In: AGC, April 2007, pp. 88–92 (2007)
Google Scholar
Cisar, P., Zelezny, M., Krnoul, Z., Kanis, J., Zelinka, J., Müller, L.: Design and recording of Czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of the Auditory-Visual Speech Processing International Conference, AVSP 2005, Vancouver Island, pp. 1–4 (2005)
Google Scholar
Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing 14(5), 449–480 (2004)
Article Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America 120(5), 2421–2424 (2006)
Article Google Scholar
Ezzat, T., Poggio, T.: Visual Speech Synthesis by Morphing Visemes. International Journal of Computer Vision 38, 45–57 (2000)
Article MATH Google Scholar
Movalleli, G.: Sara Lip-Reading Test: Construction, Evaluation and operating on a group of people with hearing disorder. MSc Thesis, Department of Rehabilitation in Tehran University of medical sciences (2002) (in Persian)
Google Scholar
Goecke, R., Millar, J.B., Zelinsky, A., Robert-Ribes, J.: A detailed description of the AVOZES data corpus. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2001), Salt Lake City, Utah, USA, May 2001, pp. 486–491 (2001)
Google Scholar
Goecke, R., Millar, J.B.: The Audio-Video Australian English Speech Data Corpus AVOZES. In: Proceedings of the 8th International Conference on Spoken Language Processing, ICSLP 2004, vol. III, pp. 2525–2528 (2004)
Google Scholar
Grimm, M., Narayanan, S.: The Vera am Mittag German audio-visual emotional speech database. In: ICME 2008, pp. 865–868. IEEE, Los Alamitos (2008)
Google Scholar
Jahani, C.: The Glottal Plosive: A Phoneme in Spoken Modern Persian or Not? In: Csató, É.Á., Isaksson, B., Jahani, C. (eds.) Linguistic Convergence and Areal Diffusion: Case studies from Iranian, Semitic and Turkic, pp. 79–96. RoutledgeCurzon, London (2005)
Google Scholar
Meng, H.M., Ching, P.C., Lee, T., Mak, M.-W., Mak, B., Moon, Y.S., Siu, M.-H., Tang, X., Hui, H.P.S., Lee, A., Lo, W.-K., Ma, B., Sio, E.K.T.: The Multi-Biometric, Multi-Device and Multilingual (M3) Corpus. In: Proceedings of The Second Workshop on Multimodal User Authentication (MMUA), Toulouse, France, May 11-12 (2006)
Google Scholar
Ladefoged, P.: Vowels and Consonants, 2nd edn. Blackwell Publishers Pub., Malden (2004)
Google Scholar
Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., Huang, T.: AVICAR: Audio- Visual Speech Corpus in a Car Environment. In: Proceedings of International Conference on Spoken Language Processing – INTERSPEECH, Jeju Island, Korea, October 4-8 (2004)
Google Scholar
Liangi, L., Luo, Y., Huang, F., Nefian, A.V.: A multi-stream audio-video large-vocabulary mandarin Chinese speech database. In: IEEE International Conference on Multimedia and Expo., vol. 3, pp. 1787–1790 (2004)
Google Scholar
Marassa, L.K., Lansing, C.R.: Visual Word Recognition in Two Facial Motion Conditions: full-face versus Lip plus Mandible. Journal of speech and hearing Research 38(6), 1387–1394 (1995)
Google Scholar
Messer, K., Matas, J., Kittler, J., Luettin, J.: XM2VTSDB: the extended M2VTS database. In: Proceedings of the 2nd International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA 1999), Washington, DC, USA, March 1999, pp. 72–77 (1999)
Google Scholar
Millar, J.B., Wagner, M., Goecke, R.: Aspects of Speaking-Face Data Corpus Design Methodology. In: Proc. 8th Int. Conf. Spoken Language Processing, ICSLP, Jeju, Korea, vol. II, pp. 1157–1160 (2004)
Google Scholar
Mostefa, D., Moreau, N., Choukri, K., Potamianos, G., Chu, S.M., Tyagi, A., Casas, J.R., Turmo, J., Christoforetti, L., Tobia, F., Pnevmatikakis, A., Mylonakis, V., Talantzis, F., Burger, S., Stiefelhagen, R., Bernardin, K., Rochet, C.: The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Journal of Language Resources and Evaluation 41, 389–407 (2008)
Article Google Scholar
Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP Journal on Applied Signal Processing 2002, 1189–1201 (2002)
Article Google Scholar
Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human computer-interface research. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), Orlando, Fla, USA, May 2002, vol. 2, pp. 2017–2020 (2002)
Google Scholar
Pera, V., Moura, A., Freitas, D.: LPFAV2: a New Multi-Modal Database for Developing Speech Recognition Systems for an Assistive Technology Application. In: SPECOM 2004: 9th Conference Speech and Computer, St. Petersburg, Russia, September 20-22 (2004)
Google Scholar
Pigeon, S., Vandendorpe, L.: The m2vts multimodal face database (release 1.00). In: Bigün, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 403–409. Springer, Heidelberg (1997)
Chapter Google Scholar
Trojanová, J., Hrúz, M., Campr, P., Železný, M.: Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)
Google Scholar
Wojdeł, J.C., Wiggers, P., Rothkrantz, L.J.M.: An audiovisual corpus for multimodal speech recognition in Dutch language. In: Proceedings of the International Conference on Spoken Language Processing, ICSLP 2002, Denver CO, USA, September 2002, pp. 1917–1920 (2002)
Google Scholar
Samareh, Y.: Persian phonetics. Markaze Nashre Daneshgahi Pub., Tehran (1998) (in Persian)
Google Scholar
Yotsukura, T., Nakamura, S., Morishima, S.: Construction of audio-visual speech corpus using motion-capture system and corpus based facial animation. The IEICE Transaction on Information and System E 88-D(11), 2377–2483 (2005)
Google Scholar
Chen, T.: Audiovisual speech processing. IEEE Signal Processing Mag. 18, 9–21 (2001)
Article MATH Google Scholar
International Phonetic Association. In: Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet, pp. 124–125. Cambridge University Press, Cambridge (1999) ISBN 978-0521637510
Google Scholar
Chaudhari, U.V., Ramaswamy, G.N.: Information fusion and decision cascading for audiovisual speaker recognition based on time-varying stream reliability prediction. Paper presented at the Int. Conf. Multimedia Expo. (1999)
Google Scholar
Chibelushi, C.C., Gandon, S., Mason, J.S.D.: Design issue for a digital audio-visual integrated database. Integrated Audio-visual Processing for Recognition, Synthesis and Communication (1996)
Google Scholar
Fox, N.A., O’Mullane, B., Reilly, R.B.: The realistic multi-modal VALID database and visual speaker identification comparison experiments. Paper presented at the 5th International Conference on Audio- and Video-Based Biometric Person Authentication (2005)
Google Scholar
Bastanfard, A., Fezel, M., Kelishami, A.A., Aghaahmadi, M.: A comprehensive audio-visual corpus for teaching sound Persian phoneme articulation. In: IEEE International Conference on Systems, Man, and Cybernetics (accepted, 2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Information Technology Research Group, Department of Engineering, Islamic Azad University Karaj branch, Iran
Azam Bastanfard
Islamic Republic of Iran Broadcast University, Tehran, Iran
Maryam Fazel
Department of Electrical, Computer and IT Engineering, Qazvin Islamic Azad University, Qazvin, Iran
Alireza Abdi Kelishami & Mohammad Aghaahmadi

Authors

Azam Bastanfard
View author publications
You can also search for this author in PubMed Google Scholar
Maryam Fazel
View author publications
You can also search for this author in PubMed Google Scholar
Alireza Abdi Kelishami
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Aghaahmadi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Oldenburg, Germany
Susanne Boll
University of Texas at San Antonio,, TX, San Antonio, USA
Qi Tian
Microsoft Research Asia, Beijing, P.R. China
Lei Zhang
Southwest University, Beibei, Chongqing, China
Zili Zhang
School of Engineering and Information Technology, Deakin University, 221 Burwood Highway, Vic, 3125, Australia
Yi-Ping Phoebe Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bastanfard, A., Fazel, M., Kelishami, A.A., Aghaahmadi, M. (2010). The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation. In: Boll, S., Tian, Q., Zhang, L., Zhang, Z., Chen, YP.P. (eds) Advances in Multimedia Modeling. MMM 2010. Lecture Notes in Computer Science, vol 5916. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11301-7_30

Download citation

DOI: https://doi.org/10.1007/978-3-642-11301-7_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11300-0
Online ISBN: 978-3-642-11301-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation

Abstract

Access this chapter

Preview

Similar content being viewed by others

An audio-visual corpus for multimodal automatic speech recognition

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

Viseme set identification from Malayalam phonemes and allophones

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation

Abstract

Access this chapter

Preview

Similar content being viewed by others

An audio-visual corpus for multimodal automatic speech recognition

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

Viseme set identification from Malayalam phonemes and allophones

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation