[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3639592.3639623acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaicccConference Proceedingsconference-collections
research-article

Multi-stage Multi-modalities Fusion of Lip, Tongue and Acoustics Information for Speech Recognition

Published: 13 April 2024 Publication History

Abstract

The ultrasound tongue imaging (UTI) and lip video are commonly used to capture the acoustic clue to obtain the visual articulatory information of speakers. However, single signal of UTI or lip video cannot completely represent the pronunciation process of speakers. In this paper, we proposed to use the convolutional neural network (CNN)-based framework to fuse the lip and tongue movement information to represent the pronunciation process of speakers. In addition, we designed the multi-stage fusion framework (MF-SR) to fuse the lip-tongue visual information and the acoustic features extracted from the speech. To evaluate our proposed method, we designed the data stream comparative experiments, the speech pattern comparative experiments, and the data increment experiments based on TAL1 dataset. The results show that the best word error rate (WER) of our proposed method on audio-visual speech recognition task is 20.03%. The best WER of our proposed model on visual-only speech recognition task is 23.34%, which is reduced by 1.75% compared with the baseline method. The results illustrate that our proposed method can effectively further improve the performance of the lip-tongue-audio fusion speech recognition.

References

[1]
Gregory Hickok. 2012. Computational neuroanatomy of speech production. Nature reviews neuroscience 13, 2, 135-145.
[2]
Jun Wang, Ashok Samal, Jordan R. Green, and Frank Rudzicz. 2012. Sentence recognition from articulatory movements for silent speech interfaces. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4985-4988.
[3]
Myungjong Kim, Beiming Cao, Ted Mau, and Jun Wang. 2017. Speaker-independent silent speech recognition from flesh-point articulatory movements using an LSTM neural network. IEEE/ACM transactions on audio, speech, and language processing 25, 12, 2323-2336.
[4]
Peter Birkholz and Christiane Neuschaefer-Rube. 2011. Combined optical distance sensing and electropalatography to measure articulation. In Twelfth Annual Conference of the International Speech Communication Association.
[5]
Chuck Jorgensen, Diana D. Lee, and Shane Agabon. 2003. Sub Auditory Speech Recognition based on EMG/EPG Signals. In International Joint Conference on Neural Networks IEEE.
[6]
Fiona Gibbon and Alice Lee. 2015. Electropalatography for older children and adults with residual speech errors. In Seminars in speech and language, Vol. 36, 271-282.
[7]
Brais Martinez, Pingchuan Ma, Stavros Petridis, and Maja Pantic. 2020. Lipreading using temporal convolutional networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6319-6323.
[8]
Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong Tang, and Mingli Song. 2020. Hearing lips: Improving lip reading by distilling speech recognizers. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 6917-6924.
[9]
Xichen Pan, Peiyu Chen, Yichen Gong, Helong Zhou, Xinbing Wang, and Zhouhan Lin. 2022. Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition. In arXiv preprint arXiv:2203.07996.
[10]
Andrew D. Scott, Marzena Wylezinska, Malcolm J. Birch, and Marc E. Miquel. 2014. Speech MRI: morphology and function. Physica Medica 30, 604-618.
[11]
Tanja Schultz, Michael Wand, Thomas Hueber, Dean J. Krusienski, Christian Herff, and Jonathan S. Brumberg. 2017. Biosignal-based spoken communication: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 12, 2257-2271.
[12]
Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An ultrasound imaging-based silent speech interaction using deep neural networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1–11.
[13]
László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya. 2023. Adaptation of tongue ultrasound-based silent speech interfaces using spatial transformer networks. In arXiv preprint arXiv:2305.19130.
[14]
Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end sentence-level lipreading. In arXiv preprint arXiv:1611.01599.
[15]
Myungjong Kim, Nordine Sebkhi, Beiming Cao, Maysam Ghovanloo, and Jun Wang. 2018. Preliminary test of a wireless magnetic tongue tracking system for silent speech interface. In 2018 IEEE Biomedical Circuits and Systems Conference (BioCAS), 1-4.
[16]
Jun Cai, Bruce Denby, Pierre Roussel, Gerard Dreyfus, Lise Crevier-Buchman. 2011. Recognition and real time performances of a lightweight ultrasound based silent speech interface employing a language model. In Twelfth Annual Conference of the International Speech Communication Association, 1021-1015.
[17]
Lu, Zhao, and László Czap. 2018. Modelling the tongue movement of Chinese Shaanxi Xi'an dialect speech. In 2018 19th International Carpathian Control Conference (ICCC), 98-103.
[18]
Harish Chander Mahendru. 2014. Quick review of human speech production mechanism. International Journal of Engineering Research and Development, 9(10), 48-54.
[19]
B. Denby, J. Cai, P. Roussel, L. Crevier-Buchman, S. Manitsaris, G. Chollet, M. Stone and C. Pillot. 2013. The silent speech challenge archive. https://ftp.espci.fr/pub/sigma/
[20]
Bruce Denby, Tanja Schultz, Kiyoshi Honda, Thomas Hueber, Jim M. Gilbert, and Jonathan S. Brumberg. 2010. Silent speech interfaces.” In Speech Communication, Vol. 52, 270–287.
[21]
Naoki Kimura, Zixiong Su, Takaaki Saeki, and Jun Rekimoto. 2022. SSR7000: A synchronized corpus of ultrasound tongue imaging for end-to-end silent speech recognition. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France: European Language Resources Association, 6866–6873.
[22]
Manuel Sam Ribeiro, Jennifer Sanger, Jing-Xuan Zhang, Aciel Eshky, Alan Wrench, Korin Richmond, and Steve Renals. 2021. TaL: A synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos. In 2021 IEEE Spoken Language Technology Workshop (SLT), New York, 1109–1116.
[23]
Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, and Steve Renals. 2021. Silent versus modal multi-speaker speech recognition from ultrasound and video. In arXiv preprint arXiv:2103.00333.
[24]
Thomas Hueber, Elie-Laurent Benaroya, Gérard Chollet, Bruce Denby, Gérard Dreyfus, and Maureen Stone. 2010. Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication, Vol. 52, 288–300.
[25]
Yan Ji, Licheng Liu, Hongcui Wang, Zhilei Liu, Zhibin Niu, and Bruce Denby. 2018. Updating the silent speech challenge benchmark with deep learning. Speech Communication, Vol. 98, 42–50.
[26]
Prashanth Gurunath Shivakumar, and Panayiotis Georgiou. 2020. Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations. Computer speech & language, Vol.63, 101077.
[27]
Dong Wang, and Thomas Fang Zheng. 2015. Transfer learning for speech and language processing. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Hong Kong, China, 1225-1237.
[28]
Nowakowski, Karol, Michal Ptaszynski, Kyoko Murasaki, and Jagna Nieuważny. 2023. Adapting multilingual speech representation model for a new, underresourced language through multilingual fine-tuning and continued pretraining. In Information Processing & Management, Vol. 60, 103148.
[29]
Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, and Sharon Goldwater. 2019. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol.1, 58-68.
[30]
Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: How much can a bad teacher benefit ASR pre-training? In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing), Toronto, Canada, 6533-6537.
[31]
Jonathan Boigne, Biman Liyanage, and Ted Östrem. 2020. Recognizing more emotions with less data using self-supervised transfer learning.” In arXiv preprint arXiv:2011.05585.
[32]
Yangyang Xia, Li-Wei Chen, Alexander Rudnicky, and Richard M. Stern. 2021. Temporal context in speech emotion recognition In Proc. Interspeech 2021, 3370–3374.
[33]
Li-Wei Chen and Alexander Rudnicky. 2023. Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing, 1-5.
[34]
Zhiyun Fan, Meng Li, Shiyu Zhou, and Bo Xu. 2021. Exploring wav2vec 2.0 on speaker verification and language identification. In Proc. Interspeech 2021, 1509–1513.
[35]
Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning audio-visual speech representation by masked multimodal cluster prediction. In arXiv preprint arXiv:2201.02184.
[36]
Chongchong Yu, Xiaosu Su, and Zhaopeng Qian. 2023. Multi-stage audio-visual fusion for dysarthric speech recognition with pre-trained models. IEEE Transactions on Neural Systems and Rehabilitation Engineering, Vol. 31, 1912–1921.
[37]
Vahid Kazemi and Josephine Sullivan. 2014. One millisecond face alignment with an ensemble of regression trees. IEEE Conference on Computer Vision and Pattern Recognition, 1867–1874.
[38]
Yu Zhang, Yu Liu, Peng Sun, Han Yan, Xiaolin Zhao, and Li Zhang. 2020. IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion, Vol. 54, 99–118.
[39]
Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In arXiv preprint arXiv:1804.10959.
[40]
Shakti P. Rath, Daniel Povey, Karel Veselý, and Jan Cernocký. 2013. Improved feature processing for deep neural networks. In Proc. Interspeech, 109-113.
[41]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, Karel Vesely. 2011.The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF. IEEE Signal Processing Society.

Index Terms

  1. Multi-stage Multi-modalities Fusion of Lip, Tongue and Acoustics Information for Speech Recognition

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AICCC '23: Proceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference
    December 2023
    280 pages
    ISBN:9798400716225
    DOI:10.1145/3639592
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 April 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Articulatory Speech Recognition
    2. Lip and Tongue Fusion
    3. Multi-stage Audio-Visual Fusion

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Humanities and Social Sciences Research Planning Fund of the Ministry of Education of China
    • Humanity and Social Science Youth Foundation of Ministry of Education of China

    Conference

    AICCC 2023

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 25
      Total Downloads
    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media