More Web Proxy on the site http://driver.im/

research-article

Multi-stage Multi-modalities Fusion of Lip, Tongue and Acoustics Information for Speech Recognition

Authors:

Chongchong YuAuthors Info & Claims

AICCC '23: Proceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference

Pages 226 - 231

https://doi.org/10.1145/3639592.3639623

Published: 13 April 2024 Publication History

Abstract

The ultrasound tongue imaging (UTI) and lip video are commonly used to capture the acoustic clue to obtain the visual articulatory information of speakers. However, single signal of UTI or lip video cannot completely represent the pronunciation process of speakers. In this paper, we proposed to use the convolutional neural network (CNN)-based framework to fuse the lip and tongue movement information to represent the pronunciation process of speakers. In addition, we designed the multi-stage fusion framework (MF-SR) to fuse the lip-tongue visual information and the acoustic features extracted from the speech. To evaluate our proposed method, we designed the data stream comparative experiments, the speech pattern comparative experiments, and the data increment experiments based on TAL1 dataset. The results show that the best word error rate (WER) of our proposed method on audio-visual speech recognition task is 20.03%. The best WER of our proposed model on visual-only speech recognition task is 23.34%, which is reduced by 1.75% compared with the baseline method. The results illustrate that our proposed method can effectively further improve the performance of the lip-tongue-audio fusion speech recognition.

References

[1]

Gregory Hickok. 2012. Computational neuroanatomy of speech production. Nature reviews neuroscience 13, 2, 135-145.

[2]

Jun Wang, Ashok Samal, Jordan R. Green, and Frank Rudzicz. 2012. Sentence recognition from articulatory movements for silent speech interfaces. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4985-4988.

[3]

Myungjong Kim, Beiming Cao, Ted Mau, and Jun Wang. 2017. Speaker-independent silent speech recognition from flesh-point articulatory movements using an LSTM neural network. IEEE/ACM transactions on audio, speech, and language processing 25, 12, 2323-2336.

[4]

Peter Birkholz and Christiane Neuschaefer-Rube. 2011. Combined optical distance sensing and electropalatography to measure articulation. In Twelfth Annual Conference of the International Speech Communication Association.

[5]

Chuck Jorgensen, Diana D. Lee, and Shane Agabon. 2003. Sub Auditory Speech Recognition based on EMG/EPG Signals. In International Joint Conference on Neural Networks IEEE.

[6]

Fiona Gibbon and Alice Lee. 2015. Electropalatography for older children and adults with residual speech errors. In Seminars in speech and language, Vol. 36, 271-282.

[7]

Brais Martinez, Pingchuan Ma, Stavros Petridis, and Maja Pantic. 2020. Lipreading using temporal convolutional networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6319-6323.

[8]

Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong Tang, and Mingli Song. 2020. Hearing lips: Improving lip reading by distilling speech recognizers. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 6917-6924.

[9]

Xichen Pan, Peiyu Chen, Yichen Gong, Helong Zhou, Xinbing Wang, and Zhouhan Lin. 2022. Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition. In arXiv preprint arXiv:2203.07996.

[10]

Andrew D. Scott, Marzena Wylezinska, Malcolm J. Birch, and Marc E. Miquel. 2014. Speech MRI: morphology and function. Physica Medica 30, 604-618.

[11]

Tanja Schultz, Michael Wand, Thomas Hueber, Dean J. Krusienski, Christian Herff, and Jonathan S. Brumberg. 2017. Biosignal-based spoken communication: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 12, 2257-2271.

Digital Library

[12]

Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An ultrasound imaging-based silent speech interaction using deep neural networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1–11.

Digital Library

[13]

László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya. 2023. Adaptation of tongue ultrasound-based silent speech interfaces using spatial transformer networks. In arXiv preprint arXiv:2305.19130.

[14]

Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end sentence-level lipreading. In arXiv preprint arXiv:1611.01599.

[15]

Myungjong Kim, Nordine Sebkhi, Beiming Cao, Maysam Ghovanloo, and Jun Wang. 2018. Preliminary test of a wireless magnetic tongue tracking system for silent speech interface. In 2018 IEEE Biomedical Circuits and Systems Conference (BioCAS), 1-4.

[16]

Jun Cai, Bruce Denby, Pierre Roussel, Gerard Dreyfus, Lise Crevier-Buchman. 2011. Recognition and real time performances of a lightweight ultrasound based silent speech interface employing a language model. In Twelfth Annual Conference of the International Speech Communication Association, 1021-1015.

[17]

Lu, Zhao, and László Czap. 2018. Modelling the tongue movement of Chinese Shaanxi Xi'an dialect speech. In 2018 19th International Carpathian Control Conference (ICCC), 98-103.

[18]

Harish Chander Mahendru. 2014. Quick review of human speech production mechanism. International Journal of Engineering Research and Development, 9(10), 48-54.

[19]

B. Denby, J. Cai, P. Roussel, L. Crevier-Buchman, S. Manitsaris, G. Chollet, M. Stone and C. Pillot. 2013. The silent speech challenge archive. https://ftp.espci.fr/pub/sigma/

[20]

Bruce Denby, Tanja Schultz, Kiyoshi Honda, Thomas Hueber, Jim M. Gilbert, and Jonathan S. Brumberg. 2010. Silent speech interfaces.” In Speech Communication, Vol. 52, 270–287.

Digital Library

[21]

Naoki Kimura, Zixiong Su, Takaaki Saeki, and Jun Rekimoto. 2022. SSR7000: A synchronized corpus of ultrasound tongue imaging for end-to-end silent speech recognition. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France: European Language Resources Association, 6866–6873.

[22]

Manuel Sam Ribeiro, Jennifer Sanger, Jing-Xuan Zhang, Aciel Eshky, Alan Wrench, Korin Richmond, and Steve Renals. 2021. TaL: A synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos. In 2021 IEEE Spoken Language Technology Workshop (SLT), New York, 1109–1116.

[23]

Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, and Steve Renals. 2021. Silent versus modal multi-speaker speech recognition from ultrasound and video. In arXiv preprint arXiv:2103.00333.

[24]

Thomas Hueber, Elie-Laurent Benaroya, Gérard Chollet, Bruce Denby, Gérard Dreyfus, and Maureen Stone. 2010. Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication, Vol. 52, 288–300.

Digital Library

[25]

Yan Ji, Licheng Liu, Hongcui Wang, Zhilei Liu, Zhibin Niu, and Bruce Denby. 2018. Updating the silent speech challenge benchmark with deep learning. Speech Communication, Vol. 98, 42–50.

Digital Library

[26]

Prashanth Gurunath Shivakumar, and Panayiotis Georgiou. 2020. Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations. Computer speech & language, Vol.63, 101077.

[27]

Dong Wang, and Thomas Fang Zheng. 2015. Transfer learning for speech and language processing. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Hong Kong, China, 1225-1237.

[28]

Nowakowski, Karol, Michal Ptaszynski, Kyoko Murasaki, and Jagna Nieuważny. 2023. Adapting multilingual speech representation model for a new, underresourced language through multilingual fine-tuning and continued pretraining. In Information Processing & Management, Vol. 60, 103148.

Digital Library

[29]

Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, and Sharon Goldwater. 2019. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol.1, 58-68.

[30]

Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: How much can a bad teacher benefit ASR pre-training? In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing), Toronto, Canada, 6533-6537.

[31]

Jonathan Boigne, Biman Liyanage, and Ted Östrem. 2020. Recognizing more emotions with less data using self-supervised transfer learning.” In arXiv preprint arXiv:2011.05585.

[32]

Yangyang Xia, Li-Wei Chen, Alexander Rudnicky, and Richard M. Stern. 2021. Temporal context in speech emotion recognition In Proc. Interspeech 2021, 3370–3374.

[33]

Li-Wei Chen and Alexander Rudnicky. 2023. Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing, 1-5.

[34]

Zhiyun Fan, Meng Li, Shiyu Zhou, and Bo Xu. 2021. Exploring wav2vec 2.0 on speaker verification and language identification. In Proc. Interspeech 2021, 1509–1513.

[35]

Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning audio-visual speech representation by masked multimodal cluster prediction. In arXiv preprint arXiv:2201.02184.

[36]

Chongchong Yu, Xiaosu Su, and Zhaopeng Qian. 2023. Multi-stage audio-visual fusion for dysarthric speech recognition with pre-trained models. IEEE Transactions on Neural Systems and Rehabilitation Engineering, Vol. 31, 1912–1921.

[37]

Vahid Kazemi and Josephine Sullivan. 2014. One millisecond face alignment with an ensemble of regression trees. IEEE Conference on Computer Vision and Pattern Recognition, 1867–1874.

Digital Library

[38]

Yu Zhang, Yu Liu, Peng Sun, Han Yan, Xiaolin Zhao, and Li Zhang. 2020. IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion, Vol. 54, 99–118.

[39]

Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In arXiv preprint arXiv:1804.10959.

[40]

Shakti P. Rath, Daniel Povey, Karel Veselý, and Jan Cernocký. 2013. Improved feature processing for deep neural networks. In Proc. Interspeech, 109-113.

[41]

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, Karel Vesely. 2011.The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF. IEEE Signal Processing Society.

Index Terms

Multi-stage Multi-modalities Fusion of Lip, Tongue and Acoustics Information for Speech Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Psycho-acoustics inspired automatic speech recognition
Abstract
Understanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights
- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition ...
Recognition of stop consonants by computer analysis (acoustics, phonetics, transforms, speech)

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

AICCC '23: Proceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference

December 2023

280 pages

ISBN:9798400716225

DOI:10.1145/3639592

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Humanities and Social Sciences Research Planning Fund of the Ministry of Education of China
Humanity and Social Science Youth Foundation of Ministry of Education of China

Conference

AICCC 2023

AICCC 2023: 2023 6th Artificial Intelligence and Cloud Computing Conference

December 16 - 18, 2023

Kyoto, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
25
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)6

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents