Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

Amin Honarmandi Shandiz¹¹ &
László Tóth¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13343))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

1753 Accesses
2 Citations

Abstract

Silent Speech Interfaces aim to reconstruct the acoustic signal from a sequence of ultrasound tongue images that records the articulatory movement. The extraction of information about the tongue movement requires us to efficiently process the whole sequence of images, not just as a single image. Several approaches have been suggested to process such a sequential image data. The classic neural network structure combines two-dimensional convolutional (2D-CNN) layers that process the images separately with recurrent layers (e.g. an LSTM) on top of them to fuse the information along time. More recently, it was shown that one may also apply a 3D-CNN network that can extract information along both the spatial and the temporal axes in parallel, achieving a similar accuracy while being less time consuming. A third option is to apply the less well-known ConvLSTM layer type, which combines the advantages of LSTM and CNN layers by replacing matrix multiplication with the convolution operation. In this paper, we experimentally compared various combinations of the above mentions layer types for a silent speech interface task, and we obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers. This hybrid network is slightly faster, smaller and more accurate than our previous 3D-CNN model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 87.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 109.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Tongue Contour Tracking in Ultrasound Images with Spatiotemporal LSTM Networks

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks

Notes

1.
The dataset is available upon request from csapot@tmit.bme.hu.

References

Schultz, T., Wand, M., Hueber, T., Krusienski, D.J., Herff, C., Brumberg, J.S.: Biosignal-based spoken communication: a survey. IEEE/ACM Trans. Audio Speech Lang. Process. 25(12), 2257–2271 (2017)
Article Google Scholar
Csapó, T.G., Grósz, T., Gosztolya, G., Tóth, L., Markó, A.: DNN-based ultrasound-to-speech conversion for a silent speech interface. In: Proceedings of InterSpeech, pp. 3672–3676 (2017)
Google Scholar
Tóth, L., Shandiz, A.H.: 3D convolutional neural networks for ultrasound-based silent speech interfaces. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2020. LNCS (LNAI), vol. 12415, pp. 159–169. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61401-0_16
Chapter Google Scholar
Jaumard-Hakoun, A., Xu, K., Leboullenger, C., Roussel-Ragot, P., Denby, B.: An articulatory-based singing voice synthesis using tongue and lips imaging. In: ISCA Interspeech 2016. vol. 2016, pp. 1467–1471 (2016)
Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)
Google Scholar
Ling, Z.H., et al.: Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Sig. Process. Mag. 32(3), 35–52 (2015)
Article Google Scholar
Grósz, T., Gosztolya, G., Tóth, L., Csapó, T.G., Markó, A.: F0 estimation for DNN-based ultrasound silent speech interfaces. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 291–295. IEEE (2018)
Google Scholar
Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)
Google Scholar
Saha, P., Liu, Y., Gick, B., Fels, S.: Ultra2Speech - a deep learning framework for formant frequency estimation and tracking from ultrasound tongue images. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 473–482. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_45
Chapter Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Juanpere, E.M., Csapó, T.G.: Ultrasound-based silent speech interface using convolutional and recurrent neural networks. Acta Acust. Acust. 105(4), 587–590 (2019)
Article Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of CVPR (2018)
Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. arXiv preprint arXiv:1506.04214 (2015)
Zhao, C., Zhang, P., Zhu, J., Wu, C., Wang, H., Xu, K.: Predicting tongue motion in unlabeled ultrasound videos using convolutional LSTM neural networks. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5926–5930. IEEE (2019)
Google Scholar
Kimura, N., Kono, M., Rekimoto, J.: SottoVoce: an ultrasound imaging-based silent speech interaction using deep neural networks. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–11 (2019)
Google Scholar
Convolutional LSTM (2019). https://medium.com/neuronio/an-introduction-to-convlstm-55c9025563a7
Recurrent neural networks and LSTMs with keras (2020). https://blog.eduonix.com/artificial-intelligence /recurrent-neural-networks-lstms-keras
Kwon, S., et al.: CLSTM: deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics 8(12), 2133 (2020)
Article Google Scholar
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: Proceedings of ICASSP, pp. 3617–3621 (2019)
Google Scholar
Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras
Behboodi, B., Rivaz, H.: Ultrasound segmentation using U-net: learning from simulated data and testing on real data. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 6628–6631. IEEE (2019)
Google Scholar

Download references

Acknowledgments

Project no. TKP2021-NVA-09 has been implemented with the support provided by the Ministry of Innovation and Technology of Hungary from the National Research, Development and Innovation Fund, financed under the TKP2021-NVA funding scheme, and also within the framework of the Artificial Intelligence National Laboratory Programme. The RTX A5000 GPU used in the experiments was donated by NVIDIA.

Author information

Authors and Affiliations

Institute of Informatics, University of Szeged, Szeged, Hungary
Amin Honarmandi Shandiz & László Tóth

Authors

Amin Honarmandi Shandiz
View author publications
You can also search for this author in PubMed Google Scholar
László Tóth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amin Honarmandi Shandiz .

Editor information

Editors and Affiliations

i-SOMET, Inc., Morioka-shi, Iwate, Japan
Hamido Fujita
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong, China
Philippe Fournier-Viger
Texas State University, San Marcos, TX, USA
Moonis Ali
Shanghai University of Finance and Economics, Shanghai, China
Yinglin Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shandiz, A.H., Tóth, L. (2022). Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks. In: Fujita, H., Fournier-Viger, P., Ali, M., Wang, Y. (eds) Advances and Trends in Artificial Intelligence. Theory and Practices in Artificial Intelligence. IEA/AIE 2022. Lecture Notes in Computer Science(), vol 13343. Springer, Cham. https://doi.org/10.1007/978-3-031-08530-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-08530-7_22
Published: 30 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08529-1
Online ISBN: 978-3-031-08530-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Tongue Contour Tracking in Ultrasound Images with Spatiotemporal LSTM Networks

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Tongue Contour Tracking in Ultrasound Images with Spatiotemporal LSTM Networks

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation