More Web Proxy on the site http://driver.im/

research-article

Collaborative Viseme Subword and End-to-End Modeling for Word-Level Lip Reading

Authors:

Chin-Hui LeeAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 26

Pages 9358 - 9371

https://doi.org/10.1109/TMM.2024.3390148

Published: 01 January 2024 Publication History

Abstract

We propose a viseme subword modeling (VSM) approach to improve the generalizability and interpretability capabilities of deep neural network based lip reading. A comprehensive analysis of preliminary experimental results reveals the complementary nature of the conventional end-to-end (E2E) and proposed VSM frameworks, especially concerning speaker head movements. To increase lip reading accuracy, we propose hybrid viseme subwords and end-to-end modeling (HVSEM), which exploits the strengths of both approaches through multitask learning. As an extension to HVSEM, we also propose collaborative viseme subword and end-to-end modeling (CVSEM), which further explores the synergy between the VSM and E2E frameworks by integrating a state-mapped temporal mask (SMTM) into joint modeling. Experimental evaluations using different model backbones on both the LRW and LRW-1000 datasets confirm the superior performance and generalizability of the proposed frameworks. Specifically, VSM outperforms the baseline E2E framework, while HVSEM outperforms VSM in a hybrid combination of VSM and E2E modeling. Building on HVSEM, CVSEM further achieves impressive accuracies on 90.75% and 58.89%, setting new benchmarks for both datasets.

References

[1]

A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic, “Lips don't lie: A generalisable and robust approach to face forgery detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 5037–5047.

[2]

K. Sun, C. Yu, W. Shi, L. Liu, and Y. Shi, “Lip-interact: Improving mobile device interaction with silent speech commands,” in Proc. 31st Annu. ACM Symp. User Interface Softw. Technol., 2018, pp. 581–593.

Digital Library

[3]

N. Tye-Murray, M. S. Sommers, and B. Spehar, “Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing,” Ear Hear., vol. 28, no. 5, pp. 656–668, 2007.

[4]

H. Chen et al., “The first multimodal information based speech processing (MISP) challenge: Data, tasks, baselines and results,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 9266–9270.

[5]

H. Chen et al., “Audio-visual speech recognition in MISP2021 challenge: Dataset release and deep analysis,” in Proc. INTERSPEECH, 2022, pp. 1766–1770.

[6]

H. Chen et al., “Summary on the multimodal information based speech processing (MISP) 2022 challenge,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023, pp. 1–2.

[7]

H. Chen et al., “Correlating subword articulation with lip shapes for embedding aware audio-visual speech enhancement,” Neural Netw., vol. 143, pp. 171–182, 2021.

Digital Library

[8]

Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A review of recent advances in visual speech decoding,” Image Vis. Comput., vol. 32, no. 9, pp. 590–605, 2014.

[9]

J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3444–3453.

[10]

T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with LSTMs for lipreading,” in Proc. Interspeech, 2017, pp. 3652–3656.

[11]

J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Proc. Asian Conf. Comput. Vis., 2016, pp. 87–103.

[12]

S. Yang et al., “LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2019, pp. 1–8.

[13]

X. Zhang, F. Cheng, and W. Shilin, “Spatio-temporal fusion based convolutional sequence learning for lip reading,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 713–722.

[14]

D. Feng, S. Yang, and S. Shan, “An efficient software for building lip reading models without pains,” in Proc. IEEE Int. Conf. Multimedia Expo Workshops, 2021, pp. 1–2.

[15]

H. Chen et al., “Automatic lip-reading with hierarchical pyramidal convolution and self-attention for image sequences with no word boundaries,” in Proc. INTERSPEECH, 2021, pp. 3001–3005.

[16]

T. Stafylakis, M. H. Khan, and G. Tzimiropoulos, “Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs,” Comput. Vis. Image Understanding, vol. 176, pp. 22–32, 2018.

[17]

K. Livescu, E. Fosler-Lussier, and F. Metze, “Subword modeling for automatic speech recognition: Past, present, and emerging approaches,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 44–57, Nov. 2012.

[18]

C. Neti et al., “Audio-visual speech recognition,” in Proc. Workshop Final Rep., 2000, pp. 1–86.

[19]

P. Lucey, G. Potamianos, and S. Sridharan, “A unified approach to multi-pose audio-visual ASR,” in Proc. Interspeech, 2007, pp. 650–653.

[20]

G. Zhao, M. Barnard, and M. Pietikainen, “Lipreading with local spatiotemporal descriptors,” IEEE Trans. Multimedia, vol. 11, no. 7, pp. 1254–1265, Nov. 2009.

Digital Library

[21]

H. E. Cetingul, Y. Yemez, E. Erzin, and A. M. Tekalp, “Discriminative analysis of lip motion features for speaker identification and speech-reading,” IEEE Trans. Image Process., vol. 15, no. 10, pp. 2879–2891, Oct. 2006.

Digital Library

[22]

M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour models,” Int. J. Comput. Vis., vol. 1, no. 4, pp. 321–331, 1988.

[23]

S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141–151, Sep. 2000.

Digital Library

[24]

P. Wu, H. Liu, X. Li, T. Fan, and X. Zhang, “A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion,” IEEE Trans. Multimedia, vol. 18, no. 3, pp. 326–338, Mar. 2016.

Digital Library

[25]

S. Petridis et al., “End-to-end audiovisual speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 6548–6552.

[26]

M. Hao, M. Mamut, N. Yadikar, A. Aysa, and K. Ubul, “How to use time information effectively? Combining with time shift module for lipreading,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 7988–7992.

[27]

D. Tsourounis, D. Kastaniotis, and S. Fotopoulos, “Lip reading by alternating between spatiotemporal and spatial convolutions,” J. Imag., vol. 7, no. 5, 2021, Art. no.

[28]

A. Koumparoulis and G. Potamianos, “Accurate and resource-efficient lipreading with EfficientNetV2 and transformers,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 8467–8471.

[29]

B. Xu, C. Lu, Y. Guo, and J. Wang, “Discriminative multi-modality speech recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 14433–14442.

[30]

C.-H. Wang, “Multi-grained spatio-temporal modeling for lip-reading,” in Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–11.

[31]

X. S. Weng and K. Kitani, “Learning spatiotemporal features with two-stream deep 3D CNNs for lipreading,” in Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–13.

[32]

J. Xiao, S. Yang, Y. Zhang, S. Shan, and X. Chen, “Deformation flow based two-stream network for lip reading,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2020, pp. 364–370.

[33]

H. Liu, Z. Chen, and B. Yang, “Lip graph assisted audio-visual speech recognition using bidirectional synchronous fusion,” in Proc. INTERSPEECH, 2020, pp. 3520–3524.

[34]

C. Sheng, X. Zhu, H. Xu, M. Pietikäinen, and L. Liu, “Adaptive semantic-spatio-temporal graph convolutional network for lip reading,” IEEE Trans. Multimedia, vol. 24, pp. 3545–3557, 2022.

Digital Library

[35]

M. Luo, S. Yang, S. Shan, and X. Chen, “Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2020, pp. 273–280.

[36]

Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, “Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2020, pp. 356–363.

[37]

B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, pp. 6319–6323.

[38]

P. Ma, Y. Wang, J. Shen, S. Petridis, and M. Pantic, “Lip-reading with densely connected temporal convolutional networks,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, pp. 2856–2865.

[39]

P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards practical lipreading with distilled and efficient models,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 7608–7612.

[40]

S. Ren, Y. Du, J. Lv, G. Han, and S. He, “Learning from the master: Distilling cross-modal advanced knowledge for lip reading,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 13320–13328.

[41]

P. Ma et al., “Lira: Learning visual speech representations from audio through self-supervision,” in Proc. Interspeech, 2021, pp. 3011–3015.

[42]

X. Zhao, S. Yang, S. Shan, and X. Chen, “Mutual information maximization for effective lip reading,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2020, pp. 420–427.

[43]

C. Sheng et al., “Importance-aware information bottleneck learning paradigm for lip reading,” IEEE Trans. Multimedia, vol. 25, pp. 6563–6574, 2023.

Digital Library

[44]

M. Luo, S. Yang, X. Chen, Z. Liu, and S. Shan, “Synchronous bidirectional learning for multilingual lip reading,” in Proc. Brit. Mach. Vis. Conf., 2020, pp. 1–13.

[45]

M. Kim, J. Hong, S. J. Park, and Y. M. Ro, “Multi-modality associative bridging through memory: Speech sound recollected from face video,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 296–306.

[46]

M. Kim, J. H. Yeo, and Y. M. Ro, “Distinguishing homophenes using multi-head visual-audio memory for lip reading,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 1174–1182.

[47]

A. Cruttenden, Gimson's Pronunciation of English. London, U.K.: Routledge, 2014.

[48]

C. G. Fisher, “Confusions among visually perceived consonants,” J. Speech Hear. Res., vol. 11, no. 4, pp. 796–804, 1968.

[49]

A. A. Montgomery and P. L. Jackson, “Physical characteristics of the lips underlying vowel lipreading performance,” J. Acoustical Soc. Amer., vol. 73, no. 6, pp. 2134–2144, 1983.

[50]

S. Lee and D. Yook, “Audio-to-visual conversion using hidden Markov models,” in Proc. Pacific Rim Int. Conf. Artif. Intell., 2002, pp. 563–570.

[51]

M. Gur and D. M. Snodderly, “Direction selectivity in V1 of alert monkeys: Evidence for parallel pathways for motion processing,” J. Physiol., vol. 585, pp. 383–400, 2007.

[52]

R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1, pp. 41–75, 1997.

Digital Library

[53]

S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard, “Latent multi-task architecture learning,” in Proc. AAAI Conf. Artif. Intell., 2019, vol. 33, no. 01, pp. 4822–4829.

[54]

A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2018, pp. 7482–7491.

[55]

Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 12, pp. 5586–5609, Dec. 2022.

[56]

T. Gong et al., “A comparison of loss weighting strategies for multi task learning in deep neural networks,” IEEE Access, vol. 7, pp. 141627–141632, 2019.

[57]

S. Young et al., “The HTK book,” Cambridge Univ. Eng. Dept., vol. 3, no. 175, pp. 118–135, 2002.

[58]

L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989.

[59]

D. Povey et al., “The kaldi speech recognition toolkit,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2011, pp. 1–4.

[60]

D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–15.

[61]

I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in Proc. Int. Conf. Learn. Representations, 2017, pp. 1–16.

[62]

N. Ruiz, E. Chong, and J. M. Rehg, “Fine-grained head pose estimation without keypoints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2018, pp. 2155–215509.

[63]

L. Van der Maaten and G. Hinton, “Visualizing data using t-sne,” J. Mach. Learn. Res., vol. 9, no. 11,pp. 2579–2605, 2008.

Index Terms

Collaborative Viseme Subword and End-to-End Modeling for Word-Level Lip Reading

Index terms have been assigned to the content through auto-classification.

Recommendations

Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling
Abstract
Conventional automatic speech recognition (ASR) and emerging end-to-end (E2E) speech recognition have achieved promising results after being provided with sufficient resources. However, for low-resource language, the current ASR is still ...
Real-time lip reading system for isolated Korean word recognition

This paper proposes a real-time lip reading system (consisting of a lip detector, lip tracker, lip activation detector, and word classifier), which can recognize isolated Korean words. Lip detection is performed in several stages: face detection, eye ...
Reducing viseme confusion in speech-reading

Speech-reading is an invaluable technique for people with hearing loss or those in adverse listening conditions (e.g., in a noisy restaurant, near children playing loudly). However, speech-reading is often difficult because identical mouth shapes (...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 26, Issue

2024

11427 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 January 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents