[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Collaborative Viseme Subword and End-to-End Modeling for Word-Level Lip Reading

Published: 01 January 2024 Publication History

Abstract

We propose a viseme subword modeling (VSM) approach to improve the generalizability and interpretability capabilities of deep neural network based lip reading. A comprehensive analysis of preliminary experimental results reveals the complementary nature of the conventional end-to-end (E2E) and proposed VSM frameworks, especially concerning speaker head movements. To increase lip reading accuracy, we propose hybrid viseme subwords and end-to-end modeling (HVSEM), which exploits the strengths of both approaches through multitask learning. As an extension to HVSEM, we also propose collaborative viseme subword and end-to-end modeling (CVSEM), which further explores the synergy between the VSM and E2E frameworks by integrating a state-mapped temporal mask (SMTM) into joint modeling. Experimental evaluations using different model backbones on both the LRW and LRW-1000 datasets confirm the superior performance and generalizability of the proposed frameworks. Specifically, VSM outperforms the baseline E2E framework, while HVSEM outperforms VSM in a hybrid combination of VSM and E2E modeling. Building on HVSEM, CVSEM further achieves impressive accuracies on 90.75% and 58.89%, setting new benchmarks for both datasets.

References

[1]
A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic, “Lips don't lie: A generalisable and robust approach to face forgery detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 5037–5047.
[2]
K. Sun, C. Yu, W. Shi, L. Liu, and Y. Shi, “Lip-interact: Improving mobile device interaction with silent speech commands,” in Proc. 31st Annu. ACM Symp. User Interface Softw. Technol., 2018, pp. 581–593.
[3]
N. Tye-Murray, M. S. Sommers, and B. Spehar, “Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing,” Ear Hear., vol. 28, no. 5, pp. 656–668, 2007.
[4]
H. Chen et al., “The first multimodal information based speech processing (MISP) challenge: Data, tasks, baselines and results,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 9266–9270.
[5]
H. Chen et al., “Audio-visual speech recognition in MISP2021 challenge: Dataset release and deep analysis,” in Proc. INTERSPEECH, 2022, pp. 1766–1770.
[6]
H. Chen et al., “Summary on the multimodal information based speech processing (MISP) 2022 challenge,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023, pp. 1–2.
[7]
H. Chen et al., “Correlating subword articulation with lip shapes for embedding aware audio-visual speech enhancement,” Neural Netw., vol. 143, pp. 171–182, 2021.
[8]
Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A review of recent advances in visual speech decoding,” Image Vis. Comput., vol. 32, no. 9, pp. 590–605, 2014.
[9]
J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3444–3453.
[10]
T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with LSTMs for lipreading,” in Proc. Interspeech, 2017, pp. 3652–3656.
[11]
J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Proc. Asian Conf. Comput. Vis., 2016, pp. 87–103.
[12]
S. Yang et al., “LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2019, pp. 1–8.
[13]
X. Zhang, F. Cheng, and W. Shilin, “Spatio-temporal fusion based convolutional sequence learning for lip reading,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 713–722.
[14]
D. Feng, S. Yang, and S. Shan, “An efficient software for building lip reading models without pains,” in Proc. IEEE Int. Conf. Multimedia Expo Workshops, 2021, pp. 1–2.
[15]
H. Chen et al., “Automatic lip-reading with hierarchical pyramidal convolution and self-attention for image sequences with no word boundaries,” in Proc. INTERSPEECH, 2021, pp. 3001–3005.
[16]
T. Stafylakis, M. H. Khan, and G. Tzimiropoulos, “Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs,” Comput. Vis. Image Understanding, vol. 176, pp. 22–32, 2018.
[17]
K. Livescu, E. Fosler-Lussier, and F. Metze, “Subword modeling for automatic speech recognition: Past, present, and emerging approaches,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 44–57, Nov. 2012.
[18]
C. Neti et al., “Audio-visual speech recognition,” in Proc. Workshop Final Rep., 2000, pp. 1–86.
[19]
P. Lucey, G. Potamianos, and S. Sridharan, “A unified approach to multi-pose audio-visual ASR,” in Proc. Interspeech, 2007, pp. 650–653.
[20]
G. Zhao, M. Barnard, and M. Pietikainen, “Lipreading with local spatiotemporal descriptors,” IEEE Trans. Multimedia, vol. 11, no. 7, pp. 1254–1265, Nov. 2009.
[21]
H. E. Cetingul, Y. Yemez, E. Erzin, and A. M. Tekalp, “Discriminative analysis of lip motion features for speaker identification and speech-reading,” IEEE Trans. Image Process., vol. 15, no. 10, pp. 2879–2891, Oct. 2006.
[22]
M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour models,” Int. J. Comput. Vis., vol. 1, no. 4, pp. 321–331, 1988.
[23]
S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141–151, Sep. 2000.
[24]
P. Wu, H. Liu, X. Li, T. Fan, and X. Zhang, “A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion,” IEEE Trans. Multimedia, vol. 18, no. 3, pp. 326–338, Mar. 2016.
[25]
S. Petridis et al., “End-to-end audiovisual speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 6548–6552.
[26]
M. Hao, M. Mamut, N. Yadikar, A. Aysa, and K. Ubul, “How to use time information effectively? Combining with time shift module for lipreading,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 7988–7992.
[27]
D. Tsourounis, D. Kastaniotis, and S. Fotopoulos, “Lip reading by alternating between spatiotemporal and spatial convolutions,” J. Imag., vol. 7, no. 5, 2021, Art. no.
[28]
A. Koumparoulis and G. Potamianos, “Accurate and resource-efficient lipreading with EfficientNetV2 and transformers,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 8467–8471.
[29]
B. Xu, C. Lu, Y. Guo, and J. Wang, “Discriminative multi-modality speech recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 14433–14442.
[30]
C.-H. Wang, “Multi-grained spatio-temporal modeling for lip-reading,” in Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–11.
[31]
X. S. Weng and K. Kitani, “Learning spatiotemporal features with two-stream deep 3D CNNs for lipreading,” in Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–13.
[32]
J. Xiao, S. Yang, Y. Zhang, S. Shan, and X. Chen, “Deformation flow based two-stream network for lip reading,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2020, pp. 364–370.
[33]
H. Liu, Z. Chen, and B. Yang, “Lip graph assisted audio-visual speech recognition using bidirectional synchronous fusion,” in Proc. INTERSPEECH, 2020, pp. 3520–3524.
[34]
C. Sheng, X. Zhu, H. Xu, M. Pietikäinen, and L. Liu, “Adaptive semantic-spatio-temporal graph convolutional network for lip reading,” IEEE Trans. Multimedia, vol. 24, pp. 3545–3557, 2022.
[35]
M. Luo, S. Yang, S. Shan, and X. Chen, “Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2020, pp. 273–280.
[36]
Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, “Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2020, pp. 356–363.
[37]
B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, pp. 6319–6323.
[38]
P. Ma, Y. Wang, J. Shen, S. Petridis, and M. Pantic, “Lip-reading with densely connected temporal convolutional networks,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, pp. 2856–2865.
[39]
P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards practical lipreading with distilled and efficient models,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 7608–7612.
[40]
S. Ren, Y. Du, J. Lv, G. Han, and S. He, “Learning from the master: Distilling cross-modal advanced knowledge for lip reading,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 13320–13328.
[41]
P. Ma et al., “Lira: Learning visual speech representations from audio through self-supervision,” in Proc. Interspeech, 2021, pp. 3011–3015.
[42]
X. Zhao, S. Yang, S. Shan, and X. Chen, “Mutual information maximization for effective lip reading,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2020, pp. 420–427.
[43]
C. Sheng et al., “Importance-aware information bottleneck learning paradigm for lip reading,” IEEE Trans. Multimedia, vol. 25, pp. 6563–6574, 2023.
[44]
M. Luo, S. Yang, X. Chen, Z. Liu, and S. Shan, “Synchronous bidirectional learning for multilingual lip reading,” in Proc. Brit. Mach. Vis. Conf., 2020, pp. 1–13.
[45]
M. Kim, J. Hong, S. J. Park, and Y. M. Ro, “Multi-modality associative bridging through memory: Speech sound recollected from face video,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 296–306.
[46]
M. Kim, J. H. Yeo, and Y. M. Ro, “Distinguishing homophenes using multi-head visual-audio memory for lip reading,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 1174–1182.
[47]
A. Cruttenden, Gimson's Pronunciation of English. London, U.K.: Routledge, 2014.
[48]
C. G. Fisher, “Confusions among visually perceived consonants,” J. Speech Hear. Res., vol. 11, no. 4, pp. 796–804, 1968.
[49]
A. A. Montgomery and P. L. Jackson, “Physical characteristics of the lips underlying vowel lipreading performance,” J. Acoustical Soc. Amer., vol. 73, no. 6, pp. 2134–2144, 1983.
[50]
S. Lee and D. Yook, “Audio-to-visual conversion using hidden Markov models,” in Proc. Pacific Rim Int. Conf. Artif. Intell., 2002, pp. 563–570.
[51]
M. Gur and D. M. Snodderly, “Direction selectivity in V1 of alert monkeys: Evidence for parallel pathways for motion processing,” J. Physiol., vol. 585, pp. 383–400, 2007.
[52]
R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1, pp. 41–75, 1997.
[53]
S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard, “Latent multi-task architecture learning,” in Proc. AAAI Conf. Artif. Intell., 2019, vol. 33, no. 01, pp. 4822–4829.
[54]
A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2018, pp. 7482–7491.
[55]
Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 12, pp. 5586–5609, Dec. 2022.
[56]
T. Gong et al., “A comparison of loss weighting strategies for multi task learning in deep neural networks,” IEEE Access, vol. 7, pp. 141627–141632, 2019.
[57]
S. Young et al., “The HTK book,” Cambridge Univ. Eng. Dept., vol. 3, no. 175, pp. 118–135, 2002.
[58]
L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989.
[59]
D. Povey et al., “The kaldi speech recognition toolkit,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2011, pp. 1–4.
[60]
D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–15.
[61]
I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in Proc. Int. Conf. Learn. Representations, 2017, pp. 1–16.
[62]
N. Ruiz, E. Chong, and J. M. Rehg, “Fine-grained head pose estimation without keypoints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2018, pp. 2155–215509.
[63]
L. Van der Maaten and G. Hinton, “Visualizing data using t-sne,” J. Mach. Learn. Res., vol. 9, no. 11,pp. 2579–2605, 2008.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 26, Issue
2024
11427 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media