[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3611659.3615682acmconferencesArticle/Chapter ViewAbstractPublication PagesvrstConference Proceedingsconference-collections
research-article
Public Access

Visual Hearing Aids: Artificial Visual Speech Stimuli for Audiovisual Speech Perception in Noise

Published: 09 October 2023 Publication History

Abstract

Speech perception is optimal in quiet environments, but noise can impair comprehension and increase errors. In these situations, lip reading can help, but it is not always possible, such as during an audio call or when wearing a face mask. One approach to improve speech perception in these situations is to use an artificial visual lip reading aid. In this paper, we present a user study (N = 17) in which we compared three levels of audio stimuli visualizations and two levels of modulating the appearance of the visualization based on the speech signal, and we compared them against two control conditions: an audio-only condition, and a real human speaking. We measured participants’ speech reception thresholds (SRTs) to understand the effects of these visualizations on speech perception in noise. These thresholds indicate the decibel levels of the speech signal that are necessary for a listener to receive the speech correctly 50% of the time. Additionally, we measured the usability of the approaches and the user experience. We found that the different artificial visualizations improved participants’ speech reception compared to the audio-only baseline condition, but they were significantly poorer than the real human condition. This suggests that different visualizations can improve speech perception when the speaker’s face is not available. However, we also discuss limitations of current plug-and-play lip sync software and abstract representations of the speaker in the context of speech perception.

References

[1]
2023. Display Dynamics : Average smartphone display size stays at 6.3 inches. https://omdia.tech.informa.com/OM022757/Display-Dynamics–January-2022-Average-smartphone-display-size-stays-at-63-inches-while-the-resolution-can-be-potentially-enhanced.
[2]
2023. Logitech G-PRO VR: Headphones for Meta Quest 2. https://www.logitechg.com/en-us/products/gaming-audio/pro-gaming-headset-oculus.981-001003.html.
[3]
Peter Assmann and Quentin Summerfield. 2004. The perception of speech under adverse conditions. Speech processing in the auditory system (2004), 231–308.
[4]
Lynne E. Bernstein, Edward T. Auer, and Sumiko Takayanagi. 2004. Auditory speech detection in noise enhanced by Lipreading. Speech Communication 44, 1-4 (2004), 5–18. https://doi.org/10.1016/j.specom.2004.10.011
[5]
John Brooke 1996. SUS-A quick and dirty usability scale. Usability evaluation in industry 189, 194 (1996), 4–7.
[6]
Zubin Choudhary, Gerd Bruder, and Gregory F. Welch. 2023. Visual Facial Enhancements Can Significantly Improve Speech Perception in the Presence of Noise. IEEE Transactions on Visualization and Computer Graphics, Special Issue on the IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2023. (2023).
[7]
Joon Hao Chuah, Andrew Robb, Casey White, Adam Wendling, Samsun Lampotang, Regis Kopper, and Benjamin Lok. 2013. Exploring agent physicality and social presence for medical team training. Presence: Teleoperators and Virtual Environments 22, 2 (2013), 141–170.
[8]
Franz Faul, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner. 2007. G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior research methods 39, 2 (2007), 175–191.
[9]
Carol A Fowler and Dawn J Dekle. 1991. Listening with eye and hand: cross-modal contributions to speech perception.Journal of experimental psychology: Human perception and performance 17, 3 (1991), 816.
[10]
Maia Garau, Mel Slater, David-Paul Pertaub, and Sharif Razzaque. 2005. The responses of people to virtual humans in an immersive virtual environment. Presence: Teleoperators & Virtual Environments 14, 1 (2005), 104–116.
[11]
Mar Gonzalez-Franco, Antonella Maselli, Dinei Florencio, Nikolai Smolyanskiy, and Zhengyou Zhang. 2017. Concurrent talking in immersive virtual reality: on the dominance of visual speech cues. Scientific reports 7, 1 (2017), 3817.
[12]
Jonathan Gratch, Jeff Rickel, Elisabeth André, Justine Cassell, Eric Petajan, and Norman Badler. 2002. Creating interactive virtual humans: Some assembly required. IEEE Intelligent systems 17, 4 (2002), 54–63.
[13]
Sabine Hochmuth, Birger Kollmeier, Thomas Brand, and Tim Jürgens. 2015. Influence of noise type on speech reception thresholds across four languages measured with matrix sentence tests. International journal of audiology 54, sup2 (2015), 62–70.
[14]
Timothy R Jordan and Sharon M Thomas. 2011. When half a face is as good as a whole: Effects of simple substantial occlusion on visual and audiovisual speech perception. Attention, Perception, & Psychophysics 73 (2011), 2270–2285.
[15]
Nitish Krishnamurthy and John HL Hansen. 2009. Babble noise: modeling, analysis, and applications. IEEE transactions on audio, speech, and language processing 17, 7 (2009), 1394–1407.
[16]
John MacDonald and Harry McGurk. 1978. Visual influences on speech perception processes. Perception & psychophysics 24, 3 (1978), 253–257.
[17]
Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748.
[18]
Michael Meehan, Brent Insko, Mary Whitton, and Frederick P Brooks Jr. 2002. Physiological measures of presence in stressful virtual environments. Acm transactions on graphics (tog) 21, 3 (2002), 645–652.
[19]
Kristine L Nowak and Frank Biocca. 2003. The effect of the agency and anthropomorphism on users’ sense of telepresence, copresence, and social presence in virtual environments. Presence: Teleoperators & Virtual Environments 12, 5 (2003), 481–494.
[20]
Rick Parent, Scott King, and Osamu Fujimura. 2002. Issues with lip sync animation: can you read my lips?. In Proceedings of Computer Animation 2002 (CA 2002). IEEE, 3–10.
[21]
Reinier Plomp and AM Mimpen. 1979. Improving the reliability of testing the speech reception threshold for sentences. Audiology 18, 1 (1979), 43–52.
[22]
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492.
[23]
Lawrence Rosenblum. 2019. Audiovisual speech perception and the McGurk effect. Oxford Research Encyclopedia, Linguistics (2019).
[24]
Lawrence D Rosenblum, Deborah A Yakel, and Kerry P Green. 2000. Face and mouth inversion effects on visual and audiovisual speech perception.Journal of Experimental Psychology: Human Perception and Performance 26, 2 (2000), 806.
[25]
Mikko Sams, Riikka Möttönen, and Toni Sihvonen. 2005. Seeing and hearing others and oneself talk. Cognitive Brain Research 23, 2-3 (2005), 429–435.
[26]
Martin Schrepp, Andreas Hinderks, and Jörg Thomaschewski. 2017. Design and evaluation of a short version of the user experience questionnaire (UEQ-S). International Journal of Interactive Multimedia and Artificial Intelligence, 4 (6), 103-108. (2017).
[27]
Ralph Schroeder. 2002. Copresence and interaction in virtual environments: An overview of the range of issues. In Presence 2002: Fifth international workshop. Citeseer, 274–295.
[28]
Mel Slater. 2009. Place illusion and plausibility can lead to realistic behaviour in immersive virtual environments. Philosophical Transactions of the Royal Society B: Biological Sciences 364, 1535 (2009), 3549–3557.
[29]
Cas Smits, Theo S Kapteyn, and Tammo Houtgast. 2004. Development and validation of an automatic speech-in-noise screening test by telephone. International journal of audiology 43, 1 (2004), 15–28.
[30]
Elizabeth A Strand. 1999. Uncovering the role of gender stereotypes in speech perception. Journal of language and social psychology 18, 1 (1999), 86–100.
[31]
William H Sumby and Irwin Pollack. 1954. Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america 26, 2 (1954), 212–215.
[32]
Quentin Summerfield. 1992. Lipreading and audio-visual speech perception. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 335, 1273 (1992), 71–78.
[33]
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–13.
[34]
Sharon M Thomas and Timothy R Jordan. 2004. Contributions of oral and extraoral facial movement to visual and audiovisual speech perception.Journal of Experimental Psychology: Human Perception and Performance 30, 5 (2004), 873.
[35]
Elien Van den Borre, Sam Denys, Astrid van Wieringen, and Jan Wouters. 2021. The digit triplet test: a scoping review. International journal of audiology 60, 12 (2021), 946–963.
[36]
Jean Vroomen and Jeroen J Stekelenburg. 2010. Visual anticipatory information modulates multisensory interactions of artificial audiovisual stimuli. Journal of cognitive neuroscience 22, 7 (2010), 1583–1596.
[37]
Yi Yuan, Yasneli Lleo, Rebecca Daniel, Alexandra White, and Yonghee Oh. 2021. The impact of temporally coherent visual cues on speech perception in Complex auditory environments. Frontiers in neuroscience 15 (2021), 678029.

Index Terms

  1. Visual Hearing Aids: Artificial Visual Speech Stimuli for Audiovisual Speech Perception in Noise

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    VRST '23: Proceedings of the 29th ACM Symposium on Virtual Reality Software and Technology
    October 2023
    542 pages
    ISBN:9798400703287
    DOI:10.1145/3611659
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Speech perception
    2. background noise
    3. hearing
    4. speechreading
    5. user study
    6. virtual humans
    7. visualizations

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    VRST 2023

    Acceptance Rates

    Overall Acceptance Rate 66 of 254 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 164
      Total Downloads
    • Downloads (Last 12 months)97
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media