More Web Proxy on the site http://driver.im/

research-article

Open access

Careless Whisper: Speech-to-Text Hallucination Harms

Authors:

Allison Koenecke,

Anna Seo Gyeong Choi,

Katelyn X. Mei,

Hilke Schellmann,

Mona SloaneAuthors Info & Claims

FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency

Pages 1672 - 1681

https://doi.org/10.1145/3630106.3658996

Published: 05 June 2024 Publication History

All formats PDF

Abstract

Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI’s Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper’s transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations—a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

References

[1]

Kirrie J Ballard, Nicole M Etter, Songjia Shen, Penelope Monroe, and Chek Tien Tan. 2019. Feasibility of automatic speech recognition for providing feedback during tablet-based treatment for apraxia of speech plus aphasia. American journal of speech-language pathology 28, 2S (2019), 818–834.

[2]

Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2017. Fairness in machine learning. Nips tutorial 1 (2017), 2017.

[3]

Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2023. Fairness and Machine Learning: Limitations and Opportunities. MIT Press.

[4]

David Frank Benson and Alfredo Ardila. 1996. Aphasia: A clinical perspective. Oxford University Press, USA.

[5]

Hervé Bredin and Antoine Laurent. 2021. End-to-end speaker segmentation for overlap-aware resegmentation. In Proc. Interspeech 2021. Brno, Czech Republic.

[6]

Chris Code and Brian Petheram. 2011. Delivering for aphasia. International Journal of Speech-Language Pathology 13, 1 (Feb. 2011), 3–10. https://doi.org/10.3109/17549507.2010.520090

[7]

Antonio R. Damasio. 1992. Aphasia. New England Journal of Medicine 326, 8 (Feb. 1992), 531–539. https://doi.org/10.1056/nejm199202203260806

[8]

Charles Ellis and Stephanie Urban. 2016. Age and aphasia: a review of presence, type, recovery and clinical outcomes. Topics in Stroke Rehabilitation 23, 6 (2016), 430–439. https://doi.org/10.1080/10749357.2016.1150412 arXiv:https://doi.org/10.1080/10749357.2016.1150412PMID: 26916396.

[9]

Gunther Eysenbach. 2023. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Medical Education 9 (March 2023), e46885. https://doi.org/10.2196/46885

[10]

Rita Frieske and Bertram E Shi. 2024. Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models. arXiv preprint arXiv:2401.01572 (2024).

[11]

Graham R Gibbs. 2007. Thematic coding and categorizing. Analyzing qualitative data 703 (2007), 38–56.

[12]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.

Digital Library

[13]

Dietrich Klakow and Jochen Peters. 2002. Testing the correlation of word error rate and perplexity. Speech Communication 38, 1-2 (Sept. 2002), 19–28. https://doi.org/10.1016/s0167-6393(01)00041-3

Digital Library

[14]

Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R Rickford, Dan Jurafsky, and Sharad Goel. 2020. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117, 14 (2020), 7684–7689.

[15]

Duc Le, Keli Licata, and Emily Mower Provost. 2018. Automatic quantitative analysis of spontaneous aphasic speech. Speech Communication 100 (2018), 1–12.

[16]

Duc Le and Emily Mower Provost. 2016. Improving Automatic Recognition of Aphasic Speech with AphasiaBank. In Proc. Interspeech 2016. 2681–2685. https://doi.org/10.21437/Interspeech.2016-213

[17]

Debbie Loakes. 2022. Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?Frontiers in Communication 7 (2022), 803452.

[18]

Julio C. Hidalgo Lopez, Shelly Sandeep, MaKayla Wright, Grace M. Wandell, and Anthony B. Law. 2023. Quantifying and Improving the Performance of Speech Recognition Systems on Dysphonic Speech. Otolaryngology–Head and Neck Surgery 168, 5 (Jan. 2023), 1130–1138. https://doi.org/10.1002/ohn.170

[19]

B. MacWhinney, D. Fromm, M. Forbes, and A. Holland. 2011. AphasiaBank: Methods for studying discourse. Aphasiology 25 (2011), 1286–1307.

[20]

John Markoff. 2019. From Your Mouth to Your Screen, Transcribing Takes the Next Step. New York Times (October 2019).

[21]

Tara McAllister and Kirrie J Ballard. 2018. Bringing advanced speech processing technology to the clinical management of speech disorders., 581–582 pages.

[22]

Robert McMillan. 2023. With AI, Hackers Can Simply Talk Computers Into Misbehaving. Wall Street Journal (August 2023).

[23]

Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A. Thompson, and Nico Grant. 2024. How Tech Giants Cut Corners to Harvest Data for A.I.New York Times (April 2024).

[24]

OpenAI. 2023. GPT 3.5. https://platform.openai.com/docs/models/gpt-3-5. Accessed: 2023-11-25.

[25]

OpenAI. 2023. Speech to text. https://platform.openai.com/docs/guides/speech-to-text. Accessed: 2023-11-25.

[26]

Orestis Papakyriakopoulos, Anna Seo Gyeong Choi, William Thong, Dora Zhao, Jerone Andrews, Rebecca Bourke, Alice Xiang, and Allison Koenecke. 2023. Augmented Datasheets for Speech Datasets and Ethical Decision-Making. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 881–904.

Digital Library

[27]

Alexis Plaquet and Hervé Bredin. 2023. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023.

[28]

Ying Qin, Tan Lee, Siyuan Feng, and Anthony Pak-Hin Kong. 2018. Automatic Speech Assessment for People with Aphasia Using TDNN-BLSTM with Multi-Task Learning. In Interspeech. 3418–3422.

[29]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv (2022). arXiv:arXiv:2212.04356

[30]

Donald B Rubin. 1980. Bias reduction using Mahalanobis-metric matching. Biometrics (1980), 293–298.

[31]

Prashant Serai, Vishal Sunder, and Eric Fosler-Lussier. 2022. Hallucination of Speech Recognition Errors With Sequence to Sequence Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 890–900. https://doi.org/10.1109/taslp.2022.3145313

Digital Library

[32]

David Sherfinski and Avi Asher-Schapiro. 2021. U.S. prisons mull AI to analyze inmate phone calls. Thomson Reuters Foundation News (August 2021).

[33]

Silero Team. 2021. Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. https://github.com/snakers4/silero-vad.

[34]

The New York City Council. 2021. A Local Law to amend the administrative code of the city of New York, in relation to automated employment decision tools. https://www.nyc.gov/site/dca/about/automated-employment-decision-tools.page.

[35]

US Department of Labor. 1990. Americans with Disabilities Act. https://www.dol.gov/general/topic/disability/ada.

[36]

US Equal Employment Opportunity Commission. 2008. The ADA: Your Responsibilities as an Employer. https://www.eeoc.gov/publications/ada-your-responsibilities-employer.

[37]

Carolina Paula Vargas, Alejandro Gaiera, Andres Brandán, Alejandro Renato, Sonia Benitez, and Daniel Luna. 2024. Automatic Speech Recognition System to Record Progress Notes in a Mobile EHR: A Pilot Study.Studies in Health Technology and Informatics 310 (2024), 124–128.

Cited By

Reutens SDandolo CLooi RKarystianis GLooi J(2024)The uses and misuses of artificial intelligence in psychiatry: Promises and challengesAustralasian Psychiatry10.1177/10398562241280348Online publication date: 2-Sep-2024
https://doi.org/10.1177/10398562241280348
Stoykova RPorter KBeka T(2024)The AI Act in a law enforcement context: The case of automatic speech recognition for transcribing investigative interviewsForensic Science International: Synergy10.1016/j.fsisyn.2024.1005639(100563)Online publication date: 2024
https://doi.org/10.1016/j.fsisyn.2024.100563
Borkar JSmith D(2024)Mind the Gap:Analyzing Lacunae with Transformer-Based TranscriptionDocument Analysis and Recognition – ICDAR 2024 Workshops10.1007/978-3-031-70642-4_4(57-70)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70642-4_4
Show More Cited By

Index Terms

Careless Whisper: Speech-to-Text Hallucination Harms
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing
2. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing design and evaluation methods

Recommendations

Whisper to normal speech conversion using pitch estimated from spectrum

A systematic method to connect a whispered vowel to its perceived pitch is presented.A regression function to predict the perceived pitch from the spectral information of the whispered speech is obtained.A whispered-to-normal speech converter is ...
Whisper speech processing: analysis, modeling, and detection with applications to keyword spotting
WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency

June 2024

2580 pages

ISBN:9798400704505

DOI:10.1145/3630106

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Pulitzer Center
Cornell Center for Social Sciences

Conference

FAccT '24

FAccT '24: The 2024 ACM Conference on Fairness, Accountability, and Transparency

June 3 - 6, 2024

Rio de Janeiro, Brazil

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
2,129
Total Downloads

Downloads (Last 12 months)2,129
Downloads (Last 6 weeks)1,337

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Reutens SDandolo CLooi RKarystianis GLooi J(2024)The uses and misuses of artificial intelligence in psychiatry: Promises and challengesAustralasian Psychiatry10.1177/10398562241280348Online publication date: 2-Sep-2024
https://doi.org/10.1177/10398562241280348
Stoykova RPorter KBeka T(2024)The AI Act in a law enforcement context: The case of automatic speech recognition for transcribing investigative interviewsForensic Science International: Synergy10.1016/j.fsisyn.2024.1005639(100563)Online publication date: 2024
https://doi.org/10.1016/j.fsisyn.2024.100563
Borkar JSmith D(2024)Mind the Gap:Analyzing Lacunae with Transformer-Based TranscriptionDocument Analysis and Recognition – ICDAR 2024 Workshops10.1007/978-3-031-70642-4_4(57-70)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70642-4_4
Zhao RChoi AKoenecke ARameau A(2024)Quantification of Automatic Speech Recognition System Performance on d/Deaf and Hard of Hearing SpeechThe Laryngoscope10.1002/lary.31713Online publication date: 19-Aug-2024
https://doi.org/10.1002/lary.31713

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents