[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3663548.3688494acmconferencesArticle/Chapter ViewAbstractPublication PagesassetsConference Proceedingsconference-collections
poster

CARTGPT: Improving CART Captioning using Large Language Models

Published: 27 October 2024 Publication History

Abstract

Communication Access Realtime Translation (CART) is a commonly used real-time captioning technology used by deaf and hard of hearing (DHH) people, due to its accuracy, reliability, and ability to provide a holistic view of the conversational environment (e.g., by displaying speaker names). However, in many real-world situations (e.g., noisy environments, long meetings), the CART captioning accuracy can considerably decline, thereby affecting the comprehension of DHH people. In this work-in-progress paper, we introduce CARTGPT, a system to assist CART captioners in improving their transcription accuracy. CARTGPT takes in errored CART captions and inaccurate automatic speech recognition (ASR) captions as input and uses a large language model to generate corrected captions in real-time. We quantified performance on a noisy speech dataset, showing that our system outperforms both CART (+5.6% accuracy) and a state-of-the-art ASR model (+17.3%). A preliminary evaluation with three DHH users further demonstrates the promise of our approach.

References

[1]
Gregory J Downey. 2008. Closed captioning: Subtitling, stenography, and the digital convergence of text with television. JHU Press.
[2]
Faiha Fareez, Tishya Parikh, Christopher Wavell, Saba Shahab, Meghan Chevalier, Scott Good, Isabella De Blasi, Rafik Rhouma, Christopher McMahon, Jean-Paul Lam, Thomas Lo, and Christopher W. Smith. 2022. A dataset of simulated patient-physician medical interviews with a focus on respiratory cases. Scientific Data 9, 1: 313. https://doi.org/10.1038/s41597-022-01423-1
[3]
Kanika Garg and Goonjan Jain. 2016. A comparative study of noise reduction techniques for automatic speech recognition systems. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2098–2103.
[4]
Matteo Gerosa, Diego Giuliani, Shrikanth Narayanan, and Alexandros Potamianos. 2009. A review of ASR technologies for children's speech. In Proceedings of the 2nd Workshop on Child, Computer and Interaction, 1–8.
[5]
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1: 1–23.
[6]
Rebecca Perkins Harrington and Gregg C Vanderheiden. 2013. Crowd caption correction (ccc). In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, 45.
[7]
François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Esteve. 2018. TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, 198–208.
[8]
Klaus Krippendorff. 2018. Content analysis: An introduction to its methodology. Sage publications.
[9]
Judy Larson and others. 1999. CART (Communication Access Realtime Translation). PEPNet Tipsheet. PEPNet-Northeast.
[10]
Walter S Lasecki, Christopher D Miller, Raja S Kushalnagar, and Jeffrey P Bigham. 2013. Legion Scribe: Real-Time Captioning by the Non-Experts. In 10th International Cross-Disclipinary Conference on Web Accessibility (W4A).
[11]
Bo Li, Anmol Gulati, Jiahui Yu, Tara N Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, and others. 2021. A better and faster end-to-end model for streaming asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5634–5638.
[12]
Rao Ma, Mengjie Qian, Potsawee Manakul, Mark Gales, and Kate Knill. 2023. Can Generative Large Language Models Perform ASR Error Correction? arXiv preprint arXiv:2307.04172.
[13]
Andrew Maas, Quoc V Le, Tyler M O'neil, Oriol Vinyals, Patrick Nguyen, and Andrew Y Ng. 2012. Recurrent neural networks for noise reduction in robust ASR. INTERSPEECH.
[14]
Somang Nam, Maria Karam, Christie Christelis, Hemanshu Bhargav, and Deborah I Fels. 2023. Assessing subjective workload for live captioners. Applied Ergonomics 113: 104094.
[15]
National Association of the Deaf (NAD). Communication Access Realtime Translation. Retrieved April 7, 2018 from https://www.nad.org/resources/technology/captioning-for-access/communication-access-realtime-translation/
[16]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206–5210.
[17]
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411.
[18]
Gonzalo Ramos, Christopher Meek, Patrice Simard, Jina Suh, and Soroush Ghorashi. 2020. Interactive machine teaching: a human-centered approach to building machine-learned models. Human–Computer Interaction 35, 5–6: 413–451.
[19]
Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 1–7.
[20]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. Retrieved from https://crfm.stanford.edu/2023/03/13/alpaca.html
[21]
Dong Wang, Xiaodong Wang, and Shaohe Lv. 2019. An overview of end-to-end automatic speech recognition. Symmetry 11, 8: 1018.
[22]
Yue Zhang, Leyang Cui, Deng Cai, Xinting Huang, Tao Fang, and Wei Bi. 2023. Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance. arXiv preprint arXiv:2305.13225.
[23]
What is real-time captioning? | UW DO-IT. Retrieved August 12, 2022 from https://www.washington.edu/doit/what-real-time-captioning#:∼:text=Captions%2C composed of text%2C are,as an event takes place.
[24]
Captions: Humans vs Artificial Intelligence: Who Wins? | Equal Entry. Retrieved September 14, 2023 from https://equalentry.com/caption-videos-human-vs-automatic-captions/
[25]
Live Professional Captions vs. CART Captioning. Retrieved September 14, 2023 from https://www.3playmedia.com/blog/live-professional-captions-vs-cart-captioning-whats-the-difference/
[26]
GPT-4 | OpenAI. Retrieved July 2, 2024 from https://openai.com/index/gpt-4/
[27]
Introducing Whisper | OpenAI. Retrieved July 2, 2024 from https://openai.com/index/whisper/
[28]
MIT OpenCourseWare | Free Online Course Materials. Retrieved September 13, 2023 from https://ocw.mit.edu/
[29]
CALLHOME American English Speech - Linguistic Data Consortium. Retrieved September 13, 2023 from https://catalog.ldc.upenn.edu/LDC97S42

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASSETS '24: Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility
October 2024
1475 pages
ISBN:9798400706776
DOI:10.1145/3663548
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Badges

  • Best Poster

Qualifiers

  • Poster
  • Research
  • Refereed limited

Conference

ASSETS '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 436 of 1,556 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 156
    Total Downloads
  • Downloads (Last 12 months)156
  • Downloads (Last 6 weeks)47
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media