More Web Proxy on the site http://driver.im/

poster

CARTGPT: Improving CART Captioning using Large Language Models

Authors:

Andrea Kleiver,

Dhruv JainAuthors Info & Claims

ASSETS '24: Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility

Article No.: 93, Pages 1 - 5

https://doi.org/10.1145/3663548.3688494

Published: 27 October 2024 Publication History

Abstract

Communication Access Realtime Translation (CART) is a commonly used real-time captioning technology used by deaf and hard of hearing (DHH) people, due to its accuracy, reliability, and ability to provide a holistic view of the conversational environment (e.g., by displaying speaker names). However, in many real-world situations (e.g., noisy environments, long meetings), the CART captioning accuracy can considerably decline, thereby affecting the comprehension of DHH people. In this work-in-progress paper, we introduce CARTGPT, a system to assist CART captioners in improving their transcription accuracy. CARTGPT takes in errored CART captions and inaccurate automatic speech recognition (ASR) captions as input and uses a large language model to generate corrected captions in real-time. We quantified performance on a noisy speech dataset, showing that our system outperforms both CART (+5.6% accuracy) and a state-of-the-art ASR model (+17.3%). A preliminary evaluation with three DHH users further demonstrates the promise of our approach.

References

[1]

Gregory J Downey. 2008. Closed captioning: Subtitling, stenography, and the digital convergence of text with television. JHU Press.

[2]

Faiha Fareez, Tishya Parikh, Christopher Wavell, Saba Shahab, Meghan Chevalier, Scott Good, Isabella De Blasi, Rafik Rhouma, Christopher McMahon, Jean-Paul Lam, Thomas Lo, and Christopher W. Smith. 2022. A dataset of simulated patient-physician medical interviews with a focus on respiratory cases. Scientific Data 9, 1: 313. https://doi.org/10.1038/s41597-022-01423-1

[3]

Kanika Garg and Goonjan Jain. 2016. A comparative study of noise reduction techniques for automatic speech recognition systems. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2098–2103.

[4]

Matteo Gerosa, Diego Giuliani, Shrikanth Narayanan, and Alexandros Potamianos. 2009. A review of ASR technologies for children's speech. In Proceedings of the 2nd Workshop on Child, Computer and Interaction, 1–8.

Digital Library

[5]

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1: 1–23.

Digital Library

[6]

Rebecca Perkins Harrington and Gregg C Vanderheiden. 2013. Crowd caption correction (ccc). In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, 45.

Digital Library

[7]

François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Esteve. 2018. TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, 198–208.

[8]

Klaus Krippendorff. 2018. Content analysis: An introduction to its methodology. Sage publications.

[9]

Judy Larson and others. 1999. CART (Communication Access Realtime Translation). PEPNet Tipsheet. PEPNet-Northeast.

[10]

Walter S Lasecki, Christopher D Miller, Raja S Kushalnagar, and Jeffrey P Bigham. 2013. Legion Scribe: Real-Time Captioning by the Non-Experts. In 10th International Cross-Disclipinary Conference on Web Accessibility (W4A).

Digital Library

[11]

Bo Li, Anmol Gulati, Jiahui Yu, Tara N Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, and others. 2021. A better and faster end-to-end model for streaming asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5634–5638.

[12]

Rao Ma, Mengjie Qian, Potsawee Manakul, Mark Gales, and Kate Knill. 2023. Can Generative Large Language Models Perform ASR Error Correction? arXiv preprint arXiv:2307.04172.

[13]

Andrew Maas, Quoc V Le, Tyler M O'neil, Oriol Vinyals, Patrick Nguyen, and Andrew Y Ng. 2012. Recurrent neural networks for noise reduction in robust ASR. INTERSPEECH.

[14]

Somang Nam, Maria Karam, Christie Christelis, Hemanshu Bhargav, and Deborah I Fels. 2023. Assessing subjective workload for live captioners. Applied Ergonomics 113: 104094.

[15]

National Association of the Deaf (NAD). Communication Access Realtime Translation. Retrieved April 7, 2018 from https://www.nad.org/resources/technology/captioning-for-access/communication-access-realtime-translation/

[16]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206–5210.

[17]

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411.

[18]

Gonzalo Ramos, Christopher Meek, Patrice Simard, Jina Suh, and Soroush Ghorashi. 2020. Interactive machine teaching: a human-centered approach to building machine-learned models. Human–Computer Interaction 35, 5–6: 413–451.

[19]

Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 1–7.

Digital Library

[20]

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. Retrieved from https://crfm.stanford.edu/2023/03/13/alpaca.html

[21]

Dong Wang, Xiaodong Wang, and Shaohe Lv. 2019. An overview of end-to-end automatic speech recognition. Symmetry 11, 8: 1018.

[22]

Yue Zhang, Leyang Cui, Deng Cai, Xinting Huang, Tao Fang, and Wei Bi. 2023. Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance. arXiv preprint arXiv:2305.13225.

[23]

What is real-time captioning? | UW DO-IT. Retrieved August 12, 2022 from https://www.washington.edu/doit/what-real-time-captioning#:∼:text=Captions%2C composed of text%2C are,as an event takes place.

[24]

Captions: Humans vs Artificial Intelligence: Who Wins? | Equal Entry. Retrieved September 14, 2023 from https://equalentry.com/caption-videos-human-vs-automatic-captions/

[25]

Live Professional Captions vs. CART Captioning. Retrieved September 14, 2023 from https://www.3playmedia.com/blog/live-professional-captions-vs-cart-captioning-whats-the-difference/

[26]

GPT-4 | OpenAI. Retrieved July 2, 2024 from https://openai.com/index/gpt-4/

[27]

Introducing Whisper | OpenAI. Retrieved July 2, 2024 from https://openai.com/index/whisper/

[28]

MIT OpenCourseWare | Free Online Course Materials. Retrieved September 13, 2023 from https://ocw.mit.edu/

[29]

CALLHOME American English Speech - Linguistic Data Consortium. Retrieved September 13, 2023 from https://catalog.ldc.upenn.edu/LDC97S42

Index Terms

CARTGPT: Improving CART Captioning using Large Language Models

Index terms have been assigned to the content through auto-classification.

Recommendations

Dynamic captioning: video accessibility enhancement for hearing impairment
MM '10: Proceedings of the 18th ACM international conference on Multimedia

There are more than 66 million people su®ering from hearing impairment and this disability brings them di±culty in the video content understanding due to the loss of audio information. If scripts are available, captioning technology can help them in a ...
Using Avatars for Improving Speaker Identification in Captioning
INTERACT '09: Proceedings of the 12th IFIP TC 13 International Conference on Human-Computer Interaction: Part II

Captioning is the main method for accessing television and film content by people who are deaf or hard-of-hearing. One major difficulty consistently identified by the community is that of knowing who is speaking particularly for an off screen narrator. ...
Real-time captioning of sign language by groups of deaf and hard-of-hearing people
iiWAS '16: Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services

In order to realize smooth communication between people with or without difficulty in hearing, we aim to implement an information support system based on crowdsourcing, a problem-solving model, in which numerous people cooperate to accomplish a job. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASSETS '24: Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility

October 2024

1475 pages

ISBN:9798400706776

DOI:10.1145/3663548

Editors:
David Flatla
University of Guelph, CANADA
,
Faustina Hwang
University of Reading, UNITED KINGDOM
,
Tiago Guerreiro
University of Lisbon, PORTUGAL
,
Robin Brewer
University of Michigan, UNITED STATES

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGACCESS: ACM Special Interest Group on Accessible Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Badges

Best Poster

Qualifiers

Poster
Research
Refereed limited

Conference

ASSETS '24

Sponsor:

SIGACCESS

ASSETS '24: The 26th International ACM SIGACCESS Conference on Computers and Accessibility

October 27 - 30, 2024

NL, St. John's, Canada

Acceptance Rates

Overall Acceptance Rate 436 of 1,556 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
156
Total Downloads

Downloads (Last 12 months)156
Downloads (Last 6 weeks)47

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents