More Web Proxy on the site http://driver.im/

research-article

Public Access

Scribe: deep integration of human and machine intelligence to caption speech in real time

Authors:

Walter S. Lasecki,

Christopher D. Miller,

Raja Kushalnagar,

Jeffrey P. BighamAuthors Info & Claims

Communications of the ACM, Volume 60, Issue 9

Pages 93 - 100

https://doi.org/10.1145/3068663

Published: 23 August 2017 Publication History

All formats PDF

Abstract

Quickly converting speech to text allows deaf and hard of hearing people to interactively follow along with live speech. Doing so reliably requires a combination of perception, understanding, and speed that neither humans nor machines possess alone. In this article, we discuss how our Scribe system combines human labor and machine intelligence in real time to reliably convert speech to text with less than 4s latency. To achieve this speed while maintaining high accuracy, Scribe integrates automated assistance in two ways. First, its user interface directs workers to different portions of the audio stream, slows down the portion they are asked to type, and adaptively determines segment length based on typing speed. Second, it automatically merges the partial input of multiple workers into a single transcript using a custom version of multiple-sequence alignment. Scribe illustrates the broad potential for deeply interleaving human labor and machine intelligence to provide intelligent interactive services that neither can currently achieve alone.

References

[1]

Bernstein, M.S., Brandt, J.R., Miller, R.C., Karger, D.R. Crowds in two seconds: Enabling realtime crowd-powered interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST '11 (New York, NY, USA, 2011). ACM, 33--42.

Digital Library

[2]

Bigham, J.P., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R.C., Miller, R., Tatarowicz, A., White, B., White, S., Yeh, T. Vizwiz: Nearly real-time answers to visual questions. In Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, UIST '10, (New York, NY, USA, 2010). ACM, 333--342.

Digital Library

[3]

Cooke, M., Green, P., Josifovski, L., Vizinho, A. Robust automatic speech recognition with missing and unreliable acoustic data. Speech commun. 34, 3 (2001), 267--285.

Digital Library

[4]

Driedger, J. Time-scale modification algorithms for music audio signals. Master's thesis, Saarland University, 2011.

[5]

Edgar, R. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 32, 5 (2004), 1792--1797.

[6]

Elliot, L.B., Stinson, M.S., Easton, D., Bourgeois, J. College students learning with C-print's education software and automatic speech recognition. In American Educational Research Association Annual Meeting (New York, NY, 2008), AERA.

[7]

Flowerdew, J.L. Salience in the performance of one speech act:the case of definitions. Discource Processes 15, 2 (Apr--June 1992), 165--181.

[8]

Metze, F., Gaur, Y., Bigham, J. P. Manipulating word lattices to incorporate human corrections. In Proceedings of INTERSPEECH, (2016).

[9]

Gaur, Y., Lasecki, W.S., Metze, F., Bigham, J.P. The effects of automatic speech recognition quality on human transcription latency. In Proceedings of the 13th Web for All Conference (2016) ACM.

Digital Library

[10]

Glass, J.R., Hazen, T.J., Cyphers, D.S., Malioutov, I., Huynh, D., Barzilay, R. Recent progress in the MIT spoken lecture processing project. In Interspeech (2007), 2553--2556.

[11]

Gordon, M., Bigham, J.P., Lasecki, W.S. Legiontools: A toolkit+ UI for recruiting and routing crowds to synchronous real-time tasks. In Adjunct Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (2015) ACM, 81--82.

Digital Library

[12]

Gordon-Salant, S. Aging, hearing loss, and speech recognition: stop shouting, i can't understand you. In Perspectives on Auditory Research, volume 50 of Springer Handbook of Auditory Research. A.N. Popper and R.R. Fay, eds. Springer New York, 2014, 211--228.

[13]

Jensema, C., McCann, R., Ramsey, S. Closed-captioned television presentation speed and vocabulary. In Am Ann Deaf 140, 4 (October 1996), 284--292.

[14]

John, B.E. Newell, A. Cumulating the science of HCI: from s-R compatibility to transcription typing. ACM SIGCHI Bulletin 20, SI (Mar. 1989), 109--114.

Digital Library

[15]

Kadri, H., Davy, M., Rabaoui, A., Lachiri, Z., Ellouze, N., et al. Robust audio speaker segmentation using one class SVMs. In IEEE European Signal Processing Conference (Lausanne, Switzerland, 2008) ISSN: 2219--5491.

[16]

Kushalnagar, R.S., Lasecki, W.S., Bigham, J.P. Captions versus transcripts for online video content. In Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility, W4A '13, (New York, NY, 2013), ACM, 32:1--32:4.

Digital Library

[17]

Kushalnagar, R.S., Lasecki, W.S., Bigham, J.P. Accessibility evaluation of classroom captions. ACM Trans Access Comput. 5, 3 (Jan. 2014), 1--24.

Digital Library

[18]

Lasecki, W. Bigham, J. Online quality control for real-time crowd captioning. In International ACM SIGACCESS Conference on Computers & Accessibility, ASSETS 2012, 2012.

Digital Library

[19]

Lasecki, W., Miller, C., Sadilek, A., Abumoussa, A., Borrello, D., Kushalnagar, R., Bigham, J. Real-time captioning by groups of non-experts. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology, UIST '12, (2012), 23--34.

Digital Library

[20]

Lasecki, W.S., Gordon, M., Koutra, D., Jung, M.F., Dow, S.P., Bigham, J.P. Glance: rapidly coding behavioral video with the crowd. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, UIST '14, (New York, NY, 2014). ACM, 1.

Digital Library

[21]

Lasecki, W.S., Homan, C., Bigham, J.P. Architecting real-time crowd-powered systems. Human Computation 1, 1 (2014).

[22]

Lasecki, W.S., Kushalnagar, R., Bigham, J.P. Helping students keep up with real-time captions by pausing and highlighting. In Proceedings of the 11th Web for All Conference, W4A '14 (New York, NY, 2014). ACM, 39:1--39:8.

Digital Library

[23]

Lasecki, W.S., Miller, C.D., Bigham, J.P. Warping time for more effective real-time crowdsourcing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '13 (New York, NY, 2013). ACM, 2033--2036.

Digital Library

[24]

Lasecki, W.S., Murray, K., White, S., Miller, R.C., Bigham, J.P. Real-time crowd control of existing interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST '11, (New York, NY, 2011). ACM, 23--32.

Digital Library

[25]

Lasecki, W.S., Song, Y.C., Kautz, H., Bigham, J.P. Real-time crowd labeling for deployable activity recognition. In Proceedings of the 2013 conference on Computer supported cooperative work (2013) ACM, 1203--1212.

Digital Library

[26]

Marschark, M., Sapere, P., Convertino, C., Seewagen, R. Access to postsecondary education through sign language interpreting. J Deaf Stud Deaf Educ. 10, 1 (Jan. 2005), 38--50.

[27]

Naim, I., Gildea, D., Lasecki, W.S., Bigham, J.P. Text alignment for real-time crowd captioning. In Proceedings North American Chapter of the Association for Computational Linguistics (NAACL) (2013), 201--210.

[28]

Salisbury, E., Stein, S., Ramchurn, S. Real-time opinion aggregation methods for crowd robotics. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems (2015), 841--849.

Digital Library

[29]

Turner, O.G. The comparative legibility and speed of manuscript and cursive handwriting. The Elementary School Journal (1930), 780--786.

[30]

Wald, M. Creating accessible educational multimedia through editing automatic speech recognition captioning in real time. Interactive Technology and Smart Education 3, 2 (2006), 131--141.

[31]

Wang, L., Jiang, T. On the complexity of multiple sequence alignment. J Comput Biol. 1, 4 (1994), 337--348.

[32]

World Health Organization. Deafness and hearing loss, fact sheet N300. http://www.who.int/mediacentre/factsheets/fs300/en/, February 2014.

Cited By

Bolaño García MDuarte Acosta NGonzález Castro K(2023)Scientific production on the use of ICT as a tool for social inclusion for deaf people: a bibliometric analysisSalud Ciencia y Tecnología10.56294/saludcyt2023318Online publication date: 21-Mar-2023
https://doi.org/10.56294/saludcyt2023318
Longjaroen WChay-Intr TFunakoshi KChotimongkol AUsanavasin S(2023)Simple2In1: A Simple Method for Fusing Two Sequences from Different Captioning Systems into One Sequence for a Small-scale Thai Dataset2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP60301.2023.10355030(1-6)Online publication date: 27-Nov-2023
https://doi.org/10.1109/iSAI-NLP60301.2023.10355030
Lian J(2022)Optimization of Music Teaching Management System for College Students Based on Similarity Distribution MethodMathematical Problems in Engineering10.1155/2022/64445542022(1-11)Online publication date: 22-Mar-2022
https://doi.org/10.1155/2022/6444554
Show More Cited By

Index Terms

Scribe: deep integration of human and machine intelligence to caption speech in real time
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Graphical user interfaces

Recommendations

E-Scribe: ubiquitous real-time speech transcription for the hearing-impaired
ICCHP'10: Proceedings of the 12th international conference on Computers helping people with special needs

Availability of real-time speech transcription anywhere, anytimess, represents a potentially life-changing opportunity for the hearing-impaired to improve their communication capability. We present e-Scribe, a prototype web-based online centre for real-...
Legion scribe: real-time captioning by the non-experts
W4A '13: Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility

Real-time captioning provides people who are deaf or hard of hearing access to aural speech in the classroom and at live events. The only reliable approach currently is to recruit a local or remote expert stenographer who is able to type at natural ...
Legion scribe: real-time captioning by non-experts
ASSETS '14: Proceedings of the 16th international ACM SIGACCESS conference on Computers & accessibility

The promise of affordable, automatic approaches to real-time captioning imagines a future in which deaf and hard of hearing (DHH) users have immediate access to speech in the world around them my simply picking up their phone or other mobile device. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Communications of the ACM

Communications of the ACM Volume 60, Issue 9

September 2017

94 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3134526

Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2017

Published in CACM Volume 60, Issue 9

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation
University of Michigan
Google

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
14,533
Total Downloads

Downloads (Last 12 months)1,598
Downloads (Last 6 weeks)114

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bolaño García MDuarte Acosta NGonzález Castro K(2023)Scientific production on the use of ICT as a tool for social inclusion for deaf people: a bibliometric analysisSalud Ciencia y Tecnología10.56294/saludcyt2023318Online publication date: 21-Mar-2023
https://doi.org/10.56294/saludcyt2023318
Longjaroen WChay-Intr TFunakoshi KChotimongkol AUsanavasin S(2023)Simple2In1: A Simple Method for Fusing Two Sequences from Different Captioning Systems into One Sequence for a Small-scale Thai Dataset2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP60301.2023.10355030(1-6)Online publication date: 27-Nov-2023
https://doi.org/10.1109/iSAI-NLP60301.2023.10355030
Lian J(2022)Optimization of Music Teaching Management System for College Students Based on Similarity Distribution MethodMathematical Problems in Engineering10.1155/2022/64445542022(1-11)Online publication date: 22-Mar-2022
https://doi.org/10.1155/2022/6444554
Li FLu CLu ZCarrington PTruong K(2022)An Exploration of Captioning Practices and Challenges of Individual Content Creators on YouTube for People with Hearing ImpairmentsProceedings of the ACM on Human-Computer Interaction10.1145/35129226:CSCW1(1-26)Online publication date: 7-Apr-2022
https://dl.acm.org/doi/10.1145/3512922
Sadiq SAryani ADemartini GHua WIndulska MBurton-Jones AKhosravi HBenavides-Prado DSellis TSomeh IVaithianathan RWang SZhou X(2022)Information Resilience: the nexus of responsible and agile approaches to information useThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00720-231:5(1059-1084)Online publication date: 16-Jan-2022
https://dl.acm.org/doi/10.1007/s00778-021-00720-2
Liu HLai VTan C(2021)Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision MakingProceedings of the ACM on Human-Computer Interaction10.1145/34795525:CSCW2(1-45)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3479552
Sayin BKrivosheev ERamirez JCasati FTaran EMalanina VYang J(2021)Crowd-Powered Hybrid Classification Services: Calibration is all you need2021 IEEE International Conference on Web Services (ICWS)10.1109/ICWS53863.2021.00019(42-50)Online publication date: Sep-2021
https://doi.org/10.1109/ICWS53863.2021.00019
Nezami MRukham R(2021)Crowdsourced NLP Retraining Engine in ChatbotsProceedings of International Conference on Emerging Technologies and Intelligent Systems10.1007/978-3-030-85990-9_26(311-320)Online publication date: 3-Dec-2021
https://doi.org/10.1007/978-3-030-85990-9_26
Kumar ABawa S(2020)DAIS: dynamic access and integration services framework for cloud-oriented storage systemsCluster Computing10.1007/s10586-020-03088-023:4(3289-3308)Online publication date: 1-Dec-2020
https://dl.acm.org/doi/10.1007/s10586-020-03088-0
Lee SWillette AKoutra DLasecki WDow SMaher MKerne ALatulipe C(2019)The Effect of Social Interaction on Facilitating Audience Participation in a Live Music PerformanceProceedings of the 2019 Conference on Creativity and Cognition10.1145/3325480.3325509(108-120)Online publication date: 13-Jun-2019
https://dl.acm.org/doi/10.1145/3325480.3325509
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents