[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Public Access

Scribe: deep integration of human and machine intelligence to caption speech in real time

Published: 23 August 2017 Publication History

Abstract

Quickly converting speech to text allows deaf and hard of hearing people to interactively follow along with live speech. Doing so reliably requires a combination of perception, understanding, and speed that neither humans nor machines possess alone. In this article, we discuss how our Scribe system combines human labor and machine intelligence in real time to reliably convert speech to text with less than 4s latency. To achieve this speed while maintaining high accuracy, Scribe integrates automated assistance in two ways. First, its user interface directs workers to different portions of the audio stream, slows down the portion they are asked to type, and adaptively determines segment length based on typing speed. Second, it automatically merges the partial input of multiple workers into a single transcript using a custom version of multiple-sequence alignment. Scribe illustrates the broad potential for deeply interleaving human labor and machine intelligence to provide intelligent interactive services that neither can currently achieve alone.

References

[1]
Bernstein, M.S., Brandt, J.R., Miller, R.C., Karger, D.R. Crowds in two seconds: Enabling realtime crowd-powered interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST '11 (New York, NY, USA, 2011). ACM, 33--42.
[2]
Bigham, J.P., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R.C., Miller, R., Tatarowicz, A., White, B., White, S., Yeh, T. Vizwiz: Nearly real-time answers to visual questions. In Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, UIST '10, (New York, NY, USA, 2010). ACM, 333--342.
[3]
Cooke, M., Green, P., Josifovski, L., Vizinho, A. Robust automatic speech recognition with missing and unreliable acoustic data. Speech commun. 34, 3 (2001), 267--285.
[4]
Driedger, J. Time-scale modification algorithms for music audio signals. Master's thesis, Saarland University, 2011.
[5]
Edgar, R. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 32, 5 (2004), 1792--1797.
[6]
Elliot, L.B., Stinson, M.S., Easton, D., Bourgeois, J. College students learning with C-print's education software and automatic speech recognition. In American Educational Research Association Annual Meeting (New York, NY, 2008), AERA.
[7]
Flowerdew, J.L. Salience in the performance of one speech act:the case of definitions. Discource Processes 15, 2 (Apr--June 1992), 165--181.
[8]
Metze, F., Gaur, Y., Bigham, J. P. Manipulating word lattices to incorporate human corrections. In Proceedings of INTERSPEECH, (2016).
[9]
Gaur, Y., Lasecki, W.S., Metze, F., Bigham, J.P. The effects of automatic speech recognition quality on human transcription latency. In Proceedings of the 13th Web for All Conference (2016) ACM.
[10]
Glass, J.R., Hazen, T.J., Cyphers, D.S., Malioutov, I., Huynh, D., Barzilay, R. Recent progress in the MIT spoken lecture processing project. In Interspeech (2007), 2553--2556.
[11]
Gordon, M., Bigham, J.P., Lasecki, W.S. Legiontools: A toolkit+ UI for recruiting and routing crowds to synchronous real-time tasks. In Adjunct Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (2015) ACM, 81--82.
[12]
Gordon-Salant, S. Aging, hearing loss, and speech recognition: stop shouting, i can't understand you. In Perspectives on Auditory Research, volume 50 of Springer Handbook of Auditory Research. A.N. Popper and R.R. Fay, eds. Springer New York, 2014, 211--228.
[13]
Jensema, C., McCann, R., Ramsey, S. Closed-captioned television presentation speed and vocabulary. In Am Ann Deaf 140, 4 (October 1996), 284--292.
[14]
John, B.E. Newell, A. Cumulating the science of HCI: from s-R compatibility to transcription typing. ACM SIGCHI Bulletin 20, SI (Mar. 1989), 109--114.
[15]
Kadri, H., Davy, M., Rabaoui, A., Lachiri, Z., Ellouze, N., et al. Robust audio speaker segmentation using one class SVMs. In IEEE European Signal Processing Conference (Lausanne, Switzerland, 2008) ISSN: 2219--5491.
[16]
Kushalnagar, R.S., Lasecki, W.S., Bigham, J.P. Captions versus transcripts for online video content. In Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility, W4A '13, (New York, NY, 2013), ACM, 32:1--32:4.
[17]
Kushalnagar, R.S., Lasecki, W.S., Bigham, J.P. Accessibility evaluation of classroom captions. ACM Trans Access Comput. 5, 3 (Jan. 2014), 1--24.
[18]
Lasecki, W. Bigham, J. Online quality control for real-time crowd captioning. In International ACM SIGACCESS Conference on Computers & Accessibility, ASSETS 2012, 2012.
[19]
Lasecki, W., Miller, C., Sadilek, A., Abumoussa, A., Borrello, D., Kushalnagar, R., Bigham, J. Real-time captioning by groups of non-experts. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology, UIST '12, (2012), 23--34.
[20]
Lasecki, W.S., Gordon, M., Koutra, D., Jung, M.F., Dow, S.P., Bigham, J.P. Glance: rapidly coding behavioral video with the crowd. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, UIST '14, (New York, NY, 2014). ACM, 1.
[21]
Lasecki, W.S., Homan, C., Bigham, J.P. Architecting real-time crowd-powered systems. Human Computation 1, 1 (2014).
[22]
Lasecki, W.S., Kushalnagar, R., Bigham, J.P. Helping students keep up with real-time captions by pausing and highlighting. In Proceedings of the 11th Web for All Conference, W4A '14 (New York, NY, 2014). ACM, 39:1--39:8.
[23]
Lasecki, W.S., Miller, C.D., Bigham, J.P. Warping time for more effective real-time crowdsourcing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '13 (New York, NY, 2013). ACM, 2033--2036.
[24]
Lasecki, W.S., Murray, K., White, S., Miller, R.C., Bigham, J.P. Real-time crowd control of existing interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST '11, (New York, NY, 2011). ACM, 23--32.
[25]
Lasecki, W.S., Song, Y.C., Kautz, H., Bigham, J.P. Real-time crowd labeling for deployable activity recognition. In Proceedings of the 2013 conference on Computer supported cooperative work (2013) ACM, 1203--1212.
[26]
Marschark, M., Sapere, P., Convertino, C., Seewagen, R. Access to postsecondary education through sign language interpreting. J Deaf Stud Deaf Educ. 10, 1 (Jan. 2005), 38--50.
[27]
Naim, I., Gildea, D., Lasecki, W.S., Bigham, J.P. Text alignment for real-time crowd captioning. In Proceedings North American Chapter of the Association for Computational Linguistics (NAACL) (2013), 201--210.
[28]
Salisbury, E., Stein, S., Ramchurn, S. Real-time opinion aggregation methods for crowd robotics. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems (2015), 841--849.
[29]
Turner, O.G. The comparative legibility and speed of manuscript and cursive handwriting. The Elementary School Journal (1930), 780--786.
[30]
Wald, M. Creating accessible educational multimedia through editing automatic speech recognition captioning in real time. Interactive Technology and Smart Education 3, 2 (2006), 131--141.
[31]
Wang, L., Jiang, T. On the complexity of multiple sequence alignment. J Comput Biol. 1, 4 (1994), 337--348.
[32]
World Health Organization. Deafness and hearing loss, fact sheet N300. http://www.who.int/mediacentre/factsheets/fs300/en/, February 2014.

Cited By

View all
  • (2023)Scientific production on the use of ICT as a tool for social inclusion for deaf people: a bibliometric analysisSalud Ciencia y Tecnología10.56294/saludcyt2023318Online publication date: 21-Mar-2023
  • (2023)Simple2In1: A Simple Method for Fusing Two Sequences from Different Captioning Systems into One Sequence for a Small-scale Thai Dataset2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP60301.2023.10355030(1-6)Online publication date: 27-Nov-2023
  • (2022)Optimization of Music Teaching Management System for College Students Based on Similarity Distribution MethodMathematical Problems in Engineering10.1155/2022/64445542022(1-11)Online publication date: 22-Mar-2022
  • Show More Cited By

Index Terms

  1. Scribe: deep integration of human and machine intelligence to caption speech in real time

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Communications of the ACM
    Communications of the ACM  Volume 60, Issue 9
    September 2017
    94 pages
    ISSN:0001-0782
    EISSN:1557-7317
    DOI:10.1145/3134526
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 August 2017
    Published in CACM Volume 60, Issue 9

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,598
    • Downloads (Last 6 weeks)114
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Scientific production on the use of ICT as a tool for social inclusion for deaf people: a bibliometric analysisSalud Ciencia y Tecnología10.56294/saludcyt2023318Online publication date: 21-Mar-2023
    • (2023)Simple2In1: A Simple Method for Fusing Two Sequences from Different Captioning Systems into One Sequence for a Small-scale Thai Dataset2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP60301.2023.10355030(1-6)Online publication date: 27-Nov-2023
    • (2022)Optimization of Music Teaching Management System for College Students Based on Similarity Distribution MethodMathematical Problems in Engineering10.1155/2022/64445542022(1-11)Online publication date: 22-Mar-2022
    • (2022)An Exploration of Captioning Practices and Challenges of Individual Content Creators on YouTube for People with Hearing ImpairmentsProceedings of the ACM on Human-Computer Interaction10.1145/35129226:CSCW1(1-26)Online publication date: 7-Apr-2022
    • (2022)Information Resilience: the nexus of responsible and agile approaches to information useThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00720-231:5(1059-1084)Online publication date: 16-Jan-2022
    • (2021)Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision MakingProceedings of the ACM on Human-Computer Interaction10.1145/34795525:CSCW2(1-45)Online publication date: 18-Oct-2021
    • (2021)Crowd-Powered Hybrid Classification Services: Calibration is all you need2021 IEEE International Conference on Web Services (ICWS)10.1109/ICWS53863.2021.00019(42-50)Online publication date: Sep-2021
    • (2021)Crowdsourced NLP Retraining Engine in ChatbotsProceedings of International Conference on Emerging Technologies and Intelligent Systems10.1007/978-3-030-85990-9_26(311-320)Online publication date: 3-Dec-2021
    • (2020)DAIS: dynamic access and integration services framework for cloud-oriented storage systemsCluster Computing10.1007/s10586-020-03088-023:4(3289-3308)Online publication date: 1-Dec-2020
    • (2019)The Effect of Social Interaction on Facilitating Audience Participation in a Live Music PerformanceProceedings of the 2019 Conference on Creativity and Cognition10.1145/3325480.3325509(108-120)Online publication date: 13-Jun-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Digital Edition

    View this article in digital edition.

    Digital Edition

    Magazine Site

    View this article on the magazine site (external)

    Magazine Site

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media