[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3586182.3615789acmconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
demonstration

LiveLocalizer: Augmenting Mobile Speech-to-Text with Microphone Arrays, Optimized Localization and Beamforming

Published: 29 October 2023 Publication History

Abstract

Speech-to-text capabilities on mobile devices have proven helpful for language translation, note-taking, hearing and speech accessibility, and meeting transcripts. However, their usefulness is constrained by being unable to distinguish between multiple speakers, track which direction speech is coming from, and provide acceptable performance in noisy environments.
This work introduces efficient real-time audio localization and adaptive beamforming algorithms on custom sound perception hardware running on a low-power microcontroller and four integrated microphones. A prototype is implemented in a phone case form factor and is plug-and-play with modern smartphones.
We characterize the performance in technical evaluations of localization, beamforming, and diarization. We demonstrate how the phone case extends existing smartphones with speaker diarization in a speech-to-text app, sound direction visualization, and sound enhancement through beamforming. In the future, we hope our approach will inspire the widespread adoption of advanced microphone arrays that natively unlock the potential of spatial sound processing and perception in mobile and wearable devices.

Supplemental Material

ZIP File
Supplemental File

References

[1]
Android. 2022. Introducing Live Transcribe. https://www.android.com/accessibility/live-transcribe/. Accessed 2022-03-26.
[2]
Android. 2022. SpeechRecognizer API Documentation). https://developer.android.com/reference/android/speech/SpeechRecognizer. Accessed 2022-10-25.
[3]
ARM. 2022. CMSIS DSP Software Library. https://www.keil.com/pack/doc/CMSIS/DSP/html/index.html. Accessed 2022-05-12.
[4]
Jørgen Grythe and AS Norsonic. 2015. Beamforming algorithms-beamformers. Technical Note, Norsonic AS, Norway (2015).
[5]
Ru Guo, Yiru Yang, Johnson Kuang, Xue Bin, Dhruv Jain, Steven Goodman, Leah Findlater, and Jon Froehlich. 2020. HoloSound: Combining Speech and Sound Identification for Deaf or Hard of Hearing Users on a Head-Mounted Display. In Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility (Virtual Event, Greece) (ASSETS ’20). Association for Computing Machinery, New York, NY, USA, Article 71, 4 pages. https://doi.org/10.1145/3373625.3418031
[6]
Dhruv Jain, Leah Findlater, Jamie Gilkeson, Benjamin Holland, Ramani Duraiswami, Dmitry Zotkin, Christian Vogler, and Jon E. Froehlich. 2015. Head-Mounted Display Visualizations to Support Sound Awareness for the Deaf and Hard of Hearing. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI ’15). Association for Computing Machinery, New York, NY, USA, 241–250. https://doi.org/10.1145/2702123.2702393
[7]
Ellington Kirby, Seoyoon Park, Yan Wang, and Yingying Chen. 2016. HearHere: Smartphone Based Audio Localization Using Time Difference of Arrival: Demo. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking (New York City, New York) (MobiCom ’16). Association for Computing Machinery, New York, NY, USA, 509–510. https://doi.org/10.1145/2973750.2985625
[8]
Charles Knapp and Glifford Carter. 1976. The generalized correlation method for estimation of time delay. IEEE transactions on acoustics, speech, and signal processing 24, 4 (1976), 320–327.
[9]
Raja S. Kushalnagar, Gary W. Behm, Aaron W. Kelstone, and Shareef Ali. 2015. Tracked Speech-To-Text Display: Enhancing Accessibility and Readability of Real-Time Speech-To-Text. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility (Lisbon, Portugal) (ASSETS ’15). Association for Computing Machinery, New York, NY, USA, 223–230. https://doi.org/10.1145/2700648.2809843
[10]
Ahmet Köse, Aleksei Tepljakov, and Sergei Astapov. 2017. Real-time localization and visualization of a sound source for virtual reality applications. In 2017 25th International Conference on Software, Telecommunications and Computer Networks (SoftCOM). 1–6. https://doi.org/10.23919/SOFTCOM.2017.8115577
[11]
Hong Liu and Miao Shen. 2010. Continuous sound source localization based on microphone array for mobile robots. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 4332–4339.
[12]
Microsoft. 2022. Translator. https://translator.microsoft.com/. Accessed 2022-03-26.
[13]
Pius Kavuma Basajjabaka Mugagga and Simon Winberg. 2015. Sound source localisation on Android smartphones: A first step to using smartphones as auditory sensors for training A.I systems with Big Data. In AFRICON 2015. 1–5. https://doi.org/10.1109/AFRCON.2015.7331970
[14]
Matthew Seita. 2020. Designing Automatic Speech Recognition Technologies to Improve Accessibility for Deaf and Hard-of-Hearing People in Small Group Meetings. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI EA ’20). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3334480.3375039
[15]
Giuseppe Valenzise, Luigi Gerosa, Marco Tagliasacchi, Fabio Antonacci, and Augusto Sarti. 2007. Scream and gunshot detection and localization for audio-surveillance systems. In 2007 IEEE Conference on Advanced Video and Signal Based Surveillance. IEEE, 21–26.

Index Terms

  1. LiveLocalizer: Augmenting Mobile Speech-to-Text with Microphone Arrays, Optimized Localization and Beamforming

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      UIST '23 Adjunct: Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology
      October 2023
      424 pages
      ISBN:9798400700965
      DOI:10.1145/3586182
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 October 2023

      Check for updates

      Author Tags

      1. ASR
      2. STT
      3. Speech-to-text
      4. accessibility
      5. audio
      6. beamforming
      7. microphone array
      8. speech

      Qualifiers

      • Demonstration
      • Research
      • Refereed limited

      Conference

      UIST '23

      Acceptance Rates

      Overall Acceptance Rate 355 of 1,733 submissions, 20%

      Upcoming Conference

      UIST '25
      The 38th Annual ACM Symposium on User Interface Software and Technology
      September 28 - October 1, 2025
      Busan , Republic of Korea

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 116
        Total Downloads
      • Downloads (Last 12 months)44
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media