More Web Proxy on the site http://driver.im/

research-article

Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing

Authors:

Julia Kaiwen Lau,

Kelvin Kai Wen Kong,

Julian Hao Yong,

Joshua Chern Wey Low,

Chun Yong Chong,

David LoAuthors Info & Claims

ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 1169 - 1181

https://doi.org/10.1145/3597926.3598126

Published: 13 July 2023 Publication History

Abstract

Recent studies have proposed the use of Text-To-Speech (TTS) systems to automatically synthesise speech test cases on a scale and uncover a large number of failures in ASR systems. However, the failures uncovered by synthetic test cases may not reflect the actual performance of an ASR system when it transcribes human audio, which we refer to as false alarms. Given a failed test case synthesised from TTS systems, which consists of TTS-generated audio and the corresponding ground truth text, we feed the human audio stating the same text to an ASR system. If human audio can be correctly transcribed, an instance of a false alarm is detected.

In this study, we investigate false alarm occurrences in five popular ASR systems using synthetic audio generated from four TTS systems and human audio obtained from two commonly used datasets. Our results show that the least number of false alarms is identified when testing Deepspeech, and the number of false alarms is the highest when testing Wav2vec2. On average, false alarm rates range from 21% to 34% in all five ASR systems. Among the TTS systems used, Google TTS produces the least number of false alarms (17%), and Espeak TTS produces the highest number of false alarms (32%) among the four TTS systems. Additionally, we build a false alarm estimator that flags potential false alarms, which achieves promising results: a precision of 98.3%, a recall of 96.4%, an accuracy of 98.5%, and an F1 score of 97.3%. Our study provides insight into the appropriate selection of TTS systems to generate high-quality speech to test ASR systems. Additionally, a false alarm estimator can be a way to minimise the impact of false alarms and help developers choose suitable test inputs when evaluating ASR systems. The source code used in this paper is publicly available on GitHub at https://github.com/julianyonghao/FAinASRtest.

References

[1]

Martín Abadi. 2016. TensorFlow: learning functions at scale. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming. 1–1. https://doi.org/10.1145/2951913.2976746

Digital Library

[2]

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, and Zhenyao Zhu. 2016. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. In Proceedings of The 33rd International Conference on Machine Learning, Maria Florina Balcan and Kilian Q. Weinberger (Eds.) (Proceedings of Machine Learning Research, Vol. 48). PMLR, New York, New York, USA. 173–182. https://proceedings.mlr.press/v48/amodei16.html

[3]

Muhammad Hilmi Asyrofi, Ferdian Thung, David Lo, and Lingxiao Jiang. 2020. Crossasr: Efficient differential testing of automatic speech recognition via text-to-speech. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 640–650. https://doi.org/10.1109/ICSME46990.2020.00066

[4]

Muhammad Hilmi Asyrofi, Zhou Yang, and David Lo. 2021. CrossASR++: A Modular Differential Testing Framework for Automatic Speech Recognition. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA. 1575–1579. isbn:9781450385626 https://doi.org/10.1145/3468264.3473124

Digital Library

[5]

Muhammad Hilmi Asyrofi, Zhou Yang, Jicke Shi, Chu Wei Quan, and David Lo. 2021. Can differential testing improve automatic speech recognition systems? In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 674–678. https://doi.org/10.1109/ICSME52107.2021.00079

[6]

Muhammad Hilmi Asyrofi, Zhou Yang, Imam Nur Bani Yusuf, Hong Jin Kang, Ferdian Thung, and David Lo. 2022. BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems. IEEE Transactions on Software Engineering, 48, 12 (2022), 5087–5101. https://doi.org/10.1109/TSE.2021.3136169

[7]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33 (2020), 12449–12460.

[8]

Alpha Cephei. 2021. Vosk. https://github.com/alphacep/vosk-api Accessed on May 26th, 2023

[9]

Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. 2016. Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193, https://doi.org/10.48550/arXiv.1609.03193

[10]

Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Jianjun Zhao, and Yang Liu. 2018. Deepcruiser: Automated guided testing for stateful deep learning systems. arXiv preprint arXiv:1812.05339, https://doi.org/10.48550/arXiv.1812.05339

[11]

Pierre Nicholas Durette. 2020. Google TTS. https://github.com/pndurette/gTTS Accessed on March 18th, 2023

[12]

Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng, Roland Maas, and Jasha Droppo. 2021. Synthasr: Unlocking synthetic data for speech recognition. arXiv preprint arXiv:2106.07803, https://doi.org/10.48550/arXiv.2106.07803

[13]

Source Forge. 2015. eSpeak Text-to-Speech. https://espeak.sourceforge.net/ Accessed on December 25th, 2022

[14]

Xiang Gao, Ripon K. Saha, Mukul R. Prasad, and Abhik Roychoudhury. 2020. Fuzz Testing Based Data Augmentation to Improve Robustness of Deep Neural Networks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA. 1147–1158. isbn:9781450371216 https://doi.org/10.1145/3377811.3380415

Digital Library

[15]

Chen Gong, Zhou Yang, Yunpeng Bai, Jieke Shi, Arunesh Sinha, Bowen Xu, David Lo, Xinwen Hou, and Guoliang Fan. 2022. Curiosity-Driven and Victim-Aware Adversarial Policies. In Proceedings of the 38th Annual Computer Security Applications Conference (ACSAC ’22). Association for Computing Machinery, New York, NY, USA. 186–200. isbn:9781450397599 https://doi.org/10.1145/3564625.3564636

Digital Library

[16]

Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam. 2014. Finding patterns in static analysis alerts: improving actionable alert ranking. In Proceedings of the 11th working conference on mining software repositories. 152–161. https://doi.org/10.1145/2597073.2597100

Digital Library

[17]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, and Adam Coates. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, https://doi.org/10.48550/arXiv.1412.5567

[18]

Kim Herzig and Nachiappan Nagappan. 2015. Empirically detecting false test alarms using association rules. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. 2, 39–48. https://doi.org/10.1109/ICSE.2015.133

[19]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780. https://doi.org/10.1007/978-3-642-24797-2_4

[20]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29 (2021), 3451–3460. https://doi.org/10.1109/TASLP.2021.3122291

Digital Library

[21]

Ting-Yao Hu, Mohammadreza Armandpour, Ashish Shrivastava, Jen-Hao Rick Chang, Hema Koppula, and Oncel Tuzel. 2022. SYNT++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7682–7686. https://doi.org/10.1109/ICASSP43922.2022.9746217

[22]

Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/

[23]

Futoshi Iwama and Takashi Fukuda. 2019. Automated Testing of Basic Recognition Capability for Speech Recognition Systems. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 13–24. https://doi.org/10.1109/ICST.2019.00012

[24]

Neetu Jain and Rabins Porwal. 2019. Automated test data generation applying heuristic approaches—a survey. In Software Engineering. Springer, 699–708. https://doi.org/10.1007/978-981-10-8848-3_68

[25]

Pin Ji, Yang Feng, Jia Liu, Zhihong Zhao, and Zhenyu Chen. 2022. ASRTest: Automated Testing for Deep-Neural-Network-Driven Speech Recognition Systems. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2022). Association for Computing Machinery, New York, NY, USA. 189–201. isbn:9781450393799 https://doi.org/10.1145/3533767.3534391

Digital Library

[26]

Hong Jin Kang, Khai Loong Aw, and David Lo. 2022. Detecting false alarms from automatic static analysis tools: how far are we? In Proceedings of the 44th International Conference on Software Engineering. 698–709. https://doi.org/10.1145/3510003.3510214

Digital Library

[27]

Veton Këpuska and Gamal Bohouta. 2017. Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx). Int. J. Eng. Res. Appl, 7, 03 (2017), 20–24. https://doi.org/10.9790/9622-0703022024

[28]

Anant Kharkar, Roshanak Zilouchian Moghaddam, Matthew Jin, Xiaoyu Liu, Xin Shi, Colin Clement, and Neel Sundaresan. 2022. Learning to reduce false positives in analytic bug detectors. In Proceedings of the 44th International Conference on Software Engineering. 1307–1316. https://doi.org/10.1145/3510003.3510153

Digital Library

[29]

Jaehyeon Kim. 2020. Glow-TTS. https://github.com/jaywalnut310/glow-tts Accessed on May 10th, 2023

[30]

Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288, https://doi.org/10.48550/arXiv.1904.03288

[31]

Joseph P Olive and Mark Y Liberman. 1985. Text to speech—An overview. The Journal of the Acoustical Society of America, 78, S1 (1985), S6–S6. https://doi.org/10.1121/1.2022951

[32]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964

[33]

Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, https://doi.org/10.21437/Interspeech.2019-2680

[34]

Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. 2020. Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629, https://doi.org/10.21437/Interspeech.2020-1470

[35]

Sai Sathiesh Rajan, Sakshi Udeshi, and Sudipta Chattopadhyay. 2022. AequeVox: Automated Fairness Testing of Speech Recognition Systems. In Fundamental Approaches to Software Engineering, Einar Broch Johnsen and Manuel Wimmer (Eds.). Springer International Publishing, 245–267. https://doi.org/10.1007/978-3-030-99429-7_14

Digital Library

[36]

Roy W Roring, Franklin G Hines, and Neil Charness. 2007. Age differences in identifying words in synthetic speech. Human factors, 49, 1 (2007), 25–31. https://doi.org/10.1518/001872007779598055

[37]

Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from building static analysis tools at google. Commun. ACM, 61, 4 (2018), 58–66. https://doi.org/10.1145/3188720

Digital Library

[38]

Garima Sharma, Kartikeyan Umapathy, and Sridhar Krishnan. 2020. Trends in audio signal feature extraction methods. Applied Acoustics, 158 (2020), 107020. https://doi.org/10.1016/j.apacoust.2019.107020

[39]

Adriana Stan, Junichi Yamagishi, Simon King, and Matthew Aylett. 2011. The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication, 53, 3 (2011), 442–450. https://doi.org/10.1016/j.specom.2010.12.002

Digital Library

[40]

Krista Taake. 2009. A comparison of natural and synthetic speech: with and without simultaneous reading.

[41]

The University of Edinburgh The Center for Speech Technology Research. 2017. The Festival Speech Synthesis System. https://www.cstr.ed.ac.uk/projects/festival/ Accessed on March 18th, 2023

[42]

Zichong Wang, Yang Zhou, Meikang Qiu, Israat Haque, Laura Brown, Yi He, Jianwu Wang, David Lo, and Wenbin Zhang. 2023. Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking. https://doi.org/10.48550/ARXIV.2302.08018

[43]

Stephen J Winters and David B Pisoni. 2004. Perception and comprehension of synthetic speech. Research on spoken language processing report, 26 (2004), 95–138.

[44]

Xiaoliang Wu and Ajitha Rajan. 2021. Catch Me If You Can: Blackbox Adversarial Attacks on Automatic Speech Recognition using Frequency Masking. https://doi.org/10.48550/ARXIV.2112.01821

[45]

Zhou Yang, Jieke Shi, Muhammad Hilmi Asyrofi, and David Lo. 2022. Revisiting Neuron Coverage Metrics and Quality of Deep Neural Networks. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE Computer Society, Los Alamitos, CA, USA. 408–419. issn:1534-5351 https://doi.org/10.1109/SANER53432.2022.00056

[46]

Zhou Yang, Jieke Shi, Muhammad Hilmi Asyrofi, Bowen Xu, Xin Zhou, DongGyun Han, and David Lo. 2023. Prioritizing Speech Test Cases. arXiv preprint arXiv:2302.00330, https://doi.org/10.48550/arXiv.2302.00330

[47]

Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA. 1482–1493. isbn:9781450392211 https://doi.org/10.1145/3510003.3510146

Digital Library

[48]

Zhou Yang, Bowen Xu, Jie M. Zhang, Hong Jin Kang, Jieke Shi, Junda He, and David Lo. 2023. Stealthy Backdoor Attack for Code Models. https://doi.org/10.48550/ARXIV.2301.02496

[49]

Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial Examples for Models of Code. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 162, nov, 30 pages. https://doi.org/10.1145/3428230

Digital Library

[50]

Jongwon Yoon, Minsik Jin, and Yungbum Jung. 2014. Reducing false alarms from an industrial-strength static analyzer by SVM. In 2014 21st Asia-Pacific Software Engineering Conference. 2, 3–6. https://doi.org/10.1109/APSEC.2014.81

Digital Library

[51]

Dong Yu and Li Deng. 2016. Automatic speech recognition. 1, Springer. https://doi.org/10.1007/978-1-4471-5779-3

[52]

Daniel Hao Xian Yuen, Andrew Yong Chen Pang, Zhou Yang, Chun Yong Chong, Mei Kuan Lim, and David Lo. 2023. ASDF: A Differential Testing Framework for Automatic Speech Recognition Systems. https://doi.org/10.48550/ARXIV.2302.05582

[53]

Zijun Zhang. 2018. Improved adam optimizer for deep neural networks. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). 1–2. https://doi.org/10.1109/IWQoS.2018.8624183

[54]

Xianrui Zheng, Yulan Liu, Deniz Gunceler, and Daniel Willett. 2021. Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5674–5678. https://doi.org/10.1109/ICASSP39728.2021.9414778

Cited By

Index Terms

Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction

Speech intelligibility is the most important parameter in evaluation of speech quality. In the contribution, a new objective intelligibility assessment of general speech processing algorithms is proposed. It is based on automatic recognition methods ...
Online Speech Detection and Dual-Gender Speech Recognition for Captioning Broadcast News

This paper describes a new method to detect speech segments online with identifying gender attributes for efficient dual gender-dependent speech recognition and broadcast news captioning. The proposed online speech detection performs dual-gender phoneme ...
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

July 2023

1554 pages

ISBN:9798400702211

DOI:10.1145/3597926

General Chair:
René Just
University of Washington, USA
,
Program Chair:
Gordon Fraser
University of Passau, Germany

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministry of Education, Singapore

Conference

ISSTA '23

Sponsor:

SIGSOFT

ISSTA '23: 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

July 17 - 21, 2023

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Sponsor:
sigsoft

34th ACM SIGSOFT International Symposium on Software Testing and Analysis

June 25 - 28, 2025

Trondheim , Norway

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
96
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents