[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3597926.3598126acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing

Published: 13 July 2023 Publication History

Abstract

Recent studies have proposed the use of Text-To-Speech (TTS) systems to automatically synthesise speech test cases on a scale and uncover a large number of failures in ASR systems. However, the failures uncovered by synthetic test cases may not reflect the actual performance of an ASR system when it transcribes human audio, which we refer to as false alarms. Given a failed test case synthesised from TTS systems, which consists of TTS-generated audio and the corresponding ground truth text, we feed the human audio stating the same text to an ASR system. If human audio can be correctly transcribed, an instance of a false alarm is detected.
In this study, we investigate false alarm occurrences in five popular ASR systems using synthetic audio generated from four TTS systems and human audio obtained from two commonly used datasets. Our results show that the least number of false alarms is identified when testing Deepspeech, and the number of false alarms is the highest when testing Wav2vec2. On average, false alarm rates range from 21% to 34% in all five ASR systems. Among the TTS systems used, Google TTS produces the least number of false alarms (17%), and Espeak TTS produces the highest number of false alarms (32%) among the four TTS systems. Additionally, we build a false alarm estimator that flags potential false alarms, which achieves promising results: a precision of 98.3%, a recall of 96.4%, an accuracy of 98.5%, and an F1 score of 97.3%. Our study provides insight into the appropriate selection of TTS systems to generate high-quality speech to test ASR systems. Additionally, a false alarm estimator can be a way to minimise the impact of false alarms and help developers choose suitable test inputs when evaluating ASR systems. The source code used in this paper is publicly available on GitHub at https://github.com/julianyonghao/FAinASRtest.

References

[1]
Martín Abadi. 2016. TensorFlow: learning functions at scale. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming. 1–1. https://doi.org/10.1145/2951913.2976746
[2]
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, and Zhenyao Zhu. 2016. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. In Proceedings of The 33rd International Conference on Machine Learning, Maria Florina Balcan and Kilian Q. Weinberger (Eds.) (Proceedings of Machine Learning Research, Vol. 48). PMLR, New York, New York, USA. 173–182. https://proceedings.mlr.press/v48/amodei16.html
[3]
Muhammad Hilmi Asyrofi, Ferdian Thung, David Lo, and Lingxiao Jiang. 2020. Crossasr: Efficient differential testing of automatic speech recognition via text-to-speech. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 640–650. https://doi.org/10.1109/ICSME46990.2020.00066
[4]
Muhammad Hilmi Asyrofi, Zhou Yang, and David Lo. 2021. CrossASR++: A Modular Differential Testing Framework for Automatic Speech Recognition. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA. 1575–1579. isbn:9781450385626 https://doi.org/10.1145/3468264.3473124
[5]
Muhammad Hilmi Asyrofi, Zhou Yang, Jicke Shi, Chu Wei Quan, and David Lo. 2021. Can differential testing improve automatic speech recognition systems? In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 674–678. https://doi.org/10.1109/ICSME52107.2021.00079
[6]
Muhammad Hilmi Asyrofi, Zhou Yang, Imam Nur Bani Yusuf, Hong Jin Kang, Ferdian Thung, and David Lo. 2022. BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems. IEEE Transactions on Software Engineering, 48, 12 (2022), 5087–5101. https://doi.org/10.1109/TSE.2021.3136169
[7]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33 (2020), 12449–12460.
[8]
Alpha Cephei. 2021. Vosk. https://github.com/alphacep/vosk-api Accessed on May 26th, 2023
[9]
Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. 2016. Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193, https://doi.org/10.48550/arXiv.1609.03193
[10]
Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Jianjun Zhao, and Yang Liu. 2018. Deepcruiser: Automated guided testing for stateful deep learning systems. arXiv preprint arXiv:1812.05339, https://doi.org/10.48550/arXiv.1812.05339
[11]
Pierre Nicholas Durette. 2020. Google TTS. https://github.com/pndurette/gTTS Accessed on March 18th, 2023
[12]
Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng, Roland Maas, and Jasha Droppo. 2021. Synthasr: Unlocking synthetic data for speech recognition. arXiv preprint arXiv:2106.07803, https://doi.org/10.48550/arXiv.2106.07803
[13]
Source Forge. 2015. eSpeak Text-to-Speech. https://espeak.sourceforge.net/ Accessed on December 25th, 2022
[14]
Xiang Gao, Ripon K. Saha, Mukul R. Prasad, and Abhik Roychoudhury. 2020. Fuzz Testing Based Data Augmentation to Improve Robustness of Deep Neural Networks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA. 1147–1158. isbn:9781450371216 https://doi.org/10.1145/3377811.3380415
[15]
Chen Gong, Zhou Yang, Yunpeng Bai, Jieke Shi, Arunesh Sinha, Bowen Xu, David Lo, Xinwen Hou, and Guoliang Fan. 2022. Curiosity-Driven and Victim-Aware Adversarial Policies. In Proceedings of the 38th Annual Computer Security Applications Conference (ACSAC ’22). Association for Computing Machinery, New York, NY, USA. 186–200. isbn:9781450397599 https://doi.org/10.1145/3564625.3564636
[16]
Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam. 2014. Finding patterns in static analysis alerts: improving actionable alert ranking. In Proceedings of the 11th working conference on mining software repositories. 152–161. https://doi.org/10.1145/2597073.2597100
[17]
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, and Adam Coates. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, https://doi.org/10.48550/arXiv.1412.5567
[18]
Kim Herzig and Nachiappan Nagappan. 2015. Empirically detecting false test alarms using association rules. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. 2, 39–48. https://doi.org/10.1109/ICSE.2015.133
[19]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780. https://doi.org/10.1007/978-3-642-24797-2_4
[20]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29 (2021), 3451–3460. https://doi.org/10.1109/TASLP.2021.3122291
[21]
Ting-Yao Hu, Mohammadreza Armandpour, Ashish Shrivastava, Jen-Hao Rick Chang, Hema Koppula, and Oncel Tuzel. 2022. SYNT++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7682–7686. https://doi.org/10.1109/ICASSP43922.2022.9746217
[22]
Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/
[23]
Futoshi Iwama and Takashi Fukuda. 2019. Automated Testing of Basic Recognition Capability for Speech Recognition Systems. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 13–24. https://doi.org/10.1109/ICST.2019.00012
[24]
Neetu Jain and Rabins Porwal. 2019. Automated test data generation applying heuristic approaches—a survey. In Software Engineering. Springer, 699–708. https://doi.org/10.1007/978-981-10-8848-3_68
[25]
Pin Ji, Yang Feng, Jia Liu, Zhihong Zhao, and Zhenyu Chen. 2022. ASRTest: Automated Testing for Deep-Neural-Network-Driven Speech Recognition Systems. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2022). Association for Computing Machinery, New York, NY, USA. 189–201. isbn:9781450393799 https://doi.org/10.1145/3533767.3534391
[26]
Hong Jin Kang, Khai Loong Aw, and David Lo. 2022. Detecting false alarms from automatic static analysis tools: how far are we? In Proceedings of the 44th International Conference on Software Engineering. 698–709. https://doi.org/10.1145/3510003.3510214
[27]
Veton Këpuska and Gamal Bohouta. 2017. Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx). Int. J. Eng. Res. Appl, 7, 03 (2017), 20–24. https://doi.org/10.9790/9622-0703022024
[28]
Anant Kharkar, Roshanak Zilouchian Moghaddam, Matthew Jin, Xiaoyu Liu, Xin Shi, Colin Clement, and Neel Sundaresan. 2022. Learning to reduce false positives in analytic bug detectors. In Proceedings of the 44th International Conference on Software Engineering. 1307–1316. https://doi.org/10.1145/3510003.3510153
[29]
Jaehyeon Kim. 2020. Glow-TTS. https://github.com/jaywalnut310/glow-tts Accessed on May 10th, 2023
[30]
Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288, https://doi.org/10.48550/arXiv.1904.03288
[31]
Joseph P Olive and Mark Y Liberman. 1985. Text to speech—An overview. The Journal of the Acoustical Society of America, 78, S1 (1985), S6–S6. https://doi.org/10.1121/1.2022951
[32]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964
[33]
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, https://doi.org/10.21437/Interspeech.2019-2680
[34]
Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. 2020. Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629, https://doi.org/10.21437/Interspeech.2020-1470
[35]
Sai Sathiesh Rajan, Sakshi Udeshi, and Sudipta Chattopadhyay. 2022. AequeVox: Automated Fairness Testing of Speech Recognition Systems. In Fundamental Approaches to Software Engineering, Einar Broch Johnsen and Manuel Wimmer (Eds.). Springer International Publishing, 245–267. https://doi.org/10.1007/978-3-030-99429-7_14
[36]
Roy W Roring, Franklin G Hines, and Neil Charness. 2007. Age differences in identifying words in synthetic speech. Human factors, 49, 1 (2007), 25–31. https://doi.org/10.1518/001872007779598055
[37]
Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from building static analysis tools at google. Commun. ACM, 61, 4 (2018), 58–66. https://doi.org/10.1145/3188720
[38]
Garima Sharma, Kartikeyan Umapathy, and Sridhar Krishnan. 2020. Trends in audio signal feature extraction methods. Applied Acoustics, 158 (2020), 107020. https://doi.org/10.1016/j.apacoust.2019.107020
[39]
Adriana Stan, Junichi Yamagishi, Simon King, and Matthew Aylett. 2011. The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication, 53, 3 (2011), 442–450. https://doi.org/10.1016/j.specom.2010.12.002
[40]
Krista Taake. 2009. A comparison of natural and synthetic speech: with and without simultaneous reading.
[41]
The University of Edinburgh The Center for Speech Technology Research. 2017. The Festival Speech Synthesis System. https://www.cstr.ed.ac.uk/projects/festival/ Accessed on March 18th, 2023
[42]
Zichong Wang, Yang Zhou, Meikang Qiu, Israat Haque, Laura Brown, Yi He, Jianwu Wang, David Lo, and Wenbin Zhang. 2023. Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking. https://doi.org/10.48550/ARXIV.2302.08018
[43]
Stephen J Winters and David B Pisoni. 2004. Perception and comprehension of synthetic speech. Research on spoken language processing report, 26 (2004), 95–138.
[44]
Xiaoliang Wu and Ajitha Rajan. 2021. Catch Me If You Can: Blackbox Adversarial Attacks on Automatic Speech Recognition using Frequency Masking. https://doi.org/10.48550/ARXIV.2112.01821
[45]
Zhou Yang, Jieke Shi, Muhammad Hilmi Asyrofi, and David Lo. 2022. Revisiting Neuron Coverage Metrics and Quality of Deep Neural Networks. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE Computer Society, Los Alamitos, CA, USA. 408–419. issn:1534-5351 https://doi.org/10.1109/SANER53432.2022.00056
[46]
Zhou Yang, Jieke Shi, Muhammad Hilmi Asyrofi, Bowen Xu, Xin Zhou, DongGyun Han, and David Lo. 2023. Prioritizing Speech Test Cases. arXiv preprint arXiv:2302.00330, https://doi.org/10.48550/arXiv.2302.00330
[47]
Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA. 1482–1493. isbn:9781450392211 https://doi.org/10.1145/3510003.3510146
[48]
Zhou Yang, Bowen Xu, Jie M. Zhang, Hong Jin Kang, Jieke Shi, Junda He, and David Lo. 2023. Stealthy Backdoor Attack for Code Models. https://doi.org/10.48550/ARXIV.2301.02496
[49]
Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial Examples for Models of Code. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 162, nov, 30 pages. https://doi.org/10.1145/3428230
[50]
Jongwon Yoon, Minsik Jin, and Yungbum Jung. 2014. Reducing false alarms from an industrial-strength static analyzer by SVM. In 2014 21st Asia-Pacific Software Engineering Conference. 2, 3–6. https://doi.org/10.1109/APSEC.2014.81
[51]
Dong Yu and Li Deng. 2016. Automatic speech recognition. 1, Springer. https://doi.org/10.1007/978-1-4471-5779-3
[52]
Daniel Hao Xian Yuen, Andrew Yong Chen Pang, Zhou Yang, Chun Yong Chong, Mei Kuan Lim, and David Lo. 2023. ASDF: A Differential Testing Framework for Automatic Speech Recognition Systems. https://doi.org/10.48550/ARXIV.2302.05582
[53]
Zijun Zhang. 2018. Improved adam optimizer for deep neural networks. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). 1–2. https://doi.org/10.1109/IWQoS.2018.8624183
[54]
Xianrui Zheng, Yulan Liu, Deniz Gunceler, and Daniel Willett. 2021. Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5674–5678. https://doi.org/10.1109/ICASSP39728.2021.9414778

Cited By

View all

Index Terms

  1. Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis
    July 2023
    1554 pages
    ISBN:9798400702211
    DOI:10.1145/3597926
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 July 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Automated Speech Recognition
    2. False Alarms
    3. Software Testing

    Qualifiers

    • Research-article

    Funding Sources

    • Ministry of Education, Singapore

    Conference

    ISSTA '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 58 of 213 submissions, 27%

    Upcoming Conference

    ISSTA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 96
      Total Downloads
    • Downloads (Last 12 months)51
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media