Abstract
Speech quality assessment (SQA) is meaningful for modern communication systems and Quality of Service (QoS). At present, the non-intrusive SQA becomes the research direction due to it not needing the original speech. However, the intrusive algorithms outperformed the non-intrusive methods since the prior information of the original signal are available in the test. The objected of this paper is to execute the non-intrusive evaluation of the noisy speech quality in “an intrusive way”. To reconstruct the original speech, a meta-reinforcement learning method MetaRL-SR is proposed in this paper, focusing on the quasi-clean speech reconstruction from the noisy speech with few training samples. First, a reinforcement learning based meta-learner is proposed which initializes the actions by a finite number of T-F masks, and the related action-value function is developed. Second, to optimize the model, this paper develops the reward calculation for reinforcement learning by using the user perception. Third, the model-agnostic Meta learning (MAML) algorithm is applied to fully utilize the limited data to improve the generalization of the meta-learner and towards better generalization of learning new tasks. Finally, the quasi-clean speech is applied as the reference in the International Telecommunication Union (ITU) standard PESQ intrusive model, and the distortion error between noisy speech and quasi clean speech is calculated to estimate the Mean Opinion Score (MOS) of noisy speech. The experiment results show that in terms of person correlation and standard deviation of error measurements, this work achieves improvement of at least 5.8% ~ 7.3% for 1-shot cases and 5.4% ~6.8% for 5-shot cases in contrast to the state-of-the-art DNN based SQA methods in challenging conditions, where the environment noises are diverse, and the signals are non-stationary.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Zhou WL, Zhu Z (2021) A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments. Int J Mach Learn Cybern 12(4):959–972
Zhou WL, Zhu Z (2019) A new online Bayesian NMF based quasi-clean speech reconstruction for non-intrusive voice quality evaluation. Neurocomputing 349:261–270
Zhou WL, He QH (2015) Non-intrusive speech quality objective evaluation in high-noise environments, Proc. IEEE China Sum and Int. Conf. on Signal and Information Processing, Chengdu, 50–54
Wang J, XIE X, Li JX et al (2014) Research on audio quality evaluation standards. Inf Technol Stand 3:39–46
ITU-T Rec (2001) P.862, Perceptual Evaluation of Speech Quality (PESQ):An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs
Ludovic M, Jens B, Martin K (2016) P.563-the ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Lang Process 14:1924–1934
Narwaria M, Lin W, McLoughlin IV et al (2012) Nonintrusive quality assessment of noise suppressed speech with mel-fi ltered energies and support vector regression[J]. IEEE Trans Audio Speech Lang Process 20(4):1217–1232
Rajesh KD, Arun K (2015) Non-intrusive speech quality assessment using multi-resolution auditory model features for degraded narrowband speech. IET Signal Processing 9:638–646
Li Q, Fang Y, Lin W et a1 (2014) Non-intrusive quality assessment for enhanced speech signals based on spectro-temporal features[C]. IEEE International Conference on Multimedia and Expo Workshops(ICMEW), Chengdu, l-6
Soni MH, Patil HA (2016) Novel subband autoencoder features for non-intrusive quality assessment of noise suppressed speech. Proc. Interspeech, pp. 3708–3712
Fu SW , Tsao Y , Hwang HT et al. (2018) Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM, arXiv preprint arXiv: 1808.05344
Chu W-H, Frank Wang Y-C (2018) Learning Semantics-Guided Visual Attention for Few-Shot Image Classification. IEEE International Conference on Image Processing (ICIP)
Das D, George Lee CS (2020) A Two-Stage Approach to Few-Shot Learning for Image Recognition. IEEE/ACM Trans Image Process 29:3336–3350
Kang B , Liu Z, Wang X, Yu F, Feng J (2019) Few-shot object detection via feature reweighting. IEEE/CVF International Conference on Computer Vision (ICCV), pp. 69–74
Pan C, Huang J, Gong J, Yuan X (2019) Few-shot transfer learning for text classification with lightweight word embedding based models. IEEE Access 7:53296–53304
Winata GI, Cahyawijaya S, Liu Z, Lin Z, Madotto A, Xu P, Fung P (2020) Learning Fast Adaptation on Cross-Accented Speech Recognition. arXiv preprint arXiv: 2003.01901
Erdogan H, Hershey JR, Watanabe S, Le Roux J (2019) Learning to Match Transient Sound Events Using Attentional Similarity for Few-shot Sound Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 125–128
Moss HB, Aggarwal V, Prateek N, González J, Barra-Chicote R (2020) BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization. arXiv preprint arXiv:2002.01953
Anand P, Singh AK, Srivastava S, Lall B (2019) Few Shot Speaker Recognition using Deep Neural Networks. arXiv preprint arXiv: arXiv:1904.08775
Zhou W, Zhu Z, Liang P et al (2019) Multimed Tools Appl 78:15647–15664
Afouras T, Chung JS, Zisserman A (2018) The conversation: Deep audiovisual speech reconsturction. Proc. Interspeech 2018, pp. 3244–3248
Fu SW, Liao CF, Tsao Y, Lin SD (2019) MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech reconsturction. arXiv preprint arXiv:1905.04874
Rethage D, Pons J, Serra X (2019) A wavenet for speech denoising. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 423–426
Masuyama Y, Togami M, Komatsu T (2018) Consistency-aware multi-channel speech reconsturction using deep neural networks. arXiv preprint: arXiv:2002.05831
Erdogan H, Hershey JR, Watanabe S, Le Roux J (2019) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 708–712
Williamson DS, Wang Y, Wang DL (2019) Complex ratio masking for monaural speech separation. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 24(3):483–492
Zhou W, Zhu Z (2020) A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-020-01214-3
Deng F, Jiang T, Wang XR, Zhang C, Li Y (2020) NAAGN: noise-aware attention-gated network for speech Reconsturction. Proc Interspeech, 2457-2461
Li A, Zheng C, Peng R, Fan C, Li X (2020) Dynamic Attention Based Generative Adversarial Network with Phase Post-Processing for Speech Reconsturction. arXiv preprint arXiv:2006.07530
Pascual S, Bonafonte A, Serra J (2018) SEGAN: Speech reconsturction generative adversarial network. In: Proc. Interspeech, pp. 77–82
Signal Processing Information Base (2020) NOISEX-92 database. http://www.auditory.org/mhonarc/2006/msg00609.html
Thiemann J, Ito N, Vincent E (2013) The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. J Acoust Soc Am 133(5):3591–3591
Tadas B, Chaitanya A, Morency L-P (2019) Multimodal machine learning: A survey and taxonomy. arXiv preprint arXiv:1705.094062
Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T (2016) Meta learning with memory-augmented neural networks. In: International conference on machine learning, pp. 767–771
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, pp. 512–515
Lin S-C, Chen C-J, Lee T-J (2018) A Multi-Label Classification With Hybrid Label-Based Meta-Learning Method in Internet of Things. IEEE Access 8:2169–3536
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org
Zhou W, Lu M, Ji R, Liang P (2021) Learning to enhance: A meta-learning framework for few-shot speech reconsturction, IEEE/ACM Transactions on Audio, Speech, and Language Processing
Silver D et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354
Mnih V et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning, arXiv preprint arXiv:1611.01578
Baker B, Gupta O, Naik N, Raskar R (2017) Designing neural network architectures using reinforcement learning, arXiv preprint arXiv:1611.02167
Jane X, Zeb K, Dhruva T, et al (2018) Learning to reinforcement learn, arXiv preprint arXiv:1611.05763
Fakoor R, Chaudhari P, Soatto S, Smola AJ (2019) Meta-Q-learning. In In Proceedings of International Conference on Learning Representations pp 332–338
ITU-T P-series Recommendations (2021) ITU-T Supplement-23 coded-speech database. http://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm#Psupp23
Ma J, Hu Y, Loizou P (2009) Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions, J Acoust Soc Am 125(5):3387–3405
Fisher W, Doddington G, Goudie K (1986) The DARPA speech recognition research database: specifications and status, pp 93–99
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer assisted intervention, pp. 234–241
Acknowledgments
This work is supported by the Foshan University Research Foundation for Advanced Talents (GG07005), the Natural Science Foundation of Guangdong Province (2018A0303130082, 2019A1515111148), Guangdong Province Colleges and Universities Young Innovative Talent Project (2019KQNCX168).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, W., Lai, J., Liao, Y. et al. Meta-reinforcement learning based few-shot speech reconstruction for non-intrusive speech quality assessment. Appl Intell 53, 14146–14161 (2023). https://doi.org/10.1007/s10489-022-04165-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04165-0