Meta-reinforcement learning based few-shot speech reconstruction for non-intrusive speech quality assessment

Weili Zhou¹,
Jinxiong Lai¹,
Yuetao Liao¹ &
…
Ruijie Ji¹

254 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Speech quality assessment (SQA) is meaningful for modern communication systems and Quality of Service (QoS). At present, the non-intrusive SQA becomes the research direction due to it not needing the original speech. However, the intrusive algorithms outperformed the non-intrusive methods since the prior information of the original signal are available in the test. The objected of this paper is to execute the non-intrusive evaluation of the noisy speech quality in “an intrusive way”. To reconstruct the original speech, a meta-reinforcement learning method MetaRL-SR is proposed in this paper, focusing on the quasi-clean speech reconstruction from the noisy speech with few training samples. First, a reinforcement learning based meta-learner is proposed which initializes the actions by a finite number of T-F masks, and the related action-value function is developed. Second, to optimize the model, this paper develops the reward calculation for reinforcement learning by using the user perception. Third, the model-agnostic Meta learning (MAML) algorithm is applied to fully utilize the limited data to improve the generalization of the meta-learner and towards better generalization of learning new tasks. Finally, the quasi-clean speech is applied as the reference in the International Telecommunication Union (ITU) standard PESQ intrusive model, and the distortion error between noisy speech and quasi clean speech is calculated to estimate the Mean Opinion Score (MOS) of noisy speech. The experiment results show that in terms of person correlation and standard deviation of error measurements, this work achieves improvement of at least 5.8% ~ 7.3% for 1-shot cases and 5.4% ~6.8% for 5-shot cases in contrast to the state-of-the-art DNN based SQA methods in challenging conditions, where the environment noises are diverse, and the signals are non-stationary.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

MetaRL-SE: a few-shot speech enhancement method based on meta-reinforcement learning

Article 26 April 2023

Online Speech Enhancement by Retraining of LSTM Using SURE Loss and Policy Iteration

Article 30 May 2021

An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks

Article 21 February 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Zhou WL, Zhu Z (2021) A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments. Int J Mach Learn Cybern 12(4):959–972
Article Google Scholar
Zhou WL, Zhu Z (2019) A new online Bayesian NMF based quasi-clean speech reconstruction for non-intrusive voice quality evaluation. Neurocomputing 349:261–270
Article Google Scholar
Zhou WL, He QH (2015) Non-intrusive speech quality objective evaluation in high-noise environments, Proc. IEEE China Sum and Int. Conf. on Signal and Information Processing, Chengdu, 50–54
Wang J, XIE X, Li JX et al (2014) Research on audio quality evaluation standards. Inf Technol Stand 3:39–46
Google Scholar
ITU-T Rec (2001) P.862, Perceptual Evaluation of Speech Quality (PESQ):An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs
Ludovic M, Jens B, Martin K (2016) P.563-the ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Lang Process 14:1924–1934
Google Scholar
Narwaria M, Lin W, McLoughlin IV et al (2012) Nonintrusive quality assessment of noise suppressed speech with mel-fi ltered energies and support vector regression[J]. IEEE Trans Audio Speech Lang Process 20(4):1217–1232
Article Google Scholar
Rajesh KD, Arun K (2015) Non-intrusive speech quality assessment using multi-resolution auditory model features for degraded narrowband speech. IET Signal Processing 9:638–646
Article Google Scholar
Li Q, Fang Y, Lin W et a1 (2014) Non-intrusive quality assessment for enhanced speech signals based on spectro-temporal features[C]. IEEE International Conference on Multimedia and Expo Workshops(ICMEW), Chengdu, l-6
Soni MH, Patil HA (2016) Novel subband autoencoder features for non-intrusive quality assessment of noise suppressed speech. Proc. Interspeech, pp. 3708–3712
Fu SW , Tsao Y , Hwang HT et al. (2018) Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM, arXiv preprint arXiv: 1808.05344
Chu W-H, Frank Wang Y-C (2018) Learning Semantics-Guided Visual Attention for Few-Shot Image Classification. IEEE International Conference on Image Processing (ICIP)
Das D, George Lee CS (2020) A Two-Stage Approach to Few-Shot Learning for Image Recognition. IEEE/ACM Trans Image Process 29:3336–3350
Article MATH Google Scholar
Kang B , Liu Z, Wang X, Yu F, Feng J (2019) Few-shot object detection via feature reweighting. IEEE/CVF International Conference on Computer Vision (ICCV), pp. 69–74
Pan C, Huang J, Gong J, Yuan X (2019) Few-shot transfer learning for text classification with lightweight word embedding based models. IEEE Access 7:53296–53304
Article Google Scholar
Winata GI, Cahyawijaya S, Liu Z, Lin Z, Madotto A, Xu P, Fung P (2020) Learning Fast Adaptation on Cross-Accented Speech Recognition. arXiv preprint arXiv: 2003.01901
Erdogan H, Hershey JR, Watanabe S, Le Roux J (2019) Learning to Match Transient Sound Events Using Attentional Similarity for Few-shot Sound Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 125–128
Moss HB, Aggarwal V, Prateek N, González J, Barra-Chicote R (2020) BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization. arXiv preprint arXiv:2002.01953
Anand P, Singh AK, Srivastava S, Lall B (2019) Few Shot Speaker Recognition using Deep Neural Networks. arXiv preprint arXiv: arXiv:1904.08775
Zhou W, Zhu Z, Liang P et al (2019) Multimed Tools Appl 78:15647–15664
Article Google Scholar
Afouras T, Chung JS, Zisserman A (2018) The conversation: Deep audiovisual speech reconsturction. Proc. Interspeech 2018, pp. 3244–3248
Fu SW, Liao CF, Tsao Y, Lin SD (2019) MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech reconsturction. arXiv preprint arXiv:1905.04874
Rethage D, Pons J, Serra X (2019) A wavenet for speech denoising. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 423–426
Masuyama Y, Togami M, Komatsu T (2018) Consistency-aware multi-channel speech reconsturction using deep neural networks. arXiv preprint: arXiv:2002.05831
Erdogan H, Hershey JR, Watanabe S, Le Roux J (2019) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 708–712
Williamson DS, Wang Y, Wang DL (2019) Complex ratio masking for monaural speech separation. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 24(3):483–492
Article Google Scholar
Zhou W, Zhu Z (2020) A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-020-01214-3
Deng F, Jiang T, Wang XR, Zhang C, Li Y (2020) NAAGN: noise-aware attention-gated network for speech Reconsturction. Proc Interspeech, 2457-2461
Li A, Zheng C, Peng R, Fan C, Li X (2020) Dynamic Attention Based Generative Adversarial Network with Phase Post-Processing for Speech Reconsturction. arXiv preprint arXiv:2006.07530
Pascual S, Bonafonte A, Serra J (2018) SEGAN: Speech reconsturction generative adversarial network. In: Proc. Interspeech, pp. 77–82
Signal Processing Information Base (2020) NOISEX-92 database. http://www.auditory.org/mhonarc/2006/msg00609.html
Thiemann J, Ito N, Vincent E (2013) The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. J Acoust Soc Am 133(5):3591–3591
Article Google Scholar
Tadas B, Chaitanya A, Morency L-P (2019) Multimodal machine learning: A survey and taxonomy. arXiv preprint arXiv:1705.094062
Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T (2016) Meta learning with memory-augmented neural networks. In: International conference on machine learning, pp. 767–771
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, pp. 512–515
Lin S-C, Chen C-J, Lee T-J (2018) A Multi-Label Classification With Hybrid Label-Based Meta-Learning Method in Internet of Things. IEEE Access 8:2169–3536
Google Scholar
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org
Zhou W, Lu M, Ji R, Liang P (2021) Learning to enhance: A meta-learning framework for few-shot speech reconsturction, IEEE/ACM Transactions on Audio, Speech, and Language Processing
Silver D et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354
Article Google Scholar
Mnih V et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Article Google Scholar
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning, arXiv preprint arXiv:1611.01578
Baker B, Gupta O, Naik N, Raskar R (2017) Designing neural network architectures using reinforcement learning, arXiv preprint arXiv:1611.02167
Jane X, Zeb K, Dhruva T, et al (2018) Learning to reinforcement learn, arXiv preprint arXiv:1611.05763
Fakoor R, Chaudhari P, Soatto S, Smola AJ (2019) Meta-Q-learning. In In Proceedings of International Conference on Learning Representations pp 332–338
ITU-T P-series Recommendations (2021) ITU-T Supplement-23 coded-speech database. http://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm#Psupp23
Ma J, Hu Y, Loizou P (2009) Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions, J Acoust Soc Am 125(5):3387–3405
Fisher W, Doddington G, Goudie K (1986) The DARPA speech recognition research database: specifications and status, pp 93–99
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer assisted intervention, pp. 234–241

Download references

Acknowledgments

This work is supported by the Foshan University Research Foundation for Advanced Talents (GG07005), the Natural Science Foundation of Guangdong Province (2018A0303130082, 2019A1515111148), Guangdong Province Colleges and Universities Young Innovative Talent Project (2019KQNCX168).

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Foshan University, Foshan, People’s Republic of China
Weili Zhou, Jinxiong Lai, Yuetao Liao & Ruijie Ji

Authors

Weili Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jinxiong Lai
View author publications
You can also search for this author in PubMed Google Scholar
Yuetao Liao
View author publications
You can also search for this author in PubMed Google Scholar
Ruijie Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weili Zhou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, W., Lai, J., Liao, Y. et al. Meta-reinforcement learning based few-shot speech reconstruction for non-intrusive speech quality assessment. Appl Intell 53, 14146–14161 (2023). https://doi.org/10.1007/s10489-022-04165-0

Download citation

Accepted: 03 September 2022
Published: 21 October 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-04165-0

Meta-reinforcement learning based few-shot speech reconstruction for non-intrusive speech quality assessment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MetaRL-SE: a few-shot speech enhancement method based on meta-reinforcement learning

Online Speech Enhancement by Retraining of LSTM Using SURE Loss and Policy Iteration

An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Meta-reinforcement learning based few-shot speech reconstruction for non-intrusive speech quality assessment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MetaRL-SE: a few-shot speech enhancement method based on meta-reinforcement learning

Online Speech Enhancement by Retraining of LSTM Using SURE Loss and Policy Iteration

An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation