Abstract
Speech quality evaluation (SQE) under complex noisy environment is important for audio processing systems and quality of service. Recently, the non-intrusive SQE is getting more and more attentive due to its efficient and ease of use. However, non-intrusive SQEs are expected to be underperformed the intrusive ones since it has no prior knowledge of the clean speech. In this paper, a novel quasi-clean speech reconstruction method for non-intrusive SQE is proposed. The method incorporates Bayesian NMF (BNMF) with deep neural network (DNN), which takes the advantages of both NMF and DNN. BNMF is utilized to calculate the basic spectro-temporal matrixes of target speech, and the obtained matrices are integrated into the DNN model as an individual layer. Then DNN is trained to learn the complex mapping between the target source and the mixture signal, and reconstruct the magnitude spectrograms of the quasi-clean speech. Finally, the reconstructed speech is regarded as the reference of the perceptual model to estimate the Mean opinion score of the tested noisy sample. The experiment results show that the proposed method outperforms the comparative non-intrusive SQE algorithms under challenging conditions in terms of objective measurement.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Gierlich H, Heute U, Moeller S (2014) Advances in perceptual modeling of speech quality in telecommunications. In: 2014 ITG symposium on speech communication, Erlangen, pp 1–4
Wang J, Xie X, Li JX et al (2014) Research on audio quality evaluation standards. Inf Technol Stand 3:39–46
Zhou WL, Zhu Z (2019) A new online Bayesian NMF based quasi-clean speech reconstruction for non-intrusive voice quality evaluation. Neurocomputing 349:261–270
Zhou WL, He QH (2015) Non-intrusive speech quality objective evaluation in high-noise environments. In: 2015 IEEE China summit and international conference on signal and information processing, Chengdu, pp 50–54
ITU-T Rec. (2001) P.862, Perceptual Evaluation of Speech Quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs
Ludovic M, Jens B, Martin K (2016) P.563-The ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Lang Process 14:1924–1934
Rajesh KD, Arun K (2015) Non-intrusive speech quality assessment using multi-resolution auditory model features for degraded narrowband speech. IET Signal Proc 9:638–646
Sharma D, Meredith L, Lainez J, Barreda D, Naylor PA (2014) A non-intrusive PESQ measure. In: 2014 IEEE international conference on GlobalSIP, pp 975–978
Soni MH, Patil HA (2016) Novel subband autoencoder features for non-intrusive quality assessment of noise suppressed speech. In: 2016 conference of the international speech communication association on interspeech. IEEE, pp 3708–3712
Fu SW, Tsao Y, Hwang HT et al (2018) Quality-net: an end-to-end non-intrusive speech quality assessment model based on BLSTM. arXiv preprint arXiv:1808.05344
Zhou WL, Zhu Z, Liang PY (2019) Speech denoising using Bayesian NMF with online base update. Multimed Tools Appl 78(11):261–270
Chen Y, Shi L, Feng Q et al (2014) Artifact suppressed dictionary learning for low-dose CT image processing. IEEE Trans Med Imaging 33(12):2271–2292
Chen Y, Zhang Y, Yang J et al (2018) Structure-adaptive fuzzy estimation for random-valued impulse noise suppression. IEEE Trans Circuits Syst Video Technol 28(2):414–427
Zhou WL, He QH, Wang YL et al (2017) Sparse representation-based quasi-clean speech construction for speech quality assessment under complex environments. IET Signal Proc 11:486–493
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
Le Roux J, Weninger F, Hershey JR (2015) Sparse NMF-half-baked or well done? Mitsubishi Elect. Res. Cambridge, Tech. Rep. TR2015-023
Weninger F, Le Roux J, Hershey JR, Watanabe S (2014) Discriminative NMF and its application to single-channel source separation. In: 2014 conference of the international speech communication association on interspeech. IEEE, pp 865–869
Ogrady PD, Pearlmutter BA (2008) Discovering speech phones using convolutive non-negative matrix factorisation with a sparseness constraint. Neurocomputing 72(1):88–101
Mysore GJ, Smaragdis P (2011) A non-negative approach to semisupervised separation of speech from noise with the use of temporal dynamics. In: 2011 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 1919–1926
Schmidt MN, Larsen J (2008) Reduction of non-stationary noise using a non-negative latent variable decomposition. In: 2008 IEEE workshop on machine learning for signal process. IEEE, pp 486–491
Mohammadiha N, Smaragdis P, Leijon A (2013) Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans Audio Speech Lang Process 21:2140–2151
Han K, Wang Y, Wang DL, Woods WS, Merks I, Zhang T (2015) Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 23(6):982–992
Wang Y, Narayanan A, Wang D (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858
Erdogan H, Hershey JR, Watanabe S, Roux JL (2015) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: 2015 IEEE international conference on acoustics, speech signal process. IEEE, pp 708–712
Williamson DS, Wang Y, Wang D (2016) Complex ratio masking for monaural speech separation. IEEE/ACM Trans Audio Speech Lang Process 24(3):483–492
Rethage D, Pons J, Serra X (2018) A wavenet for speech denoising. In: 2018 IEEE international conference on acoustics speech signal processing. IEEE, pp 1927–1930
Pascual S, Bonafonte A, Serra J (2017) Segan: speech enhancement generative adversarial network. Proc Interspeech 2017:3642–3646
Soni MH, Shah N, Patil HA (2018) Time-frequency masking-based speech enhancement using generative adversarial network. In: 2018 IEEE international conference on acoustics, speech signal processing. IEEE, pp 1887–1890
Wang Y, Wang D (2014) A structure-preserving training target for supervised speech separation. In: 2014 IEEE international conference on acoustics speech signal processing. IEEE, pp 6107–6111
Kang TG, Kwon K, Shin JW, Kim NS (2015) NMF-based target source separation using deep neural network. IEEE Signal Process Lett 22(2):229–233
Mohammadiha N, Taghia J, Leijon A (2012) Single channel speech enhancement using Bayesian NMF with recursive temporal updates of prior distributions. In: 2012 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 4561–4564
Cemgil AT (2009) Bayesian inference for nonnegative matrix factorisation models. Comput Intell Neurosci 785152:17
Martin R (2005) Speech enhancement based on minimum mean-square error estimation and supergaussian priorsm. IEEE Trans Audio Speech Lang Process 13(5):845–856
ITU-T P. Supplement-23 speech corpus. https://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm#Psupp23. Accessed 1Jan 2019
‘NOIZEUS speech corpus.https://ecs.utdallas.edu/loizou/speech/noizeus/. Accessed 11Oct 2018
ITU-T Rec (1996) P.800: ‘Methods for subjective determination of transmissionquality
‘Voice bank corpus’.https://www.infona.pl/resource/bwmeta1.element.ieee-art-000006709856/. Accessed 20Sept 2018
‘TIMIT speech corpus’. https://catalog.ldc.upenn.edu/. Accessed 20Sept 2018
‘NOISEX-92 database’. https://www.speech.cs.cmu.edu/. Accessed 1 Jan 2018
Mohammadiha N, Leijon A Model order selection for nonnegative matrix factorization with application to speech enhancement. https://kth.diva-portal.org/smash/record.jsf?pid=diva2:447310. Accessed 15 Jan 2019
Kwon K, Jong WS, Nam SK (2015) NMF-based speech enhancement using bases update. IEEE Sig Process Lett 22(4):450–454
Sunnydayal V, Kumar TK (2018) Speech enhancement using posterior regularized NMF with bases update. Comput Electr Eng 62:663–675
Acknowledgements
This work is supported by the Foshan University Research Foundation for Advanced Talents (GG07005), the Natural Science Foundation of Guangdong Province (2019A1515111148), Guangdong Province Colleges and Universities Young Innovative Talent Project (2019KQNCX168).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhou, W., Zhu, Z. A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments. Int. J. Mach. Learn. & Cyber. 12, 959–972 (2021). https://doi.org/10.1007/s13042-020-01214-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-020-01214-3