Abstract
Most of the existing deep learning-based speech denoising methods rely heavily on clean speech data. According to the traditional view, a large number of noisy and clean speech samples are required for good speech denoising performance. However, the data collection is a technical barrier to this criteria, particularly in economically challenged areas and for languages with limited resources. Training deep denoising networks with only noisy speech samples is a viable option to avoid dependence on sample data size. In this study, the target and input of a DCU-Net were trained using only noisy speech samples. Experimental results demonstrate that, when compared to traditional speech denoising techniques, the proposed approach avoids not only the high dependence on clean targets but also the high dependence on large data sizes.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are available from the corresponding author on request.
References
N. Alamdari, A. Azarang, N. Kehtarnavaz, Improving deep speech denoising by noisy2noisy signal mapping. Appl. Acoust. 172, 107631 (2021)
Y. Attabi, B. Champagne, W.P. Zhu, Dnn-based calibrated-filter models for speech enhancement. Circuits Syst. Signal Process. 40, 2926–2949 (2021)
A. Azarang, N. Kehtarnavaz, A review of multi-objective deep learning speech denoising methods. Speech Commun. 122, 1–10 (2020)
D. Baby, S. Verhulst, Sergan: speech enhancement using relativistic generative adversarial networks with gradient penalty. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 106–110. IEEE (2019)
H.S. Choi, J.H. Kim, J. Huh, A. Kim, J.W. Ha, K. Lee, Phase-aware speech enhancement with deep complex u-net. In: International Conference on Learning Representations (2019)
A. Defossez, G. Synnaeve, Y. Adi, Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020)
S.W. Fu, C. Yu, T.A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, Metricgan+: an improved version of metricgan for speech enhancement. arXiv preprint arXiv:2104.03538 (2021)
E.M. Grais, M.D. Plumbley, Single channel audio source separation using convolutional denoising autoencoders. In: 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 1265–1269. IEEE (2017)
F. He, S.H.C. Chu, O. Kjartansson, C.E. Rivera, A. Katanova, A. Gutkin, I. Demirsahin, C.C. Johny, M. Jansche, S. Sarin, et al., Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems. In: Proceedings of the Twelfth Language Resources and Evaluation Conference (2020)
Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229–238 (2007)
M.M. Kashyap, A. Tambwekar, K. Manohara, S. Natarajan, Speech denoising without clean training data: a noise2noise approach. arXiv preprint arXiv:2104.03838 (2021)
M. Kawanaka, Y. Koizumi, R. Miyazaki, K. Yatabe, Stable training of dnn for speech enhancement based on perceptually-motivated black-box cost function. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7524–7528. IEEE (2020)
Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, D. Takeuchi, Speech enhancement using self-adaptation and multi-head self-attention. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 181–185. IEEE (2020)
J. Le Roux, S. Wisdom, H. Erdogan, J.R. Hershey, Sdr–half-baked or well done? In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. IEEE (2019)
J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, T. Aila, Noise2noise: learning image restoration without clean data. arXiv preprint arXiv:1803.04189 (2018)
X. Lu, Y. Tsao, S. Matsuda, C. Hori, Speech enhancement based on deep denoising autoencoder. In: Interspeech, vol. 2013, pp. 436–440 (2013)
A.A. Nugraha, A. Liutkus, E. Vincent, Multichannel audio source separation with deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24(9), 1652–1664 (2016)
S. Pascual, A. Bonafonte, J. Serra, Segan: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017)
I.T. Recommendation, Perceptual evaluation of speech quality (pesq): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P. 862 (2001)
O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, pp. 234–241. Springer (2015)
J. Salamon, C. Jacoby, J.P. Bello, A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 1041–1044 (2014)
N. Sharma, M.K. Singh, S.Y. Low, A. Kumar, Weighted sigmoid-based frequency-selective noise filtering for speech denoising. Circuits Syst. Signal Process. 40, 276–295 (2021)
J. Su, Z. Jin, A. Finkelstein, Hifi-gan: high-fidelity denoising and dereverberation based on speech deep features in adversarial networks. arXiv preprint arXiv:2006.05694 (2020)
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
N. Takahashi, N. Goswami, Y. Mitsufuji, Mmdenselstm: an efficient combination of convolutional and recurrent neural networks for audio source separation. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 106–110. IEEE (2018)
J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings. In: Proceedings of Meetings on Acoustics ICA2013, vol. 19, p. 035081. Acoustical Society of America (2013)
C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J.F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, C.J. Pal, Deep complex networks. arXiv preprint arXiv:1705.09792 (2017)
C. Valentini-Botinhao, X. Wang, S. Takaki, J. Yamagishi, Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks. In: Interspeech, vol. 8, pp. 352–356 (2016)
K. Wang, B. He, W.P. Zhu, Caunet: context-aware u-net for speech enhancement in time domain. In: 2021 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2021)
D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24(3), 483–492 (2015)
J. Wu, Q. Li, G. Yang, L. Senhadji, H. Shu, Self-supervised speech denoising using only noisy audio signals. arXiv preprint arXiv:2111.00242 (2021)
Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2013)
Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(1), 7–19 (2014)
J. Zeng, L. Yang, Speech enhancement of complex convolutional recurrent network with attention. Circuits, Syst. Signal Process. 1–14 (2022)
H. Zhao, S. Zarar, I. Tashev, C.H. Lee, Convolutional-recurrent neural networks for speech enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2401–2405. IEEE (2018)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Poluboina, V., Pulikala, A. & Pitchaimuthu, A.N. Deep Speech Denoising with Minimal Dependence on Clean Speech Data. Circuits Syst Signal Process 43, 3909–3926 (2024). https://doi.org/10.1007/s00034-024-02644-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-024-02644-y