Abstract
Speech enhancement algorithms based on deep learning have greatly improved speech’s perceptual quality and intelligibility. Complex-valued neural networks, such as deep complex convolution recurrent network (DCCRN), make full use of audio signal phase information and achieve superior performance, but complex-valued operations increase the computational complexity. Inspired by the deep cosine transform convolutional recurrent network (DCTCRN) model, in this paper real-valued discrete cosine transform is used instead of complex-valued Fourier transform. Besides, the ideal cosine mask is employed as the training target, and the real-valued convolutional recurrent network (CRNN) is used to enhance the speech while reducing algorithm complexity. Meanwhile, the frequency-time-LSTM (F-T-LSTM) module is used for better temporal modeling and the convolutional skip connections module is introduced between the encoders and the decoders to integrate the information between features. Moreover, the improved scale-invariant source-to-noise ratio (SI-SNR) is taken as the loss function which enables the model to focus more on the part of signal variation and thus obtain better noise suppression performance. With only 1.31M parameters, the proposed method can achieve noise suppression performance that exceeds DCCRN and DCTCRN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Sig. Process. 27(2), 113–120 (1979)
Hendriks, R.C., Heusdens, R., Jensen, J.: MMSE based noise PSD tracking with low complexity. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4266–4269. IEEE (2010)
Abd El-Fattah, M., Dessouky, M.I., Diab, S., Abd El-Samie, F.: Speech enhancement using an adaptive wiener filtering approach. Prog. Electromagnet. Res. M 4, 167–184 (2008)
Hu, G., Wang, D.: Speech segregation based on pitch tracking and amplitude modulation. In: Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575), pp. 79–82. IEEE (2001)
Srinivasan, S., Roman, N., Wang, D.: Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)
Wang, X., Bao, C.: Mask estimation incorporating phase-sensitive information for speech enhancement. Appl. Acoust. 156, 101–112 (2019)
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. IEEE Sig. Process. Lett. 21(1), 65–68 (2013)
Hu, Y., et al.: DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264 (2020)
Choi, H.S., Kim, J.H., Huh, J., Kim, A., Ha, J.W., Lee, K.: Phase-aware speech enhancement with deep complex U-Net. In: International Conference on Learning Representations (2018)
Tan, K., Wang, D.: A convolutional recurrent neural network for real-time speech enhancement. In: Interspeech. vol. 2018, pp. 3229–3233 (2018)
Li, Q., Gao, F., Guan, H., Ma, K.: Real-time monaural speech enhancement with short-time discrete cosine transform. arXiv preprint arXiv:2102.04629 (2021)
Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 187–191. IEEE (2015)
Zhou, L., Gao, Y., Wang, Z., Li, J., Zhang, W.: Complex spectral mapping with attention based convolution recurrent neural network for speech enhancement. arXiv preprint arXiv:2104.05267 (2021)
Reddy, C.K., et al.: ICASSP 2021 deep noise suppression challenge. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6623–6627. IEEE (2021)
Valentini-Botinhao, C., et al.: Noisy speech database for training speech enhancement algorithms and TTS models (2017)
Veaux, C., Yamagishi, J., King, S.: The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In: 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–4. IEEE (2013)
Varga, A.: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. ICAL Report, DRA Speech Research Unit (1992)
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). vol. 2, pp. 749–752. IEEE (2001)
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217. IEEE (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Guo, J., Zhou, Y., Liu, H., Ma, Y. (2023). Convolutional Recurrent Neural Network Based on Short-Time Discrete Cosine Transform for Monaural Speech Enhancement. In: Gao, F., Wu, J., Li, Y., Gao, H. (eds) Communications and Networking. ChinaCom 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 500. Springer, Cham. https://doi.org/10.1007/978-3-031-34790-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-34790-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34789-4
Online ISBN: 978-3-031-34790-0
eBook Packages: Computer ScienceComputer Science (R0)