Abstract
With the advent of deep learning, research on noise-robust sound event detection (SED) has progressed rapidly. However, SED performance in noisy conditions of single-channel systems remains unsatisfactory. Recently, there were several speech enhancement (SE) methods for the SED front-end to reduce the noise effect, which are completely two models that handle two tasks separately. In this work, we introduced a network trained by a two-stage method to simultaneously perform signal denoising and SED, where denoising and SED are conducted sequentially using neural network method. In addition, we designed a new objective function that takes into account the Euclidean distance between the output of the denoising block and the corresponding clean audio amplitude spectrum, which can better limit the distortion of the output features. The two-stage model is then jointly trained to optimize the proposed objective function. The results show that the proposed network presents a better performance compared with single-stage network without noise suppression. Compared with other recent state-of-the-art networks in the SED field, the performance of the proposed network model is competitive, especially in noisy environments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M.: Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Trans. Intell. Transp. Syst. 17(1), 279–288 (2015)
Phuong, N.C., Do Dat, T.: Sound classification for event detection: application into medical telemonitoring. In: International Conference on Computing (2013)
Clavel, C., Ehrette, T., Richard, G.: Events detection for an audio-based surveillance system. In: ICME, pp. 1306–1309 (2005)
Baumann, J., Lohrenz, T., Roy, A., Fingscheidt, T.: Beyond the dcase 2017 challenge on rare sound event detection: a proposal for a more realistic training and test framework. In: ICASSP, pp. 611–615 (2020)
Wang, W., Kao, C.-C., Wang, C.: A simple model for detection of rare sound events. Interspeech (2018)
Shimada, K., Koyama, Y., Inoue, A.: Metric learning with background noise class for few-shot detection of rare sound events. In: ICASSP, pp. 616–620 (2019)
Lim, H., Park, J., Han, Y.: Rare sound event detection using 1D convolutional recurrent neural networks. Technical report, DCASE2017 Challenge, September 2017
Kao, C.-C., Wang, W., Sun, M., Wang, C.: R-CRNN: region-based convolutional recurrent neural network for audio event detection. Interspeech, pp. 1358–1362 (2018)
Zhang, K., Cai, Y., Ren, Y., Ye, R., He, L.: MTF-CRNN: multiscale time-frequency convolutional recurrent neural network for sound event detection. IEEE Access (99), 1 (2020)
Shen, Y.-H., He, K.-X., Zhang, W.-Q.: Learning how to listen: a temporal-frequential attention model for sound event detection. arXiv: Sound, pp. 2563–2567 (2019)
Keisuke, K., Ochiai, T., Delcroix, M., Nakatani, T.: Improving noise robust automatic speech recognition with single-channel time-domain enhancement network. In: ICASSP, pp. 7009–7013 (2020)
Kolbæk, M.: Single-microphone speech enhancement and separation using deep learning. arXiv: Sound (2018)
Heymann, J., Drude, L., Böddeker, C., Hanebrink, P., Haeb-Umbach, R.: Beamnet: end-to-end training of a beamformer-supported multi-channel ASR system. In: ICASSP, pp. 5325–5329 (2017)
Feng, Q., Zhou, Z.: Robust sound event detection through noise estimation and source separation using NMF. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop(DCASE2017) (2017)
Zhou, Q., Feng, Z., Benetos, E.: Adaptive noise reduction for sound event detection using subband-weighted NMF. Sensors (Basel, Switzerland) (2019)
Wan, T., Zhou, Y., Ma, Y., Liu, H.: Noise robust sound event detection using deep learning and audio enhancement. In: ISSPIT, pp. 1–5 (2019)
Zhao, Y., Wang, Z.Q., Wang, D.L.: Two-stage deep learning for noisy-reverberant speech enhancement. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 27, 53–62 (2018)
Tan, K., Wang, D.L.: A convolutional recurrent neural network for real-time speech enhancement. Interspeech, pp. 3229–3233 (2018)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). Computer ence (2015)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: AISTATS, pp. 315–323 (2011)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (2015)
Srivastava, N., Hinton, E.G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Mesaros, A., Heittola, T., Virtanen, T.: Tut database for acoustic scene classification and sound event detection. In: EUSIPCO, pp. 1128–1132 (2016)
Mesaros, A., et al.: Dcase 2017 challenge setup: tasks, datasets and baseline system (2017)
Kingma, P.D., Ba, L.J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Mesaros, A., Heittola, T., Virtanen, T.: Metrics for polyphonic sound event detection. Appl. Sci. (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (mp4 43308 KB)
Rights and permissions
Copyright information
© 2022 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Ou, J., Liu, H., Zhou, Y., Gan, L. (2022). Robust Sound Event Detection by a Two-Stage Network in the Presence of Background Noise. In: Gao, H., Wun, J., Yin, J., Shen, F., Shen, Y., Yu, J. (eds) Communications and Networking. ChinaCom 2021. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 433. Springer, Cham. https://doi.org/10.1007/978-3-030-99200-2_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-99200-2_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99199-9
Online ISBN: 978-3-030-99200-2
eBook Packages: Computer ScienceComputer Science (R0)