Compute and Memory Efficient Universal Sound Source Separation

Efthymios Tzinis ORCID: orcid.org/0000-0002-1047-1338¹,
Zhepei Wang¹,
Xilin Jiang¹ &
…
Paris Smaragdis²

1199 Accesses
23 Citations
5 Altmetric
Explore all metrics

Abstract

Recent progress in audio source separation led by deep learning has enabled many neural network models to provide robust solutions to this fundamental estimation problem. In this study, we provide a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-world scenarios. The backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRM-RF) as well as their aggregation which is performed through simple one-dimensional convolutions. This mechanism enables our models to obtain high fidelity signal separation in a wide variety of settings where a variable number of sources are present and with limited computational resources (e.g. floating point operations, memory footprint, number of parameters and latency). Our experiments show that SuDoRM-RF models perform comparably and even surpass several state-of-the-art benchmarks with significantly higher computational resource requirements. The causal variation of SuDoRM-RF is able to obtain competitive performance in real-time speech separation of around 10dB scale-invariant signal-to-distortion ratio improvement (SI-SDRi) while remaining up to 20 times faster than real-time on a laptop device.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

U-NET: A Supervised Approach for Monaural Source Separation

Article 26 February 2024

Audio Source Separation with Discriminative Scattering Networks

Music Source Separation with Deep Convolution Neural Network

Notes

Code: https://github.com/etzinis/sudo_rm_rf

References

Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). Layer normalization. arXiv:1607.06450.
Brunner, G., Naas, N., Palsson, S., Richter, O., & Wattenhofer, R. (2019). Monaural music source separation using a resnet latent separator network. In Proc. ICTAI (pp. 1124–1131).
Cai, H., Gan, C., Wang, T., Zhang, Z., & Han, S. (2020). Once for all: Train one network and specialize it for efficient deployment. In Proc. ICLR.
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proc. CVPR (pp. 1251–1258).
Défossez, A., Usunier, N., Bottou, L., & Bach, F. (2019). Music source separation in the waveform domain. arXiv:1911.13254.
Défossez, A., Synnaeve, G., & Adi, Y. (2020). Real time speech enhancement in the waveform domain. In Proc. Interspeech (pp. 3291–3295).
Haris, M., Shakhnarovich, G., & Ukita, N. (2018). Deep back-projection networks for super-resolution. In Proc. CVPR (pp. 1664–1673).
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proc. CVPR (pp. 1026–1034).
Hennequin, R., Khlif, A., Voituret, F., & Moussallam, M. (2020). Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50), 2154.
Article Google Scholar
Hershey, J.R., Chen, Z., Le Roux, J., & Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. ICASSP (pp. 31–35).
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861.
Huang, P.S., Kim, M., Hasegawa-Johnson, M., & Smaragdis, P. (2014). Deep learning for monaural speech separation. In Proc. ICASSP (pp. 1562–1566).
Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., Oord, A., Dieleman, S., & Kavukcuoglu, K. (2018). Efficient neural audio synthesis. In International conference on machine learning (pp. 2410–2419).
Kaspersen, E.T., Kounalakis, T., & Erkut, C. (2020). Hydranet: A real-time waveform separation network. In Proc. ICASSP (pp. 4327–4331).
Kavalerov, I., Wisdom, S., Erdogan, H., Patton, B., Wilson, K., Roux, J.L., & Hershey, J.R. (2019). Universal sound separation. In Proc. WASPAA (pp. 175–179).
Kim, M., & Smaragdis, P. (2018). Efficient source separation using bitwise neural networks. In Audio source separation (pp. 187–206). Springer.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Lane, N.D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, L., Qendro, L., & Kawsar, F. (2016). Deepx: A software accelerator for low-power deep learning inference on mobile devices. In Proc. IPSN (pp. 1–12).
Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J.R. (2019). Sdr–half-baked or well done?. In Proc. ICASSP (pp. 626– 630).
Liu, Y., & Wang, D. (2019). Divide and conquer: A deep casa approach to talker-independent monaural speaker separation. arXiv:1904.11148.
Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (pp. 4898–4906).
Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In Proc. ICASSP.
Luo, Y., Han, C., & Mesgarani, N. (2020). Ultra-lightweight speech separation via group communication. arXiv:2011.08397.
Luo, Y., & Mesgarani, N. (2019). Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8), 1256–1266.
Article Google Scholar
Maldonado, A., Rascon, C., & Velez, I. (2020). Lightweight online separation of the sound source of interest through blstm-based binary masking. arXiv:2002.11241.
Mehta, S., Rastegari, M., Shapiro, L., & Hajishirzi, H. (2019). Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proc. CVPR (pp. 9190–9200).
Pandey, A., & Wang, D. (2019). Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain. In Proc. ICASSP (pp. 6875–6879).
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., ..., Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems.
Paul, D.B., & Baker, J.M. (1992). The design for the wall street journal-based CSR corpus. In Speech and natural language: proceedings of a workshop held at harriman, February 23-26, 1992. New York.
Piczak, K.J. (2015). Esc: Dataset for environmental sound classification. In Proc. ACM international conference on multimedia (pp. 1015–1018).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241). Springer.
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. In Proc. ACL (pp. 464–468).
Sifre, L., & Mallat, S. (2014). Rigid-motion scattering for image classification. Ph. D Thesis.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In Workshop Proc. ICLR.
Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., & Zhong, J. (2021). Attention is all you need in speech separation. In Proc. ICASSP. To appear.
Tzinis, E., Venkataramani, S., Wang, Z., Subakan, C., & Smaragdis, P. (2020). Two-step sound source separation: Training on learned latent targets. In Proc. ICASSP.
Tzinis, E., Wisdom, S., Hershey, J.R., Jansen, A., & Ellis, D.P. (2020). Improving universal sound separation using sound classification. In Proc. ICASSP.
Vincent, E., Gribonval, R., & Févotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1462–1469.
Article Google Scholar
Wisdom, S., Erdogan, H., Ellis, D., Serizel, R., Turpault, N., Fonseca, E., Salamon, J., Seetharaman, P., & Hershey, J. (2020). What’s all the fuss about free universal sound separation data? arXiv:2011.00803.
Wisdom, S., Hershey, J.R., Wilson, K., Thorpe, J., Chinen, M., Patton, B., & Saurous, R.A. (2019). Differentiable consistency constraints for improved deep speech enhancement. In Proc. ICASSP (pp. 900–904).
Yu, D., Kolbæk, M., Tan, Z.H., & Jensen, J. (2017). Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP (pp. 241–245).
Yu, J., Yang, L., Xu, N., Yang, J., & Huang, T. (2019). Slimmable neural networks. In Proc. ICLR.
Zeghidour, N., & Grangier, D. (2020). Wavesplit: End-to-end speech separation by speaker clustering. arXiv:2002.08933.

Download references

Author information

Authors and Affiliations

University of Illinois at Urbana-Champaign, Champaign, IL, USA
Efthymios Tzinis, Zhepei Wang & Xilin Jiang
University of Illinois at Urbana-Champaign, Adobe Research, Champaign, IL, USA
Paris Smaragdis

Authors

Efthymios Tzinis
View author publications
You can also search for this author in PubMed Google Scholar
Zhepei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xilin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Paris Smaragdis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Efthymios Tzinis.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tzinis, E., Wang, Z., Jiang, X. et al. Compute and Memory Efficient Universal Sound Source Separation. J Sign Process Syst 94, 245–259 (2022). https://doi.org/10.1007/s11265-021-01683-x

Download citation

Received: 01 March 2021
Revised: 26 May 2021
Accepted: 03 July 2021
Published: 30 July 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11265-021-01683-x

Compute and Memory Efficient Universal Sound Source Separation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

U-NET: A Supervised Approach for Monaural Source Separation

Audio Source Separation with Discriminative Scattering Networks

Music Source Separation with Deep Convolution Neural Network

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Compute and Memory Efficient Universal Sound Source Separation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

U-NET: A Supervised Approach for Monaural Source Separation

Audio Source Separation with Discriminative Scattering Networks

Music Source Separation with Deep Convolution Neural Network

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation