[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3343031.3351086acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Public Access

On Learning Disentangled Representation for Acoustic Event Detection

Published: 15 October 2019 Publication History

Abstract

Polyphonic Acoustic Event Detection (AED) is a challenging task as the sounds are mixed with the signals from different events, and the features extracted from the mixture do not match well with features calculated from sounds in isolation, leading to suboptimal AED performance. In this paper, we propose a supervised β-VAE model for AED, which adds a novel event-specific disentangling loss in the objective function of disentangled learning. By incorporating either latent factor blocks or latent attention in disentangling, supervised β-VAE learns a set of discriminative features for each event. Extensive experiments on benchmark datasets show that our approach outperforms the current state-of-the-arts (top-1 performers in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 AED challenge). Supervised β-VAE has great success in challenging AED tasks with a large variety of events and imbalanced data.

References

[1]
Sharath Adavanne and Tuomas Virtanen. 2017. A Report on Sound Event Detection with Different Binaural Features . Technical Report. DCASE2017 Challenge.
[2]
Y Bengio, A Courville, and P Vincent. 2013. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell, Vol. 35, 8 (2013), 1798--1828.
[3]
Emre Cakir, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. 2015. Polyphonic sound event detection using multi label deep neural networks. In International Joint Conference on Neural Networks. IEEE, 1--7.
[4]
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems. 2172--2180.
[5]
Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. 2014. Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583 (2014).
[6]
Selina Chu, Shrikanth Narayanan, C C Kuo, and Maja Mataric. 2006. Where am I? Scene recognition for mobile robots using audio features. In IEEE International Conference on Multimedia and Expo. IEEE, 885--888.
[7]
Taco S Cohen and Max Welling. 2015. Transformation properties of learned visual representations. International Conference on Representation Learning (2015).
[8]
S. Davis and P. Mermelstein. 1990. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Readings in Speech Recognition, Vol. 28, 4 (1990), 65--74.
[9]
Arnaud Dessein, Arshia Cont, and Guillaume Lemaitre. 2013. Real-time detection of overlapping sound events with non-negative matrix factorization. In Matrix Information Geometry . Springer, 341--371.
[10]
Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 411--412.
[11]
Oguzhan Gencoglu, Tuomas Virtanen, and Heikki Huttunen. 2010. Recognition of acoustic events using deep neural networks. In Signal Processing Conference . 506--510.
[12]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680.
[13]
Alex Graves. 2008. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, Vol. 385 (2008).
[14]
Aki Harma, Martin F McKinney, and Janto Skowronek. 2005. Automatic surveillance of the acoustic activity in our living environment. In IEEE International Conference on Multimedia and Expo. IEEE, 4--pp.
[15]
Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, and Moncef Gabbouj. 2013. Supervised model training for overlapping sound events based on unsupervised source separation. In International Conference on Acoustics, Speech and Signal Processing. 8677--8681.
[16]
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. Beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations .
[17]
G Hinton. 2008. Visualizing High-Dimensional Data Using t-SNE . Journal of Machnine Learning Research, Vol. 9 (2008), 2579--2605.
[18]
Satoshi Innami and Hiroyuki Kasai. 2012. NMF-based environmental sound source separation using time-variant gain features. Computers & Mathematics with Applications, Vol. 64, 5 (2012), 1333--1342.
[19]
Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems. 4743--4751.
[20]
Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. International Conference on Representation Learning (2014).
[21]
Christian Kroos and Mark D. Plumbley. 2017. Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study. In Detection & Classification of Acoustic Scenes & Events Workshop .
[22]
Tejas D Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum. 2015. Deep Convolutional Inverse Graphics Network. In Conference and Workshop on Neural Information Processing Systems. 2539--2547.
[23]
Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature, Vol. 401, 6755 (1999), 788.
[24]
Yingzhen Li and Stephan Mandt. 2018. Disentangled Sequential Autoencoder. In Proceedings of the 35th International Conference on Machine Learning. 5670--5679.
[25]
Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, and Tieyan Liu. 2018. Towards Binary-Valued Gates for Robust LS™ Training. In Proceedings of the 35th International Conference on Machine Learning. 2995--3004.
[26]
Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, and Wei Xiao. 2015. Robust sound event classification using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, 3 (2015), 540--552.
[27]
Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. 2017. DCASE 2017 challenge setup: Tasks, datasets and baseline system. In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events .
[28]
Annamaria Mesaros, Toni Heittola, Antti Eronen, and Tuomas Virtanen. 2010. Acoustic event detection in real life recordings. In Signal Processing Conference. IEEE, 1267--1271.
[29]
Seongkyu Mun, Suwon Shon, Wooil Kim, and Hanseok Ko. 2016. Deep Neural Network Bottleneck Features for Acoustic Event Recognition. In INTERSPEECH . 2954--2957.
[30]
Giambattista Parascandolo, Heikki Huttunen, and Tuomas Virtanen. 2016. Recurrent neural networks for polyphonic sound event detection in real life recordings. International Conference on Acoustics, Speech, and Signal Processing (2016), 6440--6444.
[31]
Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. 2014. Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning. 1431--1439.
[32]
Manjeet Rege, Ming Dong, and Farshad Fotouhi. 2008. Bipartite isoperimetric graph partitioning for data co-clustering. Data Mining and Knowledge Discovery, Vol. 16, 3 (2008), 276--312.
[33]
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of Neural Networks using DropConnect. In Proceedings of the 30th International Conference on Machine Learning. 1058--1066.
[34]
Yun Wang and Florian Metze. 2017. A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification. In INTERSPEECH . 3097--3101.
[35]
Xianjun Xia, Roberto Togneri, Ferdous Sohel, and David Huang. 2017. Frame-Wise Dynamic Threshold Based Polyphonic Acoustic Event Detection. In INTERSPEECH .
[36]
Jimei Yang, Scott E Reed, Minghsuan Yang, and Honglak Lee. 2015. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In Advances in Neural Information Processing Systems. 1099--1107.
[37]
Dongqing Zhang and Dan Ellis. 2001. Detecting sound events in basketball video archive. Department of Electrical Engineering, Columbia University, New York (2001).
[38]
Matthias Zhrer and Franz Pernkopf. 2017. Virtual Adversarial Training and Data Augmentation for Acoustic Event Detection with Gated Recurrent Neural Networks. In INTERSPEECH. 493--497.

Cited By

View all
  • (2024)Why do variational autoencoders really promote disentanglement?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692224(3817-3849)Online publication date: 21-Jul-2024
  • (2024)On Local Temporal Embedding for Semi-Supervised Sound Event DetectionIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.336952932(1687-1698)Online publication date: 2024
  • (2023)Towards a unified framework of contrastive learning for disentangled representationsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669070(67459-67470)Online publication date: 10-Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. acoustic event detection
  2. disentangled latent representation
  3. supervised variational autoencoder

Qualifiers

  • Research-article

Funding Sources

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)84
  • Downloads (Last 6 weeks)8
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Why do variational autoencoders really promote disentanglement?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692224(3817-3849)Online publication date: 21-Jul-2024
  • (2024)On Local Temporal Embedding for Semi-Supervised Sound Event DetectionIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.336952932(1687-1698)Online publication date: 2024
  • (2023)Towards a unified framework of contrastive learning for disentangled representationsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669070(67459-67470)Online publication date: 10-Dec-2023
  • (2022)Adaptive Hierarchical Pooling for Weakly-supervised Sound Event DetectionProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548097(1779-1787)Online publication date: 10-Oct-2022
  • (2021)Reproducibility Companion Paper: On Learning Disentangled Representation for Acoustic Event DetectionProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3477938(3638-3641)Online publication date: 17-Oct-2021
  • (2021)Learning to disentangle emotion factors for facial expression recognition in the wildInternational Journal of Intelligent Systems10.1002/int.2239136:6(2511-2527)Online publication date: 25-Feb-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media