More Web Proxy on the site http://driver.im/

research-article

Public Access

On Learning Disentangled Representation for Acoustic Event Detection

Authors:

Ratna ChinnamAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 2006 - 2014

https://doi.org/10.1145/3343031.3351086

Published: 15 October 2019 Publication History

Abstract

Polyphonic Acoustic Event Detection (AED) is a challenging task as the sounds are mixed with the signals from different events, and the features extracted from the mixture do not match well with features calculated from sounds in isolation, leading to suboptimal AED performance. In this paper, we propose a supervised β-VAE model for AED, which adds a novel event-specific disentangling loss in the objective function of disentangled learning. By incorporating either latent factor blocks or latent attention in disentangling, supervised β-VAE learns a set of discriminative features for each event. Extensive experiments on benchmark datasets show that our approach outperforms the current state-of-the-arts (top-1 performers in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 AED challenge). Supervised β-VAE has great success in challenging AED tasks with a large variety of events and imbalanced data.

References

[1]

Sharath Adavanne and Tuomas Virtanen. 2017. A Report on Sound Event Detection with Different Binaural Features . Technical Report. DCASE2017 Challenge.

[2]

Y Bengio, A Courville, and P Vincent. 2013. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell, Vol. 35, 8 (2013), 1798--1828.

Digital Library

[3]

Emre Cakir, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. 2015. Polyphonic sound event detection using multi label deep neural networks. In International Joint Conference on Neural Networks. IEEE, 1--7.

[4]

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems. 2172--2180.

Digital Library

[5]

Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. 2014. Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583 (2014).

[6]

Selina Chu, Shrikanth Narayanan, C C Kuo, and Maja Mataric. 2006. Where am I? Scene recognition for mobile robots using audio features. In IEEE International Conference on Multimedia and Expo. IEEE, 885--888.

[7]

Taco S Cohen and Max Welling. 2015. Transformation properties of learned visual representations. International Conference on Representation Learning (2015).

[8]

S. Davis and P. Mermelstein. 1990. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Readings in Speech Recognition, Vol. 28, 4 (1990), 65--74.

[9]

Arnaud Dessein, Arshia Cont, and Guillaume Lemaitre. 2013. Real-time detection of overlapping sound events with non-negative matrix factorization. In Matrix Information Geometry . Springer, 341--371.

[10]

Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 411--412.

Digital Library

[11]

Oguzhan Gencoglu, Tuomas Virtanen, and Heikki Huttunen. 2010. Recognition of acoustic events using deep neural networks. In Signal Processing Conference . 506--510.

[12]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680.

[13]

Alex Graves. 2008. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, Vol. 385 (2008).

[14]

Aki Harma, Martin F McKinney, and Janto Skowronek. 2005. Automatic surveillance of the acoustic activity in our living environment. In IEEE International Conference on Multimedia and Expo. IEEE, 4--pp.

[15]

Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, and Moncef Gabbouj. 2013. Supervised model training for overlapping sound events based on unsupervised source separation. In International Conference on Acoustics, Speech and Signal Processing. 8677--8681.

[16]

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. Beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations .

[17]

G Hinton. 2008. Visualizing High-Dimensional Data Using t-SNE . Journal of Machnine Learning Research, Vol. 9 (2008), 2579--2605.

[18]

Satoshi Innami and Hiroyuki Kasai. 2012. NMF-based environmental sound source separation using time-variant gain features. Computers & Mathematics with Applications, Vol. 64, 5 (2012), 1333--1342.

Digital Library

[19]

Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems. 4743--4751.

[20]

Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. International Conference on Representation Learning (2014).

[21]

Christian Kroos and Mark D. Plumbley. 2017. Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study. In Detection & Classification of Acoustic Scenes & Events Workshop .

[22]

Tejas D Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum. 2015. Deep Convolutional Inverse Graphics Network. In Conference and Workshop on Neural Information Processing Systems. 2539--2547.

[23]

Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature, Vol. 401, 6755 (1999), 788.

[24]

Yingzhen Li and Stephan Mandt. 2018. Disentangled Sequential Autoencoder. In Proceedings of the 35th International Conference on Machine Learning. 5670--5679.

[25]

Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, and Tieyan Liu. 2018. Towards Binary-Valued Gates for Robust LS™ Training. In Proceedings of the 35th International Conference on Machine Learning. 2995--3004.

[26]

Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, and Wei Xiao. 2015. Robust sound event classification using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, 3 (2015), 540--552.

Digital Library

[27]

Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. 2017. DCASE 2017 challenge setup: Tasks, datasets and baseline system. In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events .

[28]

Annamaria Mesaros, Toni Heittola, Antti Eronen, and Tuomas Virtanen. 2010. Acoustic event detection in real life recordings. In Signal Processing Conference. IEEE, 1267--1271.

[29]

Seongkyu Mun, Suwon Shon, Wooil Kim, and Hanseok Ko. 2016. Deep Neural Network Bottleneck Features for Acoustic Event Recognition. In INTERSPEECH . 2954--2957.

[30]

Giambattista Parascandolo, Heikki Huttunen, and Tuomas Virtanen. 2016. Recurrent neural networks for polyphonic sound event detection in real life recordings. International Conference on Acoustics, Speech, and Signal Processing (2016), 6440--6444.

[31]

Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. 2014. Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning. 1431--1439.

[32]

Manjeet Rege, Ming Dong, and Farshad Fotouhi. 2008. Bipartite isoperimetric graph partitioning for data co-clustering. Data Mining and Knowledge Discovery, Vol. 16, 3 (2008), 276--312.

Digital Library

[33]

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of Neural Networks using DropConnect. In Proceedings of the 30th International Conference on Machine Learning. 1058--1066.

[34]

Yun Wang and Florian Metze. 2017. A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification. In INTERSPEECH . 3097--3101.

[35]

Xianjun Xia, Roberto Togneri, Ferdous Sohel, and David Huang. 2017. Frame-Wise Dynamic Threshold Based Polyphonic Acoustic Event Detection. In INTERSPEECH .

[36]

Jimei Yang, Scott E Reed, Minghsuan Yang, and Honglak Lee. 2015. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In Advances in Neural Information Processing Systems. 1099--1107.

[37]

Dongqing Zhang and Dan Ellis. 2001. Detecting sound events in basketball video archive. Department of Electrical Engineering, Columbia University, New York (2001).

[38]

Matthias Zhrer and Franz Pernkopf. 2017. Virtual Adversarial Training and Data Augmentation for Acoustic Event Detection with Gated Recurrent Neural Networks. In INTERSPEECH. 493--497.

Cited By

Bhowal PSoni ARambhatla SSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Why do variational autoencoders really promote disentanglement?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692224(3817-3849)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692224
Gao LMao QDong M(2024)On Local Temporal Embedding for Semi-Supervised Sound Event DetectionIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.336952932(1687-1698)Online publication date: 2024
https://doi.org/10.1109/TASLP.2024.3369529
Matthes SHan ZShen HOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Towards a unified framework of contrastive learning for disentangled representationsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669070(67459-67470)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669070
Show More Cited By

Index Terms

On Learning Disentangled Representation for Acoustic Event Detection
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection
    2. Machine learning approaches
      1. Learning latent representations

Recommendations

Reproducibility Companion Paper: On Learning Disentangled Representation for Acoustic Event Detection
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

This companion paper is provided to describe the major experiments reported in our paper "On Learning Disentangled Representation for Acoustic Event Detection" published in ACM Multimedia 2019. To make the replication of our work easier, we first give an ...
Acoustic event detection in meeting-room environments

Acoustic event detection (AED) aims at determining the identity of sounds and their temporal position in the signals that are captured by one or several microphones. The AED problem has been recently proposed for meeting-room or class-room environments, ...
Acoustic Event Detection and Sound Separation for security systems and IoT devices
CompSysTech '21: Proceedings of the 22nd International Conference on Computer Systems and Technologies

When we think of audio data, we think of music and speech. However, the set of various kinds of audio data, contains a vast multitude of different sounds. Human brain can identify sounds such as two vehicles crashing against each other, someone crying ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
755
Total Downloads

Downloads (Last 12 months)84
Downloads (Last 6 weeks)8

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bhowal PSoni ARambhatla SSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Why do variational autoencoders really promote disentanglement?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692224(3817-3849)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692224
Gao LMao QDong M(2024)On Local Temporal Embedding for Semi-Supervised Sound Event DetectionIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.336952932(1687-1698)Online publication date: 2024
https://doi.org/10.1109/TASLP.2024.3369529
Matthes SHan ZShen HOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Towards a unified framework of contrastive learning for disentangled representationsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669070(67459-67470)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669070
Gao LZhou LMao QDong MMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Adaptive Hierarchical Pooling for Weakly-supervised Sound Event DetectionProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548097(1779-1787)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548097
Gao LMao QChen JDong MChinnam RSassatelli LRomero Rondon MSharma UShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Reproducibility Companion Paper: On Learning Disentangled Representation for Acoustic Event DetectionProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3477938(3638-3641)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3477938
Zhu QGao LSong HMao Q(2021)Learning to disentangle emotion factors for facial expression recognition in the wildInternational Journal of Intelligent Systems10.1002/int.2239136:6(2511-2527)Online publication date: 25-Feb-2021
https://doi.org/10.1002/int.22391

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten