More Web Proxy on the site http://driver.im/

research-article

Public Access

SoundSemantics: exploiting semantic knowledge in text for embedded acoustic event classification

Authors:

Md Tamzeed Islam,

Shahriar NirjonAuthors Info & Claims

IPSN '19: Proceedings of the 18th International Conference on Information Processing in Sensor Networks

Pages 217 - 228

https://doi.org/10.1145/3302506.3310402

Published: 16 April 2019 Publication History

Abstract

In this paper, we propose a fundamentally different approach to acoustic event classification that exploits knowledge from the textual domain to deal with a well-known pain point in audio event classification---i.e., the lack of adequate training examples. We show that by exploiting existing context-aware semantic representation of English words (e.g., Google's word-to-vector [33]) that is generated from a massive amount of English texts on the Internet, it is possible to classify acoustic events even when there are a few or no training examples for a wide variety of sounds. Our approach works in application scenarios where a system wants to learn a number of predefined categories of acoustic events, but it does not have enough training examples per category, and/or for some categories, it does not have any training examples at all. We solve this problem by combining a robust audio representation step, followed by a cross-modal projection of the audio representation onto textual representation. Our approach is different from techniques such as one-shot and data augmentation that do not consider the cross-domain knowledge transfers. We develop a generic mobile application for audio event detection where a user can input a list of desired sound types, along with training audio clips for some of those sound types, and the system is able to recognize all types of sounds (at varying level of accuracy depending on the number of classes that do not have any training examples), which is not achievable by any existing audio classifier that we are aware of. We evaluate the performance of the proposed system on an empirical dataset [41] as well as by deploying the application in two real-world scenarios. The accuracy of the classifier lies between 60%-90% for a 6--10 class problem when the number of classes that do not have any training examples is varied between 2--5.

References

[1]

Tensorflow. https://www.tensorflow.org/lite/.

[2]

Aytar, Y., Vondrick, C., and Torralba, A. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems (2016), pp. 892--900.

Digital Library

[3]

Blumstein, D. T., Mennill, D. J., Clemins, P., Girod, L., Yao, K., Patricelli, G., Deffe, J. L., Krakauer, A. H., Clark, C., Cortopassi, K. A., et al. Acoustic monitoring in terrestrial environments using microphone arrays: applications, technological considerations and prospectus. Journal of Applied Ecology (2011).

[4]

Chen, J., Kam, A. H., Zhang, J., Liu, N., and Shue, L. Bathroom activity monitoring based on sound. In International Conference on Pervasive Computing (2005), Springer, pp. 47--61.

Digital Library

[5]

Chen, Z., Lin, M., Chen, F., Lane, N. D., Cardone, G., Wang, R., Li, T., Chen, Y., Choudhury, T., and Campbell, A. T. Unobtrusive sleep monitoring using smartphones. In Proceedings of the 7th International Conference on Pervasive Computing Technologies for Healthcare.

Digital Library

[6]

Clavel, C., Ehrette, T., and Richard, G. Events detection for an audio-based surveillance system. In 2005 IEEE International Conference on Multimedia and Expo (2005), IEEE, pp. 1306--1309.

[7]

de Godoy, D., Islam, B., Xia, S., Islam, M. T., Chandrasekaran, R., Chen, Y.-C., Nirjon, S., Kinget, P. R., and Jiang, X. Paws: A wearable acoustic system for pedestrian safety. In Internet-of-Things Design and Implementation (IoTDI), 2018 IEEE/ACM Third International Conference on (2018), IEEE, pp. 237--248.

[8]

Dedeoglu, Y., Toreyin, B. U., Gudukbay, U., and Cetin, A. E. Surveillance using both video and audio. In Multimodal Processing and Interaction.

[9]

Dickerson, R. F., Hoque, E., Asare, P., Nirjon, S., and Stankovic, J. A. Resonate: reverberation environment simulation for improved classification of speech models. In Proceedings of the 13th international symposium on Information processing in sensor networks (2014), IEEE Press, pp. 107--118.

Digital Library

[10]

Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems (2013), pp. 2121--2129.

Digital Library

[11]

Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, vol. 1. MIT press Cambridge, 2016.

Digital Library

[12]

Guvensan, M. A., Taysi, Z. C., and Melodia, T. Energy monitoring in residential spaces with audio sensor nodes: Tinyears. Ad Hoc Networks 11, 5 (2013), 1539--1555.

[13]

Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In null (2006), IEEE, pp. 1735--1742.

Digital Library

[14]

Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., et al. Cnn architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (2017), IEEE, pp. 131--135.

[15]

Ho, T. K., Hull, J. J., and Srihari, S. N. Decision combination in multiple classifier systems. IEEE transactions on pattern analysis and machine intelligence 16, 1 (1994), 66--75.

Digital Library

[16]

Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y. Improving Word Representations via Global Context and Multiple Word Prototypes. In Annual Meeting of the Association for Computational Linguistics (2012).

Digital Library

[17]

Islam, M. T., Islam, B., and Nirjon, S. Soundsifter: Mitigating overhearing of continuous listening devices. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (2017), ACM, pp. 29--41.

Digital Library

[18]

Jaitly, N., and Hinton, G. E. Vocal tract length perturbation (vtlp) improves speech recognition. In Proc. ICML Workshop on Deep Learning for Audio, Speech and Language (2013).

[19]

Japkowicz, N., and Stephen, S. The class imbalance problem: A systematic study. Intelligent data analysis 6, 5 (2002), 429--449.

Digital Library

[20]

Jordan, M. I., and Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 6245 (2015), 255--260.

[21]

Kanda, N., Takeda, R., and Obuchi, Y. Elastic spectral distortion for low resource speech recognition with deep neural networks. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on.

[22]

Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[23]

Koch, G. Siamese neural networks for one-shot image recognition.

[24]

Kodirov, E., Xiang, T., and Gong, S. Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345 (2017).

[25]

Kumar, A., Khadkevich, M., and Fügen, C. Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), IEEE, pp. 326--330.

[26]

Kuncheva, L. I. A theoretical study on six classifier fusion strategies. IEEE Transactions on pattern analysis and machine intelligence.

Digital Library

[27]

Kuncheva, L. I., Bezdek, J. C., and Duin, R. P. Decision templates for multiple classifier fusion: an experimental comparison. Pattern recognition (2001).

[28]

Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., Warmuth, M., and Wolf, P. The cmu sphinx-4 speech recognition system. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2003), Hong Kong (2003), vol. 1, pp. 2--5.

[29]

Lampert, C. H., Nickisch, H., and Harmeling, S. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 3 (2014), 453--465.

Digital Library

[30]

Lee, H., Pham, P., Largman, Y., and Ng, A. Y. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in neural information processing systems (2009), pp. 1096--1104.

Digital Library

[31]

Lehmann, E. L., and Casella, G. Theory of point estimation. Springer Science & Business Media, 2006.

[32]

Lu, H., Brush, A. B., Priyantha, B., Karlson, A. K., and Liu, J. Speakersense: Energy efficient unobtrusive speaker identification on mobile phones. In International conference on pervasive computing (2011), Springer, pp. 188--205.

Digital Library

[33]

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint.

[34]

Miluzzo, E., Cornelius, C. T., Ramaswamy, A., Choudhury, T., Liu, Z., and Campbell, A. T. Darwin phones: the evolution of sensing and inference on mobile phones. In Proceedings of the 8th international conference on Mobile systems, applications, and services (2010), ACM, pp. 5--20.

Digital Library

[35]

Mithun, N. C., Munir, S., Guo, K., and Shelton, C. ODDS: real-time object detection using depth sensors on embedded GPUs. In IPSN (2018).

Digital Library

[36]

Morales, N., Gu, L., and Gao, Y. Adding noise to improve noise robustness in speech recognition. In INTERSPEECH (2007), pp. 930--933.

[37]

Nirjon, S., Dickerson, R. F., Asare, P., Li, Q., Hong, D., Stankovic, J. A., Hu, P., Shen, G., and Jiang, X. Auditeur: A mobile-cloud service platform for acoustic event detection on smartphones. In Proceeding of the 11th annual international conference on Mobile systems, applications, and services (2013), ACM, pp. 403--416.

Digital Library

[38]

Nirjon, S., Dickerson, R. F., Li, Q., Asare, P., Stankovic, J. A., Hong, D., Zhang, B., Jiang, X., Shen, G., and Zhao, F. Musicalheart: A hearty way of listening to music. In Proceedings of the 10th ACM Conference on Embedded Network Sensor Systems (2012), ACM, pp. 43--56.

Digital Library

[39]

Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G. S., and Dean, J. Zero-shot learning by convex combination of semantic embeddings.

[40]

Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. Zero-shot learning with semantic output codes. In Advances in neural information processing systems (2009), pp. 1410--1418.

Digital Library

[41]

Piczak, K. J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, ACM Press, pp. 1015--1018.

Digital Library

[42]

Ra, H.-K., Salekin, A., Yoon, H.-J., Kim, J., Nirjon, S. S., Stone, D. J., Kim, S., Lee, J.-M., Son, S. H., Stankovic, J. A., et al. Asthmaguide: an asthma monitoring and advice ecosystem. In Wireless Health (2016), pp. 128--135.

[43]

Ruta, D., and Gabrys, B. An overview of classifier fusion methods. Computing and Information systems 7, 1 (2000), 1--10.

[44]

Sailor, H. B., Agrawal, D. M., and Patil, H. A. Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification. Proc. Interspeech 2017 (2017), 3107--3111.

[45]

SALEKIN, A., EBERLE, J. W., GLENN, J. J., TEACHMAN, B. A., and STANKOVIC, J. A. A weakly supervised learning framework for detecting social anxiety and depression. ACM Interactive, Mobile, Wearable, and Ubiquitous Technologies, (2018).

Digital Library

[46]

Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 815--823.

[47]

Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems (2013), pp. 935--943.

Digital Library

[48]

Tanner, M. A., and Wong, W. H. The calculation of posterior distributions by data augmentation. Journal of the American statistical Association 82, 398 (1987), 528--540.

[49]

Tokozume, Y., Ushiku, Y., and Harada, T. Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282 (2017).

[50]

Töreyin, B. U., Dedeoğlu, Y., and Çetin, A. E. Hmm based falling person detection using both audio and video. In International Workshop on Human-Computer Interaction (2005), Springer, pp. 211--220.

Digital Library

[51]

Tzanetakis, G., and Cook, P. Musical genre classification of audio signals. IEEE Transactions on speech and audio processing 10, 5 (2002), 293--302.

[52]

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (2016), pp. 3630--3638.

Digital Library

[53]

Wang, A. The shazam music recognition service. Communications of the ACM 49, 8 (2006), 44--48.

Digital Library

[54]

Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., and Schiele, B. Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 69--77.

[55]

Zhang, Z., and Saligrama, V. Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision (2015), pp. 4166--4174.

Digital Library

Cited By

Dogan DXie HHeittola TVirtanen T(2024)Multi-Label Zero-Shot Audio Classification with Temporal Attention2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC)10.1109/IWAENC61483.2024.10694459(250-254)Online publication date: 9-Sep-2024
https://doi.org/10.1109/IWAENC61483.2024.10694459
Liu WRen Y(2024)Semantic Proximity Alignment: Towards Human Perception-Consistent Audio Tagging by Aligning with Label Text DescriptionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446928(541-545)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446928
Perezhohin YSantos TCosta VPeres FCastelli M(2024)Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models PerformanceIEEE Access10.1109/ACCESS.2024.348297012(155136-155150)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3482970
Show More Cited By

Index Terms

SoundSemantics: exploiting semantic knowledge in text for embedded acoustic event classification
1. Computer systems organization
  1. Embedded and cyber-physical systems
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Unsupervised classification of audio signals by self-organizing maps and bayesian labeling
HAIS'12: Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part I

Audio signal classification consists of extracting some descriptive features from a sound and use them as input in a classifier. Then, the classifier will assign a different label to any different sound class. The classification of the features can be ...
Co-training Approach for Label-Minimized Audio Classification
ICMTMA '10: Proceedings of the 2010 International Conference on Measuring Technology and Mechatronics Automation - Volume 01

Audio classification is an important preprocess to the audio data. However, lots of manual labeled data are needed for training models. In order to solve this problem, we evaluate a semi-supervised machine learning algorithm called co-training for ...
Automatic Classification of Guitar Playing Modes
Sound, Music, and Motion
Abstract
When they improvise, musicians typically alternate between several playing modes on their instruments. Guitarists in particular, alternate between modes such as octave playing, mixed chords and bass, chord comping, solo melodies, walking bass, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IPSN '19: Proceedings of the 18th International Conference on Information Processing in Sensor Networks

April 2019

365 pages

ISBN:9781450362849

DOI:10.1145/3302506

General Chair:
Rasit Eskicioglu
University of Manitoba, Canada
,
Program Chairs:
Luca Mottola
Politecnico di Milano, Italy
,
Bodhi Priyantha
Microsoft Research

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGBED: ACM Special Interest Group on Embedded Systems

In-Cooperation

IEEE-SPS: Signal Processing Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

IPSN '19

Sponsor:

SIGBED

IPSN '19: The 18th International Conference on Information Processing in Sensor Networks

April 16 - 18, 2019

Quebec, Montreal, Canada

Acceptance Rates

IPSN '19 Paper Acceptance Rate 25 of 91 submissions, 27%;

Overall Acceptance Rate 143 of 593 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
713
Total Downloads

Downloads (Last 12 months)142
Downloads (Last 6 weeks)12

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dogan DXie HHeittola TVirtanen T(2024)Multi-Label Zero-Shot Audio Classification with Temporal Attention2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC)10.1109/IWAENC61483.2024.10694459(250-254)Online publication date: 9-Sep-2024
https://doi.org/10.1109/IWAENC61483.2024.10694459
Liu WRen Y(2024)Semantic Proximity Alignment: Towards Human Perception-Consistent Audio Tagging by Aligning with Label Text DescriptionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446928(541-545)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446928
Perezhohin YSantos TCosta VPeres FCastelli M(2024)Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models PerformanceIEEE Access10.1109/ACCESS.2024.348297012(155136-155150)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3482970
Enamoto LRocha Filho GWeigang L(2024)Empowering few-shot learning: a multimodal optimization frameworkNeural Computing and Applications10.1007/s00521-024-10780-4Online publication date: 14-Dec-2024
https://doi.org/10.1007/s00521-024-10780-4
Enamoto LWeigang LFilho GCosta P(2023)Generic Multimodal Gradient-based Meta Learner Framework2023 26th International Conference on Information Fusion (FUSION)10.23919/FUSION52260.2023.10224143(1-8)Online publication date: 28-Jun-2023
https://doi.org/10.23919/FUSION52260.2023.10224143
Lin CYen TChen Y(2023)Multiple Time-sensitive Inferences Scheduling on Energy-harvesting IoT DevicesProceedings of the 2023 International Conference on Research in Adaptive and Convergent Systems10.1145/3599957.3606214(1-7)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3599957.3606214
Sims YMendes AChalup S(2023)Enhanced Embeddings in Zero-Shot Learning for Environmental AudioICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096134(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096134
Elluru VKulshreshtha DPaturi RBodapati SRonanki S(2023)Generalized Zero-Shot Audio-to-Intent Classification2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU57964.2023.10389657(1-8)Online publication date: 16-Dec-2023
https://doi.org/10.1109/ASRU57964.2023.10389657
Primus PWidmer G(2022)Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers2022 30th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO55093.2022.9909760(410-413)Online publication date: 29-Aug-2022
https://doi.org/10.23919/EUSIPCO55093.2022.9909760
Zhang QTang QKao CSun MLiu YWang C(2022)Wikitag: Wikipedia-Based Knowledge Embeddings Towards Improved Acoustic Event ClassificationICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9747648(136-140)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9747648
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents