[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3302506.3310402acmconferencesArticle/Chapter ViewAbstractPublication PagescpsweekConference Proceedingsconference-collections
research-article
Public Access

SoundSemantics: exploiting semantic knowledge in text for embedded acoustic event classification

Published: 16 April 2019 Publication History

Abstract

In this paper, we propose a fundamentally different approach to acoustic event classification that exploits knowledge from the textual domain to deal with a well-known pain point in audio event classification---i.e., the lack of adequate training examples. We show that by exploiting existing context-aware semantic representation of English words (e.g., Google's word-to-vector [33]) that is generated from a massive amount of English texts on the Internet, it is possible to classify acoustic events even when there are a few or no training examples for a wide variety of sounds. Our approach works in application scenarios where a system wants to learn a number of predefined categories of acoustic events, but it does not have enough training examples per category, and/or for some categories, it does not have any training examples at all. We solve this problem by combining a robust audio representation step, followed by a cross-modal projection of the audio representation onto textual representation. Our approach is different from techniques such as one-shot and data augmentation that do not consider the cross-domain knowledge transfers. We develop a generic mobile application for audio event detection where a user can input a list of desired sound types, along with training audio clips for some of those sound types, and the system is able to recognize all types of sounds (at varying level of accuracy depending on the number of classes that do not have any training examples), which is not achievable by any existing audio classifier that we are aware of. We evaluate the performance of the proposed system on an empirical dataset [41] as well as by deploying the application in two real-world scenarios. The accuracy of the classifier lies between 60%-90% for a 6--10 class problem when the number of classes that do not have any training examples is varied between 2--5.

References

[1]
Tensorflow. https://www.tensorflow.org/lite/.
[2]
Aytar, Y., Vondrick, C., and Torralba, A. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems (2016), pp. 892--900.
[3]
Blumstein, D. T., Mennill, D. J., Clemins, P., Girod, L., Yao, K., Patricelli, G., Deffe, J. L., Krakauer, A. H., Clark, C., Cortopassi, K. A., et al. Acoustic monitoring in terrestrial environments using microphone arrays: applications, technological considerations and prospectus. Journal of Applied Ecology (2011).
[4]
Chen, J., Kam, A. H., Zhang, J., Liu, N., and Shue, L. Bathroom activity monitoring based on sound. In International Conference on Pervasive Computing (2005), Springer, pp. 47--61.
[5]
Chen, Z., Lin, M., Chen, F., Lane, N. D., Cardone, G., Wang, R., Li, T., Chen, Y., Choudhury, T., and Campbell, A. T. Unobtrusive sleep monitoring using smartphones. In Proceedings of the 7th International Conference on Pervasive Computing Technologies for Healthcare.
[6]
Clavel, C., Ehrette, T., and Richard, G. Events detection for an audio-based surveillance system. In 2005 IEEE International Conference on Multimedia and Expo (2005), IEEE, pp. 1306--1309.
[7]
de Godoy, D., Islam, B., Xia, S., Islam, M. T., Chandrasekaran, R., Chen, Y.-C., Nirjon, S., Kinget, P. R., and Jiang, X. Paws: A wearable acoustic system for pedestrian safety. In Internet-of-Things Design and Implementation (IoTDI), 2018 IEEE/ACM Third International Conference on (2018), IEEE, pp. 237--248.
[8]
Dedeoglu, Y., Toreyin, B. U., Gudukbay, U., and Cetin, A. E. Surveillance using both video and audio. In Multimodal Processing and Interaction.
[9]
Dickerson, R. F., Hoque, E., Asare, P., Nirjon, S., and Stankovic, J. A. Resonate: reverberation environment simulation for improved classification of speech models. In Proceedings of the 13th international symposium on Information processing in sensor networks (2014), IEEE Press, pp. 107--118.
[10]
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems (2013), pp. 2121--2129.
[11]
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, vol. 1. MIT press Cambridge, 2016.
[12]
Guvensan, M. A., Taysi, Z. C., and Melodia, T. Energy monitoring in residential spaces with audio sensor nodes: Tinyears. Ad Hoc Networks 11, 5 (2013), 1539--1555.
[13]
Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In null (2006), IEEE, pp. 1735--1742.
[14]
Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., et al. Cnn architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (2017), IEEE, pp. 131--135.
[15]
Ho, T. K., Hull, J. J., and Srihari, S. N. Decision combination in multiple classifier systems. IEEE transactions on pattern analysis and machine intelligence 16, 1 (1994), 66--75.
[16]
Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y. Improving Word Representations via Global Context and Multiple Word Prototypes. In Annual Meeting of the Association for Computational Linguistics (2012).
[17]
Islam, M. T., Islam, B., and Nirjon, S. Soundsifter: Mitigating overhearing of continuous listening devices. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (2017), ACM, pp. 29--41.
[18]
Jaitly, N., and Hinton, G. E. Vocal tract length perturbation (vtlp) improves speech recognition. In Proc. ICML Workshop on Deep Learning for Audio, Speech and Language (2013).
[19]
Japkowicz, N., and Stephen, S. The class imbalance problem: A systematic study. Intelligent data analysis 6, 5 (2002), 429--449.
[20]
Jordan, M. I., and Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 6245 (2015), 255--260.
[21]
Kanda, N., Takeda, R., and Obuchi, Y. Elastic spectral distortion for low resource speech recognition with deep neural networks. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on.
[22]
Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[23]
Koch, G. Siamese neural networks for one-shot image recognition.
[24]
Kodirov, E., Xiang, T., and Gong, S. Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345 (2017).
[25]
Kumar, A., Khadkevich, M., and Fügen, C. Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), IEEE, pp. 326--330.
[26]
Kuncheva, L. I. A theoretical study on six classifier fusion strategies. IEEE Transactions on pattern analysis and machine intelligence.
[27]
Kuncheva, L. I., Bezdek, J. C., and Duin, R. P. Decision templates for multiple classifier fusion: an experimental comparison. Pattern recognition (2001).
[28]
Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., Warmuth, M., and Wolf, P. The cmu sphinx-4 speech recognition system. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2003), Hong Kong (2003), vol. 1, pp. 2--5.
[29]
Lampert, C. H., Nickisch, H., and Harmeling, S. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 3 (2014), 453--465.
[30]
Lee, H., Pham, P., Largman, Y., and Ng, A. Y. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in neural information processing systems (2009), pp. 1096--1104.
[31]
Lehmann, E. L., and Casella, G. Theory of point estimation. Springer Science & Business Media, 2006.
[32]
Lu, H., Brush, A. B., Priyantha, B., Karlson, A. K., and Liu, J. Speakersense: Energy efficient unobtrusive speaker identification on mobile phones. In International conference on pervasive computing (2011), Springer, pp. 188--205.
[33]
Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint.
[34]
Miluzzo, E., Cornelius, C. T., Ramaswamy, A., Choudhury, T., Liu, Z., and Campbell, A. T. Darwin phones: the evolution of sensing and inference on mobile phones. In Proceedings of the 8th international conference on Mobile systems, applications, and services (2010), ACM, pp. 5--20.
[35]
Mithun, N. C., Munir, S., Guo, K., and Shelton, C. ODDS: real-time object detection using depth sensors on embedded GPUs. In IPSN (2018).
[36]
Morales, N., Gu, L., and Gao, Y. Adding noise to improve noise robustness in speech recognition. In INTERSPEECH (2007), pp. 930--933.
[37]
Nirjon, S., Dickerson, R. F., Asare, P., Li, Q., Hong, D., Stankovic, J. A., Hu, P., Shen, G., and Jiang, X. Auditeur: A mobile-cloud service platform for acoustic event detection on smartphones. In Proceeding of the 11th annual international conference on Mobile systems, applications, and services (2013), ACM, pp. 403--416.
[38]
Nirjon, S., Dickerson, R. F., Li, Q., Asare, P., Stankovic, J. A., Hong, D., Zhang, B., Jiang, X., Shen, G., and Zhao, F. Musicalheart: A hearty way of listening to music. In Proceedings of the 10th ACM Conference on Embedded Network Sensor Systems (2012), ACM, pp. 43--56.
[39]
Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G. S., and Dean, J. Zero-shot learning by convex combination of semantic embeddings.
[40]
Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. Zero-shot learning with semantic output codes. In Advances in neural information processing systems (2009), pp. 1410--1418.
[41]
Piczak, K. J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, ACM Press, pp. 1015--1018.
[42]
Ra, H.-K., Salekin, A., Yoon, H.-J., Kim, J., Nirjon, S. S., Stone, D. J., Kim, S., Lee, J.-M., Son, S. H., Stankovic, J. A., et al. Asthmaguide: an asthma monitoring and advice ecosystem. In Wireless Health (2016), pp. 128--135.
[43]
Ruta, D., and Gabrys, B. An overview of classifier fusion methods. Computing and Information systems 7, 1 (2000), 1--10.
[44]
Sailor, H. B., Agrawal, D. M., and Patil, H. A. Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification. Proc. Interspeech 2017 (2017), 3107--3111.
[45]
SALEKIN, A., EBERLE, J. W., GLENN, J. J., TEACHMAN, B. A., and STANKOVIC, J. A. A weakly supervised learning framework for detecting social anxiety and depression. ACM Interactive, Mobile, Wearable, and Ubiquitous Technologies, (2018).
[46]
Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 815--823.
[47]
Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems (2013), pp. 935--943.
[48]
Tanner, M. A., and Wong, W. H. The calculation of posterior distributions by data augmentation. Journal of the American statistical Association 82, 398 (1987), 528--540.
[49]
Tokozume, Y., Ushiku, Y., and Harada, T. Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282 (2017).
[50]
Töreyin, B. U., Dedeoğlu, Y., and Çetin, A. E. Hmm based falling person detection using both audio and video. In International Workshop on Human-Computer Interaction (2005), Springer, pp. 211--220.
[51]
Tzanetakis, G., and Cook, P. Musical genre classification of audio signals. IEEE Transactions on speech and audio processing 10, 5 (2002), 293--302.
[52]
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (2016), pp. 3630--3638.
[53]
Wang, A. The shazam music recognition service. Communications of the ACM 49, 8 (2006), 44--48.
[54]
Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., and Schiele, B. Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 69--77.
[55]
Zhang, Z., and Saligrama, V. Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision (2015), pp. 4166--4174.

Cited By

View all
  • (2024)Multi-Label Zero-Shot Audio Classification with Temporal Attention2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC)10.1109/IWAENC61483.2024.10694459(250-254)Online publication date: 9-Sep-2024
  • (2024)Semantic Proximity Alignment: Towards Human Perception-Consistent Audio Tagging by Aligning with Label Text DescriptionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446928(541-545)Online publication date: 14-Apr-2024
  • (2024)Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models PerformanceIEEE Access10.1109/ACCESS.2024.348297012(155136-155150)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. SoundSemantics: exploiting semantic knowledge in text for embedded acoustic event classification

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      IPSN '19: Proceedings of the 18th International Conference on Information Processing in Sensor Networks
      April 2019
      365 pages
      ISBN:9781450362849
      DOI:10.1145/3302506
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      • IEEE-SPS: Signal Processing Society

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 April 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. audio classification
      2. zero-shot learning

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      IPSN '19
      Sponsor:

      Acceptance Rates

      IPSN '19 Paper Acceptance Rate 25 of 91 submissions, 27%;
      Overall Acceptance Rate 143 of 593 submissions, 24%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)142
      • Downloads (Last 6 weeks)12
      Reflects downloads up to 11 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Multi-Label Zero-Shot Audio Classification with Temporal Attention2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC)10.1109/IWAENC61483.2024.10694459(250-254)Online publication date: 9-Sep-2024
      • (2024)Semantic Proximity Alignment: Towards Human Perception-Consistent Audio Tagging by Aligning with Label Text DescriptionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446928(541-545)Online publication date: 14-Apr-2024
      • (2024)Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models PerformanceIEEE Access10.1109/ACCESS.2024.348297012(155136-155150)Online publication date: 2024
      • (2024)Empowering few-shot learning: a multimodal optimization frameworkNeural Computing and Applications10.1007/s00521-024-10780-4Online publication date: 14-Dec-2024
      • (2023)Generic Multimodal Gradient-based Meta Learner Framework2023 26th International Conference on Information Fusion (FUSION)10.23919/FUSION52260.2023.10224143(1-8)Online publication date: 28-Jun-2023
      • (2023)Multiple Time-sensitive Inferences Scheduling on Energy-harvesting IoT DevicesProceedings of the 2023 International Conference on Research in Adaptive and Convergent Systems10.1145/3599957.3606214(1-7)Online publication date: 6-Aug-2023
      • (2023)Enhanced Embeddings in Zero-Shot Learning for Environmental AudioICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096134(1-5)Online publication date: 4-Jun-2023
      • (2023)Generalized Zero-Shot Audio-to-Intent Classification2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU57964.2023.10389657(1-8)Online publication date: 16-Dec-2023
      • (2022)Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers2022 30th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO55093.2022.9909760(410-413)Online publication date: 29-Aug-2022
      • (2022)Wikitag: Wikipedia-Based Knowledge Embeddings Towards Improved Acoustic Event ClassificationICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9747648(136-140)Online publication date: 23-May-2022
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media