[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3468784.3471604acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiaitConference Proceedingsconference-collections
research-article

Real-time Sound Visualization via Multidimensional Clustering and Projections

Published: 20 July 2021 Publication History

Abstract

Sound plays a vital role in every aspect of human life since it is one of the primary sensory information that our auditory system collects and allows us to perceive the world. Sound clustering and visualization is the process of collecting and analyzing audio samples; that process is a prerequisite of sound classification, which is the core of automatic speech recognition, virtual assistants, and text to speech applications. Nevertheless, understanding how to recognize and properly interpret complex, high-dimensional audio data is the most significant challenge in sound clustering and visualization. This paper proposed a web-based platform to visualize and cluster similar sound samples of musical notes and human speech in real-time. For visualizing high-dimensional data like audio, Mel-Frequency Cepstral Coefficients (MFCCs) were initially developed to represent the sounds made by the human vocal tract are extracted. Then, t-distributed Stochastic Neighbor Embedding (t-SNE), a dimensionality reduction technique, was designed for high dimensional datasets is applied. This paper focuses on both data clustering and high-dimensional visualization methods to properly present the clustering results in the most meaningful way to uncover potentially interesting behavioral patterns of musical notes played by different instruments.

References

[1]
F. Beritelli and R. Grasso. 2008. A pattern recognition system for environmental sound classification based on MFCCs and neural networks. In 2008 2nd International Conference on Signal Processing and Communication Systems. 1–4. https://doi.org/10.1109/ICSPCS.2008.4813723
[2]
Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3 Data-Driven Documents. IEEE Trans. Vis. Comput. Graph. 17, 12 (2011), 2301–2309.
[3]
Johannes Buchner. 2017. Synthetic Speech Commands: A public dataset for single-word speech recognition.Dataset available from https://www.kaggle.com/jbuchner/synthetic-speech-commands-dataset/(2017).
[4]
Tuan Nhon Dang and Leland Wilkinson. 2013. TimeExplorer: Similarity search time series by their signatures. In International Symposium on Visual Computing. Springer, 280–289.
[5]
Mikael Fernström and Eoin Brazil. 2001. Sonic browsing: An auditory tool for multimedia asset management. (2001), 132–135.
[6]
Jonathan T Foote and Matthew L Cooper. 2003. Media segmentation using self-similarity decomposition. 5021 (2003), 167–175.
[7]
John M. Grey. 1975. An Exploration of Musical Timbre. Ph.D., STAN-M-2(1975). https://ccrma.stanford.edu/files/papers/stanm2.pdf
[8]
Harold Hotelling. 1933. Analysis of a complex of statistical variables into principal components.Journal of educational psychology 24, 6 (1933), 417.
[9]
Hyoung-Gook Kim, N. Moreau, and T. Sikora. 2004. Audio Classification Based on MPEG-7 Spectral Basis Representations. IEEE Trans. Cir. and Sys. for Video Technol. 14, 5 (May 2004), 716–725. https://doi.org/10.1109/TCSVT.2004.826766
[10]
Leland McInnes, John Healy, and James Melville. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arxiv:1802.03426 [stat.ML]
[11]
Jonathan Rubin, Rui Abreu, Anurag Ganguli, Saigopal Nelaturi, Ion Matei, and Kumar Sricharan. 2016. Classifying heart sound recordings using deep convolutional neural networks and mel-frequency cepstral coefficients. In 2016 Computing in cardiology conference (CinC). IEEE, 813–816.
[12]
George Tzanetakis. 2002. Manipulation, Analysis and Retrieval Systems for Audio Signals. Ph.D. Dissertation. USA. Advisor(s) Cook, Perry. AAI3041872.
[13]
George Tzanetakis and Perry Cook. 2000. 3D graphics tools for sound collections. In Proc. COSTG6 Conference on Digital Audio Effects, DAFX. Fernström, Mikael Limerick.
[14]
G. Tzanetakis and P. Cook. 2002. Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10, 5(2002), 293–302.
[15]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html
[16]
C. Weinstein, S. S. McCandless, L. Mondshein, and V. Zue. 1975. A system for acoustic-phonetic analysis of continuous speech. IEEE Transactions on Acoustics, Speech, and Signal Processing 23, 1(1975), 54–67. https://doi.org/10.1109/TASSP.1975.1162651

Index Terms

  1. Real-time Sound Visualization via Multidimensional Clustering and Projections
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          IAIT '21: Proceedings of the 12th International Conference on Advances in Information Technology
          June 2021
          281 pages
          ISBN:9781450390125
          DOI:10.1145/3468784
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 20 July 2021

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Human Speech Recognition
          2. Mel-Frequency Cepstral Coefficients
          3. Multivariate Clustering
          4. Principle Component Analysis
          5. Sound visualization
          6. t-distributed Stochastic Neighbor Embedding

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          IAIT2021

          Acceptance Rates

          Overall Acceptance Rate 20 of 47 submissions, 43%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 115
            Total Downloads
          • Downloads (Last 12 months)38
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 06 Jan 2025

          Other Metrics

          Citations

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media