[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1873951.1874246acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Opensmile: the munich versatile and fast open-source audio feature extractor

Published: 25 October 2010 Publication History

Abstract

We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.

References

[1]
X. Amatriain, P. Arumi, and D. Garcia. A framework for efficient and rapid development of cross-platform audio applications. Multimedia Systems, 14(1):15--32, June 2008.
[2]
A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson. Combining efforts for improving automatic classification of emotional user states. In T. Erjavec and J. Gros, editors, Language Technologies, IS-LTC 2006, pages 240--245. Informacijska Druzba, 2006.
[3]
P. Boersma and D. Weenink. Praat: doing phonetics by computer (v. 4.3.14). http://www.praat.org/, 2005.
[4]
F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie. On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces, 3(1-2):7--19, Mar. 2010.
[5]
F. Eyben, M. Wöllmer, and B. Schuller. openEAR - introducing the munich open-source emotion and affect recognition toolkit. In Proc. of ACII 2009, volume I, pages 576--581. IEEE, 2009.
[6]
R. Fernandez. A Computational Model for the Automatic Recognition of Affect in Speech. PhD thesis, MIT Media Arts and Science, Feb. 2004.
[7]
P. N. Garner, J. Dines, T. Hain, A. El Hannani, M. Karafiat, D. Korchagin, M. Lincoln, V. Wan, and L. Zhang. Real-time asr from meetings. In Proc. of INTERSPEECH 2009, Brighton, UK. ISCA, 2009.
[8]
A. Lerch and G. Eisenberg. FEAPI: a low level feature extraction plug-in api. In Proc. of the 8th International Conference on Digital Audio Effects (DAFx), Madrid, Spain, 2005.
[9]
D. McEnnis, C. McKay, I. Fujinaga, and P. Depalle. jaudio: A feature extraction library. In Proc. of ISMIR 2005, pages 600--603, 2005.
[10]
I. Mporas and T. Ganchev. Estimation of unknown speaker's height from speech. International Journal of Speech Technology, 12(4):149--160, dec 2009.
[11]
B. Schuller, S. Steidl, and A. Batliner. The INTERSPEECH 2009 emotion challenge. In Proc. Interspeech (2009), Brighton, UK, 2009. ISCA.
[12]
B. Schuller, F. Wallhoff, D. Arsic, and G. Rigoll. Musical signal type discrimination based on large open feature sets. In Proc. of the International Conference on Multimedia and Expo ICME 2006. IEEE, 2006.
[13]
B. Schuller, M. Wimmer, L. Mösenlechner, C. Kern, D. Arsic, and G. Rigoll. Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? In Proc. of ICASSP 2008, April 2008.
[14]
I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2nd edition edition, 2005.
[15]
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK book (v3.4). Cambridge University Press, Cambridge, UK, December 2006.

Cited By

View all
  • (2025)Incorporating Multimodal Directional Interpersonal Synchrony into Empathetic Response GenerationSensors10.3390/s2502043425:2(434)Online publication date: 13-Jan-2025
  • (2025)Sounds and Natures Do Often Agree: Prediction of Esports Players’ Performance in Fighting Games Based on the Operating Sounds of Game ControllersApplied Sciences10.3390/app1502071915:2(719)Online publication date: 13-Jan-2025
  • (2025)Hybrid Self-Aligned Fusion With Dual-Weight Attention Network for Alzheimer's DetectionIEEE Signal Processing Letters10.1109/LSP.2024.351480332(346-350)Online publication date: 2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '10: Proceedings of the 18th ACM international conference on Multimedia
October 2010
1836 pages
ISBN:9781605589336
DOI:10.1145/1873951
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio feature extraction
  2. emotion
  3. music
  4. signal processing
  5. speech
  6. statistical functionals

Qualifiers

  • Short-paper

Conference

MM '10
Sponsor:
MM '10: ACM Multimedia Conference
October 25 - 29, 2010
Firenze, Italy

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)537
  • Downloads (Last 6 weeks)69
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Incorporating Multimodal Directional Interpersonal Synchrony into Empathetic Response GenerationSensors10.3390/s2502043425:2(434)Online publication date: 13-Jan-2025
  • (2025)Sounds and Natures Do Often Agree: Prediction of Esports Players’ Performance in Fighting Games Based on the Operating Sounds of Game ControllersApplied Sciences10.3390/app1502071915:2(719)Online publication date: 13-Jan-2025
  • (2025)Hybrid Self-Aligned Fusion With Dual-Weight Attention Network for Alzheimer's DetectionIEEE Signal Processing Letters10.1109/LSP.2024.351480332(346-350)Online publication date: 2025
  • (2025)Continuous Speech-Based Fatigue Detection and Transition State Prediction for Air Traffic ControllersIEEE Access10.1109/ACCESS.2024.352445213(3298-3319)Online publication date: 2025
  • (2025)AMGCN: An adaptive multi-graph convolutional network for speech emotion recognitionSpeech Communication10.1016/j.specom.2024.103184(103184)Online publication date: Jan-2025
  • (2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
  • (2025)Feature-Enhanced Multimodal Interaction model for emotion recognition in conversationKnowledge-Based Systems10.1016/j.knosys.2024.112876309(112876)Online publication date: Jan-2025
  • (2025)ProxyLabel: A framework to evaluate techniques for survey fatigue reduction leveraging auxiliary modalitiesExpert Systems with Applications10.1016/j.eswa.2024.125913265(125913)Online publication date: Mar-2025
  • (2025)Beyond breathalyzers: AI-powered speech analysis for alcohol intoxication detectionExpert Systems with Applications10.1016/j.eswa.2024.125656262(125656)Online publication date: Mar-2025
  • (2025)A cross-modal collaborative guiding network for sarcasm explanation in multi-modal multi-party dialoguesEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109884142(109884)Online publication date: Feb-2025
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media