More Web Proxy on the site http://driver.im/

short-paper

Opensmile: the munich versatile and fast open-source audio feature extractor

Authors:

Martin Wöllmer,

Björn SchullerAuthors Info & Claims

MM '10: Proceedings of the 18th ACM international conference on Multimedia

Pages 1459 - 1462

https://doi.org/10.1145/1873951.1874246

Published: 25 October 2010 Publication History

Abstract

We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.

References

[1]

X. Amatriain, P. Arumi, and D. Garcia. A framework for efficient and rapid development of cross-platform audio applications. Multimedia Systems, 14(1):15--32, June 2008.

Digital Library

[2]

A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson. Combining efforts for improving automatic classification of emotional user states. In T. Erjavec and J. Gros, editors, Language Technologies, IS-LTC 2006, pages 240--245. Informacijska Druzba, 2006.

[3]

P. Boersma and D. Weenink. Praat: doing phonetics by computer (v. 4.3.14). http://www.praat.org/, 2005.

[4]

F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie. On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces, 3(1-2):7--19, Mar. 2010.

[5]

F. Eyben, M. Wöllmer, and B. Schuller. openEAR - introducing the munich open-source emotion and affect recognition toolkit. In Proc. of ACII 2009, volume I, pages 576--581. IEEE, 2009.

[6]

R. Fernandez. A Computational Model for the Automatic Recognition of Affect in Speech. PhD thesis, MIT Media Arts and Science, Feb. 2004.

Digital Library

[7]

P. N. Garner, J. Dines, T. Hain, A. El Hannani, M. Karafiat, D. Korchagin, M. Lincoln, V. Wan, and L. Zhang. Real-time asr from meetings. In Proc. of INTERSPEECH 2009, Brighton, UK. ISCA, 2009.

[8]

A. Lerch and G. Eisenberg. FEAPI: a low level feature extraction plug-in api. In Proc. of the 8th International Conference on Digital Audio Effects (DAFx), Madrid, Spain, 2005.

[9]

D. McEnnis, C. McKay, I. Fujinaga, and P. Depalle. jaudio: A feature extraction library. In Proc. of ISMIR 2005, pages 600--603, 2005.

[10]

I. Mporas and T. Ganchev. Estimation of unknown speaker's height from speech. International Journal of Speech Technology, 12(4):149--160, dec 2009.

[11]

B. Schuller, S. Steidl, and A. Batliner. The INTERSPEECH 2009 emotion challenge. In Proc. Interspeech (2009), Brighton, UK, 2009. ISCA.

[12]

B. Schuller, F. Wallhoff, D. Arsic, and G. Rigoll. Musical signal type discrimination based on large open feature sets. In Proc. of the International Conference on Multimedia and Expo ICME 2006. IEEE, 2006.

[13]

B. Schuller, M. Wimmer, L. Mösenlechner, C. Kern, D. Arsic, and G. Rigoll. Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? In Proc. of ICASSP 2008, April 2008.

[14]

I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2nd edition edition, 2005.

Digital Library

[15]

S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK book (v3.4). Cambridge University Press, Cambridge, UK, December 2006.

Cited By

Quan JMiyake YNozawa T(2025)Incorporating Multimodal Directional Interpersonal Synchrony into Empathetic Response GenerationSensors10.3390/s2502043425:2(434)Online publication date: 13-Jan-2025
https://doi.org/10.3390/s25020434
Hiratsuka YKuga KMiura TTanaka TUeda M(2025)Sounds and Natures Do Often Agree: Prediction of Esports Players’ Performance in Fighting Games Based on the Operating Sounds of Game ControllersApplied Sciences10.3390/app1502071915:2(719)Online publication date: 13-Jan-2025
https://doi.org/10.3390/app15020719
Wang NWu MGu WChai Z(2025)Hybrid Self-Aligned Fusion With Dual-Weight Attention Network for Alzheimer's DetectionIEEE Signal Processing Letters10.1109/LSP.2024.351480332(346-350)Online publication date: 2025
https://doi.org/10.1109/LSP.2024.3514803
Show More Cited By

Index Terms

Opensmile: the munich versatile and fast open-source audio feature extractor
1. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems
  2. Robustness
    1. Hardware reliability
      1. Signal integrity and noise analysis

Recommendations

Modeling Emotion and Attitude in Speech by Means of Perceptually Based Parameter Values

This study focuses on the perception of emotion and attitude in speech. The ability to identify vocal expressions of emotion and/or attitude in speech material was investigated. Systematic perception experiments were carried out to determine optimal ...
Evaluation of the affective valence of speech using pitch substructure

In order to study the relationship between emotion and intonation, a new technique is introduced for the extraction of the dominant pitches within speech utterances and the quasi-musical analysis of the multipitch structure. After the distribution of ...
On the perception of "segmental intonation": F0 context effects on sibilant identification in German

In normal modally voiced utterances, voiceless fricatives like [s], [ź], [f], and [x] vary such that their aperiodic pitch impressions mirror the pitch level of the adjacent F0 contour. For instance, if the F0 contour creates a high or low pitch context,...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '10: Proceedings of the 18th ACM international conference on Multimedia

October 2010

1836 pages

ISBN:9781605589336

DOI:10.1145/1873951

General Chairs:
Alberto del Bimbo
University of Florence, Italy
,
Shih-Fu Chang
Columbia University, USA
,
Program Chair:
Arnold Smeulders
University of Amsterdam, NL

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

MM '10

Sponsor:

SIGMM

MM '10: ACM Multimedia Conference

October 25 - 29, 2010

Firenze, Italy

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1,541
Total Citations
View Citations
3,832
Total Downloads

Downloads (Last 12 months)537
Downloads (Last 6 weeks)69

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Quan JMiyake YNozawa T(2025)Incorporating Multimodal Directional Interpersonal Synchrony into Empathetic Response GenerationSensors10.3390/s2502043425:2(434)Online publication date: 13-Jan-2025
https://doi.org/10.3390/s25020434
Hiratsuka YKuga KMiura TTanaka TUeda M(2025)Sounds and Natures Do Often Agree: Prediction of Esports Players’ Performance in Fighting Games Based on the Operating Sounds of Game ControllersApplied Sciences10.3390/app1502071915:2(719)Online publication date: 13-Jan-2025
https://doi.org/10.3390/app15020719
Wang NWu MGu WChai Z(2025)Hybrid Self-Aligned Fusion With Dual-Weight Attention Network for Alzheimer's DetectionIEEE Signal Processing Letters10.1109/LSP.2024.351480332(346-350)Online publication date: 2025
https://doi.org/10.1109/LSP.2024.3514803
Vekkot STeja Chavali STej Kandavalli CSai Abhishek Podila RGupta DZakariah MAjami Alotaibi Y(2025)Continuous Speech-Based Fatigue Detection and Transition State Prediction for Air Traffic ControllersIEEE Access10.1109/ACCESS.2024.352445213(3298-3319)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3524452
Lian HLu CChang HZhao YLi SLi YZong Y(2025)AMGCN: An adaptive multi-graph convolutional network for speech emotion recognitionSpeech Communication10.1016/j.specom.2024.103184(103184)Online publication date: Jan-2025
https://doi.org/10.1016/j.specom.2024.103184
Fu CQian FSu YSu KSong SNiu MShi JLiu ZLiu CIshi CIshiguro H(2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129106
Fu YYan XChen WZhang J(2025)Feature-Enhanced Multimodal Interaction model for emotion recognition in conversationKnowledge-Based Systems10.1016/j.knosys.2024.112876309(112876)Online publication date: Jan-2025
https://doi.org/10.1016/j.knosys.2024.112876
Mandi SMitra B(2025)ProxyLabel: A framework to evaluate techniques for survey fatigue reduction leveraging auxiliary modalitiesExpert Systems with Applications10.1016/j.eswa.2024.125913265(125913)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125913
Amato FCesarini VOlmo GSaggio GCostantini G(2025)Beyond breathalyzers: AI-powered speech analysis for alcohol intoxication detectionExpert Systems with Applications10.1016/j.eswa.2024.125656262(125656)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125656
Zhuang XLi ZZhang CMa H(2025)A cross-modal collaborative guiding network for sarcasm explanation in multi-modal multi-party dialoguesEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109884142(109884)Online publication date: Feb-2025
https://doi.org/10.1016/j.engappai.2024.109884
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents