[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3123266.3123338acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Paralinguistic Approach To Speaker Diarisation: Using Age, Gender, Voice Likability and Personality Traits

Published: 19 October 2017 Publication History

Abstract

In this work, we present a new view on automatic speaker diarisation, i.e., assessing "who speaks when", based on the recognition of speaker traits such as age, gender, voice likability, and personality. Traditionally, speaker diarisation is accomplished using low-level audio descriptors (e.g., cepstral or spectral features), neglecting the fact that speakers can be well discriminated by humans according to various perceived characteristics. Thus, we advocate a novel paralinguistic approach that combines speaker diarisation with speaker characterisation by automatically identifying the speakers according to their individual traits. In a three-tier processing flow, speaker segmentation by voice activity detection (VAD) is initially performed to detect speaker turns. Next, speaker attributes are predicted using pre-trained paralinguistic models. To tag the speakers, clustering algorithms are applied to the predicted traits. We evaluate our methods against state-of-the-art open source and commercial systems on a corpus of realistic, spontaneous dyadic conversations recorded in the wild from three different cultures (Chinese, English, German). Our results provide clear evidence that using paralinguistic features for speaker diarisation is a promising avenue of research.

References

[1]
Jitendra Ajmera and Chuck Wooters. 2003. A robust speaker clustering algorithm. In Proc. of Workshop on Automatic Speech Recognition and Understanding. IEEE, St. Thomas, Virgin Islands, 411--416.
[2]
Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Friedland, and Oriol Vinyals. 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, 2 (2012), 356--370.
[3]
Xavier Anguera, Chuck Wooters, and Javier Hernando. 2007. Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, 7 (2007), 2011--2022.
[4]
Kofi Boakye, Beatriz Trueba-Hornero, Oriol Vinyals, and Gerald Friedland. 2008. Overlapped speech detection for improved speaker diarization in multiparty meetings Proc. of ICASSP. IEEE, Las Vegas, NV, 4353--4356.
[5]
Felix Burkhardt, Martin Eckert, Wiebke Johannsen, and Joachim Stegmann. 2010. A Database of Age and Gender Annotated Telephone Speech Proc. of LREC. ELRA, Valletta, Malta.
[6]
Felix Burkhardt, Björn Schuller, Benjamin Weiss, and Felix Weninger. 2011. `Would You Buy A Car From Me?' -- On the Likability of Telephone Voices Proc. of INTERSPEECH. ISCA, Florence, Italy, 1557--1560.
[7]
Nicholas WD Evans, Corinne Fredouille, and Jean-Franccois Bonastre. 2009. Speaker diarization using unsupervised discriminant analysis of inter-channel delay features Proc. of ICASSP. IEEE, Kyoto, Japan, 4061--4064.
[8]
Florian Eyben. 2015. Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer International Publishing, Switzerland.
[9]
Florian Eyben and Björn Schuller. 2014. openSMILE:) The Munich Open-Source Large-Scale Multimedia Feature Extractor. ACM SIGMM Records, Vol. 6, 4 (2014).
[10]
Florian Eyben, Felix Weninger, Florian Groß, and Björn Schuller. 2013. Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor Proc. of ACM Multimedia. ACM, Barcelona, Spain, 835--838.
[11]
Gerald Friedland, Oriol Vinyals, Yan Huang, and Christian A Müller. 2009 a. Prosodic and other Long-Term Features for Speaker Diarization. IEEE Transactions on Audio, Speech & Language Processing, Vol. 17, 5 (2009), 985--993.
[12]
Gerald Friedland, Chuohao Yeo, and Hayley Hung. 2009 b. Visual speaker localization aided by acoustic models Proc. of ACM Multimedia. ACM, New York, NY, 195--202.
[13]
Sylvain Galliano, Guillaume Gravier, and Laura Chaubard. 2009. The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts Proc. of INTERSPEECH, Vol. Vol. 9. ISCA, Portland, OR, 2583--2586.
[14]
Jean-Luc Gauvain, Lori Lamel, and Gilles Adda. 1998. Partitioning and transcription of broadcast news data. Proc. of ICSLP, Vol. Vol. 98. Sydney, Australia, 1335--1338.
[15]
Jürgen T. Geiger, Florian Eyben, Björn Schuller, and Gerhard Rigoll. 2013. Detecting Overlapping Speech with Long Short-Term Memory Recurrent Neural Networks Proc. of INTERSPEECH. ISCA, Lyon, France, 1668--1672.
[16]
Jürgen T Geiger, Frank Wallhoff, and Gerhard Rigoll. 2010. GMM-UBM based open-set online speaker diarization Proc. of INTERSPEECH. ISCA, Makuhari, Japan, 2330--2333.
[17]
Michael Grimm and Kristian Kroschel. 2005. Evaluation of natural emotions using self assessment manikins Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding. Cancun, Mexico, 381--385.
[18]
Gerhard Hagerer, Vedhas Pandit, Florian Eyben, and Björn Schuller. 2017. Enhancing LSTM RNN-based Speech Overlap Detection by Artificially Mixed Data Proc. AES International Conference on Semantic Audio. AES, Audio Engineering Society, Erlangen, Germany, 1--8. to appear.
[19]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter Vol. 11, 1 (2009), 10--18.
[20]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[21]
David Imseng and Gerald Friedland. 2010. Tuning-robust initialization methods for speaker diarization. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, 8 (2010), 2028--2037.
[22]
Konstantin Markov. 2009. Advanced approaches to speaker diarization of audio documents Proc. of Joint Conferences on Pervasive Computing. IEEE, Taipei, Taiwan, 179--184.
[23]
Sylvain Meignier and Teva Merlin. 2010. LIUM SpkDiarization: an open source toolkit for diarization Proc. of Carnegie Mellon University SPUD Workshop. Dallas, TX.
[24]
G. Mohammadi, A. Vinciarelli, and M. Mortillaro. 2010. The voice of personality: Mapping nonverbal vocal behavior into trait attributions Proc. of International Workshop on Social Signal Processing. ACM, Florence, Italy, 17--20.
[25]
Daniel Moraru, Laurent Besacier, and Eric Castelli. 2004. Using a priori information for speaker diarization ODYSSEY Speaker and Language Recognition Workshop. ISCA, Toledo, Spain, 355--362.
[26]
Seiichi Nakagawa, Longbiao Wang, and Shinji Ohtsuka. 2012. Speaker identification and verification by combining MFCC and phase information. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, 4 (2012), 1085--1095.
[27]
Athanasios Noulas, Gwenn Englebienne, and Ben JA Krose. 2012. Multimodal speaker diarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, 1 (2012), 79--93.
[28]
Scott Otterson and Mari Ostendorf. 2007. Efficient use of overlap information in speaker diarization Proc. of Automatic Speech Recognition and Understanding. IEEE, Kyoto, Japan, 683--686.
[29]
John C. Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers Vol. 10, 3 (1999), 61--74.
[30]
Beatrice Rammstedt and Oliver P John. 2007. Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German. Journal of Research in Personality Vol. 41, 1 (2007), 203--212.
[31]
Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. 2000. Speaker verification using adapted Gaussian mixture models. Digital signal processing Vol. 10, 1 (2000), 19--41.
[32]
Douglas A Reynolds and P Torres-Carrasquillo. 2005. Approaches and applications of audio diarization. In Proc. of ICASSP, Vol. Vol. 5. IEEE, Philadelphia, PA, 953--956.
[33]
Maximilian Schmitt and Björn Schuller. 2016. openXBOW -- Introducing the Passau Open-Source Crossmodal Bag-of-Words Toolkit. arxiv.org, Vol. 1605.06778 (2016). 9 pages.
[34]
Björn Schuller. 2012. The Computational Paralinguistics Challenge. IEEE Signal Processing Magazine Vol. 29, 4 (2012), 97--101.
[35]
Björn Schuller and Anton Batliner. 2013. Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. Wiley.
[36]
Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth Narayanan. 2010. The INTERSPEECH 2010 Paralinguistic Challenge. In Proc. of INTERSPEECH. ISCA, Makuhari, Japan, 2794--2797.
[37]
Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth Narayanan. 2013. Paralinguistics in Speech and Language -- State-of-the-Art and the Challenge. Computer Speech and Language, Special Issue on Paralinguistics in Naturalistic Speech and Language Vol. 27, 1 (2013), 4--39.
[38]
Björn Schuller, Stefan Steidl, Anton Batliner, Elmar Nöth, Alessandro Vinciarelli, Felix Burkhardt, Rob van Son, Felix Weninger, Florian Eyben, Tobias Bocklet, Gelareh Mohammadi, and Benjamin Weiss. 2012. The INTERSPEECH 2012 Speaker Trait Challenge. In Proc. of INTERSPEECH. ISCA, Portland, OR, 254--257.
[39]
Björn Schuller, Stefan Steidl, Anton Batliner, Elmar Nöth, Alessandro Vinciarelli, Felix Burkhardt, Rob van Son, Felix Weninger, Florian Eyben, Tobias Bocklet, Gelareh Mohammadi, and Benjamin Weiss. 2015. A Survey on Perceived Speaker Traits: Personality, Likability, Pathology, and the First Challenge. Computer Speech and Language, Special Issue on Next Generation Computational Paralinguistics, Vol. 29, 1 (2015), 100--131.
[40]
Sue E Tranter. 2006. Who really spoke when? Finding speaker turns and identities in broadcast news audio Proc. of ICASSP, Vol. Vol. 1. IEEE, Toulouse, France, 1013--1016.
[41]
Sue E Tranter and Douglas A Reynolds. 2006. An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14, 5 (2006), 1557--1565.
[42]
Fabio Valente, Deepu Vijayasenan, and Petr Motlicek. 2011. Speaker diarization of meetings based on speaker role n-gram models Proc. of ICASSP. IEEE, Prague, Czech Republic, 4416--4419.
[43]
Ravichander Vipperla, Jürgen Geiger, Simon Bozonnet, Dong Wang, Nicholas Evans, Björn Schuller, and Gerhard Rigoll. 2012. Speech Overlap Detection and Attribution Using Convolutive Non-Negative Sparse Coding Proc. of ICASSP. IEEE, Kyoto, Japan, 4181--4184.
[44]
Felix Weninger, Johannes Bergmann, and Björn Schuller. 2015. Introducing CURRENNT: the Munich Open-Source CUDA RecurREnt Neural Network Toolkit. Journal of Machine Learning Research Vol. 16 (2015), 547--551.
[45]
Felix Weninger, Florian Eyben, Björn Schuller, Marcello Mortillaro, and Klaus R. Scherer. 2013. On the Acoustics of Emotion in Audio: What Speech, Music and Sound have in Common. Frontiers in Emotion Science Vol. 4, 292 (2013), 1--12.

Cited By

View all
  • (2024)The Impact of Perceived Tone, Age, and Gender on Voice Assistant Persuasiveness in the Context of Product RecommendationsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665545(1-15)Online publication date: 8-Jul-2024
  • (2021)The Power of Voice: Using Audio Podcasts to Teach Vocal Performance and Digital CommunicationJournal of Communication Pedagogy10.31446/JCP.2021.1.044(38-50)Online publication date: 2021
  • (2021)Age-Invariant Speaker Embedding for Diarization of Cognitive Assessments2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP49672.2021.9362084(1-5)Online publication date: 24-Jan-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computational paralinguistics
  2. speaker diarisation

Qualifiers

  • Research-article

Funding Sources

  • European Commission

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23 - 27, 2017
California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Impact of Perceived Tone, Age, and Gender on Voice Assistant Persuasiveness in the Context of Product RecommendationsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665545(1-15)Online publication date: 8-Jul-2024
  • (2021)The Power of Voice: Using Audio Podcasts to Teach Vocal Performance and Digital CommunicationJournal of Communication Pedagogy10.31446/JCP.2021.1.044(38-50)Online publication date: 2021
  • (2021)Age-Invariant Speaker Embedding for Diarization of Cognitive Assessments2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP49672.2021.9362084(1-5)Online publication date: 24-Jan-2021
  • (2021)Convolutional and Deep Neural Networks based techniques for extracting the age-relevant features of the speakerJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03238-113:12(5655-5667)Online publication date: 25-Apr-2021
  • (2020)A review of CALL-based ASR and its potential application for Malay cued Speech learning tool applicationPROCEEDINGS OF ADVANCED MATERIAL, ENGINEERING & TECHNOLOGY10.1063/5.0023095(020007)Online publication date: 2020
  • (2020)Heterogeneous ensemble classifiers for Malay syllables classificationPROCEEDINGS OF ADVANCED MATERIAL, ENGINEERING & TECHNOLOGY10.1063/5.0023094(020074)Online publication date: 2020
  • (2019)Efficient band selection for improving the robustness of the EMD-based cepstral featuresSādhanā10.1007/s12046-019-1052-x44:3Online publication date: 9-Feb-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media