More Web Proxy on the site http://driver.im/

research-article

Perception of Paralinguistic Traits in Synthesized Voices

Authors:

Stina Hasse Jørgensen,

Emilia Parada-Cabaleiro,

Nicholas Cummins,

Björn SchullerAuthors Info & Claims

AM '17: Proceedings of the 12th International Audio Mostly Conference on Augmented and Participatory Sound and Music Experiences

Article No.: 17, Pages 1 - 5

https://doi.org/10.1145/3123514.3123528

Published: 23 August 2017 Publication History

Abstract

Along with the rise of artificial intelligence and the internet-of-things, synthesized voices are now common in daily--life, providing us with guidance, assistance, and even companionship. From formant to concatenative synthesis, the synthesized voice continues to be defined by the same traits we prescribe to ourselves. When the recorded voice is synthesized, does our perception of its new machine embodiment change, and can we consider an alternative, more inclusive form? To begin evaluating the impact of aesthetic design, this study presents a first--step perception test to explore the paralinguistic traits of the synthesized voice. Using a corpus of 13 synthesized voices, constructed from acoustic concatenative speech synthesis, we assessed the response of 23 listeners from differing cultural backgrounds. To evaluate if perception shifts from the defined traits, we asked listeners to assigned traits of age, gender, accent origin, and human--likeness. Results present a difference in perception for age and human--likeness across voices, and a general agreement across listeners for both gender and accent origin. Connections found between age, gender and human--likeness call for further exploration into a more participatory and inclusive synthesized vocal identity.

References

[1]

Amazon. 2017. The Alexa Fund. (2017). http://amzn.to/2fD1COc/

[2]

A. Baird, F. Tollund Juutilainen, S. Hasse Jorgensen, and M. Steensig Pelt. 2017. {multi'vocal}, Exploring Representation, Identity and Aesthetics of Synthesized Voices. (2017). http://www.multivocal.org/

[3]

M. Beaulieu. 2002. Wireless Internet Applications and Architecture: Building Professional Wireless Applications Worldwide. Pearson Education, Boston, MA, USA.

Digital Library

[4]

C. Yen C. Nass. 2010. The Man Who Lied to His Laptop: What We Can Learn About Ourselves from Our Machines. Penguin Group, New York, NY, USA.

[5]

IBM® Watson Developer Cloud. 2017. Text to speech. (2017). https://ibm.co/2vLOhNE

[6]

IBM® Watson Developer Cloud. 2017. The Science Behind the Service. (2017). https://ibm.co/2vtyDnu

[7]

Y. Fan, Y. Quan, F. Xie, and F. Soong. 2014. HMM-based synthesis of creaky voice. In Proc. Interspeech (2014), 964--1968.

[8]

G. Fant. 1981. The Source Filter Concept in Voice Production. STL-QPSR 22, 1 (1981), 21--37.

[9]

L. Ferlazzo. 2015. The Most Translated Words Using Google Translate Are. (2015). http://bit.ly/2wArIZI

[10]

J. Ferrell. 1999. System and Method for Multimodal Interactive Speech and Language Training. (23. 03. 1999).

[11]

L. Gong and J. Lai. 2003. To Mix or Not to Mix Synthetic Speech and Human Speech? Contrasting Impact on Judge-Rated Task Performance versus Self-Rated Performance and Attitudinal Responses. International Journal of Speech Technology 6 (2003), 123--131.

[12]

Yamaha Group. 2014. Designing the New Sound. Annual report 2014. (2014). http://bit.ly/2vsTIOR

[13]

S. Hantke, F. Eyben, T. Appel, and B. Schuller. 2015. iHEARu-PLAY: Introducing a Game for Crowdsourced Data Collection for Affective Computing. In Proc. 1st International WASA 2015, ACII 2015 (2015), 891--897.

Digital Library

[14]

R. A. Harris. 2005. Voice Interaction Design: Crafting the New Conversational Speech Systems. Morgan Kaufmann Publishers /Elsevier, San Francisco, CA, USA.

Digital Library

[15]

S. Hasse. 2016. Stemmernes Politik I Samtidskunsten. TerrÃęn: Dansk Samtidskunst, Aarhus Universitetsforlag (2016), no pagination.

[16]

J. Hirschberge. 2006. Speech Synthesis: Prosody. In Encyclopedia of Language & Linguistics 7 (2006), 49--55.

[17]

S. Watkins Homer Dudley, R. Riesz. 1939. A Synthetic Speaker. Journal of The Franklin Institute 227, 6 (June 1939), 739--764.

[18]

U. Jekosch. 2005. Voice and Speech Quality Perception: Assessment and Evaluation. Springer-Verlag Berlin Heidelberg, Heidelberg, Germany.

Digital Library

[19]

A. Kharpal. 2017. Amazon Voice Assistant Alexa could be a Billion Dollar Mega-Hit by 2020. (2017). http://cnb.cx/2vWx8QX

[20]

E. Ju Lee, C. Nass, and S. Brave. 2000. Can Computer-generated Speech Have Gender?: An Experimental Test of Gender Stereotype. In CHI '00 Extended Abstracts on Human Factors in Computing Systems (CHI EA '00). ACM, New York, NY, USA, 289--290.

Digital Library

[21]

E. Marchi, F. Eyben, G. Hagerer, and B. W. Schuller. 2016. Real-time Tracking of Speakers' Emotions, States, and Traits on Mobile Platforms. In Proc. Interspeech 2016. ISCA, ISCA, San Francisco, CA, 1182--1183.

[22]

M. Mori. 1970. Bukimi No Tani {The Uncanny Valley}. ENERGY 7, 4 (1970), 33--35.

[23]

T. Phan. 2017. The Materiality of the Digital and the Gendered Voice of Siri. Transformations 29 (2017), 23--33.

[24]

J. F. Pitrelli, R. Bakis, E. M. Eide, R. Fernandez, W. Hamza, and M. A. Picheny. 2006. The IBM Expressive Text-to-Speech Synthesis System for American English. IEEE Transactions on Audio, Speech, and Language Processing 14, 4 (2006), 1099--1108.

Digital Library

[25]

T. Raitio, J. Kane, T. Drugman, and Gobl C. 2013. HMM-based Synthesis of Creaky Voice. In Proc. Interspeech (2013), 2316--2320.

[26]

B. B. Read. 2011. IVR: Nuance Acquires PerSay to Bring Voice Biometrics to Market. (2011). http://bit.ly/2uv4YNr

[27]

J. Robin. 2008. 'Robo-Diva R&B' Aesthetics, Politics, and Black Female Robots in Contemporary Popular Music. Journal of Popular Music Studies 20, 4 (2008), 402--423.

[28]

M. R.Schroeder. 2004. Computer Speech: Recognition, Compression, Synthesis. Springer-Verlag, Heidelberg, Germany.

Digital Library

[29]

J. Sánchez and C. Oyarzún. 2011. Mobile audio assistance in bus transportation for the blind. Official journal of the the National Institute of Child Health and Human Development in Israel 10, 4 (2011), 365--371.

[30]

R. Scha. 1992. Virtual Voices. Mediamatic Magazine 7, 1 (1992), 27--42.

[31]

K. Scherer, R. Banse, and H. Wallbott. 2001. Emotion Inferences from Vocal Expression Correlate Across Languages and Cultures. Journal of Cross Cultural Psychology 32, 1 (2001), 76--92.

[32]

M. Schröder. 2001. Emotional Speech Synthesis: A Review. In Proc. Interspeech (2001), 964--1968.

[33]

M. Schröder. 2009. Approaches to Emotional Expressivity in Synthetic Speech. In Emotions in the Human Voice, Krzysztof Izdebski (Ed.). Culture and Perception, Vol. 3. Plural Publishing, United Kingdom, Chapter 19, 307--323.

[34]

B. Schuller and A. Batliner. 2013. Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. Wiley, Hoboken, NJ, USA.

[35]

A. Stent, A. Syrdal, and T. Mishra. 2011. On the Intelligibility of Fast Synthesized Speech for Individuals with Early-onset Blindness. In Proc. ACM SIGACCESS (ASSETS 2011). ACM, New York, NY, USA, 211--218.

Digital Library

[36]

T. Streeter. 2003. The Romantic Self and the Politics of Internet Commercialization. Cultural Studies 17, 5 (2003), 648--668.

[37]

K. Scherer T. Bänziger, H. Pirker. 2006. GEMEP-GEneva Multimodal Emotion Portrayals: A corpus for the study of multimodal emotional expressions. In In Proc. Language Resources and Evaluation. 15--19.

[38]

A. Weidman. 2014. Anthropology and Voice. Annual Review of Anthropology 43 (October 2014), 37--51.

[39]

J. Yamagishi. 2006. An Introduction to HMM-Based Speech Synthesis. Technical report, Technical report. Tokyo Institute of Technology (2006).

[40]

Y. Zhang and B. Schuller. 2016. Towards Human-Like Holisitc Machine Perception of Speaker States and Traits. In Proc. of the Human-Like Computing Machine Intelligence Workshop, MI20-HLC. Springer, Windsor, U. K. 'no pagination'.

Cited By

Pias SHuang RWilliamson DKim MKapadia A(2024)The Impact of Perceived Tone, Age, and Gender on Voice Assistant Persuasiveness in the Context of Product RecommendationsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665545(1-15)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3640794.3665545
Holliday N(2023)Siri, you've changed! Acoustic properties and racialized judgments of voice assistantsFrontiers in Communication10.3389/fcomm.2023.11169558Online publication date: 26-Apr-2023
https://doi.org/10.3389/fcomm.2023.1116955
Seaborn KNam SKeckeis JItagaki T(2023)Can Voice Assistants Sound Cute? Towards a Model of Kawaii VocalicsExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3585656(1-7)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544549.3585656
Show More Cited By

Index Terms

Perception of Paralinguistic Traits in Synthesized Voices
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User studies
    2. Interaction devices
      1. Sound-based input / output
  2. Ubiquitous and mobile computing
    1. Ubiquitous and mobile devices
      1. Personal digital assistants

Recommendations

A Paralinguistic Approach To Speaker Diarisation: Using Age, Gender, Voice Likability and Personality Traits
MM '17: Proceedings of the 25th ACM international conference on Multimedia

In this work, we present a new view on automatic speaker diarisation, i.e., assessing "who speaks when", based on the recognition of speaker traits such as age, gender, voice likability, and personality. Traditionally, speaker diarisation is ...
Ageing Voices: The Effect of Changes in Voice Parameters on ASR Performance

With ageing, human voices undergo several changes which are typically characterized by increased hoarseness and changes in articulation patterns. In this study, we have examined the effect on Automatic Speech Recognition (ASR) and found that the Word ...
Paralinguistic profiling using speech recognition

This research explores the various indicators for non-verbal cues of speech and provides a method of building a paralinguistic profile of these speech characteristics which determines the emotional state of the speaker. Since a major part of human ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

AM '17: Proceedings of the 12th International Audio Mostly Conference on Augmented and Participatory Sound and Music Experiences

August 2017

337 pages

ISBN:9781450353731

DOI:10.1145/3123514

Conference Chair:
George Fazekas,
Program Chairs:
Mathieu Barthet,
Tony Stockman

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Queen Mary, University of London

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

AM '17

AM '17: Audio Mostly 2017

August 23 - 26, 2017

London, United Kingdom

Acceptance Rates

AM '17 Paper Acceptance Rate 54 of 77 submissions, 70%;

Overall Acceptance Rate 177 of 275 submissions, 64%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
317
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pias SHuang RWilliamson DKim MKapadia A(2024)The Impact of Perceived Tone, Age, and Gender on Voice Assistant Persuasiveness in the Context of Product RecommendationsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665545(1-15)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3640794.3665545
Holliday N(2023)Siri, you've changed! Acoustic properties and racialized judgments of voice assistantsFrontiers in Communication10.3389/fcomm.2023.11169558Online publication date: 26-Apr-2023
https://doi.org/10.3389/fcomm.2023.1116955
Seaborn KNam SKeckeis JItagaki T(2023)Can Voice Assistants Sound Cute? Towards a Model of Kawaii VocalicsExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3585656(1-7)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544549.3585656
Do TAkter MChoudhary ZAzevedo RMcMahan R(2022)The Effects of an Embodied Pedagogical Agent’s Synthetic Speech Accent on Learning OutcomesProceedings of the 2022 International Conference on Multimodal Interaction10.1145/3536221.3556587(198-206)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3536221.3556587
Seaborn KMiyake NPennefather POtake-Matsuura M(2021)Voice in Human–Agent InteractionACM Computing Surveys10.1145/338686754:4(1-43)Online publication date: 3-May-2021
https://dl.acm.org/doi/10.1145/3386867
Baird ASchuller B(2020)Considerations for a More Ethical Approach to Data in AI: On Data Representation and InfrastructureFrontiers in Big Data10.3389/fdata.2020.000253Online publication date: 2-Sep-2020
https://doi.org/10.3389/fdata.2020.00025
Chavez-Sanchez FFranco Gde la Peña GCarrillo ETorres MSchlögl SClark LPorcheron M(2020)Beyond What is SaidProceedings of the 2nd Conference on Conversational User Interfaces10.1145/3405755.3406145(1-3)Online publication date: 22-Jul-2020
https://dl.acm.org/doi/10.1145/3405755.3406145
Thakur NHan CJanowicz KKuhn WCena FHaller AVamvoudakis K(2018)An approach to analyze the social acceptance of virtual assistants by elderly peopleProceedings of the 8th International Conference on the Internet of Things10.1145/3277593.3277616(1-6)Online publication date: 15-Oct-2018
https://dl.acm.org/doi/10.1145/3277593.3277616
Fazal MFerguson SJohnston ACunningham SPicking R(2018)Investigating Concurrent Speech-based Designs for Information CommunicationProceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion10.1145/3243274.3243284(1-8)Online publication date: 12-Sep-2018
https://dl.acm.org/doi/10.1145/3243274.3243284
Søndergaard MHansen LKoskinen ILim YCerratto-Pargman TChow KOdom W(2018)Intimate FuturesProceedings of the 2018 Designing Interactive Systems Conference10.1145/3196709.3196766(869-880)Online publication date: 8-Jun-2018
https://dl.acm.org/doi/10.1145/3196709.3196766

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents