More Web Proxy on the site http://driver.im/

research-article

Open access

Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content

Authors:

Jessica Colnago,

Jofish KayeAuthors Info & Claims

CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

Pages 1 - 13

https://doi.org/10.1145/3313831.3376789

Published: 23 April 2020 Publication History

All formats PDF

Abstract

The advancement of text-to-speech (TTS) voices and a rise of commercial TTS platforms allow people to easily experience TTS voices across a variety of technologies, applications, and form factors. As such, we evaluated TTS voices for long-form content: not individual words or sentences, but voices that are pleasant to listen to for several minutes at a time. We introduce a method using a crowdsourcing platform and an online survey to evaluate voices based on listening experience, perception of clarity and quality, and comprehension. We evaluated 18 TTS voices, three human voices, and a text-only control condition. We found that TTS voices are close to rivaling human voices, yet no single voice outperforms the others across all evaluation dimensions. We conclude with considerations for selecting text-to-speech voices for long-form content.

Supplementary Material

ZIP File (paper660aux.zip)

Survey protocol

Download
15.75 KB

References

[1]

Sean Andrist, Micheline Ziadee, Halim Boukaram, Bilge Mutlu, and Majd Sakr. 2015. Effects of Culture on the Credibility of Robot Speech: A Comparison between English and Arabic. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction - HRI '15. ACM Press, Portland, Oregon, USA, 157--164.

Digital Library

[2]

Matthew P. Aylett, Selina Jeanne Sutton, and Yolanda Vazquez-Alvarez. 2019. The Right Kind of Unnatural: Designing a Robot Voice. In Proceedings of the 1st International Conference on Conversational User Interfaces (CUI '19). ACM, NY, NY, USA, 25:1--25:2. event-place: Dublin, Ireland.

Digital Library

[3]

Alice Baird, Stina Hasse Jørgensen, Emilia Parada-Cabaleiro, Nicholas Cummins, Simone Hantke, and Björn Schuller. 2018a. The Perception of Vocal Traits in Synthesized Voices: Age, Gender, and Human Likeness. Journal of the Audio Engineering Society 66,

[4]

(2018), 277--285. [4] Alice Baird, Emilia Parada-Cabaleiro, Simone Hantke, Felix Burkhardt, Nicholas Cummins, and Björn Schuller. 2018b. The Perception and Analysis of the Likeability and Human Likeness of Synthesized Speech. In Proc. Interspeech 2018. 2863--2867.

[5]

Alan W Black and Keiichi Tokuda. 2005. The Blizzard Challenge - 2005: Evaluating corpus-based speech synthesis on common datasets. In Proc. Interspeech 2005. 77--80.

[6]

Danielle Bragg, Cynthia Bennett, Katharina Reinecke, and Richard Ladner. 2018. A Large Inclusive Study of Human Listening Rates. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, NY, NY, USA, Article 444, 12 pages.

Digital Library

[7]

Michael Braun, Anja Mainz, Ronee Chadowitz, Bastian Pfleging, and Florian Alt. 2019. At Your Service: Designing Voice Assistant Personalities to Improve Automotive User Interfaces. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, NY, NY, USA, 40:1--40:11. event-place: Glasgow, Scotland Uk.

Digital Library

[8]

Alison Wood Brooks, Laura Huang, Sarah Wood Kearney, and Fiona E. Murray. 2014. Investors prefer entrepreneurial ventures pitched by attractive men. Proceedings of the National Academy of Sciences 111, 12 (2014), 4427--4431.

[9]

Julia Cambre and Chinmay Kulkarni. 2019. One Voice Fits All? Social Implications and Research Challenges of Designing Voices for Smart Devices. To appear in Proc. ACM Hum.-Comput. Interact. CSCW (2019).

[10]

Catherine Stupp. 2019. Fraudsters Used AI to Mimic CEO's Voice in Unusual Cybercrime Case. The Wall Street Journal (Aug. 2019). https://www.wsj.com/articles/fraudsters-use-ai-to-mimicceos-voice-in-unusual-cybercrime-case-11567157402

[11]

Leigh Clark, Philip Doyle, Diego Garaialde, Emer Gilmartin, Stephan Schlögl, Jens Edlund, Matthew Aylett, João Cabral, Cosmin Munteanu, Justin Edwards, and Benjamin R Cowan. 2019a. The State of Speech in HCI: Trends, Themes and Challenges. Interacting with Computers iwz016 (Sept. 2019).

[12]

Rob Clark, Hanna Silen, Tom Kenter, and Ralph Leith. 2019b. Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs. arXiv:1909.03965 [cs, eess] (Sept. 2019). http://arxiv.org/abs/1909.03965 arXiv: 1909.03965.

[13]

Martin Cooke, Catherine Mayo, and Cassia Valentini-Botinhao. 2013. Intelligibility-enhancing speech modifications: the hurricane challenge.

[14]

Benjamin R Cowan, Nadia Pantidi, David Coyle, Kellie Morrissey, Peter Clarke, Sara Al-Shehri, David Earley, and Natasha Bandeira. 2017. "What Can I Help You with?": Infrequent Users' Experiences of Intelligent Personal Assistants. In Proceedings of the 19th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI '17). ACM, NY, NY, USA, 43:1--43:12.

Digital Library

[15]

Alex Cranz. 2018. Uhh, Google Assistant Impersonating a Human on the Phone Is Scary as Hell to Me. (May 18, 2018). https://gizmodo.com/uhh-google-assistantimpersonating-a-human-is-scary-as-1825861987

[16]

Nils Dahlbäck, QianYing Wang, Clifford Nass, and Jenny Alwin. 2007. Similarity is More Important Than Expertise: Accent Effects in Speech Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '07). ACM, NY, NY, USA, 1553--1556. event-place: San Jose, California, USA.

Digital Library

[17]

Philip R. Doyle, Justin Edwards, Odile Dumbleton, Leigh Clark, and Benjamin R. Cowan. Mapping Perceptions of Humanness in Speech-Based Intelligent Personal Assistant Interaction. In MobileHCI 2019: 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. ACM. arXiv: 1907.11585.

Digital Library

[18]

Edison Research and Triton Digital. 2019. The Infinite Dial 2019. Marketing report.

[19]

Maxine Eskenazi, Gina-Anne Levow, Helen Meng, Gabriel Parent, and David Suendermann. 2013. Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. John Wiley & Sons.

Digital Library

[20]

Avashna Govender and Simon King. 2018a. Measuring the Cognitive Load of Synthetic Speech Using a Dual Task Paradigm. In Proc. Interspeech 2018. 2843--2847.

[21]

Avashna Govender and Simon King. 2018b. Using Pupillometry to Measure the Cognitive Load of Synthetic Speech. In Proc. Interspeech 2018. 2838--2842.

[22]

Greg McKeown. 2013. Reduce Your Stress in Two Minutes a Day. Harvard Business Review (Nov. 2013). https://hbr.org/2013/11/reduce-your-stress-in-twominutes-a-day

[23]

Iben Have and Birgitte Pedersen. 2013. Sonic Mediatization of the Book: Affordances of the Audiobook. MedieKultur: Journal of media and communication research 29 (03 2013), 18.

[24]

Florian Hinterleitner, Georgina Neitzel, Sebastian Möller, and Christoph Norrenbrock. 2011. An evaluation protocol for the subjective assessment of text-to-speech in audiobook reading tasks. Proceedings of Blizzard Challenge (2011).

[25]

Katharine Schwab. 2019. The real reason Google Assistant launched with a female voice: biased data. FastCompany (Sept. 2019). https://www.fastcompany.com/90404860/the-real-reason-thereare-so-many-female-voice-assistants-biased-data

[26]

Simon King. 2014. Measuring a decade of progress in text-to-speech. Loquens 1, 1 (2014), 006.

[27]

Sara L. Knox. 2011. Hearing Hardy, talking Tolstoy : the audiobook narrator's voice and reader experience. (2011). http://handle.uws.edu.au:8081/1959.7/543239

[28]

Marianne LaFrance. 1989. The quality of expertise: implications of expert-novice differences for knowledge acquisition. ACM SIGART Bulletin 108 (1989), 6--14.

Digital Library

[29]

B. Langner and A. W. Black. 2005. Improving the understandability of speech synthesis by modeling speech in noise. In Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., Vol. 1. I/265--I/268 Vol. 1.

[30]

Eun Ju Lee, Clifford Nass, and Scott Brave. 2000. Can Computer-generated Speech Have Gender?: An Experimental Test of Gender Stereotype. In CHI '00 Extended Abstracts on Human Factors in Computing Systems (CHI EA '00). ACM, NY, NY, USA, 289--290. event-place: The Hague, The Netherlands.

Digital Library

[31]

Ewa Luger and Abigail Sellen. 2016. "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems - CHI '16 (2016), 5286--5297.

Digital Library

[32]

C. McGinn and I. Torre. 2019. Can you Tell the Robot by the Voice? An Exploratory Study on the Role of Voice in the Perception of Robots. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). 211--221.

[33]

Joseph Mendelson and Matthew P. Aylett. 2017. Beyond the Listening Test: An Interactive Approach to TTS Evaluation. In Proc. Interspeech 2017. 249--253.

[34]

Roger K Moore. 2017a. Appropriate Voices for Artefacts: Some Key Insights. In 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots.

[35]

Roger K. Moore. 2017b. Is Spoken Language All-or-Nothing? Implications for Future Speech-Based Human-Machine Interaction. In Dialogues with Social Robots: Enablements, Analyses, and Evaluation, Kristiina Jokinen and Graham Wilcock (Eds.). Springer Singapore, Singapore, 281--291.

[36]

Clifford Nass and Scott Brave. 2005. Wired for speech: How voice activates and advances the human-computer relationship. MIT press.

Digital Library

[37]

Clifford Nass and Kwan Min Lee. 2001. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. Journal of experimental psychology: applied 7, 3 (2001), 171.

[38]

Clifford Nass, Youngme Moon, and Nancy Green. 1997. Are machines gender neutral? Gender-stereotypic responses to computers with voices. Journal of Applied Social Psychology 27, 10 (1997), 864--876.

[39]

Casey Newton. 2018. Pocket redesigns its mobile apps to emphasize listening. (Oct. 11, 2018). https://www.theverge.com/2018/10/11/17961564/pocketredesign-listening-amazon-polly

[40]

Christoph R Norrenbrock, Florian Hinterleitner, Ulrich Heute, and Sebastian Möller. Towards perceptual quality modeling of synthesized audiobooks-Blizzard Challenge 2012. Proceedings of the Blizzard Challenge, 2012. http://festvox.org/blizzard/bc2012/ Norrenbrock_etal_Blizzard_workshop_2012_final.pdf

[41]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

[42]

Sarah Perez. 2017. Audm turns long-form print journalism into professionally narrated digital audio. (July 14, 2017). https://techcrunch.com/2017/07/14/audm-turns-longform-print-journalism-into-professionally-narrateddigital-audio/

[43]

Victoria Petrock. 2019. Voice Assistant Use Reaches Critical Mass. (August 15, 2019). https://www.emarketer.com/content/voice-assistant-usereaches-critical-mass

[44]

Quentin Hardy. 2016. Looking for a Choice of Voices in A.I. Technology. The New York Times (Oct. 2016). https://www.nytimes.com/2016/10/10/technology/lookingfor-a-choice-of-voices-in-ai-technology.html

[45]

Falk Rehkopf. 2019. Audio is the new video: Will podcasts take off in Europe? https://www.ubermetricstechnologies.com/blog/audio-is-the-new-video-willpodcasts-finally-take-off-in-europe/. (Feb. 2019). Accessed: 2019--3--19.

[46]

Selina Jeanne Sutton, Paul Foulkes, David Kirk, and Shaun Lawson. 2019. Voice As a Design Material: Sociophonetic Inspired Design Strategies in Human-Computer Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, NY, NY, USA, 603:1--603:14. event-place: Glasgow, Scotland Uk.

Digital Library

[47]

Marie Louise Juul Søndergaard and Lone Koefoed Hansen. 2018. Intimate Futures: Staying with the Trouble of Digital Personal Assistants through Design Fiction. Proceedings of the 2018 on Designing Interactive Systems Conference 2018 - DIS '18 (2018), 869--880.

Digital Library

[48]

Benedict Tay, Younbo Jung, and Taezoon Park. 2014. When stereotypes meet robots: The double-edge sword of robot gender and personality in human--robot interaction. Computers in Human Behavior 38 (Sept. 2014), 75--84.

Digital Library

[49]

Petra Wagner, Jonas Beskow, Simon Betz, Jens Edlund, Joakim Gustafson, Gustav Eje Henter, Sébastien Le Maguer, Zofia Malisz, Éva Székely, Christina Tånnander, and Jana Voße. 2019. Speech Synthesis Evaluation - State-of-the-Art Assessment and Suggestion for a Novel Research Program. In Proc. 10th ISCA Speech Synthesis Workshop. 105--110.

[50]

Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. 2015. Are We Using Enough Listeners? No!-An Empirically-Supported Critique of Interspeech 2014 TTS Evaluations. In Proc. Interspeech 2015. https://www.isca-speech.org/archive/ interspeech_2015/papers/i15_3476.pdf

[51]

Andy Wolber. 2017. 4 Text-to-Speech apps that will read online articles to you. (April 05, 2017). https://www.techrepublic.com/article/4-text-to-speechapps-that-will-read-online-articles-to-you/

Cited By

Längle SSchlögl SEcker Avan Kooten WSpieß T(2024)Nonbinary Voices for Digital Assistants—An Investigation of User Perceptions and Gender StereotypesRobotics10.3390/robotics1308011113:8(111)Online publication date: 23-Jul-2024
https://doi.org/10.3390/robotics13080111
Feijóo-García PWrenn CGomes de Siqueira AGhosh RStuart JYao HLok B(2024)Exploring the Effects of User-Agent and User-Designer Similarity in Virtual Human Design to Promote Mental Health Intentions for College StudentsACM Transactions on Applied Perception10.1145/368982222:1(1-41)Online publication date: 28-Nov-2024
https://dl.acm.org/doi/10.1145/3689822
Kang WHughes MRoy D(2024)Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling HeardProceedings of the ACM on Human-Computer Interaction10.1145/36870218:CSCW2(1-22)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.1145/3687021
Show More Cited By

Index Terms

Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction devices
      1. Sound-based input / output
    2. Interaction paradigms
      1. Natural language interfaces

Recommendations

Accurate synthesis of dysarthric Speech for ASR data augmentation
Highlights
- Modified a neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.
- Providing data augmentation for machine learning tasks such ...
Abstract
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more ...
Analysis and modeling of F0 contours for cantonese text-to-speech

For the generation of highly natural synthetic speech, the control of prosody is of primary importance. The fundamental frequency (F0) is one of the most important components of speech prosody. This research investigates the variation of F0 in ...
Analysis and HMM-based synthesis of hypo and hyperarticulated speech

Hypo and hyperarticulation refer to the production of speech with respectively a reduction and an increase of the articulatory efforts compared to the neutral style. Produced consciously or not, these variations of articulatory efforts depend upon the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

April 2020

10688 pages

ISBN:9781450367080

DOI:10.1145/3313831

General Chairs:
Regina Bernhaupt
Eindhoven University of Technology, Netherlands
,
Florian 'Floyd' Mueller
Monash University, Australia
,
David Verweij
Newcastle University, UK
,
Josh Andres
RMIT, Australia
,
Program Chairs:
Joanna McGrenere
University of British Columbia, Canada
,
Andy Cockburn
University of Canterbury, New Zealand
,
Ignacio Avellino
University of Maryland Baltimore County, USA
,
Alix Goguey
Grenoble Alpes University, France
,
Pernille Bjørn
University of Copenhagen, Denmark
,
Shengdong (Shen) Zhao
National University of Singapore, Singapore
,
Briane Paul Samson
Future University Hakodate, Japan & De La Salle University, Philippines
,
Rafal Kocielnik
University of Washington, USA

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CHI '20

Sponsor:

SIGCHI

CHI '20: CHI Conference on Human Factors in Computing Systems

April 25 - 30, 2020

HI, Honolulu, USA

Acceptance Rates

Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

Upcoming Conference

CHI 2025

Sponsor:
sigchi

ACM CHI Conference on Human Factors in Computing Systems

April 26 - May 1, 2025

Yokohama , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
9,579
Total Downloads

Downloads (Last 12 months)1,954
Downloads (Last 6 weeks)312

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Längle SSchlögl SEcker Avan Kooten WSpieß T(2024)Nonbinary Voices for Digital Assistants—An Investigation of User Perceptions and Gender StereotypesRobotics10.3390/robotics1308011113:8(111)Online publication date: 23-Jul-2024
https://doi.org/10.3390/robotics13080111
Feijóo-García PWrenn CGomes de Siqueira AGhosh RStuart JYao HLok B(2024)Exploring the Effects of User-Agent and User-Designer Similarity in Virtual Human Design to Promote Mental Health Intentions for College StudentsACM Transactions on Applied Perception10.1145/368982222:1(1-41)Online publication date: 28-Nov-2024
https://dl.acm.org/doi/10.1145/3689822
Kang WHughes MRoy D(2024)Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling HeardProceedings of the ACM on Human-Computer Interaction10.1145/36870218:CSCW2(1-22)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.1145/3687021
Seaborn KUrakami JPennefather PMiyake N(2024)Qualitative Approaches to Voice UXACM Computing Surveys10.1145/365866656:12(1-34)Online publication date: 20-Apr-2024
https://dl.acm.org/doi/10.1145/3658666
Oppenlaender JAbbas TGadiraju U(2024)The State of Pilot Study Reporting in Crowdsourcing: A Reflection on Best Practices and GuidelinesProceedings of the ACM on Human-Computer Interaction10.1145/36410238:CSCW1(1-45)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3641023
Oh JIm HLee S(2024)Toward a Third-Kind Voice for Conversational Agents in an Era of Blurring Boundaries Between Machine and Human SoundsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665880(1-7)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3640794.3665880
Dubiel MSergeeva ALeiva L(2024)Impact of Voice Fidelity on Decision Making: A Potential Dark Pattern?Proceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645202(181-194)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645202
Hutiri WPapakyriakopoulos OXiang A(2024)Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech GeneratorsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658911(359-376)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658911
Taylor JSimpson ETran ABrubaker JFox SZhu H(2024)Cruising Queer HCI on the DL: A Literature Review of LGBTQ+ People in HCIProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642494(1-21)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642494
Faruk LBabakerkhell MMongkolnam PChongsuphajaisiddhi VFunilkul SPal D(2024)A Review of Subjective Scales Measuring the User Experience of Voice AssistantsIEEE Access10.1109/ACCESS.2024.335842312(14893-14917)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3358423
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents