[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3313831.3376789acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article
Open access

Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content

Published: 23 April 2020 Publication History

Abstract

The advancement of text-to-speech (TTS) voices and a rise of commercial TTS platforms allow people to easily experience TTS voices across a variety of technologies, applications, and form factors. As such, we evaluated TTS voices for long-form content: not individual words or sentences, but voices that are pleasant to listen to for several minutes at a time. We introduce a method using a crowdsourcing platform and an online survey to evaluate voices based on listening experience, perception of clarity and quality, and comprehension. We evaluated 18 TTS voices, three human voices, and a text-only control condition. We found that TTS voices are close to rivaling human voices, yet no single voice outperforms the others across all evaluation dimensions. We conclude with considerations for selecting text-to-speech voices for long-form content.

Supplementary Material

ZIP File (paper660aux.zip)
Survey protocol

References

[1]
Sean Andrist, Micheline Ziadee, Halim Boukaram, Bilge Mutlu, and Majd Sakr. 2015. Effects of Culture on the Credibility of Robot Speech: A Comparison between English and Arabic. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction - HRI '15. ACM Press, Portland, Oregon, USA, 157--164.
[2]
Matthew P. Aylett, Selina Jeanne Sutton, and Yolanda Vazquez-Alvarez. 2019. The Right Kind of Unnatural: Designing a Robot Voice. In Proceedings of the 1st International Conference on Conversational User Interfaces (CUI '19). ACM, NY, NY, USA, 25:1--25:2. event-place: Dublin, Ireland.
[3]
Alice Baird, Stina Hasse Jørgensen, Emilia Parada-Cabaleiro, Nicholas Cummins, Simone Hantke, and Björn Schuller. 2018a. The Perception of Vocal Traits in Synthesized Voices: Age, Gender, and Human Likeness. Journal of the Audio Engineering Society 66,
[4]
(2018), 277--285. [4] Alice Baird, Emilia Parada-Cabaleiro, Simone Hantke, Felix Burkhardt, Nicholas Cummins, and Björn Schuller. 2018b. The Perception and Analysis of the Likeability and Human Likeness of Synthesized Speech. In Proc. Interspeech 2018. 2863--2867.
[5]
Alan W Black and Keiichi Tokuda. 2005. The Blizzard Challenge - 2005: Evaluating corpus-based speech synthesis on common datasets. In Proc. Interspeech 2005. 77--80.
[6]
Danielle Bragg, Cynthia Bennett, Katharina Reinecke, and Richard Ladner. 2018. A Large Inclusive Study of Human Listening Rates. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, NY, NY, USA, Article 444, 12 pages.
[7]
Michael Braun, Anja Mainz, Ronee Chadowitz, Bastian Pfleging, and Florian Alt. 2019. At Your Service: Designing Voice Assistant Personalities to Improve Automotive User Interfaces. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, NY, NY, USA, 40:1--40:11. event-place: Glasgow, Scotland Uk.
[8]
Alison Wood Brooks, Laura Huang, Sarah Wood Kearney, and Fiona E. Murray. 2014. Investors prefer entrepreneurial ventures pitched by attractive men. Proceedings of the National Academy of Sciences 111, 12 (2014), 4427--4431.
[9]
Julia Cambre and Chinmay Kulkarni. 2019. One Voice Fits All? Social Implications and Research Challenges of Designing Voices for Smart Devices. To appear in Proc. ACM Hum.-Comput. Interact. CSCW (2019).
[10]
Catherine Stupp. 2019. Fraudsters Used AI to Mimic CEO's Voice in Unusual Cybercrime Case. The Wall Street Journal (Aug. 2019). https://www.wsj.com/articles/fraudsters-use-ai-to-mimicceos-voice-in-unusual-cybercrime-case-11567157402
[11]
Leigh Clark, Philip Doyle, Diego Garaialde, Emer Gilmartin, Stephan Schlögl, Jens Edlund, Matthew Aylett, João Cabral, Cosmin Munteanu, Justin Edwards, and Benjamin R Cowan. 2019a. The State of Speech in HCI: Trends, Themes and Challenges. Interacting with Computers iwz016 (Sept. 2019).
[12]
Rob Clark, Hanna Silen, Tom Kenter, and Ralph Leith. 2019b. Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs. arXiv:1909.03965 [cs, eess] (Sept. 2019). http://arxiv.org/abs/1909.03965 arXiv: 1909.03965.
[13]
Martin Cooke, Catherine Mayo, and Cassia Valentini-Botinhao. 2013. Intelligibility-enhancing speech modifications: the hurricane challenge.
[14]
Benjamin R Cowan, Nadia Pantidi, David Coyle, Kellie Morrissey, Peter Clarke, Sara Al-Shehri, David Earley, and Natasha Bandeira. 2017. "What Can I Help You with?": Infrequent Users' Experiences of Intelligent Personal Assistants. In Proceedings of the 19th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI '17). ACM, NY, NY, USA, 43:1--43:12.
[15]
Alex Cranz. 2018. Uhh, Google Assistant Impersonating a Human on the Phone Is Scary as Hell to Me. (May 18, 2018). https://gizmodo.com/uhh-google-assistantimpersonating-a-human-is-scary-as-1825861987
[16]
Nils Dahlbäck, QianYing Wang, Clifford Nass, and Jenny Alwin. 2007. Similarity is More Important Than Expertise: Accent Effects in Speech Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '07). ACM, NY, NY, USA, 1553--1556. event-place: San Jose, California, USA.
[17]
Philip R. Doyle, Justin Edwards, Odile Dumbleton, Leigh Clark, and Benjamin R. Cowan. Mapping Perceptions of Humanness in Speech-Based Intelligent Personal Assistant Interaction. In MobileHCI 2019: 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. ACM. arXiv: 1907.11585.
[18]
Edison Research and Triton Digital. 2019. The Infinite Dial 2019. Marketing report.
[19]
Maxine Eskenazi, Gina-Anne Levow, Helen Meng, Gabriel Parent, and David Suendermann. 2013. Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. John Wiley & Sons.
[20]
Avashna Govender and Simon King. 2018a. Measuring the Cognitive Load of Synthetic Speech Using a Dual Task Paradigm. In Proc. Interspeech 2018. 2843--2847.
[21]
Avashna Govender and Simon King. 2018b. Using Pupillometry to Measure the Cognitive Load of Synthetic Speech. In Proc. Interspeech 2018. 2838--2842.
[22]
Greg McKeown. 2013. Reduce Your Stress in Two Minutes a Day. Harvard Business Review (Nov. 2013). https://hbr.org/2013/11/reduce-your-stress-in-twominutes-a-day
[23]
Iben Have and Birgitte Pedersen. 2013. Sonic Mediatization of the Book: Affordances of the Audiobook. MedieKultur: Journal of media and communication research 29 (03 2013), 18.
[24]
Florian Hinterleitner, Georgina Neitzel, Sebastian Möller, and Christoph Norrenbrock. 2011. An evaluation protocol for the subjective assessment of text-to-speech in audiobook reading tasks. Proceedings of Blizzard Challenge (2011).
[25]
Katharine Schwab. 2019. The real reason Google Assistant launched with a female voice: biased data. FastCompany (Sept. 2019). https://www.fastcompany.com/90404860/the-real-reason-thereare-so-many-female-voice-assistants-biased-data
[26]
Simon King. 2014. Measuring a decade of progress in text-to-speech. Loquens 1, 1 (2014), 006.
[27]
Sara L. Knox. 2011. Hearing Hardy, talking Tolstoy : the audiobook narrator's voice and reader experience. (2011). http://handle.uws.edu.au:8081/1959.7/543239
[28]
Marianne LaFrance. 1989. The quality of expertise: implications of expert-novice differences for knowledge acquisition. ACM SIGART Bulletin 108 (1989), 6--14.
[29]
B. Langner and A. W. Black. 2005. Improving the understandability of speech synthesis by modeling speech in noise. In Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., Vol. 1. I/265--I/268 Vol. 1.
[30]
Eun Ju Lee, Clifford Nass, and Scott Brave. 2000. Can Computer-generated Speech Have Gender?: An Experimental Test of Gender Stereotype. In CHI '00 Extended Abstracts on Human Factors in Computing Systems (CHI EA '00). ACM, NY, NY, USA, 289--290. event-place: The Hague, The Netherlands.
[31]
Ewa Luger and Abigail Sellen. 2016. "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems - CHI '16 (2016), 5286--5297.
[32]
C. McGinn and I. Torre. 2019. Can you Tell the Robot by the Voice? An Exploratory Study on the Role of Voice in the Perception of Robots. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). 211--221.
[33]
Joseph Mendelson and Matthew P. Aylett. 2017. Beyond the Listening Test: An Interactive Approach to TTS Evaluation. In Proc. Interspeech 2017. 249--253.
[34]
Roger K Moore. 2017a. Appropriate Voices for Artefacts: Some Key Insights. In 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots.
[35]
Roger K. Moore. 2017b. Is Spoken Language All-or-Nothing? Implications for Future Speech-Based Human-Machine Interaction. In Dialogues with Social Robots: Enablements, Analyses, and Evaluation, Kristiina Jokinen and Graham Wilcock (Eds.). Springer Singapore, Singapore, 281--291.
[36]
Clifford Nass and Scott Brave. 2005. Wired for speech: How voice activates and advances the human-computer relationship. MIT press.
[37]
Clifford Nass and Kwan Min Lee. 2001. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. Journal of experimental psychology: applied 7, 3 (2001), 171.
[38]
Clifford Nass, Youngme Moon, and Nancy Green. 1997. Are machines gender neutral? Gender-stereotypic responses to computers with voices. Journal of Applied Social Psychology 27, 10 (1997), 864--876.
[39]
Casey Newton. 2018. Pocket redesigns its mobile apps to emphasize listening. (Oct. 11, 2018). https://www.theverge.com/2018/10/11/17961564/pocketredesign-listening-amazon-polly
[40]
Christoph R Norrenbrock, Florian Hinterleitner, Ulrich Heute, and Sebastian Möller. Towards perceptual quality modeling of synthesized audiobooks-Blizzard Challenge 2012. Proceedings of the Blizzard Challenge, 2012. http://festvox.org/blizzard/bc2012/ Norrenbrock_etal_Blizzard_workshop_2012_final.pdf
[41]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[42]
Sarah Perez. 2017. Audm turns long-form print journalism into professionally narrated digital audio. (July 14, 2017). https://techcrunch.com/2017/07/14/audm-turns-longform-print-journalism-into-professionally-narrateddigital-audio/
[43]
Victoria Petrock. 2019. Voice Assistant Use Reaches Critical Mass. (August 15, 2019). https://www.emarketer.com/content/voice-assistant-usereaches-critical-mass
[44]
Quentin Hardy. 2016. Looking for a Choice of Voices in A.I. Technology. The New York Times (Oct. 2016). https://www.nytimes.com/2016/10/10/technology/lookingfor-a-choice-of-voices-in-ai-technology.html
[45]
Falk Rehkopf. 2019. Audio is the new video: Will podcasts take off in Europe? https://www.ubermetricstechnologies.com/blog/audio-is-the-new-video-willpodcasts-finally-take-off-in-europe/. (Feb. 2019). Accessed: 2019--3--19.
[46]
Selina Jeanne Sutton, Paul Foulkes, David Kirk, and Shaun Lawson. 2019. Voice As a Design Material: Sociophonetic Inspired Design Strategies in Human-Computer Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, NY, NY, USA, 603:1--603:14. event-place: Glasgow, Scotland Uk.
[47]
Marie Louise Juul Søndergaard and Lone Koefoed Hansen. 2018. Intimate Futures: Staying with the Trouble of Digital Personal Assistants through Design Fiction. Proceedings of the 2018 on Designing Interactive Systems Conference 2018 - DIS '18 (2018), 869--880.
[48]
Benedict Tay, Younbo Jung, and Taezoon Park. 2014. When stereotypes meet robots: The double-edge sword of robot gender and personality in human--robot interaction. Computers in Human Behavior 38 (Sept. 2014), 75--84.
[49]
Petra Wagner, Jonas Beskow, Simon Betz, Jens Edlund, Joakim Gustafson, Gustav Eje Henter, Sébastien Le Maguer, Zofia Malisz, Éva Székely, Christina Tånnander, and Jana Voße. 2019. Speech Synthesis Evaluation - State-of-the-Art Assessment and Suggestion for a Novel Research Program. In Proc. 10th ISCA Speech Synthesis Workshop. 105--110.
[50]
Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. 2015. Are We Using Enough Listeners? No!-An Empirically-Supported Critique of Interspeech 2014 TTS Evaluations. In Proc. Interspeech 2015. https://www.isca-speech.org/archive/ interspeech_2015/papers/i15_3476.pdf
[51]
Andy Wolber. 2017. 4 Text-to-Speech apps that will read online articles to you. (April 05, 2017). https://www.techrepublic.com/article/4-text-to-speechapps-that-will-read-online-articles-to-you/

Cited By

View all
  • (2024)Nonbinary Voices for Digital Assistants—An Investigation of User Perceptions and Gender StereotypesRobotics10.3390/robotics1308011113:8(111)Online publication date: 23-Jul-2024
  • (2024)Exploring the Effects of User-Agent and User-Designer Similarity in Virtual Human Design to Promote Mental Health Intentions for College StudentsACM Transactions on Applied Perception10.1145/368982222:1(1-41)Online publication date: 28-Nov-2024
  • (2024)Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling HeardProceedings of the ACM on Human-Computer Interaction10.1145/36870218:CSCW2(1-22)Online publication date: 8-Nov-2024
  • Show More Cited By

Index Terms

  1. Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems
      April 2020
      10688 pages
      ISBN:9781450367080
      DOI:10.1145/3313831
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 April 2020

      Check for updates

      Author Tags

      1. listening experience
      2. long-form
      3. synthesized speech
      4. text-to-speech
      5. tts
      6. voice interface
      7. voice quality

      Qualifiers

      • Research-article

      Conference

      CHI '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

      Upcoming Conference

      CHI 2025
      ACM CHI Conference on Human Factors in Computing Systems
      April 26 - May 1, 2025
      Yokohama , Japan

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,954
      • Downloads (Last 6 weeks)312
      Reflects downloads up to 11 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Nonbinary Voices for Digital Assistants—An Investigation of User Perceptions and Gender StereotypesRobotics10.3390/robotics1308011113:8(111)Online publication date: 23-Jul-2024
      • (2024)Exploring the Effects of User-Agent and User-Designer Similarity in Virtual Human Design to Promote Mental Health Intentions for College StudentsACM Transactions on Applied Perception10.1145/368982222:1(1-41)Online publication date: 28-Nov-2024
      • (2024)Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling HeardProceedings of the ACM on Human-Computer Interaction10.1145/36870218:CSCW2(1-22)Online publication date: 8-Nov-2024
      • (2024)Qualitative Approaches to Voice UXACM Computing Surveys10.1145/365866656:12(1-34)Online publication date: 20-Apr-2024
      • (2024)The State of Pilot Study Reporting in Crowdsourcing: A Reflection on Best Practices and GuidelinesProceedings of the ACM on Human-Computer Interaction10.1145/36410238:CSCW1(1-45)Online publication date: 26-Apr-2024
      • (2024)Toward a Third-Kind Voice for Conversational Agents in an Era of Blurring Boundaries Between Machine and Human SoundsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665880(1-7)Online publication date: 8-Jul-2024
      • (2024)Impact of Voice Fidelity on Decision Making: A Potential Dark Pattern?Proceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645202(181-194)Online publication date: 18-Mar-2024
      • (2024)Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech GeneratorsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658911(359-376)Online publication date: 3-Jun-2024
      • (2024)Cruising Queer HCI on the DL: A Literature Review of LGBTQ+ People in HCIProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642494(1-21)Online publication date: 11-May-2024
      • (2024)A Review of Subjective Scales Measuring the User Experience of Voice AssistantsIEEE Access10.1109/ACCESS.2024.335842312(14893-14917)Online publication date: 2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media