Abstract
The Mean Opinion Scale (MOS) is a questionnaire used to obtain listeners' subjective assessments of synthetic speech. This paper documents the motivation, method, and results of six experiments conducted from 1999 to 2002 that investigated the psychometric properties of the MOS and expanded the range of speech characteristics it evaluates. Our initial experiments documented the reliability, validity, sensitivity, and factor structure of the P.L. Salza et al. (Acta Acustica, Vol. 82, pp. 650–656, 1996) MOS and used psychometric principles to revise and improve the scale. This work resulted in the MOS-Revised (MOS-R). Four subsequent experiments expanded the MOS-R beyond its previous focus on Intelligibility and Naturalness, to include measurement of the Prosody and Social Impression of synthetic voices. As a result of this work, we created the MOS-Expanded (MOS-X), a rating scale shown to be reliable, valid, and sensitive for high-quality evaluation of synthetic speech in applied industrial settings.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Baken, R. (1978). Clinical Measurement of Speech and Voice. Boston: Allyn & Bacon.
Berry, D. (1992). Vocal types and stereotypes: Joint effects of vocal attractiveness and vocal maturity on person perception. Journal of Nonverbal Behavior, 16:41-45.
Bloom, K., Zajac, D., and Titus, J. (1999). The influence of nasality of voice on sex-stereotyped perceptions. Journal of Nonverbal Behavior, 23:271-281.
Bradlow, A., Torretta, G., and Pisoni, D. (1996). Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics. Speech Communication, 20:255-272.
Brown, B., Strong, W., and Rencher, A. (1973). Perceptions of personality from speech: Effects of manipulations of acoustical parameters. Journal of the Acoustical Society of America, 54:29-35.
Brown, B., Strong, W., and Rencher, A. (1975). Acoustic determinants of perceptions of personality from speech. International Journal of the Sociology of Language, 6:1-32.
Cliff, N. (1987). Analyzing Multivariate Data. San Diego, CA: Harcourt Brace Jovanovich.
Coovert, M.D. and McNelis, K. (1988). Determining the number of common factors in factor analysis: A review and program. Educational and Psychological Measurement, 48:687-693.
Ekman, P., O'Sullivan, M., Friesen,W., and Scherer,K. (1991). Face, voice, and body in detecting deceit. Journal of Nonverbal Behavior, 15:125-135.
Francis, A.L. and Nusbaum, H.C. (1999). Evaluating the quality of synthetic speech. In D. Gardner-Bonneau (Ed.), Human Factors and Voice Interactive Systems. Boston, MA: Kluwer, pp. 63-97.
Goldstein, M. (1995). Classification of methods used for assessment of text-to-speech systems according to the demands placed on the listener. Speech Communication, 16:225-244.
Gorenflo, D. and Gorenflo, C. (1997). Effects of synthetic speech, gender, and perceived similarity on attitudes toward the augmented communicator. AAC: Augmentative and Alternative Communication, 13:87-91.
Granstrom, B. and Nord, L. (1992). Neglected dimensions in speech synthesis. Speech Communication, 11:459-462.
Greene, B., Logan, J., and Pisoni, D. (1986). Perception of synthetic speech produced automatically by rule: Intelligibility of eight textto-speech systems. Behavior Research Methods, Instruments, and Computers, 18:100-107.
Hieda, I. and Kuchinomachi,Y. (1997). Preliminary study of relations between physical characteristics and psychological impressions of natural voices. Perceptual and Motor Skills, 85:1483-1491.
Higashikawa,M. and Minifie, F. (1999). Acoustical-perceptual correlates of 'whisper pitch' in synthetically generated vowels. Journal of Speech, Language, and Hearing Research, 42:583-591.
Hillenbrand, J. (1988). Perception of aperiodicities in synthetically generated voices. Journal of the Acoustical Society of America, 83:2361-2371.
Hoag, L. and Bedrosian, J. (1992). Effects of speech output type, message length, and reauditorization on perceptions of the communicative competence of an adult AAC user. Journal of Speech and Hearing Research, 35:1363-1366.
Holtgraves, T. and Lasky, B. (1999). Linguistic power and persuasion. Journal of Language and Social Psychology, 18:1960-205.
Hosman, L. (1989). The evaluative consequences of hedges, hesitations, and intensifiers: Powerful and powerless speech styles. Human Communication Research, 15:383-406.
International Telecommunication Union (1994). A Method for Subjective Performance Assessment of the Quality of Speech Voice Output Devices (ITU-T Recommendation, p. 85). Geneva, Switzerland: ITU.
Johnston, R.D. (1996). Beyond intelligibility: The performance of text-to-speech synthesisers. BT Technology Journal, 14:100-111.
Johnson,W., Emde, R., Scherer, K., and Klinnert, M. (1986). Recognition of emotion from vocal cues. Archives of General Psychiatry, 43:280-283.
Klatt, D. and Klatt, L. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, 87:820-857.
Koopmans-Van Beinum, F. (1992). The role of focus words in natural and in synthetic continuous speech: Acoustic aspects. Speech Communication, 11:439-452.
Kraft, V. and Portele, T. (1995). Quality evaluation of five German speech synthesis systems. Acta Acustica, 3:351-365.
Landauer, T.K. (1988). Research methods in human-computer interaction. In M. Helander (Ed.), Handbook of Human-Computer Interaction. New York: Elsevier.
Lavner, Y., Gath, I., and Rosenhouse, J. (2000). The effects of acoustic modifications on the identification of familiar voices speaking isolated vowels. Speech Communication, 30:9-26.
Lewis, J.R. (1993). Multipoint scales: Mean and median differences and observed significance levels. International Journal of Human-Computer Interaction, 5:383-392.
Lewis, J.R. (2001a). Psychometric properties of the Mean Opinion Scale. In Proceedings of HCI International 2001: Usability Evaluation and Interface Design. Mahwah, NJ: Lawrence Erlbaum, pp. 149-153.
Lewis, J.R. (2001b). The Revised Mean Opinion Scale (MOS-R): Preliminary Psychometric Evaluation (Tech. Report 29.3414). Raleigh, NC: International Business Machines Corp.
Martin, R. and Haroldson, S. (1992). Stuttering and speech naturalness: Audio and audiovisual judgments. Journal of Speech and Hearing Research, 35:521-528.
Massaro, D. and Egan, P. (1996). Perceiving affect from the voice and the face. Psychonomic Bulletin & Review, 3:215-221.
Miyake, K. and Zuckerman, M. (1993). Beyond personality: Effects of physical and vocal attractiveness on false consensus, social comparison, affiliation, and assumed and perceived personality. Journal of Personality, 61:411-437.
Moller, S., Jekosch, U., Mersdorf, J., and Kraft, V. (2001). Auditory assessment of synthesized speech in application scenarios: Two case studies. Speech Communication, 34:229-246.
Munsterburg, H. (1913). Psychology and industrial efficiency. In L.T. Benjamin Jr. (Ed.), A History of Psychology: Original Sources and Contemporary Research, 2nd edn. Boston: McGrawHill, pp. 584-593.
Murray, I. and Arnott, J. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. Journal of the Acoustical Society of America, 93:1097-1108.
Murray, I. and Arnott, J. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16:369-390.
Murray, I., Arnott, J., and Rohwer, E. (1996). Emotional stress in synthetic speech: Progress and future directions. Speech Communication, 20:85-91.
Nunnally, J.C. (1978). Psychometric Theory. New York: McGraw-Hill.
Paddock, J. and Nowicki, S. (1986). Paralanguage and the interpersonal impact of dysphoria: It's not what you say but how you say it. Social Behavior and Personality, 14:29-44.
Page, R. and Balloun, J. (1978). The effect of voice volume on the perception of personality. Journal of Social Psychology, 105:65-72.
Paris, C.R., Thomas, M.H., Gilson, R.D., and Kincaid, J.P. (2000). Linguistic cues and memory for synthetic and natural speech. Human Factors, 42:421-431.
Pelachaud, C., Badler, N., and Steedman, M. (1996). Generating facial expressions for speech. Cognitive Science, 20:1-46.
Pisoni, D. (1997). Perception of synthetic speech. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer, pp. 541-560.
Pols, L. and Jekosch, U. (1997). A structured way of looking at the performance of text-to-speech systems. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer, pp. 519-528.
Portele, T. and Heuft, B. (1997). Toward a prominence-based synthesis system. Speech Communication, 21:61-72.
Salza, P.L., Foti, E., Nebbia, L., and Oreglia, M. (1996). MOS and pair comparison combined methods for quality evaluation of text to speech systems. Acta Acustica, 82:650-656.
Schmidt-Nielsen, A. (1995). Intelligibility and acceptability testing for speech technology. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.), Applied Speech Technology. Boca Raton: CRC Press.
Shipley, K. and McAfee, J. (1992). Assessment in Speech Language Pathology: A Resource Manual. San Diego: Singular.
Sonntag, G.P. and Portele, T. (1998). PURR-A method for prosody evaluation and investigation. Computer Speech and Language, 12:437-451.
Sonntag, G.P., Portele, T., Haas, F., and Kohler, J. (1999). Comparative evaluation of six German TTS systems. Eurospeech '99. Budapest: Technical University of Budapest, pp. 251-254.
Slowiaczek, L. and Nusbaum, H. (1985). Effects of speech rate and pitch contour on the perception of synthetic speech. Human Factors, 27:701-712.
Stern, S., Mullennix, J., Dyson, C., and Wilson, S. (1999). The persuasiveness of synthetic speech versus human speech. HumanFactors, 41:588-595.
Tartter, V. and Braun, D. (1994). Hearing smiles and frowns in normal and whisper registers. Journal of the Acoustical Society of America, 96:2101-2107.
van Bezooijen, R. and van Heuven, V. (1997). Assessment of synthesis systems. In D. Gibbon, R. Moore, and R. Winski (Eds.), Handbook of Standards and Resources for Spoken Language Systems. New York, NY: Mouton de Gruyter.
Wang, H. and Lewis, J.R. (2001). Intelligibility and acceptability of short phrases generated by embedded text-to-speech engines. In Proceedings of HCI International 2001: Usability Evaluation and Interface Design. Mahwah, NJ: Lawrence Erlbaum, pp. 144-148.
Whalen, D. and Hoequist, C. (1995). The effects of breath sounds on the perception of synthetic speech. Journal of the Acoustical Society of America, 97:3147-3153.
Whitmore, J. and Fisher, S. (1996). Speech during sustained operations. Speech Communication, 20:55-70.
Yabuoka, H., Nakayama,T., Kitabayashi,Y., and Asakawa,Y. (2000). Investigations of independence of distortion scales in objective evaluation of synthesized speech quality. Electronics and Communications in Japan, Part 3, 83:14-22.
van Riper, C. and Emerick, L. (1990). Speech Correction. Englewood Cliffs, NJ: Prentice Hall.
Yaeger-Dror, M. (1996). Register as a variable in prosodic analysis: The case of the English negative. Speech Communication, 19:39-60.
Zuckerman, M., Miyake, K., and Hodgins, H. (1991). Cross-channel effects of vocal and physical attractiveness and their implications for interpersonal perception. Journal of Personality and Social Psychology, 60:545-554.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Polkosky, M.D., Lewis, J.R. Expanding the MOS: Development and Psychometric Evaluation of the MOS-R and MOS-X. International Journal of Speech Technology 6, 161–182 (2003). https://doi.org/10.1023/A:1022390615396
Issue Date:
DOI: https://doi.org/10.1023/A:1022390615396