A computational model for the automatic recognition of affect in speech

January 2004

Author:
Raul Fernandez,
Supervisor:
Rosalind W. Picard

Publisher:

Massachusetts Institute of Technology
201 Vassar Street, W59-200 Cambridge, MA
United States

Order Number:AAI0806110

Pages:

Purchase on ProQuest

Bibliometrics

Abstract

Spoken language, in addition to serving as a primary vehicle for externalizing linguistic structures and meaning, acts as a carrier of various sources of information, including background, age, gender, membership in social structures, as well as physiological, pathological and emotional states. These sources of information are more than just ancillary to the main purpose of linguistic communication: Humans react to the various non-linguistic factors encoded in the speech signal, shaping and adjusting their interactions to satisfy interpersonal and social protocols.

Computer science, artificial intelligence and computational linguistics have devoted much active research to systems that aim to model the production and recovery of linguistic lexico-semantic structures from speech. However, less attention has been devoted to systems that model and understand the paralinguistic and extralinguistic information in the signal. As the breadth and nature of human-computer interaction escalates to levels previously reserved for human-to-human communication, there is a growing need to endow computational systems with human-like abilities which facilitate the interaction and make it more natural. Of paramount importance amongst these is the human ability to make inferences regarding the affective content of our exchanges.

This thesis proposes a framework for the recognition of affective qualifiers from prosodic-acoustic parameters extracted from spoken language. It is argued that modeling the affective prosodic variation of speech can be approached by integrating acoustic parameters from various prosodic time scales, summarizing information from more localized (e.g., syllable level) to more global prosodic phenomena (e.g., utterance level). In this framework speech is structurally modeled as a dynamically evolving hierarchical model in which levels of the hierarchy are determined by prosodic constituency and contain parameters that evolve according to dynamical systems. The acoustic parameters have been chosen to reflect four main components of speech thought to reflect paralinguistic and affect-specific information: intonation, loudness, rhythm and voice quality. The thesis addresses the contribution of each of these components separately, and evaluates the full model by testing it on datasets of acted and of spontaneous speech perceptually annotated with affective labels, and by comparing it against human performance benchmarks. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Cited By

Contributors

Raúl Fernández
IBM Research
- Publication Years1998 - 2024
- Publication counts17
- Citation count161
- Available for Download3
- Downloads (cumulative)2,516
- Downloads (12 months)238
- Downloads (6 weeks)43
- Average Downloads per Article839
- Average Citation per Article9
View Full Profile
Rosalind Wright Picard
Massachusetts Institute of Technology
- Publication Years1991 - 2024
- Publication counts174
- Citation count7,308
- Available for Download76
- Downloads (cumulative)79,653
- Downloads (12 months)10,231
- Downloads (6 weeks)1,225
- Average Downloads per Article1,048
- Average Citation per Article42
View Full Profile

Index Terms

A computational model for the automatic recognition of affect in speech
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
  2. Modeling and simulation

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

Cued Speech automatic recognition in normal-hearing and deaf subjects

This article discusses the automatic recognition of Cued Speech in French based on hidden Markov models (HMMs). Cued Speech is a visual mode which, by using hand shapes in different positions and in combination with lip patterns of speech, makes all the ...
Integrating computational auditory scene analysis and automatic speech recognition
Dynamic pronunciation models for automatic speech recognition

Browse Theses

Sections

Cited By

Index Terms

Cued Speech automatic recognition in normal-hearing and deaf subjects

Integrating computational auditory scene analysis and automatic speech recognition

Dynamic pronunciation models for automatic speech recognition

Sections

Cited By

Save to Binder

Index Terms

Recommendations

Cued Speech automatic recognition in normal-hearing and deaf subjects

Integrating computational auditory scene analysis and automatic speech recognition

Dynamic pronunciation models for automatic speech recognition