More Web Proxy on the site http://driver.im/

research-article

Free access

Latent mixture of discriminative experts for multimodal prediction modeling

Authors:

Louis-Philippe MorencyAuthors Info & Claims

COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics

Pages 860 - 868

Published: 23 August 2010 Publication History

Abstract

During face-to-face conversation, people naturally integrate speech, gestures and higher level language interpretations to predict the right time to start talking or to give backchannel feedback. In this paper we introduce a new model called Latent Mixture of Discriminative Experts which addresses some of the key issues with multimodal language processing: (1) temporal synchrony/asynchrony between modalities, (2) micro dynamics and (3) integration of different levels of interpretation. We present an empirical evaluation on listener nonverbal feedback prediction (e.g., head nod), based on observable behaviors of the speaker. We confirm the importance of combining four types of multimodal features: lexical, syntactic structure, eye gaze, and prosody. We show that our Latent Mixture of Discriminative Experts model outperforms previous approaches based on Conditional Random Fields (CRFs) and Latent-Dynamic CRFs.

References

[1]

Anderson, H., M. Bader, E. G. Bard, G. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. Thompson, and R. Weinert. 1991. The mcrc map task corpus. Language and Speech, 34(4):351--366.

[2]

Bavelas, J. B., L. Coates, and T. Johnson. 2000. Listeners as co-narrators. JPSP, 79(6):941--952.

[3]

Blunsom, P., T. Cohn, and M. Osborne. 2008. A discriminative latent variable model for statistical machine translation. In ACL: HLT, pages 200--208.

[4]

Burgoon, Judee K., Lesa A. Stern, and Leesa Dillman. 1995. Interpersonal adaptation: Dyadic interaction patterns. Cambridge University Press, Cambridge.

[5]

Cassell, J. and M. Stone. 1999. Living hand to mouth: Psychological theories about speech and gesture in interactive dialogue systems. In AAAI.

[6]

Cathcart, N., Jean Carletta, and Ewan Klein. 2003. A shallow model of backchannel continuers in spoken dialogue. In EACL, pages 51--58.

Digital Library

[7]

Eisenstein, J., R. Barzilay, and R. Davis. 2008. Gestural cohesion for topic segmentation. In ACL: HLT, pages 852--860.

[8]

Eisentein, J. and R. Davis. 2007. Conditional modality fusion for coreference. In ACL, pages 352--359.

[9]

Eyben, Florian, Martin Wöllmer, and Björn Schuller. 2009. openEAR - Introducing the Munich Open-Source Emotion and Affect Recognition Toolkit. In ACII, pages 576--581.

[10]

Frampton, M., J. Huang, T. Bui, and S. Peters. 2009. Real-time decision detection in multi-party dialogue. In EMNLP, pages 1133--1141.

Digital Library

[11]

Fuchs, D. 1987. Examiner familiarity effects on test performance: implications for training and practice. Topics in Early Childhood Special Education, 7:90--104.

[12]

Fujie, Shinya, Yasuhi Ejiri, Kei Nakajima, Yosuke Matsusaka, and Tetsunori Kobayashi. 2004. A conversation robot using head gesture recognition as para-linguistic information. In RO-MAN, pages 159--164.

[13]

Goldberg, S. B. 2005. The secrets of successful mediators. Negotiation Journal, 21(3):365--376.

[14]

Gravano, A., S. Benus, H. Chavez, J. Hirschberg, and L. Wilcox. 2007. On the role of context and prosody in the interpretation and 'okay'. In ACL, pages 800--807.

[15]

Heylen, D. and R. op den Akker. 2007. Computing backchannel distributions in multi-party conversations. In ACL: EmbodiedNLP, pages 17--24.

Digital Library

[16]

Johnston, M. 1998. Multimodal language processing. In ICSLP.

[17]

Jovanovic, N., R. op den Akker, and A. Nijholt. 2006. Adressee identification in face-to-face meetings. In EACL.

[18]

Jurafsky, D., E. Shriberg, B. Fox, and T. Curl. 1998. Lexical, prosodic and syntactic cures for dialog acts. In Workshop on Discourse Relations, pages 114--120.

[19]

Kendon, A. 2004. Gesture: Visible Action as Utterance. Cambridge University Press.

[20]

Kumar, S. and M. Herbert. 2003. Discriminative random fields: A framework for contextual interaction in classification. In ICCV.

Digital Library

[21]

Lafferty, J., A. McCallum, and F. Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labelling sequence data. In ICML.

Digital Library

[22]

Maatman, M., J. Gratch, and S. Marsella. 2005. Natural behavior of a listening agent. In IVA.

Digital Library

[23]

Marcus, Mitchell, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The penn treebank: annotating predicate argument structure. In ACL: HLT, pages 114--119.

Digital Library

[24]

McNeill, D. 1992. Hand and Mind: What Gestures Reveal about Thought. Univ. Chicago Press.

[25]

Moore, P.-Y. Hsueh J. 2007. What decisions have you made: Automatic decision detection in conversational speech. In NAACL-HLT, pages 25--32.

[26]

Morency, Louis-Philippe, Ariadna Quattoni, and Trevor Darrell. 2007. Latent-dynamic discriminative models for continuous gesture recognition. In CVPR.

[27]

Murray, G. and G. Carenini. 2009. Predicting subjectivity in multimodal conversations. In EMNLP, pages 1348--1357.

Digital Library

[28]

Nakano, Reinstein, Stocky, and Justine Cassell. 2003. Towards a model of face-to-face grounding. In ACL.

Digital Library

[29]

Nakano, Y., K. Murata, M. Enomoto, Y. Arimoto, Y. Asa, and H. Sagawa. 2007. Predicting evidence of understanding by monitoring user's task manipulation in multimodal conversations. In ACL, pages 121--124.

Digital Library

[30]

Nishimura, Ryota, Norihide Kitaoka, and Seiichi Nakagawa. 2007. A spoken dialog system for chat-like conversations considering response timing. LNCS, 4629:599--606.

Digital Library

[31]

Oviatt, S. 1999. Ten myths of multimodal interaction. Communications of the ACM.

Digital Library

[32]

Quek, F. 2003. The catchment feature model for multimodal language analysis. In ICCV.

Digital Library

[33]

Sagae, Kenji and Jun'ichi Tsujii. 2007. Dependency parsing and domain adaptation with LR models and parser ensembles. In ACL, pages 1044--1050.

[34]

Smith, A., T. Cohn, and M. Osborne. 2005. Logarithmic opinion pools for conditional random fields. In ACL, pages 18--25.

Digital Library

[35]

Ward, N. and W. Tsukahara. 2000. Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics, 23:1177--1207.

Cited By

de Kok IHeylen D(2012)Integrating backchannel prediction models into embodied conversational agentsProceedings of the 12th international conference on Intelligent Virtual Agents10.1007/978-3-642-33197-8_28(268-274)Online publication date: 12-Sep-2012
https://dl.acm.org/doi/10.1007/978-3-642-33197-8_28
Ozkan DMorency LLin D(2011)Modeling wisdom of crowds using latent mixture of discriminative expertsProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 210.5555/2002736.2002806(335-340)Online publication date: 19-Jun-2011
https://dl.acm.org/doi/10.5555/2002736.2002806
Morency LCucchiara RPantic MDaoudi MDel Bimbo APentland AVinciarelli A(2011)Computational study of human communication dynamicProceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding10.1145/2072572.2072578(13-18)Online publication date: 1-Dec-2011
https://dl.acm.org/doi/10.1145/2072572.2072578

Latent mixture of discriminative experts for multimodal prediction modeling
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Latent Mixture of Discriminative Experts

In this paper, we introduce a new model called Latent Mixture of Discriminative Experts which can automatically learn the temporal relationship between different modalities. Since, we train separate experts for each modality, LMDE is capable of ...
Modeling wisdom of crowds using latent mixture of discriminative experts
HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2

In many computational linguistic scenarios, training labels are subjectives making it necessary to acquire the opinions of multiple annotators/experts, which is referred to as "wisdom of crowds". In this paper, we propose a new approach for modeling ...
Prediction of Various Backchannel Utterances Based on Multimodal Information
IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

The listener's backchannels are an important part of dialogues. With appropriate backchannels, people are able to smoothly promote dialogues. Thus, backchannels are considered to be important in dialogues between not only humans but also humans and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics

August 2010

1408 pages

General Chair:
Aravind K. Joshi
University of Pennsylvania
,
Program Chairs:
Chu-Ren Huang
The Hong Kong Polytechnic University
,
Dan Jurafsky
Stanford University

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 23 August 2010

Qualifiers

Research-article

Acceptance Rates

Overall Acceptance Rate 1,537 of 1,537 submissions, 100%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
136
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)8

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

de Kok IHeylen D(2012)Integrating backchannel prediction models into embodied conversational agentsProceedings of the 12th international conference on Intelligent Virtual Agents10.1007/978-3-642-33197-8_28(268-274)Online publication date: 12-Sep-2012
https://dl.acm.org/doi/10.1007/978-3-642-33197-8_28
Ozkan DMorency LLin D(2011)Modeling wisdom of crowds using latent mixture of discriminative expertsProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 210.5555/2002736.2002806(335-340)Online publication date: 19-Jun-2011
https://dl.acm.org/doi/10.5555/2002736.2002806
Morency LCucchiara RPantic MDaoudi MDel Bimbo APentland AVinciarelli A(2011)Computational study of human communication dynamicProceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding10.1145/2072572.2072578(13-18)Online publication date: 1-Dec-2011
https://dl.acm.org/doi/10.1145/2072572.2072578

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten