[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/1873781.1873878dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
research-article
Free access

Latent mixture of discriminative experts for multimodal prediction modeling

Published: 23 August 2010 Publication History

Abstract

During face-to-face conversation, people naturally integrate speech, gestures and higher level language interpretations to predict the right time to start talking or to give backchannel feedback. In this paper we introduce a new model called Latent Mixture of Discriminative Experts which addresses some of the key issues with multimodal language processing: (1) temporal synchrony/asynchrony between modalities, (2) micro dynamics and (3) integration of different levels of interpretation. We present an empirical evaluation on listener nonverbal feedback prediction (e.g., head nod), based on observable behaviors of the speaker. We confirm the importance of combining four types of multimodal features: lexical, syntactic structure, eye gaze, and prosody. We show that our Latent Mixture of Discriminative Experts model outperforms previous approaches based on Conditional Random Fields (CRFs) and Latent-Dynamic CRFs.

References

[1]
Anderson, H., M. Bader, E. G. Bard, G. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. Thompson, and R. Weinert. 1991. The mcrc map task corpus. Language and Speech, 34(4):351--366.
[2]
Bavelas, J. B., L. Coates, and T. Johnson. 2000. Listeners as co-narrators. JPSP, 79(6):941--952.
[3]
Blunsom, P., T. Cohn, and M. Osborne. 2008. A discriminative latent variable model for statistical machine translation. In ACL: HLT, pages 200--208.
[4]
Burgoon, Judee K., Lesa A. Stern, and Leesa Dillman. 1995. Interpersonal adaptation: Dyadic interaction patterns. Cambridge University Press, Cambridge.
[5]
Cassell, J. and M. Stone. 1999. Living hand to mouth: Psychological theories about speech and gesture in interactive dialogue systems. In AAAI.
[6]
Cathcart, N., Jean Carletta, and Ewan Klein. 2003. A shallow model of backchannel continuers in spoken dialogue. In EACL, pages 51--58.
[7]
Eisenstein, J., R. Barzilay, and R. Davis. 2008. Gestural cohesion for topic segmentation. In ACL: HLT, pages 852--860.
[8]
Eisentein, J. and R. Davis. 2007. Conditional modality fusion for coreference. In ACL, pages 352--359.
[9]
Eyben, Florian, Martin Wöllmer, and Björn Schuller. 2009. openEAR - Introducing the Munich Open-Source Emotion and Affect Recognition Toolkit. In ACII, pages 576--581.
[10]
Frampton, M., J. Huang, T. Bui, and S. Peters. 2009. Real-time decision detection in multi-party dialogue. In EMNLP, pages 1133--1141.
[11]
Fuchs, D. 1987. Examiner familiarity effects on test performance: implications for training and practice. Topics in Early Childhood Special Education, 7:90--104.
[12]
Fujie, Shinya, Yasuhi Ejiri, Kei Nakajima, Yosuke Matsusaka, and Tetsunori Kobayashi. 2004. A conversation robot using head gesture recognition as para-linguistic information. In RO-MAN, pages 159--164.
[13]
Goldberg, S. B. 2005. The secrets of successful mediators. Negotiation Journal, 21(3):365--376.
[14]
Gravano, A., S. Benus, H. Chavez, J. Hirschberg, and L. Wilcox. 2007. On the role of context and prosody in the interpretation and 'okay'. In ACL, pages 800--807.
[15]
Heylen, D. and R. op den Akker. 2007. Computing backchannel distributions in multi-party conversations. In ACL: EmbodiedNLP, pages 17--24.
[16]
Johnston, M. 1998. Multimodal language processing. In ICSLP.
[17]
Jovanovic, N., R. op den Akker, and A. Nijholt. 2006. Adressee identification in face-to-face meetings. In EACL.
[18]
Jurafsky, D., E. Shriberg, B. Fox, and T. Curl. 1998. Lexical, prosodic and syntactic cures for dialog acts. In Workshop on Discourse Relations, pages 114--120.
[19]
Kendon, A. 2004. Gesture: Visible Action as Utterance. Cambridge University Press.
[20]
Kumar, S. and M. Herbert. 2003. Discriminative random fields: A framework for contextual interaction in classification. In ICCV.
[21]
Lafferty, J., A. McCallum, and F. Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labelling sequence data. In ICML.
[22]
Maatman, M., J. Gratch, and S. Marsella. 2005. Natural behavior of a listening agent. In IVA.
[23]
Marcus, Mitchell, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The penn treebank: annotating predicate argument structure. In ACL: HLT, pages 114--119.
[24]
McNeill, D. 1992. Hand and Mind: What Gestures Reveal about Thought. Univ. Chicago Press.
[25]
Moore, P.-Y. Hsueh J. 2007. What decisions have you made: Automatic decision detection in conversational speech. In NAACL-HLT, pages 25--32.
[26]
Morency, Louis-Philippe, Ariadna Quattoni, and Trevor Darrell. 2007. Latent-dynamic discriminative models for continuous gesture recognition. In CVPR.
[27]
Murray, G. and G. Carenini. 2009. Predicting subjectivity in multimodal conversations. In EMNLP, pages 1348--1357.
[28]
Nakano, Reinstein, Stocky, and Justine Cassell. 2003. Towards a model of face-to-face grounding. In ACL.
[29]
Nakano, Y., K. Murata, M. Enomoto, Y. Arimoto, Y. Asa, and H. Sagawa. 2007. Predicting evidence of understanding by monitoring user's task manipulation in multimodal conversations. In ACL, pages 121--124.
[30]
Nishimura, Ryota, Norihide Kitaoka, and Seiichi Nakagawa. 2007. A spoken dialog system for chat-like conversations considering response timing. LNCS, 4629:599--606.
[31]
Oviatt, S. 1999. Ten myths of multimodal interaction. Communications of the ACM.
[32]
Quek, F. 2003. The catchment feature model for multimodal language analysis. In ICCV.
[33]
Sagae, Kenji and Jun'ichi Tsujii. 2007. Dependency parsing and domain adaptation with LR models and parser ensembles. In ACL, pages 1044--1050.
[34]
Smith, A., T. Cohn, and M. Osborne. 2005. Logarithmic opinion pools for conditional random fields. In ACL, pages 18--25.
[35]
Ward, N. and W. Tsukahara. 2000. Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics, 23:1177--1207.

Cited By

View all
  • (2012)Integrating backchannel prediction models into embodied conversational agentsProceedings of the 12th international conference on Intelligent Virtual Agents10.1007/978-3-642-33197-8_28(268-274)Online publication date: 12-Sep-2012
  • (2011)Modeling wisdom of crowds using latent mixture of discriminative expertsProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 210.5555/2002736.2002806(335-340)Online publication date: 19-Jun-2011
  • (2011)Computational study of human communication dynamicProceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding10.1145/2072572.2072578(13-18)Online publication date: 1-Dec-2011
  1. Latent mixture of discriminative experts for multimodal prediction modeling

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image DL Hosted proceedings
      COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics
      August 2010
      1408 pages

      Publisher

      Association for Computational Linguistics

      United States

      Publication History

      Published: 23 August 2010

      Qualifiers

      • Research-article

      Acceptance Rates

      Overall Acceptance Rate 1,537 of 1,537 submissions, 100%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)38
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 03 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2012)Integrating backchannel prediction models into embodied conversational agentsProceedings of the 12th international conference on Intelligent Virtual Agents10.1007/978-3-642-33197-8_28(268-274)Online publication date: 12-Sep-2012
      • (2011)Modeling wisdom of crowds using latent mixture of discriminative expertsProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 210.5555/2002736.2002806(335-340)Online publication date: 19-Jun-2011
      • (2011)Computational study of human communication dynamicProceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding10.1145/2072572.2072578(13-18)Online publication date: 1-Dec-2011

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media