[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3136755.3136803acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper
Public Access

Predicting meeting extracts in group discussions using multimodal convolutional neural networks

Published: 03 November 2017 Publication History

Abstract

This study proposes the use of multimodal fusion models employing Convolutional Neural Networks (CNNs) to extract meeting minutes from group discussion corpus. First, unimodal models are created using raw behavioral data such as speech, head motion, and face tracking. These models are then integrated into a fusion model that works as a classifier. The main advantage of this work is that the proposed models were trained without any hand-crafted features, and they outperformed a baseline model that was trained using hand-crafted features. It was also found that multimodal fusion is useful in applying the CNN approach to model multimodal multiparty interaction.

References

[1]
Murray, G and Carenini, G. 2008. Summarizing Spoken and Written Conversations. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 773–782. Retrieved from http://dl.acm.org/citation.cfm?id=1613715.1613813
[2]
Xie, S. Hakkani-Tur, D. Favre, B. and Liu, Y. 2009. Integrating prosodic features in extractive meeting summarization. IEEE Workshop on Speech Recognition and Understanding (ASRU), 387–391.
[3]
Wang, L and Cardie, C. 2012. Focused Meeting Summarization via Unsupervised Relation Extraction. Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics, 304–313. Retrieved from http://dl.acm.org/citation.cfm?id=2392800.2392853
[4]
Maskey, S and Hirschberg, J. 2005. Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization. INTERSPEECH- 2005, 621–624. Retrieved from http://virtualhost.cs.columbia.edu/~julia/files/eurospeech05_vfinal.pdf
[5]
Nihei, F. Nakano, YI. and Takase, Y. 2016. Meeting Extracts for Discussion Summarization Based on Multimodal Nonverbal Information. Proceedings of the 18th ACM International Conference on Multimodal Interaction, ACM, 185– 192.
[6]
Aran, O and Gatica-Perez, D. 2013. One of a Kind: Inferring Personality Impressions in Meetings. Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ACM, 11–18.
[7]
Nicolaou, MA. Gunes, H. and Pantic, M. 2011. Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space. IEEE Transactions on Affective Computing 2, 2: 92–105.
[8]
Hinton, GE. Osindero, S. and Teh, Y-W. 2006. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 18, 7: 1527–1554.
[9]
Le, Q V. 2013. Building high-level features using large scale unsupervised learning. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 8595–8598.
[10]
Hinton, GE and Salakhutdinov, RR. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786: 504–507.
[11]
Bengio, Y. Lamblin, P. Popovici, D. and Larochelle, H. 2006. Greedy Layer-wise Training of Deep Networks. Proceedings of the 19th International Conference on Neural Information Processing Systems, MIT Press, 153–160. Retrieved from http://dl.acm.org/citation.cfm?id=2976456.2976476
[12]
Pan, J. Sayrol, E. Giro-I-Nieto, X. McGuinness, K. and O’Connor, NE. 2016. Shallow and Deep Convolutional Networks for Saliency Prediction. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 598–606.
[13]
Sainath, TN. Weiss, RJ. Senior, AW. Wilson, KW. and Vinyals, O. 2015. Learning the speech front-end with raw waveform CLDNNs. INTERSPEECH- 2015, 1–5. Retrieved from http://www.iscaspeech.org/archive/interspeech_2015/i15_0001.html
[14]
Golik, P. Tüske, Z. Schlüter, R. and Ney, H. 2015. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. INTERSPEECH- 2015, 26–30.
[15]
Zhang, S. Zhang, S. Huang, T. and Gao, W. 2016. Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition. Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ACM, 281–284.
[16]
Nojavanasghari, B. Gopinath, D. Koushik, J. Baltrušaitis, T. and Morency, L-P. 2016. Deep Multimodal Fusion for Persuasiveness Prediction. Proceedings of the 18th ACM International Conference on Multimodal Interaction, ACM, 284–288.
[17]
Nihei, F. Nakano, YI. Hayashi, Y. Hung, H-H. and Okada, S. 2014. Predicting Influential Statements in Group Discussions Using Speech and Head Motion Information. Proceedings of the 16th International Conference on Multimodal Interaction, ACM, 136–143.
[18]
Fan, Y. Lu, X. Li, D. and Liu, Y. 2016. Video-based Emotion Recognition Using CNN-RNN and C3D Hybrid Networks. Proceedings of the 18th ACM International Conference on Multimodal Interaction, ACM, 445–450.
[19]
Krizhevsky, A. Sutskever, I. and Hinton, GE. 2012. ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the 25th International Conference on Neural Information Processing Systems, Curran Associates Inc., 1097–1105.
[20]
Retrieved from http://dl.acm.org/citation.cfm?id=2999134.2999257

Cited By

View all
  • (2020)Multimodal Data Fusion in Learning Analytics: A Systematic ReviewSensors10.3390/s2023685620:23(6856)Online publication date: 30-Nov-2020
  • (2019)Task-independent Multimodal Prediction of Group Performance Based on Product Dimensions2019 International Conference on Multimodal Interaction10.1145/3340555.3353729(264-273)Online publication date: 14-Oct-2019
  • (2019)REsCUEProceedings of the 2019 CHI Conference on Human Factors in Computing Systems10.1145/3290605.3300802(1-13)Online publication date: 2-May-2019
  • Show More Cited By

Index Terms

  1. Predicting meeting extracts in group discussions using multimodal convolutional neural networks

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction
      November 2017
      676 pages
      ISBN:9781450355438
      DOI:10.1145/3136755
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 November 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Deep neural network
      2. Important utterances for meeting summarization
      3. Multimodal fusion

      Qualifiers

      • Short-paper

      Funding Sources

      Conference

      ICMI '17
      Sponsor:

      Acceptance Rates

      ICMI '17 Paper Acceptance Rate 65 of 149 submissions, 44%;
      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)65
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 01 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Multimodal Data Fusion in Learning Analytics: A Systematic ReviewSensors10.3390/s2023685620:23(6856)Online publication date: 30-Nov-2020
      • (2019)Task-independent Multimodal Prediction of Group Performance Based on Product Dimensions2019 International Conference on Multimodal Interaction10.1145/3340555.3353729(264-273)Online publication date: 14-Oct-2019
      • (2019)REsCUEProceedings of the 2019 CHI Conference on Human Factors in Computing Systems10.1145/3290605.3300802(1-13)Online publication date: 2-May-2019
      • (2019)Towards Collaboration TranslucenceProceedings of the 2019 CHI Conference on Human Factors in Computing Systems10.1145/3290605.3300269(1-16)Online publication date: 2-May-2019
      • (2018)Fusing Verbal and Nonverbal Information for Extractive Meeting SummarizationProceedings of the Group Interaction Frontiers in Technology10.1145/3279981.3279987(1-9)Online publication date: 16-Oct-2018
      • (2018)Using Parallel Episodes of Speech to Represent and Identify Interaction Dynamics for Group MeetingsProceedings of the Group Interaction Frontiers in Technology10.1145/3279981.3279983(1-7)Online publication date: 16-Oct-2018
      • (2018)Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural NetworksProceedings of the 20th ACM International Conference on Multimodal Interaction10.1145/3242969.3242973(191-199)Online publication date: 2-Oct-2018

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media