More Web Proxy on the site http://driver.im/

research-article

Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural Networks

Authors:

Kazuhiro Otsuka,

Keisuke Kasuga,

Martina KöhlerAuthors Info & Claims

ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

Pages 191 - 199

https://doi.org/10.1145/3242969.3242973

Published: 02 October 2018 Publication History

Abstract

Convolutional neural networks (CNNs) are employed to estimate the visual focus of attention (VFoA), also called gaze direction, in multiparty face-to-face meetings on the basis of multimodal nonverbal behaviors including head pose, direction of the eyeball, and presence/absence of utterance. To reveal the potential of CNNs, we focus on aspects of multimodal and multiparty fusion including individual/group models, early/late fusion, and robustness when using inputs from image-based trackers. In contrast to the individual model that separately targets each person specific to one's seat, the group model aims to jointly estimate the gaze directions of all participants. Experiments confirmed that the group model outperformed the individual model especially in predicting listeners' VFoA when the inputs did not include eyeball directions. This result indicates that the group CNN model can implicitly learn underlying conversation structures, e.g., the listeners' gazes converge on the speaker. When the eyeball direction feature is available, both models outperformed the Bayes models used for comparison. In this case, the individual model was superior to the group model, particularly in estimating the speaker's VFoA. Moreover, it was revealed that in group models, two-stage late fusion, which integrates an individual features first, and multiparty features second, outperformed other structures. Furthermore, our experiment confirmed that image-based tracking can provide a comparable level of performance to that of sensor-based measurements. Overall, the results suggest that the CNN is a promising approach for VFoA estimation.

References

[1]

Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio, Speech, and Language Processing, Vol. 22 (2014), 1533--1545.

Digital Library

[2]

Shima Alizadeh and Azar Fazel. 2017. Convolutional Neural Networks for Facial Expression Recognition. CoRR, Vol. abs/1704.06756 (2017). arxiv: 1704.06756 http://arxiv.org/abs/1704.06756

[3]

Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. 2016. OpenFace: A General-Purpose Face Recognition Library with Mobile Applications. Technical Report. Carnegie Mellon University-CS-16-118, Carnegie Mellon University School of Computer Science.

[4]

Michael Argyle. 1988. Bodily Communication -- 2nd ed. Routledge, London and New York.

[5]

Michael Argyle and Mark Cook. 1976. Gaze and Mutual Gaze. Cambridge University Press Cambridge, Eng.; New York.

[6]

Kartik Audhkhasi, Osonde Osoba, and Bart Kosko. 2016. Noise-enhanced Convolutional Neural Networks. Neural Networks, Vol. 78, C (June 2016), 15--23.

Digital Library

[7]

Sileye O. Ba and Jean-Marc Odobez. 2011. Multiperson Visual Focus of Attention from Head Pose and Meeting Contextual Cues. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 33, 1 (Jan. 2011), 101--116.

Digital Library

[8]

Gérard Bailly, Alaeddine Mihoub, Christian Wolf, and Frédéric Elisei. 2015. Learning joint multimodal behaviors for face-to-face interaction: performance & properties of statistical models . In Proceedings of Human-Robot Interaction. Workshop on Behavior Coordination between Animals, Humans, and Robots.

[9]

Tadas Baltruvsaitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An Open Source Facial Behavior Analysis Toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. 00. 1--10.

[10]

Starkey Dancan. 1972. Some Signals and Rules for Taking Speaking Turns in Conversations. Journal of Personality and Social Psychology, Vol. 23, 2 (1972), 283--292.

[11]

Daniel Gatica-Perez. 2009. Automatic Nonverbal Analysis of Social Interaction in Small Groups: A Review. Image and Vision Computing, Vol. 27, 12 (2009), 1775--1787.

Digital Library

[12]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Digital Library

[13]

Charles Goodwin. 1981. Conversational Organization: Interaction between Speakers and Hearers. Academic Press.

[14]

Sebastian Gorga and Kazuhiro Otsuka. 2010. Conversation Scene Analysis based on Dynamic Bayesian Network and Image-based Gaze Detection. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction (ICMI-MLMI '10). 54:1--54:8.

Digital Library

[15]

Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI'16). 1533--1540.

Digital Library

[16]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[17]

Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A Convolutional Neural Network for Modelling Sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 655--665.

[18]

Adam Kendon. 1967. Some Functions of Gaze-Direction in Social Interaction. Acta Psychologica, Vol. 26 (1967), 22--63.

[19]

Chris L. Kleinke. 1986. Gaze and Eye Contact: A Research Review. Psychological Bulletin, Vol. 100, 1 (1986), 78--100.

[20]

Jens-Peter Kreiss and Soumendra Nath Lahiri. 2012. Bootstrap methods for time series. In Time Series Analysis: Methods and Applications. Handbook of Statistics, Vol. 30. Elsevier, 3--26.

[21]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25. 1097--1105.

Digital Library

[22]

Shiro Kumano, Kazuhiro Otsuka, Dan Mikami, Masafumi Matsuda, and Junji Yamato. 2015. Analyzing Interpersonal Empathy via Collective Impressions. IEEE Trans. Affective Computing, Vol. 6, 4 (2015), 324--336.

Digital Library

[23]

Oscar Mateo Lozano and Kazuhiro Otsuka. 2008. Real-time Visual Tracker by Stream Processing. Journal of Signal Processing Systems, Vol. 57, 2 (2008), 285--295.

Digital Library

[24]

Dan Mikami, Kazuhiro Otsuka, and Junji Yamato. 2009. Memory-based Particle Filter for Face Pose Tracking Robust under Complex Dynamics. In Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09). 999--1006.

[25]

Sebastian Münzner, Philip Schmidt, Attila Reiss, Michael Hanselmann, Rainer Stiefelhagen, and Robert Dürichen. 2017. CNN-based Sensor Fusion Techniques for Multimodal Human Activity Recognition. In Proceedings of 2017 ACM International Symposium on Wearable Computers (ISWC '17). 158--165.

Digital Library

[26]

Fumio Nihei, Yukiko I. Nakano, and Yutaka Takase. 2017. Predicting Meeting Extracts in Group Discussions Using Multimodal Convolutional Neural Networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (ICMI 2017). 421--425.

Digital Library

[27]

Francisco Javier Ordó nez and Daniel Roggen. 2016. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. Sensors, Vol. 16, 1 (2016).

[28]

Kazuhiro Otsuka. 2011. Conversation Scene Analysis. IEEE Signal Processing Magazine, Vol. 28, 4 (2011), 127--131.

[29]

Kazuhiro Otsuka, Shoko Araki, Kentaro Ishizuka, Masakiyo Fujimoto, Martin Heinrich, and Junji Yamato. 2008. A Realtime Multimodal System for Analyzing Group Meetings by Combining Face Pose Tracking and Speaker Diarization. In Proceedings of the 10th International Conference on Multimodal Interfaces (ICMI '08). 257--264.

Digital Library

[30]

Kazuhiro Otsuka, Hiroshi Sawada, and Junji Yamato. 2007. Automatic Inference of Cross-modal Nonverbal Interactions in Multiparty Conversations. In Proceedings of the 9th International Conference on Multimodal Interfaces (ICMI '07). 255--262.

Digital Library

[31]

Kazuhiro Otsuka, Yoshinao Takemae, Junji Yamato, and Hiroshi Murase. 2005. A Probabilistic Inference of Multiparty-Conversation Structure Based on Markov-Switching Models of Gaze Patterns, Head Directions, and Utterances. In Proceedings of the 7th International Conference on Multimodal Interfaces (ICMI '05). 191--198.

Digital Library

[32]

Kazuhiro Otsuka and Junji Yamato. 2008. Fast and Robust Face Tracking for Analyzing Multiparty Face-to-Face Meetings. In Proceedings of the 5th International Workshop on Machine Learning for Multimodal Interaction (MLMI '08). 14--25.

Digital Library

[33]

Polhemus. 2018. Fastrak®: The Workhorse 6DOF Motion Tracker that Set the Standard in Tracking. (2018). https://polhemus.com/motion-tracking/all-trackers/fastrak Retrieved July 31, 2018 from

[34]

Charissa Ann Ronao and Sung-Bae Cho. 2015. Deep Convolutional Neural Networks for Human Activity Recognition with Smartphone Sensors. In Proceedings of International Conference on Neural Information Processing (ICONIP'15). 46--53.

[35]

Samira Sheikhi and Jean-Marc Odobez. 2012. Investigating the Midline Effect for Visual Focus of Attention Recognition. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI'12). 221--224.

Digital Library

[36]

Kevin Smith, Sileye O. Ba, Jean-Marc Odobez, and Daniel Gatica-Perez. 2008. Tracking the Visual Focus of Attention for a Varying Number of Wandering People. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 30 (2008), 1212--1229.

Digital Library

[37]

Rainer Stiefelhagen, Jie Yang, and Alex Waibel. 2002. Modeling Focus of Attention for Meeting Index Based on Multiple Cues. IEEE Trans. Neural Networks, Vol. 13, 4 (2002).

Digital Library

[38]

Andrea Vedaldi and Karel Lenc. 2015. MatConvNet -- Convolutional Neural Networks for MATLAB. In Proceeding of the ACM Int. Conf. on Multimedia.

Digital Library

[39]

Alessandro Vinciarelli, Maja Pantic, Dirk Heylen, Catherine Pelachaud, Isabella Poggi, Francesca D'Errico, and Marc Schroeder. 2012. Bridging the Gap Between Social Animal and Unsocial Machine: A Survey of Social Signal Processing. IEEE Trans. Affect. Comput., Vol. 3, 1 (Jan. 2012), 69--87.

Digital Library

[40]

Jian Bo Yang. 2016. cnn-timeseries. (2016). https://github.com/sibosutd/cnn-timeseries

[41]

Jian Bo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI'15). 3995--4001.

Digital Library

[42]

Subin Yi, Janghoon Ju, Man-Ki Yoon, and Jaesik Choi. 2017. Grouped Convolutional Neural Networks for Multivariate Time Series. CoRR, Vol. abs/1703.09938 (2017). arxiv: 1703.09938 http://arxiv.org/abs/1703.09938

[43]

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How Transferable Are Features in Deep Neural Networks?. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'14). 3320--3328.

Digital Library

[44]

Ming Zeng, Le T. Nguyen, Bo Yu, Ole Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors. In Proceedings of the 2014 6th International Conference on Mobile Computing, Applications and Services, MobiCASE 2014. 197--205.

[45]

Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J. Leon Zhao. 2014. Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks. In Proceedings of International Conference on Web-Age Information Management 2014 (WAIM 2014). 298--310.

Cited By

Tashiro AImamura MKumano SOtsuka K(2024)Exploring Interlocutor Gaze Interactions in Conversations based on Functional Spectrum AnalysisProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685708(86-94)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3678957.3685708
Ma FHe YSun BLi SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688987
D’Mello SDuran NMichaels AStewart A(2024)Improving collaborative problem-solving skills via automated feedback and scaffolding: a quasi-experimental study with CPSCoach 2.0User Modeling and User-Adapted Interaction10.1007/s11257-023-09387-634:4(1087-1125)Online publication date: 14-Feb-2024
https://doi.org/10.1007/s11257-023-09387-6
Show More Cited By

Index Terms

Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural Networks
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI theory, concepts and models

Recommendations

Recognizing visual focus of attention from head pose in natural meetings
Special issue on human computing

We address the problem of recognizing the visual focus of attention (VFOA) of meeting participants based on their head pose. To this end, the head pose observations are modeled using a Gaussian mixture model (GMM) or a hidden Markov model (HMM) whose ...
Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking
iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion

Video hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...
Multimodal multiparty social interaction with the furhat head
ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction

We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

October 2018

687 pages

ISBN:9781450356923

DOI:10.1145/3242969

General Chairs:
Sidney K. D'Mello
University of Illinois, USA
,
Panayiotis (Panos) Georgiou
University of Southern California, USA
,
Stefan Scherer
University of Southern California, USA
,
Program Chairs:
Emily Mower Provost
University of Michigan, USA
,
Mohammad Soleymani
University of Southern California, USA
,
Marcelo Worsley
Northwestern University, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: Specialist Interest Group in Computer-Human Interaction of the ACM

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMI '18

Sponsor:

SIGCHI

ICMI '18: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 16 - 20, 2018

CO, Boulder, USA

Acceptance Rates

ICMI '18 Paper Acceptance Rate 63 of 149 submissions, 42%;

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
329
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tashiro AImamura MKumano SOtsuka K(2024)Exploring Interlocutor Gaze Interactions in Conversations based on Functional Spectrum AnalysisProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685708(86-94)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3678957.3685708
Ma FHe YSun BLi SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688987
D’Mello SDuran NMichaels AStewart A(2024)Improving collaborative problem-solving skills via automated feedback and scaffolding: a quasi-experimental study with CPSCoach 2.0User Modeling and User-Adapted Interaction10.1007/s11257-023-09387-634:4(1087-1125)Online publication date: 14-Feb-2024
https://doi.org/10.1007/s11257-023-09387-6
Tafasca SGupta AKojovic NGelsomini MMaillart TPapandrea MSchaer MOdobez J(2023)The AI4Autism Project: A Multimodal and Interdisciplinary Approach to Autism Diagnosis and StratificationCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616239(414-425)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3610661.3616239
Tashiro AImamura MKumano SOtsuka K(2023)Analyzing and Recognizing Interlocutors' Gaze Functions from Multimodal Nonverbal CuesProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614152(33-41)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3614152
Singh VKar PSohini ARangaiah MChakraborty SMaity M(2023)WiFiTuned: Monitoring Engagement in Online Participation by Harmonizing WiFi and AudioProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614108(670-678)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3614108
Lewis AOchoa XQamra RHilliger IKhosravi HRienties BDawson S(2023)Instructor-in-the-Loop Exploratory Analytics to Support Group WorkLAK23: 13th International Learning Analytics and Knowledge Conference10.1145/3576050.3576093(284-292)Online publication date: 13-Mar-2023
https://dl.acm.org/doi/10.1145/3576050.3576093
Ma FMa ZSun BLi SMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)TA-CNNProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3551587(7099-7103)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3551587
Gupta ATafasca SOdobez J(2022)A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW56347.2022.00552(5037-5046)Online publication date: Jun-2022
https://doi.org/10.1109/CVPRW56347.2022.00552
Müller PDietz MSchiller DThomas DZhang GGebhard PAndré EBulling AShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)MultiMediateProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479219(4878-4882)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3479219
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents