[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3242969.3242973acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural Networks

Published: 02 October 2018 Publication History

Abstract

Convolutional neural networks (CNNs) are employed to estimate the visual focus of attention (VFoA), also called gaze direction, in multiparty face-to-face meetings on the basis of multimodal nonverbal behaviors including head pose, direction of the eyeball, and presence/absence of utterance. To reveal the potential of CNNs, we focus on aspects of multimodal and multiparty fusion including individual/group models, early/late fusion, and robustness when using inputs from image-based trackers. In contrast to the individual model that separately targets each person specific to one's seat, the group model aims to jointly estimate the gaze directions of all participants. Experiments confirmed that the group model outperformed the individual model especially in predicting listeners' VFoA when the inputs did not include eyeball directions. This result indicates that the group CNN model can implicitly learn underlying conversation structures, e.g., the listeners' gazes converge on the speaker. When the eyeball direction feature is available, both models outperformed the Bayes models used for comparison. In this case, the individual model was superior to the group model, particularly in estimating the speaker's VFoA. Moreover, it was revealed that in group models, two-stage late fusion, which integrates an individual features first, and multiparty features second, outperformed other structures. Furthermore, our experiment confirmed that image-based tracking can provide a comparable level of performance to that of sensor-based measurements. Overall, the results suggest that the CNN is a promising approach for VFoA estimation.

References

[1]
Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio, Speech, and Language Processing, Vol. 22 (2014), 1533--1545.
[2]
Shima Alizadeh and Azar Fazel. 2017. Convolutional Neural Networks for Facial Expression Recognition. CoRR, Vol. abs/1704.06756 (2017). arxiv: 1704.06756 http://arxiv.org/abs/1704.06756
[3]
Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. 2016. OpenFace: A General-Purpose Face Recognition Library with Mobile Applications. Technical Report. Carnegie Mellon University-CS-16-118, Carnegie Mellon University School of Computer Science.
[4]
Michael Argyle. 1988. Bodily Communication -- 2nd ed. Routledge, London and New York.
[5]
Michael Argyle and Mark Cook. 1976. Gaze and Mutual Gaze. Cambridge University Press Cambridge, Eng.; New York.
[6]
Kartik Audhkhasi, Osonde Osoba, and Bart Kosko. 2016. Noise-enhanced Convolutional Neural Networks. Neural Networks, Vol. 78, C (June 2016), 15--23.
[7]
Sileye O. Ba and Jean-Marc Odobez. 2011. Multiperson Visual Focus of Attention from Head Pose and Meeting Contextual Cues. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 33, 1 (Jan. 2011), 101--116.
[8]
Gérard Bailly, Alaeddine Mihoub, Christian Wolf, and Frédéric Elisei. 2015. Learning joint multimodal behaviors for face-to-face interaction: performance & properties of statistical models . In Proceedings of Human-Robot Interaction. Workshop on Behavior Coordination between Animals, Humans, and Robots.
[9]
Tadas Baltruvsaitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An Open Source Facial Behavior Analysis Toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. 00. 1--10.
[10]
Starkey Dancan. 1972. Some Signals and Rules for Taking Speaking Turns in Conversations. Journal of Personality and Social Psychology, Vol. 23, 2 (1972), 283--292.
[11]
Daniel Gatica-Perez. 2009. Automatic Nonverbal Analysis of Social Interaction in Small Groups: A Review. Image and Vision Computing, Vol. 27, 12 (2009), 1775--1787.
[12]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
[13]
Charles Goodwin. 1981. Conversational Organization: Interaction between Speakers and Hearers. Academic Press.
[14]
Sebastian Gorga and Kazuhiro Otsuka. 2010. Conversation Scene Analysis based on Dynamic Bayesian Network and Image-based Gaze Detection. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction (ICMI-MLMI '10). 54:1--54:8.
[15]
Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI'16). 1533--1540.
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[17]
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A Convolutional Neural Network for Modelling Sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 655--665.
[18]
Adam Kendon. 1967. Some Functions of Gaze-Direction in Social Interaction. Acta Psychologica, Vol. 26 (1967), 22--63.
[19]
Chris L. Kleinke. 1986. Gaze and Eye Contact: A Research Review. Psychological Bulletin, Vol. 100, 1 (1986), 78--100.
[20]
Jens-Peter Kreiss and Soumendra Nath Lahiri. 2012. Bootstrap methods for time series. In Time Series Analysis: Methods and Applications. Handbook of Statistics, Vol. 30. Elsevier, 3--26.
[21]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25. 1097--1105.
[22]
Shiro Kumano, Kazuhiro Otsuka, Dan Mikami, Masafumi Matsuda, and Junji Yamato. 2015. Analyzing Interpersonal Empathy via Collective Impressions. IEEE Trans. Affective Computing, Vol. 6, 4 (2015), 324--336.
[23]
Oscar Mateo Lozano and Kazuhiro Otsuka. 2008. Real-time Visual Tracker by Stream Processing. Journal of Signal Processing Systems, Vol. 57, 2 (2008), 285--295.
[24]
Dan Mikami, Kazuhiro Otsuka, and Junji Yamato. 2009. Memory-based Particle Filter for Face Pose Tracking Robust under Complex Dynamics. In Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09). 999--1006.
[25]
Sebastian Münzner, Philip Schmidt, Attila Reiss, Michael Hanselmann, Rainer Stiefelhagen, and Robert Dürichen. 2017. CNN-based Sensor Fusion Techniques for Multimodal Human Activity Recognition. In Proceedings of 2017 ACM International Symposium on Wearable Computers (ISWC '17). 158--165.
[26]
Fumio Nihei, Yukiko I. Nakano, and Yutaka Takase. 2017. Predicting Meeting Extracts in Group Discussions Using Multimodal Convolutional Neural Networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (ICMI 2017). 421--425.
[27]
Francisco Javier Ordó nez and Daniel Roggen. 2016. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. Sensors, Vol. 16, 1 (2016).
[28]
Kazuhiro Otsuka. 2011. Conversation Scene Analysis. IEEE Signal Processing Magazine, Vol. 28, 4 (2011), 127--131.
[29]
Kazuhiro Otsuka, Shoko Araki, Kentaro Ishizuka, Masakiyo Fujimoto, Martin Heinrich, and Junji Yamato. 2008. A Realtime Multimodal System for Analyzing Group Meetings by Combining Face Pose Tracking and Speaker Diarization. In Proceedings of the 10th International Conference on Multimodal Interfaces (ICMI '08). 257--264.
[30]
Kazuhiro Otsuka, Hiroshi Sawada, and Junji Yamato. 2007. Automatic Inference of Cross-modal Nonverbal Interactions in Multiparty Conversations. In Proceedings of the 9th International Conference on Multimodal Interfaces (ICMI '07). 255--262.
[31]
Kazuhiro Otsuka, Yoshinao Takemae, Junji Yamato, and Hiroshi Murase. 2005. A Probabilistic Inference of Multiparty-Conversation Structure Based on Markov-Switching Models of Gaze Patterns, Head Directions, and Utterances. In Proceedings of the 7th International Conference on Multimodal Interfaces (ICMI '05). 191--198.
[32]
Kazuhiro Otsuka and Junji Yamato. 2008. Fast and Robust Face Tracking for Analyzing Multiparty Face-to-Face Meetings. In Proceedings of the 5th International Workshop on Machine Learning for Multimodal Interaction (MLMI '08). 14--25.
[33]
Polhemus. 2018. Fastrak®: The Workhorse 6DOF Motion Tracker that Set the Standard in Tracking. (2018). https://polhemus.com/motion-tracking/all-trackers/fastrak Retrieved July 31, 2018 from
[34]
Charissa Ann Ronao and Sung-Bae Cho. 2015. Deep Convolutional Neural Networks for Human Activity Recognition with Smartphone Sensors. In Proceedings of International Conference on Neural Information Processing (ICONIP'15). 46--53.
[35]
Samira Sheikhi and Jean-Marc Odobez. 2012. Investigating the Midline Effect for Visual Focus of Attention Recognition. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI'12). 221--224.
[36]
Kevin Smith, Sileye O. Ba, Jean-Marc Odobez, and Daniel Gatica-Perez. 2008. Tracking the Visual Focus of Attention for a Varying Number of Wandering People. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 30 (2008), 1212--1229.
[37]
Rainer Stiefelhagen, Jie Yang, and Alex Waibel. 2002. Modeling Focus of Attention for Meeting Index Based on Multiple Cues. IEEE Trans. Neural Networks, Vol. 13, 4 (2002).
[38]
Andrea Vedaldi and Karel Lenc. 2015. MatConvNet -- Convolutional Neural Networks for MATLAB. In Proceeding of the ACM Int. Conf. on Multimedia.
[39]
Alessandro Vinciarelli, Maja Pantic, Dirk Heylen, Catherine Pelachaud, Isabella Poggi, Francesca D'Errico, and Marc Schroeder. 2012. Bridging the Gap Between Social Animal and Unsocial Machine: A Survey of Social Signal Processing. IEEE Trans. Affect. Comput., Vol. 3, 1 (Jan. 2012), 69--87.
[40]
Jian Bo Yang. 2016. cnn-timeseries. (2016). https://github.com/sibosutd/cnn-timeseries
[41]
Jian Bo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI'15). 3995--4001.
[42]
Subin Yi, Janghoon Ju, Man-Ki Yoon, and Jaesik Choi. 2017. Grouped Convolutional Neural Networks for Multivariate Time Series. CoRR, Vol. abs/1703.09938 (2017). arxiv: 1703.09938 http://arxiv.org/abs/1703.09938
[43]
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How Transferable Are Features in Deep Neural Networks?. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'14). 3320--3328.
[44]
Ming Zeng, Le T. Nguyen, Bo Yu, Ole Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors. In Proceedings of the 2014 6th International Conference on Mobile Computing, Applications and Services, MobiCASE 2014. 197--205.
[45]
Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J. Leon Zhao. 2014. Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks. In Proceedings of International Conference on Web-Age Information Management 2014 (WAIM 2014). 298--310.

Cited By

View all
  • (2024)Exploring Interlocutor Gaze Interactions in Conversations based on Functional Spectrum AnalysisProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685708(86-94)Online publication date: 4-Nov-2024
  • (2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
  • (2024)Improving collaborative problem-solving skills via automated feedback and scaffolding: a quasi-experimental study with CPSCoach 2.0User Modeling and User-Adapted Interaction10.1007/s11257-023-09387-634:4(1087-1125)Online publication date: 14-Feb-2024
  • Show More Cited By

Index Terms

  1. Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural Networks

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
      October 2018
      687 pages
      ISBN:9781450356923
      DOI:10.1145/3242969
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      • SIGCHI: Specialist Interest Group in Computer-Human Interaction of the ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 October 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. convolutional neural networks
      2. deep learning
      3. gaze
      4. meeting analysis
      5. multimodal fusion
      6. visual focus of attention

      Qualifiers

      • Research-article

      Conference

      ICMI '18
      Sponsor:
      • SIGCHI

      Acceptance Rates

      ICMI '18 Paper Acceptance Rate 63 of 149 submissions, 42%;
      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 03 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Exploring Interlocutor Gaze Interactions in Conversations based on Functional Spectrum AnalysisProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685708(86-94)Online publication date: 4-Nov-2024
      • (2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
      • (2024)Improving collaborative problem-solving skills via automated feedback and scaffolding: a quasi-experimental study with CPSCoach 2.0User Modeling and User-Adapted Interaction10.1007/s11257-023-09387-634:4(1087-1125)Online publication date: 14-Feb-2024
      • (2023)The AI4Autism Project: A Multimodal and Interdisciplinary Approach to Autism Diagnosis and StratificationCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616239(414-425)Online publication date: 9-Oct-2023
      • (2023)Analyzing and Recognizing Interlocutors' Gaze Functions from Multimodal Nonverbal CuesProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614152(33-41)Online publication date: 9-Oct-2023
      • (2023)WiFiTuned: Monitoring Engagement in Online Participation by Harmonizing WiFi and AudioProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614108(670-678)Online publication date: 9-Oct-2023
      • (2023)Instructor-in-the-Loop Exploratory Analytics to Support Group WorkLAK23: 13th International Learning Analytics and Knowledge Conference10.1145/3576050.3576093(284-292)Online publication date: 13-Mar-2023
      • (2022)TA-CNNProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3551587(7099-7103)Online publication date: 10-Oct-2022
      • (2022)A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW56347.2022.00552(5037-5046)Online publication date: Jun-2022
      • (2021)MultiMediateProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479219(4878-4882)Online publication date: 17-Oct-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media