More Web Proxy on the site http://driver.im/

research-article

A Graphical Model for Audiovisual Object Tracking

Authors:

Matthew J. Beal,

Hagai AttiasAuthors Info & Claims

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 25, Issue 7

Pages 828 - 836

https://doi.org/10.1109/TPAMI.2003.1206512

Published: 01 July 2003 Publication History

Abstract

We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using off-the-shelf equipment.

References

[1]

H. Attias L. Deng A. Acero and J.C. Platt, “A New Method for Speech Denoising and Robust Speech Recognition Using Probabilistic Models for Clean Speech and for Noise,” Proc. Eurospeech, 2001.

[2]

H. Attias and C.E. Schreiner, “Blind Source Separation and Deconvolution: The Dynamic Component Analysis Algorithm,” Neural Computation, vol. 10, 1998.

Digital Library

[3]

S. Ben-Yacoub J. Luttin K. Jonsson J. Matas and J. Kittler, “Audio-Visual Person Verification,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000.

[4]

A. Blake and M. Isard, Active Contours. Springer, 1998.

[5]

Microphone Arrays, M. Brandstein and D. Ward, eds. Springer, 2001.

[6]

M.S. Brandstein, “Time-Delay Estimation of Reverberant Speech Exploiting Harmonic Structure,” J. Accoustic Soc. Am., vol. 105,no. 5, pp. 2914-2919, 1999.

[7]

C. Bregler and Y. Konig, “Eigenlips for Robust Speech Recognition,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 1994.

[8]

K. Cheok G. Smid and D. McCune, “A Multisensor-Based Collision Avoidance System with Application to Military HMMWV,” Proc. IEEE Conf. Intelligent Transportation Systems, 2000.

[9]

R. Cutler and L. Davis, “Look Who's Talking: Speaker Detection Using Video and Audio Correlation,” Proc. IEEE Conf. Multimedia and Expo, 2000.

[10]

R. Cutler Y. Rui A. Gupta J.J. Cadiz I. Tashev L.-W. He A. Colburn Z. Zhang Z. Liu and S. Silverberg, “Distributed Meetings: A Meeting Capture and Broadcasting System,” Proc. ACM Multimedia, 2002.

Digital Library

[11]

R. Duraiswami D. Zotkin and L. David, “Active Speech Source Localization by a Dual Coarse-to-Fine Search,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 2001.

Digital Library

[12]

B. Frey and N. Jojic, “Fast, Large-Scale Transformation-Invariant Clustering,” Proc. Advances in Neural Information Processing Systems 2001, vol. 14, 2002.

[13]

B.J. Frey and N. Jojic, “Advances in Algorithms for Inference and Learning in Complex Probability Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, pending publication.

[14]

B.J. Frey and N. Jojic, “Transformation-Invariant Clustering Using the EM Algorithm,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 1, Jan. 2003.

Digital Library

[15]

A. Garg V. Pavlovic and J.M. Rehg, “Audio-Visual Speaker Detection Using Dynamic Bayesian Networks,” Proc. IEEE Conf. Automatic Face and Gesture Recognition, 2000.

Digital Library

[16]

R. Goecke J.B. Millar A. Zelinsky and J. Robert-Ribes, “Stereo Vision Lip-Tracking for Audio-Video Speech Processing,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 2001.

[17]

J. Hershey and M. Case, “Audio-Visual Speech Separation Using Hidden Markov Models,” Proc. Advances in Neural Information Processing Systems 2001, vol. 14, 2002.

[18]

J. Hershey and J.R. Movellan, “Using Audio-Visual Synchrony to Locate Sounds,” Proc. Advances in Neural Information Processing Systems 1999, S.A. Solla, T.K. Leen, and K.-R. Muller, eds., vol. 12, 2000.

[19]

J.W. Fisher III T. Darrell W.T. Freeman and P.A. Viola, “Learning Joint Statistical Models for Audio-Visual Fusion and Segregation,” Proc. Advances in Neural Information Processing Systems 2000, vol. 14, 2001.

[20]

A.D. Jepson D.J. Fleet and T. El-Maraghi, “Robust, On-Line Appearance Models for Vision Tracking,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, Dec. 2001.

[21]

N. Jojic and B.J. Frey, “Learning Flexible Sprites in Video Layers,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2001.

[22]

N. Jojic N. Petrovic B.J. Frey and T.S. Huang, “Transformed Hidden Markov Models: Estimating Mixture Models of Images and Inferring Spatial Transformations in Video Sequences,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000.

[23]

M.I. Jordan Z. Ghahramani T.S. Jaakkola and L.K. Saul, “An Introduction to Variational Methods for Graphical Models,” Learning in Graphical Models, M.I. Jordan, ed. Norwell Mass.: Kluwer Academic Publishers, 1998.

Digital Library

[24]

K. Nakadai K. Hidai H. Mizoguchi H.G. Okuno and H. Kitano, “Real-Time Auditory and Visual Multiple-Object Tracking for Robots,” Proc. Int'l Joint Conf. Artificial Intelligence, 2001.

Digital Library

[25]

R.M. Neal and G.E. Hinton, “A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants,” Learning in Graphical Models, M.I. Jordan, ed. pp. 355-368, Norwell Mass.: Kluwer Academic Publishers, 1998.

Digital Library

[26]

H.G. Okuno K. Nakadai and H. Kitano, “Social Interaction of Humanoid Robot Based on Audio-Visual Tracking,” Proc. Int'l Conf. Industrial and Eng. Applications of Artificial Intelligence and Expert Systems, 2002.

Digital Library

[27]

G. Pingali G. Tunali and I. Carlborn, “Audio-Visual Tracking for Natural Interfaces,” Proc. ACM Multimedia, 1999.

Digital Library

[28]

Y. Rui and Y. Chen, “Better Proposal Distributions: Object Tracking Using Unscented Particle Filter,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000.

[29]

M. Slaney and M. Covell, “Facesync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks,” Proc. Advances in Neural Information Processing Systems 2000, vol. 14, 2001.

[30]

D.E. Sturim M.S. Brandstein and H.F. Solverman, “Tracking Multiple Talkers Using Microphone-Array Measurements,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 1997.

Digital Library

[31]

J. Vermaak M. Gangnet A. Blake and P. Perez, “Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking,” Proc. IEEE Int'l Conf. Computer Vision, 2001.

[32]

H. Wang and P. Chu, “Voice Source Localization for Automatic Camera Pointing System in Cideoconferencing,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 1997.

Digital Library

[33]

K. Wilson N. Checka D. Demirdjian and T. Darrell, “Audio-Video Array Source Localization for Perceptual User Interfaces,” Proc. Workshop Perceptive User Interfaces, 2001.

Digital Library

[34]

D.N. Zotkin R. Duraiswami and L.S. Davis, “Joint Audio-Visual Tracking Using Particle Filters,” EURASIP J. Applied Signal Processing, vol. 11, pp. 1154-1164, 2002.

Digital Library

Cited By

Zhang CHuang GLiu LHuang SYang YWan XGe STao D(2023)WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.323285445:7(9186-9205)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TPAMI.2022.3232854
Qian XBrutti ALanz OOmologo MCavallaro A(2019)Multi-Speaker Tracking From an Audio–Visual Sensing DeviceIEEE Transactions on Multimedia10.1109/TMM.2019.290248921:10(2576-2588)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1109/TMM.2019.2902489
Qian XXompero ACavallaro ABrutti ALanz OOmologo M(2018)3D Mouth Tracking from a Compact Microphone Array Co-Located with a camera2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2018.8461323(3071-3075)Online publication date: 15-Apr-2018
https://dl.acm.org/doi/10.1109/ICASSP.2018.8461323
Show More Cited By

Index Terms

A Graphical Model for Audiovisual Object Tracking
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Markov decision processes
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Markov decision processes

Recommendations

An Introduction to Variational Methods for Graphical Models

This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, ...
Visual Tracking Using Multimodal Particle Filter

Visual tracking of humans or objects in motion is a challenging problem when observed data undergo appearance changes (e.g., due to illumination variations, occlusion, cluttered background, etc.). Moreover, tracking systems are usually initialized with ...
Multiagent bayesian forecasting of structural time-invariant dynamic systems with graphical models

Time series are found widely in engineering and science. We study forecasting of stochastic, dynamic systems based on observations from multivariate time series. We model the domain as a dynamic multiply sectioned Bayesian network (DMSBN) and populate ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Pattern Analysis and Machine Intelligence Volume 25, Issue 7

July 2003

144 pages

ISSN:0162-8828

Issue’s Table of Contents

Copyright © Copyright © 2003 IEEE. All Rights Reserved.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 July 2003

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang CHuang GLiu LHuang SYang YWan XGe STao D(2023)WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.323285445:7(9186-9205)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TPAMI.2022.3232854
Qian XBrutti ALanz OOmologo MCavallaro A(2019)Multi-Speaker Tracking From an Audio–Visual Sensing DeviceIEEE Transactions on Multimedia10.1109/TMM.2019.290248921:10(2576-2588)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1109/TMM.2019.2902489
Qian XXompero ACavallaro ABrutti ALanz OOmologo M(2018)3D Mouth Tracking from a Compact Microphone Array Co-Located with a camera2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2018.8461323(3071-3075)Online publication date: 15-Apr-2018
https://dl.acm.org/doi/10.1109/ICASSP.2018.8461323
Peng YYe HLin YBao YZhao ZQiu HLu YWang LZheng YDavis LChang SWu ZJiang Y(2017)Large-Scale Video Classification with Elastic Streaming Sequential Data Processing SystemProceedings of the Workshop on Large-Scale Video Classification Challenge10.1145/3134263.3134264(1-7)Online publication date: 27-Oct-2017
https://dl.acm.org/doi/10.1145/3134263.3134264
Crocco MCristani MTrucco AMurino V(2016)Audio SurveillanceACM Computing Surveys10.1145/287118348:4(1-46)Online publication date: 22-Feb-2016
https://dl.acm.org/doi/10.1145/2871183
Alameda-Pineda XHoraud R(2015)Vision-guided robot hearingInternational Journal of Robotics Research10.1177/027836491454805034:4-5(437-456)Online publication date: 1-Apr-2015
https://dl.acm.org/doi/10.1177/0278364914548050
Kilic VBarnard MWenwu Wang Kittler J(2015)Audio Assisted Robust Visual Tracking With Adaptive Particle FilteringIEEE Transactions on Multimedia10.1109/TMM.2014.237751517:2(186-200)Online publication date: 1-Feb-2015
https://dl.acm.org/doi/10.1109/TMM.2014.2377515
Li KYe JHua KHua KRui YSteinmetz RHanjalic ANatsev AZhu W(2014)What's Making that Sound?Proceedings of the 22nd ACM international conference on Multimedia10.1145/2647868.2654936(147-156)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2647868.2654936
Derbas NQuénot GKankanhalli MRueger SManmatha RJose Jvan Rijsbergen K(2014)Joint Audio-Visual Words for Violent Scenes Detection in MoviesProceedings of International Conference on Multimedia Retrieval10.1145/2578726.2578799(483-486)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1145/2578726.2578799
Jhuo IYe GGao SLiu DJiang YLee DChang S(2014)Discovering joint audio---visual codewords for video event detectionMachine Vision and Applications10.1007/s00138-013-0567-025:1(33-47)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.1007/s00138-013-0567-0
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents