[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Graphical Model for Audiovisual Object Tracking

Published: 01 July 2003 Publication History

Abstract

We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using off-the-shelf equipment.

References

[1]
H. Attias L. Deng A. Acero and J.C. Platt, “A New Method for Speech Denoising and Robust Speech Recognition Using Probabilistic Models for Clean Speech and for Noise,” Proc. Eurospeech, 2001.
[2]
H. Attias and C.E. Schreiner, “Blind Source Separation and Deconvolution: The Dynamic Component Analysis Algorithm,” Neural Computation, vol. 10, 1998.
[3]
S. Ben-Yacoub J. Luttin K. Jonsson J. Matas and J. Kittler, “Audio-Visual Person Verification,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000.
[4]
A. Blake and M. Isard, Active Contours. Springer, 1998.
[5]
Microphone Arrays, M. Brandstein and D. Ward, eds. Springer, 2001.
[6]
M.S. Brandstein, “Time-Delay Estimation of Reverberant Speech Exploiting Harmonic Structure,” J. Accoustic Soc. Am., vol. 105,no. 5, pp. 2914-2919, 1999.
[7]
C. Bregler and Y. Konig, “Eigenlips for Robust Speech Recognition,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 1994.
[8]
K. Cheok G. Smid and D. McCune, “A Multisensor-Based Collision Avoidance System with Application to Military HMMWV,” Proc. IEEE Conf. Intelligent Transportation Systems, 2000.
[9]
R. Cutler and L. Davis, “Look Who's Talking: Speaker Detection Using Video and Audio Correlation,” Proc. IEEE Conf. Multimedia and Expo, 2000.
[10]
R. Cutler Y. Rui A. Gupta J.J. Cadiz I. Tashev L.-W. He A. Colburn Z. Zhang Z. Liu and S. Silverberg, “Distributed Meetings: A Meeting Capture and Broadcasting System,” Proc. ACM Multimedia, 2002.
[11]
R. Duraiswami D. Zotkin and L. David, “Active Speech Source Localization by a Dual Coarse-to-Fine Search,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 2001.
[12]
B. Frey and N. Jojic, “Fast, Large-Scale Transformation-Invariant Clustering,” Proc. Advances in Neural Information Processing Systems 2001, vol. 14, 2002.
[13]
B.J. Frey and N. Jojic, “Advances in Algorithms for Inference and Learning in Complex Probability Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, pending publication.
[14]
B.J. Frey and N. Jojic, “Transformation-Invariant Clustering Using the EM Algorithm,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 1, Jan. 2003.
[15]
A. Garg V. Pavlovic and J.M. Rehg, “Audio-Visual Speaker Detection Using Dynamic Bayesian Networks,” Proc. IEEE Conf. Automatic Face and Gesture Recognition, 2000.
[16]
R. Goecke J.B. Millar A. Zelinsky and J. Robert-Ribes, “Stereo Vision Lip-Tracking for Audio-Video Speech Processing,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 2001.
[17]
J. Hershey and M. Case, “Audio-Visual Speech Separation Using Hidden Markov Models,” Proc. Advances in Neural Information Processing Systems 2001, vol. 14, 2002.
[18]
J. Hershey and J.R. Movellan, “Using Audio-Visual Synchrony to Locate Sounds,” Proc. Advances in Neural Information Processing Systems 1999, S.A. Solla, T.K. Leen, and K.-R. Muller, eds., vol. 12, 2000.
[19]
J.W. Fisher III T. Darrell W.T. Freeman and P.A. Viola, “Learning Joint Statistical Models for Audio-Visual Fusion and Segregation,” Proc. Advances in Neural Information Processing Systems 2000, vol. 14, 2001.
[20]
A.D. Jepson D.J. Fleet and T. El-Maraghi, “Robust, On-Line Appearance Models for Vision Tracking,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, Dec. 2001.
[21]
N. Jojic and B.J. Frey, “Learning Flexible Sprites in Video Layers,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2001.
[22]
N. Jojic N. Petrovic B.J. Frey and T.S. Huang, “Transformed Hidden Markov Models: Estimating Mixture Models of Images and Inferring Spatial Transformations in Video Sequences,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000.
[23]
M.I. Jordan Z. Ghahramani T.S. Jaakkola and L.K. Saul, “An Introduction to Variational Methods for Graphical Models,” Learning in Graphical Models, M.I. Jordan, ed. Norwell Mass.: Kluwer Academic Publishers, 1998.
[24]
K. Nakadai K. Hidai H. Mizoguchi H.G. Okuno and H. Kitano, “Real-Time Auditory and Visual Multiple-Object Tracking for Robots,” Proc. Int'l Joint Conf. Artificial Intelligence, 2001.
[25]
R.M. Neal and G.E. Hinton, “A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants,” Learning in Graphical Models, M.I. Jordan, ed. pp. 355-368, Norwell Mass.: Kluwer Academic Publishers, 1998.
[26]
H.G. Okuno K. Nakadai and H. Kitano, “Social Interaction of Humanoid Robot Based on Audio-Visual Tracking,” Proc. Int'l Conf. Industrial and Eng. Applications of Artificial Intelligence and Expert Systems, 2002.
[27]
G. Pingali G. Tunali and I. Carlborn, “Audio-Visual Tracking for Natural Interfaces,” Proc. ACM Multimedia, 1999.
[28]
Y. Rui and Y. Chen, “Better Proposal Distributions: Object Tracking Using Unscented Particle Filter,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000.
[29]
M. Slaney and M. Covell, “Facesync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks,” Proc. Advances in Neural Information Processing Systems 2000, vol. 14, 2001.
[30]
D.E. Sturim M.S. Brandstein and H.F. Solverman, “Tracking Multiple Talkers Using Microphone-Array Measurements,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 1997.
[31]
J. Vermaak M. Gangnet A. Blake and P. Perez, “Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking,” Proc. IEEE Int'l Conf. Computer Vision, 2001.
[32]
H. Wang and P. Chu, “Voice Source Localization for Automatic Camera Pointing System in Cideoconferencing,” Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 1997.
[33]
K. Wilson N. Checka D. Demirdjian and T. Darrell, “Audio-Video Array Source Localization for Perceptual User Interfaces,” Proc. Workshop Perceptive User Interfaces, 2001.
[34]
D.N. Zotkin R. Duraiswami and L.S. Davis, “Joint Audio-Visual Tracking Using Particle Filters,” EURASIP J. Applied Signal Processing, vol. 11, pp. 1154-1164, 2002.

Cited By

View all
  • (2023)WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.323285445:7(9186-9205)Online publication date: 1-Jul-2023
  • (2019)Multi-Speaker Tracking From an Audio–Visual Sensing DeviceIEEE Transactions on Multimedia10.1109/TMM.2019.290248921:10(2576-2588)Online publication date: 1-Oct-2019
  • (2018)3D Mouth Tracking from a Compact Microphone Array Co-Located with a camera2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2018.8461323(3071-3075)Online publication date: 15-Apr-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 25, Issue 7
July 2003
144 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 July 2003

Author Tags

  1. Audio
  2. Bayesian inference
  3. audiovisual
  4. automatic calibrations.
  5. cameras
  6. expectation-maximization (EM) algorithm
  7. generative models
  8. graphical models
  9. microphone arrays
  10. multimedia
  11. multimodal
  12. probabilistic inference
  13. speaker modeling
  14. speech
  15. tracking
  16. variational methods
  17. video
  18. vision

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.323285445:7(9186-9205)Online publication date: 1-Jul-2023
  • (2019)Multi-Speaker Tracking From an Audio–Visual Sensing DeviceIEEE Transactions on Multimedia10.1109/TMM.2019.290248921:10(2576-2588)Online publication date: 1-Oct-2019
  • (2018)3D Mouth Tracking from a Compact Microphone Array Co-Located with a camera2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2018.8461323(3071-3075)Online publication date: 15-Apr-2018
  • (2017)Large-Scale Video Classification with Elastic Streaming Sequential Data Processing SystemProceedings of the Workshop on Large-Scale Video Classification Challenge10.1145/3134263.3134264(1-7)Online publication date: 27-Oct-2017
  • (2016)Audio SurveillanceACM Computing Surveys10.1145/287118348:4(1-46)Online publication date: 22-Feb-2016
  • (2015)Vision-guided robot hearingInternational Journal of Robotics Research10.1177/027836491454805034:4-5(437-456)Online publication date: 1-Apr-2015
  • (2015)Audio Assisted Robust Visual Tracking With Adaptive Particle FilteringIEEE Transactions on Multimedia10.1109/TMM.2014.237751517:2(186-200)Online publication date: 1-Feb-2015
  • (2014)What's Making that Sound?Proceedings of the 22nd ACM international conference on Multimedia10.1145/2647868.2654936(147-156)Online publication date: 3-Nov-2014
  • (2014)Joint Audio-Visual Words for Violent Scenes Detection in MoviesProceedings of International Conference on Multimedia Retrieval10.1145/2578726.2578799(483-486)Online publication date: 1-Apr-2014
  • (2014)Discovering joint audio---visual codewords for video event detectionMachine Vision and Applications10.1007/s00138-013-0567-025:1(33-47)Online publication date: 1-Jan-2014
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media