Article

Audio-video array source separation for perceptual user interfaces

Authors:

Kevin Wilson,

Neal Checka,

David Demirdjian,

Trevor DarrellAuthors Info & Claims

PUI '01: Proceedings of the 2001 workshop on Perceptive user interfaces

Pages 1 - 7

https://doi.org/10.1145/971478.971500

Published: 15 November 2001 Publication History

Get Access

Abstract

Steerable microphone arrays provide a flexible infrastructure for audio source separation. In order for them to be used effectively in perceptual user interfaces, there must be a mechanism in place for steering the focus of the array to the sound source. Audio-only steering techniques often perform poorly in the presence of multiple sound sources or strong reverberation. Video-only techniques can achieve high spatial precision but require that the audio and video subsystems be accurately calibrated to preserve this precision. We present an audio-video localization technique that combines the benefits of the two modalities. We implement our technique in a test environment containing multiple stereo cameras and a room-sized microphone array. Our technique achieves an 8.9 dB improvement over a single far-field microphone and a 6.7 dB improvement over source separation based on video-only localization.

References

[1]

D. J. Beymer and K. Konolige. Real-time tracking of multiple people using stereo. In Frame-Rate Workshop, 1999.

Google Scholar

[2]

U. Bub, M. Hunke, and A. Waibel. Knowing who to listen to in speech recognition: Visually guided beamforming. In 1995 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1995.

Crossref

Google Scholar

[3]

M. Casey, W. Gardner, and S. Basu. Vision steered beamforming and transaural rendering for the artificial life interactive video environment,(alive). In 99th Convention of the Audio Engineering Society, 1995.

Google Scholar

[4]

M. Collobert, R. Feraud, G. LeTourneur, O. Bernier, J. E. Viallet, Y. Mahieux, and D. Collobert. Listen: a system for locating and tracking individual speakers. In 2nd International Conference on Face and Gesture Recognition, 1996.

Digital Library

Google Scholar

[5]

T. Darrell, D. Demirdjian, N. Checka, and P. Felzenszwalb. Plan-view trajectory estimation with dense stereo background models. In 2001 International Conference on Computer Vision, 2001.

Crossref

Google Scholar

[6]

T. Darrell, G. G. Gordon, M. Harville, and J. Woodfill. Integrated person tracking using stereo, color, and pattern detection. IJCV, (37(2)):199--207, June 2000.

Digital Library

Google Scholar

[7]

R. Duraiswami, D. Zotkin, and L. S. Davis. Active speech source localization by a dual course-to-fine search. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001.

Digital Library

Google Scholar

[8]

Y. A. Ivanov, A. F. Bobick, and J. Liu. Fast lighting independent background subtraction. IJCV, 2000.

Digital Library

Google Scholar

[9]

J. Krumm, S. Harris, B. Meyers, B. Brummit, M. Hale, and S. Shafer. Multi-camera multi-person tracking for easyliving. In 3rd IEEE Workshop on Visual Surveillance, 2000. R<10>H. F. Silverman, W. R. Patterson, and J. L. Flanagan. The huge microphone array. IEEE Concurrency, pages 36--46, Oct. 1998.

Digital Library

Google Scholar

[10]

B. D. V. Veen and K. M. Buckley. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Magazine, Apr. 1988.

Crossref

Google Scholar

[11]

M. Viberg and H. Krim. Two decades of statistical array processing. In 31st Asilomar Conference on Signals, Systems, and Computers, 1997.

Crossref

Google Scholar

[12]

C. Wang and M. Brandstein. Multi-source face tracking with audio and visual data. In IEEE International Workshop on Multimedia Signal Processing, 1999.

Crossref

Google Scholar

Cited By

View all

Li KYe JHua KHua KRui YSteinmetz RHanjalic ANatsev AZhu W(2014)What's Making that Sound?Proceedings of the 22nd ACM international conference on Multimedia10.1145/2647868.2654936(147-156)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2647868.2654936
Khan MNaqvi Sur-Rehman AWang WChambers J(2013)Video-Aided Model-Based Source Separation in Real Reverberant RoomsIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2013.226181421:9(1900-1912)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1109/TASL.2013.2261814
Khan MNaqvi SChambers J(2013)Two-stage audio-visual speech dereverberation and separation based on models of the interaural spatial cues and spatial covariance2013 18th International Conference on Digital Signal Processing (DSP)10.1109/ICDSP.2013.6622780(1-6)Online publication date: Jul-2013
https://doi.org/10.1109/ICDSP.2013.6622780
Show More Cited By

Index Terms

Audio-video array source separation for perceptual user interfaces
1. Applied computing

Recommendations

Binaural rendering of microphone array captures based on source separation

A method for binaural rendering of sound scene recordings is proposed.Source signals and their direction of arrival is estimated using a microphone array.A low-rank NMF model for separation of sound sources is used.Speech intelligibility test with ...
Capturing and reproducing spatial audio based on a circular microphone array

This paper proposes a real-time method for capturing and reproducing spatial audio based on a circular microphone array. Following a different approach than other recently proposed array-based methods for spatial audio, the proposed method estimates the ...
Multichannel Audio Source Separation With Probabilistic Reverberation Priors

Incorporating prior knowledge about the sources and/or the mixture is a way to improve under-determined audio source separation performance. A great number of informed source separation techniques concentrate on taking priors on the sources into account,...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PUI '01: Proceedings of the 2001 workshop on Perceptive user interfaces

November 2001

241 pages

ISBN:9781450374736

DOI:10.1145/971478

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

PUI01

PUI01: Workshop on Perceptive User Interfaces

November 15 - 16, 2001

Florida, Orlando, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
282
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Li KYe JHua KHua KRui YSteinmetz RHanjalic ANatsev AZhu W(2014)What's Making that Sound?Proceedings of the 22nd ACM international conference on Multimedia10.1145/2647868.2654936(147-156)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2647868.2654936
Khan MNaqvi Sur-Rehman AWang WChambers J(2013)Video-Aided Model-Based Source Separation in Real Reverberant RoomsIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2013.226181421:9(1900-1912)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1109/TASL.2013.2261814
Khan MNaqvi SChambers J(2013)Two-stage audio-visual speech dereverberation and separation based on models of the interaural spatial cues and spatial covariance2013 18th International Conference on Digital Signal Processing (DSP)10.1109/ICDSP.2013.6622780(1-6)Online publication date: Jul-2013
https://doi.org/10.1109/ICDSP.2013.6622780
Khan MDan GFodor V(2013)Characterization of SURF interest point distribution for visual processing in sensor networks2013 18th International Conference on Digital Signal Processing (DSP)10.1109/ICDSP.2013.6622701(1-7)Online publication date: Jul-2013
https://doi.org/10.1109/ICDSP.2013.6622701
Huiyu Zhou Taj MCavallaro A(2008)Target Detection and Tracking With Heterogeneous SensorsIEEE Journal of Selected Topics in Signal Processing10.1109/JSTSP.2008.20014292:4(503-513)Online publication date: Aug-2008
https://doi.org/10.1109/JSTSP.2008.2001429
Denda YNishiura TYamashita Y(2008)Omnidirectional Audio-Visual Talker Localization Based on Dynamic Fusion of Audio-Visual Features Using Validity and Reliability CriteriaIEICE - Transactions on Information and Systems10.1093/ietisy/e91-d.3.598E91-D:3(598-606)Online publication date: 1-Mar-2008
https://dl.acm.org/doi/10.1093/ietisy/e91-d.3.598
Cristani MBicego MMurino V(2007)Audio-Visual Event Recognition in Surveillance Video SequencesIEEE Transactions on Multimedia10.1109/TMM.2006.8862639:2(257-267)Online publication date: 1-Feb-2007
https://dl.acm.org/doi/10.1109/TMM.2006.886263
Cristani MBicego MMurino V(2004)Audio-Video Integration for Background ModellingComputer Vision - ECCV 200410.1007/978-3-540-24671-8_16(202-213)Online publication date: 2004
https://doi.org/10.1007/978-3-540-24671-8_16
Beal MJojic NAttias H(2003)A Graphical Model for Audiovisual Object TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2003.120651225:7(828-836)Online publication date: 1-Jul-2003
https://dl.acm.org/doi/10.1109/TPAMI.2003.1206512
Kapralos BJenkin MMilios E(2003)Audiovisual localization of multiple speakers in a video teleconferencing settingInternational Journal of Imaging Systems and Technology10.1002/ima.1004513:1(95-105)Online publication date: 2-Jun-2003
https://doi.org/10.1002/ima.10045

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Binaural rendering of microphone array captures based on source separation

Capturing and reproducing spatial audio based on a circular microphone array

Multichannel Audio Source Separation With Probabilistic Reverberation Priors