[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1459359.1459391acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

SIFT-Bag kernel for video event analysis

Published: 26 October 2008 Publication History

Abstract

In this work, we present a SIFT-Bag based generative-to-discriminative framework for addressing the problem of video event recognition in unconstrained news videos. In the generative stage, each video clip is encoded as a bag of SIFT feature vectors, the distribution of which is described by a Gaussian Mixture Models (GMM). In the discriminative stage, the SIFT-Bag Kernel is designed for characterizing the property of Kullback-Leibler divergence between the specialized GMMs of any two video clips, and then this kernel is utilized for supervised learning in two ways. On one hand, this kernel is further refined in discriminating power for centroid-based video event classification by using the Within-Class Covariance Normalization approach, which depresses the kernel components with high-variability for video clips of the same event. On the other hand, the SIFT-Bag Kernel is used in a Support Vector Machine for margin-based video event classification. Finally, the outputs from these two classifiers are fused together for final decision. The experiments on the TRECVID 2005 corpus demonstrate that the mean average precision is boosted from the best reported 38.2% in [36] to 60.4% based on our new framework.

References

[1]
A. Amir et al., IBM Research TRECVID-2005 Video Retrieval System, NIST TRECVID Workshop, 2005.
[2]
O. Boiman and M. Irani, Detecting irregularities in images and in video, IEEE International Conference on Computer Vision, pp. 462--469, 2005.
[3]
M. Brand, N. Oliver, and A. Pentland, Coupled Hidden Markov Models for Complex Action Recognition, Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 994--999, 1997.
[4]
A. Hatch and A. Stolcke, GENERALIZED LINEAR KERNELS FOR ONE-VERSUS-ALL CLASSIFICATION: APPLICATION TO SPEAKER RECOGNITION. ICASSP, vol. V, pp. 585--588, 2006.
[5]
C. Chang and C. Lin, LIBSVM: A Library for Support Vector Machines, 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm
[6]
Columbia University's Baseline Detectors for 374 LSCOM Semantic Visual Concepts, http://www.ee.columbia.edu/ln/dvmm/columbia374/.
[7]
L. David, "Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval", Proceedings of ECML-98, 10th European Conference on Machine Learning: 4--15, 1998.
[8]
P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-temporal Features, Proceedings of IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65--72, 2005.
[9]
DTO LSCOM Lexicon Definitions and Annotations, http://www.ee.columbia.edu/dvmm/lscom/.
[10]
S. Ebadollahi, L. Xie, S. Chang, and J. Smith, Visual Event Detection Using Multi-Dimensional Concept Dynamics, IEEE International Conference on Multimedia and Expo, pp. 881--884, 2006.
[11]
A. Efros, A. Berg, G. Mori, and J. Malik, Recognizing Action at a Distance, Proceedings of IEEE International Conference on Computer Vision, pp. 726--733, 2003.
[12]
K. Grauman and T. Darrell, The Pyramid Match Kernel: Discriminantive Classification with Sets of Image Features, Proceedings of IEEE International Conference on Computer Vision, pp. 1458--1465, 2005.
[13]
A. Hauptmann et al., Multi-Lingual Broadcast News Retrieval, In NIST TRECVID Workshop, Gaithersburg, MD, Nov. 2006.
[14]
C. Harris and M. Stephens, A Combined Corner and Edge Detector, Alvey Vision Confernece, 1988.
[15]
X. He and P. Niyogi, Locality Preserving Projections, Proceedings of the Conference on Advances in Nerual Information Processing Systems, 2003.
[16]
F. Jing, M. Li, H. Zhang, and B. Zhang, An Effient and Effective Region-based Image Retrieval Framework, IEEE Transactions on Image Processing, vol. 13, no. 5, pp. 699--709, 2004.
[17]
Y. Ke, R. Sukthankar, and M. Hebert, Effient Visual Event Detection Using Volumetric Features, Proceedings of IEEE International Conference on Computer Vision, pp. 166--173, 2005.
[18]
L. Laptev and T. Lindeberg, Space-time Interest Points, Proceedings of IEEE International Conference on Computer Vision, pp. 432--439, 2003.
[19]
S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features, Spatial Pyramid Matching for Recognizing Natural Scene Categories, Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2169--2178, 2006.
[20]
C. Lee, C. Lin, and B. Juang, A study on speaker adaptation of the parameters of continuous density hidden Markov models. tsap, vol. 39, no. 4, pp. 806--814, 1991.
[21]
E. Levina and P. Bickel, The Earth Mover's Distance is the Mallows Distance: Some Insights from Statistics, Proceedings of IEEE International Conference on Computer Vision, pp. 251--256, 2001.
[22]
J. Liu et al., University of Central Florida at TRECVID 2006 High-Level Feature Extraction and Video Search, In NIST TRECVID Workshop, Gaithersburg, MD, Nov. 2006.
[23]
D. Lowe, Object Recognition from Local Scale-Invariant Features, Proceedings of IEEE International Conference on Computer Vision, pp. 1150--1157, 1999.
[24]
P. Moreno, P. Ho, and N. Vasconcelos, A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications, Proceedings of Neural Information Processing Systems, Dec. 2003.
[25]
M. Naphade, J. Smith, J. Tesic, S. Chang, W. Hsu, L.Kennedy, A. Hauptmann, and J. Curtis, Large-Scale Concept Ontology for Multimedia, IEEE Multimedia Magazine, vol. 13, no. 3, pp.86--91, 2006.
[26]
J. Niebles, H. Wang, and L. Feifei, Unsupervised Learning of Human Action Categories Using Spatial Temporal Words, British Machine Vision Conference, 2006.
[27]
N. Oliver, B. Rosario, and A. Pentland, A Bayesian Computer Vision System for Modeling Human Interactions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):831--843, 2000.
[28]
P. Peursum, S. Venkatesh, G. West, and H. Bui, Object Labelling from Human Action Recognition, Proceedings of IEEE International Conference on Pervasive Computing and Communications, pp. 399--406, 2003.
[29]
J. Platt, Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, Advances in Large Margin Classifiers, 1999.
[30]
D. Reynolds, T. Quatieri, and R. Dunn, Speaker Verification using Adapted Gaussian Mixture Models. Digital Signal Processing, vol. 10, pp. 19--41, 2000.
[31]
Y. Rubner, C. Tomasi, and L. Guibas, The Earth Mover's Distance as a Metric for Image Retrieval, International Journal of Computer Vision, vol. 40, no. 2, pp. 99--121, 2000.
[32]
C. Schuldt, I. Laptev, and B. Caputo, Recognizing Human Actions, A Local svm Approach, Proceedings of IEEE International Conference on Pattern Recognition, pp. 32--36, 2004.
[33]
A. Smeaton, P. Over, and W. Kraaij, Evaluation campaigns and TRECVid, Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321--330, 2006.
[34]
TRECVID, http://www-nlpir.nist.gov/projects/trecvid.
[35]
D. Xu and S. Chang, Visual event recognition in news video using kernel methods with multi-level temporal alignment, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007.
[36]
D. Xu and S. Chang, Video Event Recognition using Kernel Methods with Multi-Level Temporal Alignment, Accepted for future publication in IEEE Transactions on Pattern Analysis and Machine Intelligence.
[37]
D. Zhang, D. Perez, S. Bengio, and I. McCowan, Semi-supervised Adapted HMMs for Unusual Event Detection, Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 611--618, 2005.
[38]
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study, International Journal of Computer Vision, vol. 73, no. 2, pp. 213--238, 2007.
[39]
J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings. Ninth IEEE International Conference on Computer Vision, pp. 1470--1477, 2003.
[40]
P. Quelhas, F. Monay, J.M. Odobez, D. Gatica-Perez and T. Tuytelaars, A Thousand Words in a Scene, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, pp. 79--86, 2007.
[41]
A.D. Bagdanov, L. Ballan, M. Bertini, and A. Del Bimbo, Trademark matching and retrieval in sports video databases, Proceedings of the international workshop on Workshop on multimedia information retrieval, pp. 1575--1589, 2007.

Cited By

View all
  • (2024)Active learning for image retrieval via visual similarity metrics and semantic featuresEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109239138:PAOnline publication date: 1-Dec-2024
  • (2019)Action Parsing-Driven Video Summarization Based on Reinforcement LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2018.286079729:7(2126-2137)Online publication date: Jul-2019
  • (2019)Fusing depth and colour information for human action recognitionMultimedia Tools and Applications10.1007/s11042-018-6875-778:5(5919-5939)Online publication date: 1-Mar-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '08: Proceedings of the 16th ACM international conference on Multimedia
October 2008
1206 pages
ISBN:9781605583037
DOI:10.1145/1459359
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. kernel design
  2. sift-bag
  3. video event recognition
  4. within-class covariation normalization

Qualifiers

  • Research-article

Conference

MM08
Sponsor:
MM08: ACM Multimedia Conference 2008
October 26 - 31, 2008
British Columbia, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Active learning for image retrieval via visual similarity metrics and semantic featuresEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109239138:PAOnline publication date: 1-Dec-2024
  • (2019)Action Parsing-Driven Video Summarization Based on Reinforcement LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2018.286079729:7(2126-2137)Online publication date: Jul-2019
  • (2019)Fusing depth and colour information for human action recognitionMultimedia Tools and Applications10.1007/s11042-018-6875-778:5(5919-5939)Online publication date: 1-Mar-2019
  • (2018)Classification of Multidimensional Time-Evolving Data Using Histograms of Grassmannian PointsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2016.263171928:4(892-905)Online publication date: Apr-2018
  • (2018)Fast car Crash Detection in Video2018 XLIV Latin American Computer Conference (CLEI)10.1109/CLEI.2018.00081(632-637)Online publication date: Oct-2018
  • (2018)Multi-label semantic concept detection in videos using fusion of asymmetrically trained deep convolutional neural networks and foreground driven concept co-occurrence matrixApplied Intelligence10.1007/s10489-017-1033-x48:8(2047-2066)Online publication date: 1-Aug-2018
  • (2017)Sketching for large-scale learning of mixture modelsInformation and Inference: A Journal of the IMA10.1093/imaiai/iax0157:3(447-508)Online publication date: 22-Dec-2017
  • (2016)Category driven deep recurrent neural network for video summarization2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)10.1109/ICMEW.2016.7574720(1-6)Online publication date: Jul-2016
  • (2016)Human action recognition with DeepAction Kernel Gaussian Process2016 International Conference on Advanced Robotics and Mechatronics (ICARM)10.1109/ICARM.2016.7606913(165-170)Online publication date: Aug-2016
  • (2016)Modeling spatio-temporal layout with Lie Algebrized Gaussians for action recognitionMultimedia Tools and Applications10.1007/s11042-015-3008-475:17(10335-10355)Online publication date: 1-Sep-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media