More Web Proxy on the site http://driver.im/

research-article

SIFT-Bag kernel for video event analysis

Authors:

Xiaodan Zhuang,

Mark Hasegawa-Johnson,

Thomas S. HuangAuthors Info & Claims

MM '08: Proceedings of the 16th ACM international conference on Multimedia

Pages 229 - 238

https://doi.org/10.1145/1459359.1459391

Published: 26 October 2008 Publication History

Abstract

In this work, we present a SIFT-Bag based generative-to-discriminative framework for addressing the problem of video event recognition in unconstrained news videos. In the generative stage, each video clip is encoded as a bag of SIFT feature vectors, the distribution of which is described by a Gaussian Mixture Models (GMM). In the discriminative stage, the SIFT-Bag Kernel is designed for characterizing the property of Kullback-Leibler divergence between the specialized GMMs of any two video clips, and then this kernel is utilized for supervised learning in two ways. On one hand, this kernel is further refined in discriminating power for centroid-based video event classification by using the Within-Class Covariance Normalization approach, which depresses the kernel components with high-variability for video clips of the same event. On the other hand, the SIFT-Bag Kernel is used in a Support Vector Machine for margin-based video event classification. Finally, the outputs from these two classifiers are fused together for final decision. The experiments on the TRECVID 2005 corpus demonstrate that the mean average precision is boosted from the best reported 38.2% in [36] to 60.4% based on our new framework.

References

[1]

A. Amir et al., IBM Research TRECVID-2005 Video Retrieval System, NIST TRECVID Workshop, 2005.

[2]

O. Boiman and M. Irani, Detecting irregularities in images and in video, IEEE International Conference on Computer Vision, pp. 462--469, 2005.

Digital Library

[3]

M. Brand, N. Oliver, and A. Pentland, Coupled Hidden Markov Models for Complex Action Recognition, Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 994--999, 1997.

Digital Library

[4]

A. Hatch and A. Stolcke, GENERALIZED LINEAR KERNELS FOR ONE-VERSUS-ALL CLASSIFICATION: APPLICATION TO SPEAKER RECOGNITION. ICASSP, vol. V, pp. 585--588, 2006.

[5]

C. Chang and C. Lin, LIBSVM: A Library for Support Vector Machines, 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm

[6]

Columbia University's Baseline Detectors for 374 LSCOM Semantic Visual Concepts, http://www.ee.columbia.edu/ln/dvmm/columbia374/.

[7]

L. David, "Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval", Proceedings of ECML-98, 10th European Conference on Machine Learning: 4--15, 1998.

Digital Library

[8]

P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-temporal Features, Proceedings of IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65--72, 2005.

Digital Library

[9]

DTO LSCOM Lexicon Definitions and Annotations, http://www.ee.columbia.edu/dvmm/lscom/.

[10]

S. Ebadollahi, L. Xie, S. Chang, and J. Smith, Visual Event Detection Using Multi-Dimensional Concept Dynamics, IEEE International Conference on Multimedia and Expo, pp. 881--884, 2006.

[11]

A. Efros, A. Berg, G. Mori, and J. Malik, Recognizing Action at a Distance, Proceedings of IEEE International Conference on Computer Vision, pp. 726--733, 2003.

Digital Library

[12]

K. Grauman and T. Darrell, The Pyramid Match Kernel: Discriminantive Classification with Sets of Image Features, Proceedings of IEEE International Conference on Computer Vision, pp. 1458--1465, 2005.

Digital Library

[13]

A. Hauptmann et al., Multi-Lingual Broadcast News Retrieval, In NIST TRECVID Workshop, Gaithersburg, MD, Nov. 2006.

[14]

C. Harris and M. Stephens, A Combined Corner and Edge Detector, Alvey Vision Confernece, 1988.

[15]

X. He and P. Niyogi, Locality Preserving Projections, Proceedings of the Conference on Advances in Nerual Information Processing Systems, 2003.

[16]

F. Jing, M. Li, H. Zhang, and B. Zhang, An Effient and Effective Region-based Image Retrieval Framework, IEEE Transactions on Image Processing, vol. 13, no. 5, pp. 699--709, 2004.

Digital Library

[17]

Y. Ke, R. Sukthankar, and M. Hebert, Effient Visual Event Detection Using Volumetric Features, Proceedings of IEEE International Conference on Computer Vision, pp. 166--173, 2005.

Digital Library

[18]

L. Laptev and T. Lindeberg, Space-time Interest Points, Proceedings of IEEE International Conference on Computer Vision, pp. 432--439, 2003.

Digital Library

[19]

S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features, Spatial Pyramid Matching for Recognizing Natural Scene Categories, Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2169--2178, 2006.

Digital Library

[20]

C. Lee, C. Lin, and B. Juang, A study on speaker adaptation of the parameters of continuous density hidden Markov models. tsap, vol. 39, no. 4, pp. 806--814, 1991.

[21]

E. Levina and P. Bickel, The Earth Mover's Distance is the Mallows Distance: Some Insights from Statistics, Proceedings of IEEE International Conference on Computer Vision, pp. 251--256, 2001.

[22]

J. Liu et al., University of Central Florida at TRECVID 2006 High-Level Feature Extraction and Video Search, In NIST TRECVID Workshop, Gaithersburg, MD, Nov. 2006.

[23]

D. Lowe, Object Recognition from Local Scale-Invariant Features, Proceedings of IEEE International Conference on Computer Vision, pp. 1150--1157, 1999.

Digital Library

[24]

P. Moreno, P. Ho, and N. Vasconcelos, A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications, Proceedings of Neural Information Processing Systems, Dec. 2003.

[25]

M. Naphade, J. Smith, J. Tesic, S. Chang, W. Hsu, L.Kennedy, A. Hauptmann, and J. Curtis, Large-Scale Concept Ontology for Multimedia, IEEE Multimedia Magazine, vol. 13, no. 3, pp.86--91, 2006.

Digital Library

[26]

J. Niebles, H. Wang, and L. Feifei, Unsupervised Learning of Human Action Categories Using Spatial Temporal Words, British Machine Vision Conference, 2006.

[27]

N. Oliver, B. Rosario, and A. Pentland, A Bayesian Computer Vision System for Modeling Human Interactions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):831--843, 2000.

Digital Library

[28]

P. Peursum, S. Venkatesh, G. West, and H. Bui, Object Labelling from Human Action Recognition, Proceedings of IEEE International Conference on Pervasive Computing and Communications, pp. 399--406, 2003.

Digital Library

[29]

J. Platt, Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, Advances in Large Margin Classifiers, 1999.

[30]

D. Reynolds, T. Quatieri, and R. Dunn, Speaker Verification using Adapted Gaussian Mixture Models. Digital Signal Processing, vol. 10, pp. 19--41, 2000.

Digital Library

[31]

Y. Rubner, C. Tomasi, and L. Guibas, The Earth Mover's Distance as a Metric for Image Retrieval, International Journal of Computer Vision, vol. 40, no. 2, pp. 99--121, 2000.

Digital Library

[32]

C. Schuldt, I. Laptev, and B. Caputo, Recognizing Human Actions, A Local svm Approach, Proceedings of IEEE International Conference on Pattern Recognition, pp. 32--36, 2004.

Digital Library

[33]

A. Smeaton, P. Over, and W. Kraaij, Evaluation campaigns and TRECVid, Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321--330, 2006.

Digital Library

[34]

TRECVID, http://www-nlpir.nist.gov/projects/trecvid.

[35]

D. Xu and S. Chang, Visual event recognition in news video using kernel methods with multi-level temporal alignment, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007.

[36]

D. Xu and S. Chang, Video Event Recognition using Kernel Methods with Multi-Level Temporal Alignment, Accepted for future publication in IEEE Transactions on Pattern Analysis and Machine Intelligence.

Digital Library

[37]

D. Zhang, D. Perez, S. Bengio, and I. McCowan, Semi-supervised Adapted HMMs for Unusual Event Detection, Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 611--618, 2005.

Digital Library

[38]

J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study, International Journal of Computer Vision, vol. 73, no. 2, pp. 213--238, 2007.

Digital Library

[39]

J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings. Ninth IEEE International Conference on Computer Vision, pp. 1470--1477, 2003.

Digital Library

[40]

P. Quelhas, F. Monay, J.M. Odobez, D. Gatica-Perez and T. Tuytelaars, A Thousand Words in a Scene, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, pp. 79--86, 2007.

Digital Library

[41]

A.D. Bagdanov, L. Ballan, M. Bertini, and A. Del Bimbo, Trademark matching and retrieval in sports video databases, Proceedings of the international workshop on Workshop on multimedia information retrieval, pp. 1575--1589, 2007.

Digital Library

Cited By

Casado-Coscolla ASanchez-Belenguer CWolfart EAngorrilla-Bustamante CSequeira V(2024)Active learning for image retrieval via visual similarity metrics and semantic featuresEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109239138:PAOnline publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1016/j.engappai.2024.109239
Lei JLuan QSong XLiu XTao DSong M(2019)Action Parsing-Driven Video Summarization Based on Reinforcement LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2018.286079729:7(2126-2137)Online publication date: Jul-2019
https://doi.org/10.1109/TCSVT.2018.2860797
Avola DBernardi MForesti G(2019)Fusing depth and colour information for human action recognitionMultimedia Tools and Applications10.1007/s11042-018-6875-778:5(5919-5939)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11042-018-6875-7
Show More Cited By

Index Terms

SIFT-Bag kernel for video event analysis
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Video Event Understanding Using Natural Language Descriptions
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer Vision

Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a ...
Nonlinear Discriminant Analysis on Embedded Manifold

Traditional manifold learning algorithms, such as ISOMAP, LLE, and Laplacian Eigenmap, mainly focus on uncovering the latent low-dimensional geometry structure of the training samples in an unsupervised manner where useful class information is ignored. ...
Mining temporal patterns of movement for video content classification
MIR '06: Proceedings of the 8th ACM international workshop on Multimedia information retrieval

Scalable approaches to video content classification are limited by an inability to automatically generate representations of events that encode abstract temporal structure. This paper presents a method in which temporal information is captured by ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '08: Proceedings of the 16th ACM international conference on Multimedia

October 2008

1206 pages

ISBN:9781605583037

DOI:10.1145/1459359

General Chairs:
Abdulmotaleb EL Saddik
University of Ottawa
,
Son Vuong
University of British Colombia
,
Program Chairs:
Carsten Griwodz
University of Oslo
,
Alberto Del Bimbo
University degli Studi di Firenze
,
K. Selcuk Candan
Arizona State University
,
Alejandro Jaimes
Telefonica R&D, Madrid, Spain

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM08

Sponsor:

MM08: ACM Multimedia Conference 2008

October 26 - 31, 2008

British Columbia, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

80
Total Citations
View Citations
763
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Casado-Coscolla ASanchez-Belenguer CWolfart EAngorrilla-Bustamante CSequeira V(2024)Active learning for image retrieval via visual similarity metrics and semantic featuresEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109239138:PAOnline publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1016/j.engappai.2024.109239
Lei JLuan QSong XLiu XTao DSong M(2019)Action Parsing-Driven Video Summarization Based on Reinforcement LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2018.286079729:7(2126-2137)Online publication date: Jul-2019
https://doi.org/10.1109/TCSVT.2018.2860797
Avola DBernardi MForesti G(2019)Fusing depth and colour information for human action recognitionMultimedia Tools and Applications10.1007/s11042-018-6875-778:5(5919-5939)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11042-018-6875-7
Dimitropoulos KBarmpoutis PKitsikidis AGrammalidis N(2018)Classification of Multidimensional Time-Evolving Data Using Histograms of Grassmannian PointsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2016.263171928:4(892-905)Online publication date: Apr-2018
https://doi.org/10.1109/TCSVT.2016.2631719
Machaca Arceda VLaura Riveros E(2018)Fast car Crash Detection in Video2018 XLIV Latin American Computer Conference (CLEI)10.1109/CLEI.2018.00081(632-637)Online publication date: Oct-2018
https://doi.org/10.1109/CLEI.2018.00081
Janwe NBhoyar K(2018)Multi-label semantic concept detection in videos using fusion of asymmetrically trained deep convolutional neural networks and foreground driven concept co-occurrence matrixApplied Intelligence10.1007/s10489-017-1033-x48:8(2047-2066)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.1007/s10489-017-1033-x
Keriven NBourrier AGribonval RPérez P(2017)Sketching for large-scale learning of mixture modelsInformation and Inference: A Journal of the IMA10.1093/imaiai/iax0157:3(447-508)Online publication date: 22-Dec-2017
https://doi.org/10.1093/imaiai/iax015
Xinhui Song Ke Chen Jie Lei Li Sun Zhiyuan Wang Lei Xie Mingli Song (2016)Category driven deep recurrent neural network for video summarization2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)10.1109/ICMEW.2016.7574720(1-6)Online publication date: Jul-2016
https://doi.org/10.1109/ICMEW.2016.7574720
Wang YLi LQiao Y(2016)Human action recognition with DeepAction Kernel Gaussian Process2016 International Conference on Advanced Robotics and Mechatronics (ICARM)10.1109/ICARM.2016.7606913(165-170)Online publication date: Aug-2016
https://doi.org/10.1109/ICARM.2016.7606913
Chen MGong LWang TLiu FFeng Q(2016)Modeling spatio-temporal layout with Lie Algebrized Gaussians for action recognitionMultimedia Tools and Applications10.1007/s11042-015-3008-475:17(10335-10355)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/s11042-015-3008-4
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten