More Web Proxy on the site http://driver.im/

research-article

Multimodal feature generation framework for semantic image classification

Authors:

Adrian Popescu,

Hervé le Borgne,

Céline HudelotAuthors Info & Claims

ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval

Article No.: 38, Pages 1 - 8

https://doi.org/10.1145/2324796.2324842

Published: 05 June 2012 Publication History

Abstract

The automatic attribution of semantic labels to unlabeled or weakly labeled images has received considerable attention but, given the complexity of the problem, remains a hard research topic. Here we propose a unified classification framework which mixes textual and visual information in a seamless manner. Unlike most recent previous works, computer vision techniques are used as inspiration to process textual information. To do so, we consider two types of complementary tag similarities, respectively computed from a conceptual hierarchy and from data collected from a photo sharing platform. Visual content is processed using recent techniques for bag-of visual-words feature generation. A central contribution of our work is to infer the coding step of the general bag-of-word framework with such similarities and to aggregate these tag-codes by max-pooling to obtain a single representative vector (signature). Final image annotations are obtained via late fusion, where the three modalities (two text-based and one visual-based) are merged during the classification step. Experimental results on the Pascal VOC 2007 and MIR Flickr datasets show an improvement over the state-of-the-art methods, while significantly decreasing the computational complexity of the learning system.

References

[1]

A. Binder, W. Samek, M. Kloft, C. Müller, K.-R. Müller, and M. Kawanabe. The Joint Submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 Photo Annotation Task. In CLEF (Notebook Papers/Labs/Workshop), 2011.

[2]

Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2559--2566, 2010.

[3]

A. Coates and A. Ng. The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization. In ACM International Conference on Machine Learning (ICML), pages 921--928, 2011.

[4]

G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision (ECCV), pages 1--22, 2004.

[5]

G. Dork and C. Schmid. Object class recognition using discriminative local features. Rapport de recherche RR-5497, INRIA, 2005.

[6]

R. P. W. Duin. The Combining Classifier: To Train or Not to Train? In International Conference on Pattern Recognition (ICPR), pages 765--770, 2002.

[7]

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.

[8]

C. Fellbaum, editor. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, 1998.

[9]

S. Gao, I. Tsang, L. Chia, and P. Zhao. Local features are not lonely - Laplacian sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3555--3561, 2011.

[10]

M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 902--909, 2010.

[11]

Y. Huang, K. Huang, Y. Yu, and T. Tan. Salient Coding for Image Classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1753--1760, 2011.

Digital Library

[12]

M. J. Huiskes and M. S. Lew. The MIR flickr retrieval evaluation. In ACM international conference on Multimedia information retrieval (ICMR), pages 39--43, 2008.

Digital Library

[13]

M. Kawanabe, A. Binder, C. Muller, and W. Wojcikiewicz. Multi-modal visual concept classification of images via Markov random walk over tags. In IEEE Workshop on Applications of Computer Vision, pages 396--401, 2011.

Digital Library

[14]

S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2169--2178, 2006.

Digital Library

[15]

L. Liu, L. Wang, and X. Liu. In Defense of Soft-assignment Coding. In IEEE International Conference on Computer Vision (ICCV), 2011.

Digital Library

[16]

D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision (IJCV), 60(2):91--110, 2004.

Digital Library

[17]

A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42(3):145--175, 2001.

Digital Library

[18]

A. Popescu and G. Grefenstette. Social media driven image retrieval. In ACM International Conference on Multimedia Retrieval (ICMR), pages 33:1--33:8, 2011.

Digital Library

[19]

G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.

Digital Library

[20]

J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. In IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1470--1477, 2003.

Digital Library

[21]

A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22:1349--1380, 2000.

Digital Library

[22]

J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek. Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 1271--1283, 2009.

Digital Library

[23]

G. Wang, D. Hoiem, and D. Forsyth. Building text features for object image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1367--1374, 2009.

[24]

J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3360--3367, 2010.

[25]

D. H. Wolpert. Stacked generalization. Neural Networks, 5:241--259, 1992.

Digital Library

[26]

Z. Wu and M. Palmer. Verb semantics and lexical selection. In Annual Meeting of the Association for Computational Linguistics, pages 133--138, 1994.

Digital Library

[27]

J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1794--1801, 2009.

[28]

K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. Advances in Neural Information Processing Systems, 22:2223--2231, 2009.

Cited By

Hanouti CLe Borgne H(2023)Learning semantic ambiguities for zero-shot learningMultimedia Tools and Applications10.1007/s11042-023-14877-182:26(40745-40759)Online publication date: 31-Mar-2023
https://doi.org/10.1007/s11042-023-14877-1
Xu ZYang XJin YChen S(2023)Research on Feature Fusion Methods for Multimodal Medical DataComputer Applications10.1007/978-981-99-8764-1_8(96-114)Online publication date: 14-Dec-2023
https://doi.org/10.1007/978-981-99-8764-1_8
Yang CChang CLiang D(2018)A Novel GMM-Based Behavioral Modeling Approach for Smartwatch-Based Driver AuthenticationSensors10.3390/s1804100718:4(1007)Online publication date: 28-Mar-2018
https://doi.org/10.3390/s18041007
Show More Cited By

Index Terms

Multimodal feature generation framework for semantic image classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding

Recommendations

Multimodal fusion using learned text concepts for image categorization
MM '06: Proceedings of the 14th ACM international conference on Multimedia

Conventional image categorization techniques primarily rely on low-level visual cues. In this paper, we describe a multimodal fusion scheme which improves the image classification accuracy by incorporating the information derived from the embedded texts ...
Image retrieval based on high level concept detection and semantic labelling

This paper presents a novel approach to high-level concept detection and retrieval in images based on a combination of visual thesaurus and multi-class supervised learning. The visual thesaurus includes both conceptual and spatial location information ...
Semantic context learning with large-scale weakly-labeled image set
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

There are a large number of images available on the web; meanwhile, only a subset of web images can be labeled by professionals because manual annotation is time-consuming and labor-intensive. Although we can now use the collaborative image tagging ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval

June 2012

489 pages

ISBN:9781450313292

DOI:10.1145/2324796

Conference Chairs:
Horace H. S. Ip
City University of Hong Kong
,
Yong Rui
Microsoft, China

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMR '12

Sponsor:

SIGMM

ICMR '12: International Conference on Multimedia Retrieval

June 5 - 8, 2012

Hong Kong, China

Acceptance Rates

ICMR '12 Paper Acceptance Rate 50 of 145 submissions, 34%;

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
313
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hanouti CLe Borgne H(2023)Learning semantic ambiguities for zero-shot learningMultimedia Tools and Applications10.1007/s11042-023-14877-182:26(40745-40759)Online publication date: 31-Mar-2023
https://doi.org/10.1007/s11042-023-14877-1
Xu ZYang XJin YChen S(2023)Research on Feature Fusion Methods for Multimodal Medical DataComputer Applications10.1007/978-981-99-8764-1_8(96-114)Online publication date: 14-Dec-2023
https://doi.org/10.1007/978-981-99-8764-1_8
Yang CChang CLiang D(2018)A Novel GMM-Based Behavioral Modeling Approach for Smartwatch-Based Driver AuthenticationSensors10.3390/s1804100718:4(1007)Online publication date: 28-Mar-2018
https://doi.org/10.3390/s18041007
Zhang XLi ZLv XChen X(2016)Integrating multiple types of features for event identification in social imagesMultimedia Tools and Applications10.1007/s11042-014-2436-x75:6(3301-3322)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1007/s11042-014-2436-x
Xie LPan PLu Y(2015)Markov random field based fusion for supervised and semi-supervised multi-modal image classificationMultimedia Tools and Applications10.1007/s11042-014-2018-y74:2(613-634)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1007/s11042-014-2018-y
Xie LPan PLu YWang SLi JWang XGarofalakis MSoboroff ISuel TWang M(2014)A Cross-modal Multi-task Learning Framework for Image AnnotationProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2662023(431-440)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2661829.2662023
Costa ATraina ATraina CCho YShin SKim SHung CHong J(2014)MFS-MapProceedings of the 29th Annual ACM Symposium on Applied Computing10.1145/2554850.2554868(945-950)Online publication date: 24-Mar-2014
https://dl.acm.org/doi/10.1145/2554850.2554868
Znaidia ALe Borgne HHudelot CJain RPrabhakaran BWorring MSmith JChua T(2013)Tag completion based on belief theory and neighbor votingProceedings of the 3rd ACM conference on International conference on multimedia retrieval10.1145/2461466.2461476(49-56)Online publication date: 16-Apr-2013
https://dl.acm.org/doi/10.1145/2461466.2461476

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten