[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2324796.2324842acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Multimodal feature generation framework for semantic image classification

Published: 05 June 2012 Publication History

Abstract

The automatic attribution of semantic labels to unlabeled or weakly labeled images has received considerable attention but, given the complexity of the problem, remains a hard research topic. Here we propose a unified classification framework which mixes textual and visual information in a seamless manner. Unlike most recent previous works, computer vision techniques are used as inspiration to process textual information. To do so, we consider two types of complementary tag similarities, respectively computed from a conceptual hierarchy and from data collected from a photo sharing platform. Visual content is processed using recent techniques for bag-of visual-words feature generation. A central contribution of our work is to infer the coding step of the general bag-of-word framework with such similarities and to aggregate these tag-codes by max-pooling to obtain a single representative vector (signature). Final image annotations are obtained via late fusion, where the three modalities (two text-based and one visual-based) are merged during the classification step. Experimental results on the Pascal VOC 2007 and MIR Flickr datasets show an improvement over the state-of-the-art methods, while significantly decreasing the computational complexity of the learning system.

References

[1]
A. Binder, W. Samek, M. Kloft, C. Müller, K.-R. Müller, and M. Kawanabe. The Joint Submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 Photo Annotation Task. In CLEF (Notebook Papers/Labs/Workshop), 2011.
[2]
Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2559--2566, 2010.
[3]
A. Coates and A. Ng. The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization. In ACM International Conference on Machine Learning (ICML), pages 921--928, 2011.
[4]
G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision (ECCV), pages 1--22, 2004.
[5]
G. Dork and C. Schmid. Object class recognition using discriminative local features. Rapport de recherche RR-5497, INRIA, 2005.
[6]
R. P. W. Duin. The Combining Classifier: To Train or Not to Train? In International Conference on Pattern Recognition (ICPR), pages 765--770, 2002.
[7]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.
[8]
C. Fellbaum, editor. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, 1998.
[9]
S. Gao, I. Tsang, L. Chia, and P. Zhao. Local features are not lonely - Laplacian sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3555--3561, 2011.
[10]
M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 902--909, 2010.
[11]
Y. Huang, K. Huang, Y. Yu, and T. Tan. Salient Coding for Image Classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1753--1760, 2011.
[12]
M. J. Huiskes and M. S. Lew. The MIR flickr retrieval evaluation. In ACM international conference on Multimedia information retrieval (ICMR), pages 39--43, 2008.
[13]
M. Kawanabe, A. Binder, C. Muller, and W. Wojcikiewicz. Multi-modal visual concept classification of images via Markov random walk over tags. In IEEE Workshop on Applications of Computer Vision, pages 396--401, 2011.
[14]
S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2169--2178, 2006.
[15]
L. Liu, L. Wang, and X. Liu. In Defense of Soft-assignment Coding. In IEEE International Conference on Computer Vision (ICCV), 2011.
[16]
D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision (IJCV), 60(2):91--110, 2004.
[17]
A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42(3):145--175, 2001.
[18]
A. Popescu and G. Grefenstette. Social media driven image retrieval. In ACM International Conference on Multimedia Retrieval (ICMR), pages 33:1--33:8, 2011.
[19]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
[20]
J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. In IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1470--1477, 2003.
[21]
A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22:1349--1380, 2000.
[22]
J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek. Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 1271--1283, 2009.
[23]
G. Wang, D. Hoiem, and D. Forsyth. Building text features for object image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1367--1374, 2009.
[24]
J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3360--3367, 2010.
[25]
D. H. Wolpert. Stacked generalization. Neural Networks, 5:241--259, 1992.
[26]
Z. Wu and M. Palmer. Verb semantics and lexical selection. In Annual Meeting of the Association for Computational Linguistics, pages 133--138, 1994.
[27]
J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1794--1801, 2009.
[28]
K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. Advances in Neural Information Processing Systems, 22:2223--2231, 2009.

Cited By

View all
  • (2023)Learning semantic ambiguities for zero-shot learningMultimedia Tools and Applications10.1007/s11042-023-14877-182:26(40745-40759)Online publication date: 31-Mar-2023
  • (2023)Research on Feature Fusion Methods for Multimodal Medical DataComputer Applications10.1007/978-981-99-8764-1_8(96-114)Online publication date: 14-Dec-2023
  • (2018)A Novel GMM-Based Behavioral Modeling Approach for Smartwatch-Based Driver AuthenticationSensors10.3390/s1804100718:4(1007)Online publication date: 28-Mar-2018
  • Show More Cited By

Index Terms

  1. Multimodal feature generation framework for semantic image classification

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
      June 2012
      489 pages
      ISBN:9781450313292
      DOI:10.1145/2324796
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 June 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. bags of words
      2. classification
      3. image annotation
      4. multimedia fusion

      Qualifiers

      • Research-article

      Conference

      ICMR '12
      Sponsor:

      Acceptance Rates

      ICMR '12 Paper Acceptance Rate 50 of 145 submissions, 34%;
      Overall Acceptance Rate 254 of 830 submissions, 31%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 09 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Learning semantic ambiguities for zero-shot learningMultimedia Tools and Applications10.1007/s11042-023-14877-182:26(40745-40759)Online publication date: 31-Mar-2023
      • (2023)Research on Feature Fusion Methods for Multimodal Medical DataComputer Applications10.1007/978-981-99-8764-1_8(96-114)Online publication date: 14-Dec-2023
      • (2018)A Novel GMM-Based Behavioral Modeling Approach for Smartwatch-Based Driver AuthenticationSensors10.3390/s1804100718:4(1007)Online publication date: 28-Mar-2018
      • (2016)Integrating multiple types of features for event identification in social imagesMultimedia Tools and Applications10.1007/s11042-014-2436-x75:6(3301-3322)Online publication date: 1-Mar-2016
      • (2015)Markov random field based fusion for supervised and semi-supervised multi-modal image classificationMultimedia Tools and Applications10.1007/s11042-014-2018-y74:2(613-634)Online publication date: 1-Jan-2015
      • (2014)A Cross-modal Multi-task Learning Framework for Image AnnotationProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2662023(431-440)Online publication date: 3-Nov-2014
      • (2014)MFS-MapProceedings of the 29th Annual ACM Symposium on Applied Computing10.1145/2554850.2554868(945-950)Online publication date: 24-Mar-2014
      • (2013)Tag completion based on belief theory and neighbor votingProceedings of the 3rd ACM conference on International conference on multimedia retrieval10.1145/2461466.2461476(49-56)Online publication date: 16-Apr-2013

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media