[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/2999792.2999849guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

DeViSE: a deep visual-semantic embedding model

Published: 05 December 2013 Publication History

Abstract

Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources - such as text data - both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model.

References

[1]
S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In Advances in Neural Information Processing Systems, NIPS, 2010.
[2]
Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003.
[3]
A. Coates and A. Ng. The importance of encoding versus training with sparse coding and vector quantization. In International Conference on Machine Learning (ICML), 2011.
[4]
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, MarcAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, NIPS, 2012.
[5]
Thomas Dean, Mark Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan, and Jay Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
[6]
Jia Deng, Alex Berg, Sanjeev Satheesh, Hao Su, Aditya Khosla, and Fei-Fei Li. Imagenet large scale visual recognition challenge 2012.
[7]
Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[8]
Thomas Deselaers and Vittorio Ferrari. Visual and semantic similarity in imagenet. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
[9]
J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121-2159, 2011.
[10]
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[11]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, NIPS, 2012.
[12]
Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In European Conference on Computer Vision (ECCV), 2012.
[13]
Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In International Conference on Learning Representations (ICLR), Scottsdale, Arizona, USA, 2013.
[14]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, NIPS, 2013.
[15]
Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Jonathon Shlens, Andrea Frome, Greg S. Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv (to be submitted), 2013.
[16]
Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, and Tom M. Mitchell. Zero-shot learning with semantic output codes. In Advances in Neural Information Processing Systems, NIPS, 2009.
[17]
Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
[18]
R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. D. Manning, and A. Y. Ng. Zero-shot learning through cross-modal transfer. In International Conference on Learning Representations (ICLR), Scottsdale, Arizona, USA, 2013.
[19]
L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579-2605, 2008.
[20]
Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning, 81(1):21-35, 2010.

Cited By

View all
  • (2023)Localized symbolic knowledge distillation for visual commonsense modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666623(11338-11352)Online publication date: 10-Dec-2023
  • (2023)VS-BoostProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/123(1107-1115)Online publication date: 19-Aug-2023
  • (2023)DUETProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i1.25114(405-413)Online publication date: 7-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'13: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2
December 2013
3236 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 05 December 2013

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Localized symbolic knowledge distillation for visual commonsense modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666623(11338-11352)Online publication date: 10-Dec-2023
  • (2023)VS-BoostProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/123(1107-1115)Online publication date: 19-Aug-2023
  • (2023)DUETProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i1.25114(405-413)Online publication date: 7-Feb-2023
  • (2023)Annotating Satellite Images of Forests with Keywords from a Specialized Corpus in the Context of Change DetectionProceedings of the 20th International Conference on Content-based Multimedia Indexing10.1145/3617233.3617242(14-20)Online publication date: 20-Sep-2023
  • (2023)Tutorial on Multimodal Machine Learning: Principles, Challenges, and Open QuestionsCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3617602(101-104)Online publication date: 9-Oct-2023
  • (2023)C2MR: Continual Cross-Modal Retrieval for Streaming Multi-modal DataProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611919(8963-8974)Online publication date: 26-Oct-2023
  • (2022)Mind the gapProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601550(17612-17625)Online publication date: 28-Nov-2022
  • (2022)I2DFormerProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601162(12283-12294)Online publication date: 28-Nov-2022
  • (2022)Improved Feature Generating Networks for Zero-Shot LearningProceedings of the 2022 4th International Conference on Image, Video and Signal Processing10.1145/3531232.3531263(211-217)Online publication date: 18-Mar-2022
  • (2022)Cross-Modal Retrieval between Event-Dense Text and ImageProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531374(229-238)Online publication date: 27-Jun-2022
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media