Abstract
New applications are emerging every day exploiting the huge data volume in community photo collections. Most focus on popular subsets, e.g., images containing landmarks or associated to Wikipedia articles. In this work we are concerned with the problem of accurately finding the location where a photo is taken without needing any metadata, that is, solely by its visual content. We also recognize landmarks where applicable, automatically linking them to Wikipedia. We show that the time is right for automating the geo-tagging process, and we show how this can work at large scale. In doing so, we do exploit redundancy of content in popular locations—but unlike most existing solutions, we do not restrict to landmarks. In other words, we can compactly represent the visual content of all thousands of images depicting e.g., the Parthenon and still retrieve any single, isolated, non-landmark image like a house or a graffiti on a wall. Starting from an existing, geo-tagged dataset, we cluster images into sets of different views of the same scene. This is a very efficient, scalable, and fully automated mining process. We then align all views in a set to one reference image and construct a 2D scene map. Our indexing scheme operates directly on scene maps. We evaluate our solution on a challenging one million urban image dataset and provide public access to our service through our online application, VIRaL.
Similar content being viewed by others
Notes
We shall use the terms photo, image and view interchangeably in the following.
Photo titles and user tags are the ones provided by users at the Flickr website.
We have published the dataset online at http://image.ntua.gr/iva/datasets/ec1m/.
References
Agarwal S, Snavely N, Simon I, Seitz SM, Szeliski R (2009) Building Rome in a day. In: International conference on computer vision
Avrithis Y, Kalantidis Y, Tolias G, Spyrou E (2010) Retrieving landmark and non-landmark images from community photo collections. In: ACM multimedia. Firenze, Italy
Avrithis Y, Tolias G, Kalantidis Y (2010) Feature map hashing: sub-linear indexing of appearance and global geometry. In: ACM multimedia. Firenze, Italy
Bay H, Tuytelaars T, Van Gool L (2006) SURF: speeded up robust features. In: European conference on computer vision
Cheng Y (1995) Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell 17(8):790–799
Chum O, Matas J (2010) Large-scale discovery of spatially related images. IEEE Trans Pattern Anal Mach Intell 32(2):371–377
Chum O, Matas J, Kittler J (2003) Locally optimized RANSAC. In: German association for pattern recognition. Springer, Berlin, p 236
Chum O, Perdoch M, Matas J (2009) Geometric min-hashing: finding a (thick) needle in a haystack. In: Computer vision and pattern recognition
Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: automatic query expansion with a generative feature model for object retrieval. In: International conference on computer vision
Crandall D, Backstrom L, Huttenlocher D, Kleinberg J (2009) Mapping the world’s photos. In: International World Wide Web conference
Gammeter S, Bossard L, Quack T, Van Gool L (2009) I know what you did last summer: object-level auto-annotation of holiday snaps. In: International conference on computer vision
Hartley R, Zisserman A (2000) Multiple view geometry. Cambridge University Press, Cambridge
Hays J, Efros AA (2008) IM2GPS: estimating geographic information from a single image. In: Computer vision and pattern recognition
Heath K, Gelfand N, Ovsjanikov M, Aanjaneya M, Guibas LJ (2010) Image webs: computing and exploiting connectivity in image collections. In: Computer vision and pattern recognition
Jegou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vis 1–21
Jegou H, Douze M, Schmid C, Perez P (2010) Aggregating local descriptors into a compact image representation. In: Computer vision and pattern recognition
Johansson B, Cipolla R (2002) A system for automatic pose—estimation from a single image in a city scene. In: IASTED international conference on signal processing, pattern recognition and applications
Kalogerakis E, Vesselova O, Hays J, Efros AA, Hertzmann A (2009) Image sequence geolocation with human travel priors. In: International conference on computer vision
Kennedy L, Naaman M, Ahern S, Nair R, Rattenbury T (2007) How flickr helps us make sense of the world: Context and content in community-contributed media collections. In: ACM multimedia, vol 3, pp 631–640
Lampert CH (2009) Detecting objects in large image collections and videos by efficient subimage retrieval. In: International conference on computer vision
Leibe B, Leonardis A, Schiele B (2008) Robust object detection with interleaved categorization and segmentation. Int J Comput Vis 77(1):259–289
Levenshtein VI (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm 1(1):8–17
Li X, Wu C, Zach C, Lazebnik S, Frahm JM (2008) Modeling and recognition of landmark image collections using iconic scene graphs. In: European conference on computer vision. Springer, Berlin, pp 427–440
Li Y, Crandall DJ, Huttenlocher DP (2009) Landmark classification in large-scale image collections. In: International conference on computer VISION
Lowe DG (2001) Local feature view clustering for 3D object recognition. In: Computer vision and pattern recognition
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Matas J, Chum O, Urban M, Pajdla T (2004) Robust wide-baseline stereo from maximally stable extremal regions. Image Vis Comput 22(10):761–767
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: 6Th ACM international conference on knowledge discovery and data mining, p 178
Muja M, Lowe DG (2009) Fast approximate nearest neighbors with automatic algorithm configuration. In: International conference on computer vision
Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: Computer vision and pattern recognition
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Perdoch M, Chum O, Matas J (2009) Efficient representation of local geometry for large scale object retrieval. In: Computer vision and pattern recognition
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: Computer vision and pattern recognition
Quack T, Leibe B, Van Gool L (2008) World-scale mining of objects and events from community photo collections. In: CIVR, pp 47–56
Robertson D, Cipolla R (2004) An image-based system for urban navigation. In: British machine vision conference
Schaffalitzky F, Zisserman A (2002) Multi-view matching for unordered image sets, or how do i organize my holiday snaps. In: European conference on computer vision
Schindler G, Brown M, Szeliski R (2007) City-scale location recognition. In: Computer vision and pattern recognition
Silpa-Anan C, Hartley R (2008) Optimised KD-trees for fast image descriptor matching. In: Computer vision and pattern recognition
Simon I, Snavely N, Seitz SM (2007) Scene summarization for online image collections. In: International conference on computer vision
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: International conference on computer vision, pp 1470–1477
Snavely N, Seitz SM, Szeliski R (2006) Photo tourism: exploring photo collections in 3D. In: Computer graphics and interactive techniques, pp 835–846
Snavely N, Seitz SM, Szeliski R (2008) Skeletal graphs for efficient structure from motion. In: Computer vision and pattern recognition
Steinhoff U, Omercevic D, Perko R, Schiele B, Leonardis A (2007) How computer vision can help in outdoor positioning. In: European conference on ambient intelligence
Tipping M, Schölkopf B (2001) A kernel approach for vector quantization with guaranteed distortion bounds. In: Artificial intelligence and statistics, pp 129–134
Zhang W, Kosecka J (2006) Image based localization in urban environments. In: International symposium on 3D data processing, visualization and transmission
Zheng Y, Zhao M, Song Y, Adam H, Buddemeier U, Bissacco A, Brucher F, Chua TS, Neven H (2009) Tour the world: building a web-scale landmark recognition engine. In: Computer vision and pattern recognition
Acknowledgements
This work was partially supported by the European Commission under contract FP7-215453 WeKnowIt.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kalantidis, Y., Tolias, G., Avrithis, Y. et al. VIRaL: Visual Image Retrieval and Localization. Multimed Tools Appl 51, 555–592 (2011). https://doi.org/10.1007/s11042-010-0651-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-010-0651-7