Abstract
In nearest neighbor searching we are given a set of n data points in real d-dimensional space, ℜd, and the problem is to preprocess these points into a data structure, so that given a query point, the nearest data point to the query point can be reported efficiently. Because data sets can be quite large, we are interested in data structures that use optimal O(dn) storage.
In this paper we consider a novel approach to nearest neighbor searching, in which the search returns the correct nearest neighbor with a given probability assuming that the queries are drawn from some known distribution. The query distribution is represented by providing a set of training query points at preprocessing time.
The data structure, called the overlapped split tree, is an augmented BSP-tree in which each node is associated with a cover region, which is used to determine whether the search should visit this node. We use principal component analysis and support vector machines to analyze the structure of the data and training points in order to better adapt the tree structure to the data sets. We show empirically that this new approach provides improved predictability over the kd-tree in average query performance.
The support of the National Science Foundation under grant CCR-9712379 is gratefully acknowledged.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Arya and D. M. Mount. Approximate nearest neighbor queries in fixed dimensions. In Proc. 4th ACM-SIAM Sympos. Discrete Algorithms, pages 271–280, 1993.
S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. In Proc. 5th ACM-SIAM Sympos. Discrete Algorithms, pages 573–582, 1994.
S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Journal of the ACM, 45:891–923, 1998.
N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In Proc. ACM SIGMOD Conf. on Management of Data, pages 322–331, 1990.
S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree: An index structure for high-dimensional data. In Proc. 22nd VLDB Conference, pages 28–39, 1996.
M. Bern. Approximate closest-point queries in high dimensions. Inform. Process. Lett., 45:95–99, 1993.
C. J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.
T. Chan. Approximate nearest neighbor queries revisited. In Proc. 13th Annu. ACM Sympos. Comput. Geom., pages 352–358, 1997.
C. Chang and C. Lin. LIBSVM: Introduction and benchmarks. LIBSVM can be obtained from URL: http://www.csie.ntu.edu.tw/~cjlin/libsvm, 1999.
P. Ciaccia and M. Patella. Using the distance distribution for approximate similarity queries in high-dimensional metric spaces. In Proc. 10th Workshop Database and Expert Systems Applications, pages 200–205, 1999.
K. L. Clarkson. An algorithm for approximate closest-point queries. In Proc. 10th Annu. ACM Sympos. Comput. Geom., pages 160–164, 1994.
S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10:57–78, 1993.
T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Trans. Inform. Theory, 13:57–67, 1967.
M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geometry: Algorithms and Applications. Springer-Verlag, 1997.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci., 41(6):391–407, 1990.
L. Devroye and T. J. Wagner. Nearest neighbor methods in discrimination. In P. R. Krishnaiah and L. N. Kanal, editors, Handbook of Statistics, volume 2. North-Holland, 1982.
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, NY, 1973.
C. Duncan, M. Goodrich, and S. Kobourov. Balanced aspect ratio trees: Combining the advantages of k-d trees and octrees. In Proc. 10th ACM-SIAM Sympos. Discrete Algorithms, pages 300–309, 1999.
Christos Faloutsos and Ibrahim Kamel. Beyond uniformity and independence: Analysis of r-trees using the concept of fractal dimension. In Proc. Annu. ACM Sympos. Principles Database Syst., pages 4–13, 1994.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, 1996.
M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: The QBIC system. IEEE Computer, 28:23–32, 1995.
J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software, 3(3):209–226, 1977.
K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 2nd edition, 1990.
A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer Academic, Boston, 1992.
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput., pages 604–613, 1998.
N. Katayama and S. Satoh. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In Proc. ACM SIGMOD Conf. on Management of Data, pages 369–380, 1997.
J. M. Kleinberg. Two algorithms for nearest-neighbor search in high dimension. In Proc. 29th Annu. ACM Sympos. Theory Comput., pages 599–608, 1997.
E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimemsional spaces. In Proc. 30th Annu. ACM Sympos. Theory Comput., pages 614–623, 1998.
K. I. Lin, H. V. Jagadish, and C. Faloutsos. The TV-tree: An index structure for high-dimensional data. VLDB Journal, 3(4):517–542, 1994.
D. M. Mount and S. Arya. ANN: A library for approximate nearest neighbor searching. Center for Geometric Computing 2nd Annual Fall Workshop on Computational Geometry, URL: http://www.cs.umd.edu/~mount/ANN, 1997.
F. Murtagh. PC A (principal components analysis): C program. PC A program can be obtained from URL: http://astro.u-strasbg.fr/~fmurtagh/mda-sw/pca.c, 1989.
D. Saupe, 1994. Private communication.
T. Sellis, N. Roussopoulos, and C. Faloutsos. The R+-tree: A dynamic index for multi-dimensional objects. In Proc. 13th VLDB Conference, pages 507–517, 1987.
V. Vapnik. Statistical Learning Theory. John Wiley & Sons, NY, 1998.
K. Zatloukal, M. H. Johnson, and R. Ladner. Nearest neighbor search for data compression. (Presented at the 6th DIMACS Implementation Challenge Workshop), URL: http://www.cs.washington.edu/homes/ladner/nns.ps, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Maneewongvatana, S., Mount, D.M. (2001). An Empirical Study of a New Approach to Nearest Neighbor Searching. In: Buchsbaum, A.L., Snoeyink, J. (eds) Algorithm Engineering and Experimentation. ALENEX 2001. Lecture Notes in Computer Science, vol 2153. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44808-X_14
Download citation
DOI: https://doi.org/10.1007/3-540-44808-X_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42560-1
Online ISBN: 978-3-540-44808-2
eBook Packages: Springer Book Archive