[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2882903.2882930acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index

Published: 26 June 2016 Publication History

Abstract

Due to the "curse of dimensionality" problem, it is very expensive to process the nearest neighbor (NN) query in high-dimensional spaces; and hence, approximate approaches, such as Locality-Sensitive Hashing (LSH), are widely used for their theoretical guarantees and empirical performance. Current LSH-based approaches target at the L1 and L2 spaces, while as shown in previous work, the fractional distance metrics (Lp metrics with 0 < p < 1) can provide more insightful results than the usual L1 and L2 metrics for data mining and multimedia applications. However, none of the existing work can support multiple fractional distance metrics using one index. In this paper, we propose LazyLSH that answers approximate nearest neighbor queries for multiple Lp metrics with theoretical guarantees. Different from previous LSH approaches which need to build one dedicated index for every query space, LazyLSH uses a single base index to support the computations in multiple Lp spaces, significantly reducing the maintenance overhead. Extensive experiments show that LazyLSH provides more accurate results for approximate kNN search under fractional distance metrics.

References

[1]
C. Aggarwal, A. Hinneburg, and D. Keim. On the surprising behavior of distance metrics in high dimensional space. In ICDT, volume 1973, pages 420--434. 2001.
[2]
N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In STOC, pages 557--563, 2006.
[3]
A. Andoni and P. Indyk. Efficient algorithms for substring near neighbor problem. In SODA, pages 1203--1212, 2006.
[4]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. COMMUNICATIONS OF THE ACM, 51(1):117--122, 2008.
[5]
S. Arya and D. M. Mount. Approximate nearest neighbor queries in fixed dimensions. In SODA, pages 271--280, 1993.
[6]
N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The r*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, pages 322--331, 1990.
[7]
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? In ICDT, pages 217--235. 1999.
[8]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD, pages 39--48, 2003.
[9]
K. Binder and D. Heermann. Monte Carlo simulation in statistical physics: an introduction. Springer Science & Business Media, 2010.
[10]
A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, pages 21--29, 1997.
[11]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW, pages 1157--1166, 1997.
[12]
A. M. Bronstein, M. M. Bronstein, A. M. Bruckstein, and R. Kimmel. Partial similarity of objects, or how to compare a centaur to a horse. International Journal of Computer Vision, 84(2):163--183, 2008.
[13]
G. Calafiore, F. Dabbene, and R. Tempo. Uniform sample generation in lp balls for probabilistic robustness analysis. In CDC, volume 3, pages 3335--3340, 1998.
[14]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, 2002.
[15]
J. G. Cleary. Analysis of an algorithm for finding nearest neighbors in euclidean space. ACM Trans. Math. Softw., 5(2):183--192, 1979.
[16]
G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishnan. Fast mining of massive tabular data via approximate distance computations. In ICDE, pages 605--614, 2002.
[17]
T. Cover and P. Hart. Nearest neighbor pattern classification. Information Theory, IEEE Trans. on, 13(1):21--27, 1967.
[18]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253--262, 2004.
[19]
M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, and C. Schmid. Evaluation of gist descriptors for web-scale image search. In CIVR, pages 19:1--19:8, 2009.
[20]
D. Francois, V. Wertz, and M. Verleysen. The concentration of fractional distances. TKDE, 19(7):873--886, 2007.
[21]
J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD, pages 541--552, 2012.
[22]
J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. Dsh: Data sensitive hashing for high-dimensional k-nn search. In SIGMOD, pages 1127--1138, 2014.
[23]
A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, pages 47--57, 1984.
[24]
A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? In VLDB, pages 506--515, 2000.
[25]
P. Howarth and S. Rüger. Fractional distance measures for content-based image retrieval. In ECIR, pages 447--456, 2005.
[26]
P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, pages 604--613, 1998.
[27]
H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In ECCV, pages 304--317, 2008.
[28]
L. J. Latecki, R. Lakaemper, and D. Wolter. Optimal partial shape similarity. Image and Vision Computing, 23(2):227 -- 236, 2005.
[29]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.
[30]
Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. Sk-lsh: An efficient index structure for approximate nearest neighbor search. PVLDB., 7(9):745--756, 2014.
[31]
D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91--110, 2004.
[32]
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In VLDB, pages 950--961, 2007.
[33]
M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2, 2009.
[34]
V. Pestov. Lower bounds on performance of metric tree indexing schemes for exact similarity search in high dimensions. Algorithmica, 66:310--328, 2013.
[35]
D. Ravichandran, P. Pantel, and E. Hovy. Randomized algorithms and nlp: Using locality sensitive hash function for high speed noun clustering. In ACL, pages 622--629, 2005.
[36]
N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, pages 71--79, 1995.
[37]
B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: A database and web-based tool for image annotation. Int. J. Comput. Vision, 77(1--3):157--173, 2008.
[38]
H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., 2005.
[39]
B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In WWW, pages 285--295, 2001.
[40]
V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430--441, 2012.
[41]
T. K. Sellis, N. Roussopoulos, and C. Faloutsos. The r+-tree: A dynamic index for multi-dimensional objects. In VLDB, pages 507--518, 1987.
[42]
E. W. Stacy. A generalization of the gamma distribution. The Annals of Mathematical Statistics, pages 1187--1192, 1962.
[43]
Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. Srs: Solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB, 8(1):1--12, 2014.
[44]
X. Tan, S. Chen, Z.-H. Zhou, and J. Liu. Face recognition under occlusions and variant expressions with partial similarity. Information Forensics and Security, IEEE Transactions on, 4(2):217--230, 2009.
[45]
Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst., 35(3):20:1--20:46, 2010.
[46]
A. K. H. Tung, R. Zhang, N. Koudas, and B. C. Ooi. Similarity search: A matching based approach. In VLDB, pages 631--642, 2006.
[47]
H. Wang, S. Zhang, W. Liang, F. Wang, and Y. Yao. Content-based image retrieval using fractional distance metric. In IASP, pages 1--5, 2012.
[48]
J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485--3492, 2010.
[49]
M.-L. Zhang and Z.-H. Zhou. Ml-knn: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038 -- 2048, 2007.
[50]
J. Zhou and A. K. Tung. Smiler: A semi-lazy time series prediction system for sensors. In SIGMOD, pages 1871--1886, 2015.
[51]
J. Zhou, A. K. Tung, W. Wu, and W. S. Ng. A "semi-lazy" approach to probabilistic path prediction in dynamic environments. In SIGKDD, pages 748--756, 2013.
[52]
X. Zhu and A. B. Goldberg. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1--130, 2009.

Cited By

View all
  • (2025)Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/37097293:1(1-29)Online publication date: 11-Feb-2025
  • (2024)Fast vector query processing for large datasets beyond GPU memory with reordered pipeliningProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691827(23-40)Online publication date: 16-Apr-2024
  • (2024)Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language ModelsProceedings of the VLDB Endowment10.14778/3696435.369643918:1(42-52)Online publication date: 1-Sep-2024
  • Show More Cited By

Index Terms

  1. LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
    June 2016
    2300 pages
    ISBN:9781450335317
    DOI:10.1145/2882903
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. LP metrics
    2. locality sensitive hashing
    3. nearest neighbor search

    Qualifiers

    • Research-article

    Funding Sources

    • National Basic Research Program of China 973
    • NUS-ZJU Sensor-Enhanced socialMedia (SeSaMe) Centre
    • NUS FRC Grant

    Conference

    SIGMOD/PODS'16
    Sponsor:
    SIGMOD/PODS'16: International Conference on Management of Data
    June 26 - July 1, 2016
    California, San Francisco, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/37097293:1(1-29)Online publication date: 11-Feb-2025
    • (2024)Fast vector query processing for large datasets beyond GPU memory with reordered pipeliningProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691827(23-40)Online publication date: 16-Apr-2024
    • (2024)Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language ModelsProceedings of the VLDB Endowment10.14778/3696435.369643918:1(42-52)Online publication date: 1-Sep-2024
    • (2024)DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor SearchProceedings of the VLDB Endowment10.14778/3665844.366585417:9(2241-2254)Online publication date: 1-May-2024
    • (2024)DB-LSH 2.0: Locality-Sensitive Hashing With Query-Based Dynamic BucketingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.329583136:3(1000-1015)Online publication date: Mar-2024
    • (2023)FARGO: Fast Maximum Inner Product Search via Global Multi-ProbingProceedings of the VLDB Endowment10.14778/3579075.357908416:5(1100-1112)Online publication date: 1-Jan-2023
    • (2023)Efficient Approximate Nearest Neighbor Search in Multi-dimensional DatabasesProceedings of the ACM on Management of Data10.1145/35889081:1(1-27)Online publication date: 30-May-2023
    • (2023)Co-design Hardware and Algorithm for Vector SearchProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607045(1-15)Online publication date: 12-Nov-2023
    • (2023)Density-Based Clustering by Means of Bridge Point IdentificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323231535:11(11274-11287)Online publication date: 1-Nov-2023
    • (2023)Cover Trees Revisited: Exploiting Unused Distance and Direction InformationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323178135:11(11231-11245)Online publication date: 1-Nov-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media