Abstract
Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold θ. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The paper considers the efficient evaluation of such queries, as well as of the closely related top-k cosine similarity queries. It provides novel optimality guarantees that exhibit good performance on real datasets. We take as a starting point Fagin’s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for θ-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that multiple real-world data sets from mass spectrometry, natural language process, and computer vision exhibit a certain form of data skewness and we exploit this property to obtain better traversal strategies. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.
Similar content being viewed by others
Notes
There is no need to include pairs with zero values in the list.
This optimization is equally applicable to the TA’s problem: Scan only the lists that correspond to dimensions that actually affect the function F.
Notice, the unit vector constraint enables inference about the collective weight of the unseen coordinates of a vector.
The inner product threshold problem is the special case where fi(s[i]) = qi ⋅s[i].
[d] is the set \(\{1, \dots , d\}\)
Recall that Li[0] = 1.
We denote by fi(Li) the list \([f_{i}(L_{i}[0]), f_{i}(L_{i}[1]), \dots ]\) for every Li.
where each fi of the decomposable function F is the identity function
The linear scan is feasible for join because the naive approach of join is quadratic.
References
Aebersold, R., Mann, M.: Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016)
Ahle, T.D., Pagh, R., Razenshteyn, I., Silvestri, F.: On the complexity of inner product similarity join. In: PODS, pp 151–164 (2016)
Akbarinia, R., Pacitti, E., Valduriez, P.: Best position algorithms for top-k queries. In: VLDB, pp 495–506 (2007)
Anastasiu, D.C., Karypis, G.: L2AP: Fast cosine similarity search with prefix L-2 norm bounds. In: ICDE, pp 784–795 (2014)
Anastasiu, D.C., Karypis, G.: PL2AP: Fast parallel cosine similarity search. In: IA3, pp 8:1–8:8 (2015)
Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., Schmidt, L.: Practical and optimal lsh for angular distance. In: NIPS, pp 1225–1233 (2015)
André, F., Kermarrec, A.-M., Scouarnec, N.L.: Cache locality is not enough: High-performance nearest neighbor search with product quantization fast scan. PVLDB 9(4), 288–299 (2015)
Arora, A., Sinha, S., Kumar, P., Bhattacharya, A.: HD-Index: Pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. PVLDB 11(8), 906–919 (2018)
Bast, H., Majumdar, D., Schenkel, R., Theobald, M., Weikum, G.: Io-top-k: Index-access optimized top-k query processing. In: VLDB, pp 475–486 (2006)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp 131–140 (2007)
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML, pp 97–104 (2006)
Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. CSUR 33(3), 322–373 (2001)
Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press, Cambridge (2004)
Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: CIKM, pp 426–434 (2003)
Bruno, N., Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: ICDE, pp 369–380 (2002)
Chakrabarti, K., Chaudhuri, S., Ganti, V.: Interval-based pruning for top-k processing over compressed lists. In: ICDE, pp 709–720 (2011)
Chen, L., Gao, Y., Zheng, B., Jensen, C.S., Yang, H., Yang, K.: Pivot-based metric indexing. PVLDB 10(10), 1058–1069 (2017)
Craig, R., Cortens, J.C, Fenyo, D., Beavis, R.C.: Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 5 (8), 1843–1849 (2006)
Cui, B., Zhao, J., Cong, G.: ISIS: A new approach for efficient similarity search in sparse databases. In: DASFAA, pp 231–245 (2010)
Curtin, R.R., Gray, A.G., Ram, P.: Fast exact max-kernel search. In: SDM, pp 1–9 (2013)
Dasari, S., Chambers, M.C., Martinez, M.A., Carpenter, K.L., Ham, A.-J.L., Vega-Montoto, L.J., Tabb, D.L.: Pepitome: Evaluating improved spectral library search for identification complementarity and quality assessment. J. Proteome Res. 11(3), 1686–95 (2012)
De Berg, M., Cheong, O., Van Kreveld, M., Overmars, M.: Computational Geometry: Introduction. Springer, Berlin (2008)
Deshpande, P.M., Deepak, P., Kummamuru, K.: Efficient online top-k retrieval with arbitrary similarity measures. In: EDBT, pp 356–367 (2008)
Ding, S., Suel, T.: Faster top-k document retrieval using block-max indexes. In: SIGIR, pp 993–1002 (2011)
Doc2Vec. https://radimrehurek.com/gensim/models/doc2vec.html
Dong, W., Moses, C., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: WWW, pp 577–586 (2011)
Dutta, D., Chen, T.: Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search. Bioinformatics 23(5), 612–618 (2007)
Eghbali, S., Tahvildari, L.: Cosine similarity search with multi index hashing. arXiv:1610.00574
Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp 102–113 (2001)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. JCSS 66(4), 614–656 (2003)
Fraccaro, M., Paquet, U., Winther, O.: Indexable probabilistic matrix factorization for maximum inner product search. In: AAAI, pp 1554–1560 (2016)
Fu, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with navigating spreading-out graphs. arXiv:1707.00143 (2017)
Graham, R.L.: An efficient algorithm for determining the convex hull of a finite planar set. Info. Pro. Lett. 1, 132–133 (1972)
Güntzer, U., Balke, W.-T., Kiebling, W.: Optimizing multi-feature queries for image databases. In: VLDB, pp 419–428 (2000)
Houle, M.E., Nett, M.: Rank-based similarity search: Reducing the dimensional dependence. PAMI 37(1), 136–150 (2015)
Hristidis, V., Koudas, N., Papakonstantinou, Y: PREFER: A system for the efficient execution of multi-parametric ranked queries. In: SIGMOD, pp 259–270 (2001)
Hu, X., Tao, Y., Yi, K.: Output-optimal parallel algorithms for similarity joins. In: PODS, pp 79–90 (2017)
Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. CSUR 40(4), 1–58 (2008)
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ICDT, pp 604–613 (1998)
Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. PAMI 33(1), 117–128 (2011)
Jin, W., Patel, J.M.: Efficient and generic evaluation of ranked queries. In: SIGMOD, pp 601–612 (2011)
Johnson, W.B., Lindenstrauss, J., Schechtman, G.: Extensions of lipschitz maps into banach spaces. Israel J. Math. 54(2), 129–138 (1986)
Keivani, O., Sinha, K., Ram, P.: Improved maximum inner product search with better theoretical guarantees. In: IJCNN, pp 2927–2934 (2017)
Kong, A.T., Leprevost, F.V., Avtonomov, D.M., Mellacheruvu, D., Nesvizhskii, A.I.: Msfragger: Ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017)
Kuhn, H.W., Tucker, A.W.: Nonlinear programming. In: Traces and Emergence of Nonlinear Programming, pp 247–258 (2014)
Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: ICCV, pp 2130–2137 (2009)
Lam, H., Deutsch, E.W., Eddes, J.S., Eng, J.K., King, N., Stein, S.E., Aebersold, R.: Development and validation of a spectral library searching method for peptide identification from ms/ms. Proteomics 7(5) (2007)
Learned-Miller, E., Huang, G.B., RoyChowdhury, A., Li, H., Hua, G.: Labeled faces in the wild: A survey. In: Advances in face detection and facial image analysis, pp 189–248. Springer (2016)
Lee, J., Cho, H., Hwang, S.-W.: Efficient dual-resolution layer indexing for top-k queries. In: ICDE, pp 1084–1095 (2012)
Lempitsky, V.: The inverted multi-index. In: CVPR, pp 3069–3076 (2012)
Li, C., Chang, E., Garcia-Molina, H., Wiederhold, G.: Clustering for approximate similarity search in high-dimensional spaces. TKDE 14(4), 792–808 (2002)
Li, H, Chan, T.N., Yiu, M.L., Mamoulis, N.: Fexipro: Fast and exact inner product retrieval in recommender systems. In: SIGMOD, pp 835–850 (2017)
Li, W., Deng, L., Li, Y., Li, C.: Zigzag: Supporting similarity queries on vector space models. In: SIGMOD (2018)
Li, Y., Wang, J., Pullman, B., Bandeira, N., Papakonstantinou, Y.: Index-Based High-Dimensional, Cosine Threshold Querying with Optimality Guarantees. In: ICDT, vol. 127, pp 11:1–11:20 (2019)
Lian, X., Chen, L.: General cost models for evaluating dimensionality reduction in high-dimensional spaces. TKDE 21(10), 1447–1460 (2009)
Liaw, Y.-C., Leou, M.-L., Wu, C.-M.: Fast exact k nearest neighbors search using an orthogonal search tree. PR 43(6), 2351–2358 (2010)
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. PAMI 36(11), 2227–2240 (2014)
Mussmann, S., Ermon, S.: Learning and inference via maximum inner product search. In: ICML, pp 2587–2596 (2016)
Qin, J., Wang, Y., Xiao, C., Wang, W., Lin, X., Ishikawa, Y.: GPH: Similarity search in hamming space. In: ICDE (2018)
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)
Ram, P., Gray, A.G.: Maximum inner-product search using cone trees. In: SIGKDD, pp 931–939 (2012)
Ramaswamy, S., Rose, K.: Adaptive cluster distance bounding for high-dimensional indexing. TKDE 23(6), 815–830 (2011)
Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco (2005)
Savage, J.E.: Models of computation–exploring the power of computing (1998)
Shen, F., Liu, W., Zhang, S., Yang, Y., Shen, H.T.: Learning binary codes for maximum inner product search. In: ICCV, pp 4148–4156 (2015)
Silpa-Anan, C., Hartley, R.: Optimised kd-trees for fast image descriptor matching. In: CVPR, pp 1–8 (2008)
Tang, W.H., Halpern, B.R., Shilov, I.V., Seymour, S.L., Keating, S.P., Loboda, A., Patel, A.A., Schaeffer, D.A., Nuwaysir, L.M.: Discovering known and unanticipated protein modifications using ms/ms database searching. Anal. Chem. 77(13), 3931–3946 (2005)
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD, pp 563–576 (2009)
Teflioudi, C., Gemulla, R.: Exact and approximate maximum inner product search with lemp. TODS 42(1), 5:1–5:49 (2016)
Teflioudi, C., Gemulla, R., Mykytiuk, O.: Lemp: Fast retrieval of large entries in a matrix product. In: SIGMOD, pp 107–122 (2015)
The Booking.com Dataset. https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe
Wang, J., Pérez-Santiago, J., Katz, J.E., Mallick, P., Bandeira, N.: Peptide identification from mixture tandem mass spectra. MCP 9(7), 1476–1485 (2010)
Wang, J.: Query Processing of Sorted Lists on Modern Hardware. University of California, San Diego (2019)
Wang, J., Lin, C., He, R., Chae, M., Papakonstantinou, Y., Swanson, S.: MILC: Inverted list compression in memory. PVLDB 10(8), 853–864 (2017)
Wang, M., Bandeira, N.: Spectral library generating function for assessing spectrum-spectrum match significance. J. Proteome Res. 12 (9), 3944–3951 (2013)
Wang, M., Carver, J.J., Bandeira, N.: Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34(8), 828–837 (2016)
Wang, Y., Shrivastava, A., Wang, J., Ryu, J.: Randomized algorithms accelerated over cpu-gpu for ultra-high dimensional similarity search. In: SIGMOD (2018)
Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp 194–205 (1998)
Wu, Y., Jin, R., Zhang, X.: Fast and unified local search for random walk based k-nearest-neighbor query in large graphs. In: SIGMOD, pp 1139–1150 (2014)
Xin, D., Han, J., Chang, K.C.: Progressive and selective merge: Computing top-k with ad-hoc ranking functions. In: SIGMOD, pp 103–114 (2007)
Yates, J.R., Morgan, S.F., Gatlin, C.L., Griffin, P.R., Eng, J.K.: Method to compare collision-induced dissociation spectra of peptides: Potential for library searching and subtractive analysis. Anal. Chem. 70(17), 3557–3565 (1998)
Yu, A., Agarwal, P.K., Yang, J.: Top-k preferences in high dimensions. In: ICDE, pp 748–759 (2014)
Zhang, S., Sun, C., He, Z.: Listmerge: Accelerating top-k aggregation queries over large number of lists. In: DASFAA, pp 67–81 (2016)
Zhang, Z., Hwang, S.-W., Chang, K.C.-C., Wang, M., Lang, C.A., Chang, Y.-C.: Boolean + ranking: Querying a database by k-constrained optimization. In: SIGMOD, pp 359–370 (2006)
Acknowledgments
We thank the anonymous reviewers of ICDT and ToCS for the very constructive and helpful comments. We are also grateful to Victor Vianu who helped us improve the presentation of the paper. This work was supported in part by the National Science Foundation (NSF) under awards BIGDATA 1447943 and ABI 1759980, and by the National Institutes of Health (NIH) under awards P41GM103484 and R24GM127667.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Database Theory (ICDT 2019)
Guest Editor: Pablo Baceló
Rights and permissions
About this article
Cite this article
Li, Y., Wang, J., Pullman, B. et al. Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees. Theory Comput Syst 65, 42–83 (2021). https://doi.org/10.1007/s00224-020-10009-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00224-020-10009-6