An Efficient Similarity Join Algorithm with Cosine Similarity Predicate

Dongjoo Lee¹⁹,
Jaehui Park¹⁹,
Junho Shim²⁰ &
…
Sang-goo Lee¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6262))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1143 Accesses
13 Citations

Abstract

Given a large collection of objects, finding all pairs of similar objects, namely similarity join, is widely used to solve various problems in many application domains.Computation time of similarity join is critical issue, since similarity join requires computing similarity values for all possible pairs of objects. Several existing algorithms adopt prefix filtering to avoid unnecessary similarity computation; however, existing algorithms implementing the prefix filtering have inefficiency in filtering out object pairs, in particular, when aggregate weighted similarity function, such as cosine similarity, is used to quantify similarity values between objects. This is mostly caused by large prefixes the algorithms select. In this paper, we propose an alternative method to select small prefixes by exploiting the relationship between arithmetic mean and geometric mean of elements’ weights. A new algorithm, MMJoin, implementing the proposed methods dramatically reduces the average size of prefixes without much overhead. Finally, it saves much computation time. We demonstrate that our algorithm outperforms a state-of-the-art one with empirical evaluation on large-scale real world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

SETJoin: a novel top-k similarity join algorithm

Article 06 March 2020

Similarity Joins and Beyond: An Extended Set of Binary Operators with Order

An efficient MapReduce algorithm for similarity join in metric spaces

Article 06 February 2016

References

Chaudhuri, S., Chen, B.C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB (2007)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)
Google Scholar
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR (2006)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW (2007)
Google Scholar
Chien, S., Immorlica, N.: Semantic similarity between search engine queries using temporal correlation. In: WWW (2005)
Google Scholar
Chandel, A., Hassanzadeh, O., Koudas, N., Sadoghi, M., Srivastava, D.: Benchmarking declarative approximate selection predicates. In: SIGMOD (2007)
Google Scholar
Chuang, S.L., Chien, L.F.: Taxonomy generation for text segments: A practical web-based approach. ACM Trans. Inf. Syst. 23(4), 363–396 (2005)
Article Google Scholar
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW (2006)
Google Scholar
Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale study in the orkut social network. In: KDD (2005)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)
Google Scholar
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB (2006)
Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD (2004)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: VLDB (2008)
Google Scholar
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Taylor Graham Series in Foundations of Information Science, pp. 132–142 (1988)
Google Scholar
Helmer, S., Moerkotte, G.: Evaluation of main memory join algorithms for joins with set comparison join predicates. In: VLDB (1997)
Google Scholar
Mamoulis, N.: Efficient processing of joins on set-valued attributes. In: SIGMOD (2003)
Google Scholar
Ramasamy, K., Patel, J.M., Naughton, J.F., Kaushik, R.: Set containment joins: The good, the bad and the ugly. In: VLDB (2000)
Google Scholar
Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. SIGMOD Rec. 30(2), 379–388 (2001)
Article Google Scholar
Hersh, W.: Managing gigabytes—compressing and indexing documents and images (second edition). Inf. Retr. 4(1), 79–80 (2001)
Article Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science & Engineering, Seoul National University, Seoul, 151-742, Korea
Dongjoo Lee, Jaehui Park & Sang-goo Lee
Dept of Computer Science, Sookmyung Women’s University, Seoul, 140-742, Korea
Junho Shim

Authors

Dongjoo Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jaehui Park
View author publications
You can also search for this author in PubMed Google Scholar
Junho Shim
View author publications
You can also search for this author in PubMed Google Scholar
Sang-goo Lee
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DeustoTech Computing, University of Deusto, Avda. Universidades, 24, 48007, Bilbao, Spain
Pablo García Bringas
Institut de Recherche en Informatique de Toulouse (IRIT), Paul Sabatier University, 118, route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
Faculty of Computer Science, Department of Distributed Systems and Multimedia Systems, University of Vienna, Liebiggasse 4/3-4, 1010, Vienna, Austria
Gerald Quirchmayr

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, D., Park, J., Shim, J., Lee, Sg. (2010). An Efficient Similarity Join Algorithm with Cosine Similarity Predicate. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds) Database and Expert Systems Applications. DEXA 2010. Lecture Notes in Computer Science, vol 6262. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15251-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-15251-1_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15250-4
Online ISBN: 978-3-642-15251-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics