[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Fast set intersection in memory

Published: 01 January 2011 Publication History

Abstract

Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time [EQUATION], where r is the intersection size and w is the number of bits in a machine-word. In addition, we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

References

[1]
R. A. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In CPM, pages 400--408, 2004.
[2]
R. A. Baeza-Yates and A. Salinger. Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences. In SPIRE, pages 13--24, 2005.
[3]
J. Barbay and C. Kenyon. Adaptive intersection and t-threshold problems. In SODA, pages 390--399, 2002.
[4]
J. Barbay, A. López-Ortiz, and T. Lu. Faster Adaptive Set Intersections for Text Searching. In 5th WEA, pages 146--157, 2006.
[5]
J. Barbay, A. López-Ortiz, T. Lu, and A. Salinger. An experimental investigation of set intersection algorithms for text searching. ACM Journal of Experimental Algorithmics, 14:7--24, 2010.
[6]
P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of Union-Intersection Expressions. In ISAAC, pages 739--750, 2007.
[7]
G. E. Blelloch and M. Reid-Miller. Fast Set Operations using Treaps. In ACM SPAA, pages 16--26, 1998.
[8]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM, pages 426--434, 2003.
[9]
M. R. Brown and R. E. Tarjan. A Fast Merging Algorithm. Journal of the ACM, 26(2):211--226, 1979.
[10]
J. Brutlag. Speed Matters for Google Web Search. http://code.google.com/speed/files/delayexp.pdf, 2009.
[11]
E. Chiniforooshan, A. Farzan, and M. Mirzazadeh. Worst case optimal union-intersection expression evaluation. In ALENEX, pages 179--190, 2001.
[12]
E. Demaine, A. López-Ortiz, and J. Munro. Adaptive Set Intersections, Unions, and Differences. In SODA, pages 743--752, 2000.
[13]
E. Demaine, A. López-Ortiz, and J. Munro. Experiments on Adaptive Set Intersections for Text Retrieval Systems. In ALENEX, pages 91--104, 2001.
[14]
D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge, 2009.
[15]
R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. In PODS, pages 102--113, 2001.
[16]
F. K. Hwang and S. Lin. A Simple Algorithm for Merging Two Disjoint Linearly Ordered Sets. SIAM Journal, 1(1):31--39, 1972.
[17]
G. Linden. http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html.
[18]
W. Pugh. A skip list cookbook. Technical Report UMIACS-TR-89-72.1, University of Maryland, 1990.
[19]
P. Sanders and F. Transier. Intersection in Integer Inverted Indices. In ALENEX, pages 71--83, 2007.
[20]
S. Tatikonda, F. Junqueira, B. B. Cambazoglu, and V. Plachouras. On Efficient Posting List Intersection with Multicore Processors. In ACM SIGIR, pages 738--739, 2009.
[21]
F. Transier and P. Sanders. Compressed inverted indexes for in-memory search engines. In ALENEX, pages 3--12, 2008.
[22]
D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. PVLDB, 2(1):838--849, 2009.
[23]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes - Compressing and Indexing Documents and Images. Morgan Kaufman Publishers, 1999.

Cited By

View all
  • (2024)Chimera: A System Design of Dual Storage and Traversal-Join Unified Query Processing for SQL/PGQProceedings of the VLDB Endowment10.14778/3705829.370584518:2(279-292)Online publication date: 1-Oct-2024
  • (2024)HERO: A Hierarchical Set Partitioning and Join Framework for Speeding up the Set Intersection Over GraphsProceedings of the ACM on Management of Data10.1145/36392842:1(1-25)Online publication date: 26-Mar-2024
  • (2024)BigSet: An Efficient Set Intersection ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.343259536:12(7677-7691)Online publication date: 1-Dec-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 4, Issue 4
January 2011
59 pages

Publisher

VLDB Endowment

Publication History

Published: 01 January 2011
Published in PVLDB Volume 4, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Chimera: A System Design of Dual Storage and Traversal-Join Unified Query Processing for SQL/PGQProceedings of the VLDB Endowment10.14778/3705829.370584518:2(279-292)Online publication date: 1-Oct-2024
  • (2024)HERO: A Hierarchical Set Partitioning and Join Framework for Speeding up the Set Intersection Over GraphsProceedings of the ACM on Management of Data10.1145/36392842:1(1-25)Online publication date: 26-Mar-2024
  • (2024)BigSet: An Efficient Set Intersection ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.343259536:12(7677-7691)Online publication date: 1-Dec-2024
  • (2023)An Index for Set Intersection With Post-FilteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.332914536:7(2862-2876)Online publication date: 1-Nov-2023
  • (2023)Reduction of LiDAR Point Cloud Maps for Localization of Resource-Constrained Robotic SystemsIEEE Systems Journal10.1109/JSYST.2022.316292617:1(916-927)Online publication date: Mar-2023
  • (2022)A separation logic for negative dependenceProceedings of the ACM on Programming Languages10.1145/34987196:POPL(1-29)Online publication date: 12-Jan-2022
  • (2022)Exploiting Reuse for GPU Subgraph EnumerationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.303556434:9(4231-4244)Online publication date: 1-Sep-2022
  • (2022)Efficient Regular Expression Matching Based on Positional Inverted IndexIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299229534:3(1133-1148)Online publication date: 1-Mar-2022
  • (2022)Efficient $k-\text{clique}$ Listing with Set Intersection Speedup2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00192(1955-1968)Online publication date: May-2022
  • (2022)Edge-Connected Jaccard Similarity for Graph Link Prediction on FPGA2022 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC55821.2022.9926326(1-10)Online publication date: 19-Sep-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media