[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article
Free access

Fast and scalable inequality joins

Published: 01 February 2017 Publication History

Abstract

Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as $$B^+$$B+-tree, $$R^*$$Rź-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster.

References

[1]
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995)
[2]
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT, pp. 99---110 (2010)
[3]
Agrawal, D., Chawla, S., Elmagarmid, A.K., Ouzzani, Z.K.M., Papotti, P., Quiané-Ruiz, J., Tang, N., Zaki, M.J.: Road to freedom in big data analytics. In: EDBT, pp. 479---484 (2016)
[4]
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD, pp. 1383---1394 (2015)
[5]
Bender, M.A., Hu, H.: An adaptive packed-memory array. TODS 32(4) 26:1---26:43 (2007)
[6]
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746---755 (2007)
[7]
Böhm, C., Klump, G., Kriegel, H.-P.: XZ-Ordering: A space-filling curve for objects with spatial extension. In: SSD, pp. 75---90 (1999)
[8]
Chan, C.-Y., Ioannidis, Y. E.: Bitmap index design and evaluation. In: SIGMOD, pp. 355---366 (1998)
[9]
Chan, C.-Y., Ioannidis, Y.E.: An efficient bitmap encoding scheme for selection queries. In: SIGMOD, pp. 215---226 (1999)
[10]
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458---469 (2013)
[11]
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F. Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD (2013)
[12]
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107---113 (2008)
[13]
DeWitt, D.J., Naughton, J.F., Schneider, D.A.: An evaluation of non-equijoin algorithms. In: VLDB, pp. 443---452 (1991)
[14]
Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 515---529 (2010)
[15]
Ebaid, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J., Tang, N., Yin, S.: NADEEF: a generalized data cleaning system. PVLDB 6(12), 1218---1221 (2013)
[16]
Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J., Tang, N., Yin, S.: NADEEF/ER: generic and interactive entity resolution. In: SIGMOD, pp. 1071---1074 (2014)
[17]
Enderle, J., Hampel, M., Seidl, T.: Joining interval data in relational databases. In: SIGMOD, pp. 683---694 (2004)
[18]
Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D.: Join operations in temporal databases. VLDB J. 14(1), 2---29 (2005)
[19]
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems. Pearson Education (2009)
[20]
Govindaraju, N.K., Gray, J., Kumar, R., Manocha, D.: GPUTeraSort: high performance graphics co-processor sorting for large database management. In: SIGMOD, pp. 325---336 (2006)
[21]
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp. 47---57 (1984)
[22]
Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: VLDB, pp. 562---573 (1995)
[23]
Kemper, A., Kossmann, D., Wiesner, C.: Generalised hash teams for join and group-by. In: VLDB, pp. 30---41 (1999)
[24]
Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: BigDansing: a system for big data cleansing. In: SIGMOD, pp. 1215---1230 (2015)
[25]
Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Lightning fast and space efficient inequality joins. PVLDB 8(13), 2074---2085 (2015)
[26]
Kiukkonen, N., Blom, J., Dousse, O., Gatica-Perez, D., Laurila, J.: Towards rich mobile phone datasets: lausanne data collection campaign. In: ICPS (2010)
[27]
Knuth, D. E.: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, Reading (1973)
[28]
Laurila, J.K., Gatica-Perez, D., Aad, I., Bornet, O., Do, T.-M.-T., Dousse, O., Eberle, J., Miettinen, M.: The mobile data challenge: big data for mobile computing research. In: Pervasive Computing (2012)
[29]
Lohman, G., Mohan, C., Haas, L., Daniels, D., Lindsay, B., Selinger, P., Wilms, P.: Query processing in R*. In: Query Processing in Database Systems, pp. 31---47 (1985)
[30]
Lopes Siqueira, T.L., Ciferri, R.R., Times, V.C., de Aguiar Ciferri, C.D.: A spatial bitmap-based index for geographical data warehouses. In: SAC, pp. 1336---1342 (2009)
[31]
Mamoulis, N., Papadias, D.: Multiway spatial joins. TODS 26(4), 424---475 (2001)
[32]
Morris, J., Ramesh, B.: Dynamic Partition Enhanced Inequality Joining Using a Value-count Index, 1 2011. US Patent 7,873,629 B1
[33]
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD, pp. 949---960 (2011)
[34]
Schneider, D.A., DeWitt, D.J.: A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In: SIGMOD (1989)
[35]
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23---34 (1979)
[36]
Stockinger, K., Wu, K.: Bitmap indices for data warehouses. Data Wareh OLAP Concepts Archit Solut 5, 157---178 (2007)
[37]
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud, pp. 10---10 (2010)
[38]
Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using MapReduce. PVLDB 5(11), 1184---1195 (2012)

Cited By

View all
  • (2024)Efficient Stream Join Processing: Novel Approaches and ChallengesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658833(409-412)Online publication date: 3-Jun-2024
  • (2023)Online list labeling with predictionsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668755(60278-60290)Online publication date: 10-Dec-2023
  • (2023)Nereus: A Distributed Stream Band Join System With Adaptive Range PartitioningIEEE Transactions on Consumer Electronics10.1109/TCE.2023.324929269:4(949-961)Online publication date: 1-Nov-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 26, Issue 1
February 2017
148 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 February 2017

Author Tags

  1. Incremental
  2. Inequality join
  3. PostgreSQL
  4. Selectivity estimation
  5. Spark SQL

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)7
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Stream Join Processing: Novel Approaches and ChallengesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658833(409-412)Online publication date: 3-Jun-2024
  • (2023)Online list labeling with predictionsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668755(60278-60290)Online publication date: 10-Dec-2023
  • (2023)Nereus: A Distributed Stream Band Join System With Adaptive Range PartitioningIEEE Transactions on Consumer Electronics10.1109/TCE.2023.324929269:4(949-961)Online publication date: 1-Nov-2023
  • (2023)Incremental discovery of denial constraintsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00788-y32:6(1289-1313)Online publication date: 17-Mar-2023
  • (2023)Boosting Knowledge Graph Generation from Tabular Data with RML ViewsThe Semantic Web10.1007/978-3-031-33455-9_29(484-501)Online publication date: 28-May-2023
  • (2022)Fast approximate denial constraint discoveryProceedings of the VLDB Endowment10.14778/3565816.356582816:2(269-281)Online publication date: 1-Oct-2022
  • (2022)What is the price for joining securely?Proceedings of the VLDB Endowment10.14778/3494124.349414615:3(659-672)Online publication date: 4-Feb-2022
  • (2021)Beyond equi-joinsProceedings of the VLDB Endowment10.14778/3476249.347630614:11(2599-2612)Online publication date: 27-Oct-2021
  • (2021)Fast incremental discovery of pointwise order dependenciesProceedings of the VLDB Endowment10.14778/3401960.340196513:10(1669-1681)Online publication date: 10-Mar-2021
  • (2021)Leveraging range joins for the computation of overlap joinsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00692-331:1(75-99)Online publication date: 28-Aug-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media