[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2902251.2902303acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article
Public Access

Parallel Algorithms for Constructing Range and Nearest-Neighbor Searching Data Structures

Published: 15 June 2016 Publication History

Abstract

With the massive amounts of data available today, it is common to store and process data using multiple machines. Parallel programming platforms such as MapReduce and its variants are popular frameworks for handling such large data. We present the first provably efficient algorithms to compute, store, and query data structures for range queries and approximate nearest neighbor queries in a popular parallel computing abstraction that captures the salient features of MapReduce and other massively parallel communication (MPC) models. In particular, we describe algorithms for $kd$-trees, range trees, and BBD-trees that only require O(1) rounds of communication for both preprocessing and querying while staying competitive in terms of running time and workload to their classical counterparts. Our algorithms are randomized, but they can be made deterministic at some increase in their running time and workload while keeping the number of rounds of communication to be constant.

References

[1]
http://pig.apache.org/.
[2]
F. N. Afrati, A. D. Sarma, S. Salihoglu, and J. D. Ullman. Upper and lower bounds on the cost of a MapReduce computation. Proc. VLDB Endow., 6(4):277--288, 2013.
[3]
P. K. Agarwal, L. Arge, O. Procopiuc, and J. S. Vitter. A framework for index bulk loading and dynamization. In Proc. 28th Int. Colloq. Automata, Lang. and Program., pages 115--127, 2001.
[4]
P. K. Agarwal, M. de Berg, J. Gudmundsson, M. Hammar, and H. J. Haverkort. Box-trees and R-trees with near-optimal query time. Disc. Comput. Geom., 28(3):291--312, 2002.
[5]
P. K. Agarwal and J. Erickson. Geometric range searching and its relatives. In B. Chazelle, J. Goodman, and R. Pollack, editors, Advances in Discrete and Computational Geometry, pages 1--56. American Mathematical Society, 1998.
[6]
A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Comm. ACM, 31(9):1116--1127, 1988.
[7]
T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernandez-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, and S. Whittle. The Dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow., 8(12):1792--1803, 2015.
[8]
L. Alarabi, A. Eldawy, R. Alghamdi, and M. F. Mokbel. TAREEG: a MapReduce-based system for extracting spatial data from OpenStreetMap. In Proc. 22nd ACM SIGSPATIAL Int. Conf. Advances Geo. Inf. Systems, pages 83--92, 2014.
[9]
A. Andoni, A. Nikolov, K. Onak, and G. Yaroslavtsev. Parallel algorithms for geometric graph problems. In Proc. 46th ACM Annu. Symp. Theory Comput., pages 574--583, 2014.
[10]
L. Arge. External memory data structures. In J. Abello, P. M. Pardalos, and M. G. Resende, editors, Handbook of Massive Data Sets, pages 313--357. Springer, 2002.
[11]
L. Arge, G. S. Brodal, and R. Fagerberg. Cache-oblivious data structures. In D. Mehta and S. Sahni, editors, Handbook of Data Structures and Applications. CRC Press, 2005.
[12]
S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J. ACM, 45(6):891--923, 1998.
[13]
B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing. In Proc. 21st ACM Int. Conf. Info. Know. Manage., pages 2174--2178, 2012.
[14]
P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. In Proc. 32nd Symp. Princ. Database Systems, pages 273--284, 2013.
[15]
T. M. Chan. Optimal partition trees. Disc. Comput. Geom., 47(4):661--690, 2012.
[16]
B. Chazelle. The Discrepancy Method - Randomness and Complexity. Cambridge University Press, 2001.
[17]
M. de Berg, M. van Kreveld, M. Overmars, and O. Cheong. Computational geometry. Springer, 3rd edition, 2000.
[18]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Comm. ACM, 51(1):107--113, 2008.
[19]
F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel geometric algorithms for coarse grained multicomputers. In Proc. 9th Annu. Symp. Comput. Geom., pages 298--307, 1993.
[20]
A. Eldawy, L. Alarabi, and M. F. Mokbel. Spatial partitioning techniques in SpatialHadoop. Proc. VLDB Endow., 8(12):1602--1613, 2015.
[21]
A. Eldawy, Y. Li, M. F. Mokbel, and R. Janardan. CG_Hadoop: computational geometry in MapReduce. In Proc. 21st ACM SIGSPATIAL Int. Conf. Advances Geo. Inf. Systems, pages 284--293, 2013.
[22]
A. Eldawy and M. F. Mokbel. SpatialHadoop: A MapReduce framework for spatial data. In Proc. 31st IEEE Int. Conf. Data Eng., pages 1352--1363, 2015.
[23]
A. Ene, S. Im, and B. Moseley. Fast clustering using MapReduce. In Proc. 17th ACM SIGKDD Int. Conf. Know. Discovery and Data Mining, pages 681--689, 2011.
[24]
G. Evangelidis, D. B. Lomet, and B. Salzberg. The hB-Pi-tree: A multi-attribute index supporting concurrency, recovery and node consolidation. VLDB J., 6(1):1--25, 1997.
[25]
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th IEEE Annu. Symp. Found. Comp. Sci., pages 285--298, 1999.
[26]
A. Goel and K. Munagala. Complexity measures for MapReduce, and comparison to parallel computing. CoRR, abs/1211.6526, 2012.
[27]
M. T. Goodrich. Communication-efficient parallel sorting. SIAM J. Comput., 29(2):416--432, 1999.
[28]
M. T. Goodrich. Parallel algorithms in geometry. In Handbook of Discrete and Computational Geometry, Second Edition., pages 953--967. 2004.
[29]
M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the MapReduce framework. In Proc. 22nd Int. Symp. Alg. Comput., pages 374--383, 2011.
[30]
E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for MapReduce programs. Proc. VLDB Endow., 4(6):385--396, 2011.
[31]
H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for MapReduce. In Proc. 21st ACM-SIAM Annu. Symp. Disc. Alg., pages 938--948, 2010.
[32]
R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in MapReduce and streaming. ACM Trans. Parallel Comput., 2(3):14:1--14:22, 2015.
[33]
S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: A method for solving graph problems in MapReduce. In Proc. 23rd ACM Annu. Symp. Parall. Alg. Arch., pages 85--94, 2011.
[34]
K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon. Parallel data processing with MapReduce: A survey. SIGMOD Rec., 40(4):11--20, 2012.
[35]
Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of learning. J. Comp. Syst. Sci., 62(3):516--527, 2001.
[36]
D. B. Lomet and B. Salzberg. The hb-tree: A multiattribute indexing method with good guaranteed performance. ACM Trans. Database Syst., 15(4):625--658, 1990.
[37]
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proc. 2010 ACM SIGMOD Int. Conf. Manage. Data, pages 135--146, 2010.
[38]
J. Matousek. Approximations and optimal geometric divide-and-conquer. J. Comp. Syst. Sci., 50(2):203--208, 1995.
[39]
M. Mitzenmacher and E. Upfal. Probability and Computing - Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005.
[40]
L. Qin, J. X. Yu, L. Chang, H. Cheng, C. Zhang, and X. Lin. Scalable big graph processing in MapReduce. In Proc. 2014 ACM SIGMOD Int. Conf. Manage. Data, pages 827--838, 2014.
[41]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proc. 26th IEEE Symp. Mass Storage Syst. Tech., pages 1--10, 2010.
[42]
L. G. Valiant. A bridging model for parallel computation. Comm. ACM, 33(8):103--111, 1990.
[43]
T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2012.
[44]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., pages 10--10, 2010.

Cited By

View all
  • (2025)Parallel kd-tree with Batch UpdatesProceedings of the ACM on Management of Data10.1145/37097123:1(1-26)Online publication date: 11-Feb-2025
  • (2024)Topology-aware Parallel JoinsProceedings of the ACM on Management of Data10.1145/36515982:2(1-25)Online publication date: 14-May-2024
  • (2022)General Data Search Algorithms for Earth Simulation Systems with Cyclic BoundariesISPRS International Journal of Geo-Information10.3390/ijgi1107039211:7(392)Online publication date: 12-Jul-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
June 2016
504 pages
ISBN:9781450341912
DOI:10.1145/2902251
  • General Chair:
  • Tova Milo,
  • Program Chair:
  • Wang-Chiew Tan
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computational geometry
  2. data structures
  3. mapreduce
  4. nearest-neighbor search
  5. range search
  6. sampling

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

PODS '16 Paper Acceptance Rate 31 of 94 submissions, 33%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)136
  • Downloads (Last 6 weeks)21
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Parallel kd-tree with Batch UpdatesProceedings of the ACM on Management of Data10.1145/37097123:1(1-26)Online publication date: 11-Feb-2025
  • (2024)Topology-aware Parallel JoinsProceedings of the ACM on Management of Data10.1145/36515982:2(1-25)Online publication date: 14-May-2024
  • (2022)General Data Search Algorithms for Earth Simulation Systems with Cyclic BoundariesISPRS International Journal of Geo-Information10.3390/ijgi1107039211:7(392)Online publication date: 12-Jul-2022
  • (2022)Incremental User Identification Across Social Networks Based on User-Guider Similarity IndexJournal of Computer Science and Technology10.1007/s11390-022-2430-037:5(1086-1104)Online publication date: 1-Oct-2022
  • (2022)IndexingEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_217-2(1-6)Online publication date: 15-Jun-2022
  • (2021)Algorithms for a Topology-aware Massively Parallel Computation ModelProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458318(199-214)Online publication date: 20-Jun-2021
  • (2020)Packing R-trees with Space-filling CurvesACM Transactions on Database Systems10.1145/339750645:3(1-47)Online publication date: 26-Aug-2020
  • (2019)Output-Optimal Massively Parallel Algorithms for Similarity JoinsACM Transactions on Database Systems10.1145/331196744:2(1-36)Online publication date: 8-Apr-2019
  • (2019)IndexingEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_217(1015-1019)Online publication date: 20-Feb-2019
  • (2018)PAMACM SIGPLAN Notices10.1145/3200691.317850953:1(290-304)Online publication date: 10-Feb-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media