[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3329785.3329920acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

GPU-Accelerated Similarity Self-Join for Multi-Dimensional Data

Published: 01 July 2019 Publication History

Abstract

The similarity self-join finds all objects in a dataset that are within a search distance, ∈, of each other. As such, the self-join is a building block of many algorithms. In high dimensions, indexing structures become increasingly ineffective at pruning the search, making the self-join challenging to compute efficiently. We advance a GPU-accelerated self-join algorithm targeted towards high dimensional data. The massive parallelism afforded by the GPU and high aggregate memory bandwidth makes the architecture well-suited for data-intensive workloads. We leverage a grid-based GPU-tailored index to perform range queries, and propose the following optimizations: (i) a trade-off between candidate set filtering and index search overhead by exploiting properties of the index; (ii) reordering the data based on variance in each dimension to improve the filtering power of the index; and (iii) a pruning method for reducing the number of expensive distance calculations. Our algorithm generally outperforms a parallel CPU state-of-the-art approach.

References

[1]
{n.d.}. Nvidia Volta. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed: Oct. 5, 2018.
[2]
{n.d.}. Top500. https://www.top500.org/lists/2018/06/. Accessed: Apr. 29, 2019.
[3]
Rakesh Agrawal, Christos Faloutsos, and Arun Swami. 1993. Efficient similarity search in sequence databases. Foundations of data organization and algorithms (1993), 69--84.
[4]
Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In IEEE Symp. on Foundations of Computer Science. 459--468.
[5]
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering Points to Identify the Clustering Structure. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data. 49--60.
[6]
Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient Exact Set-similarity Joins. In Proc. of the Intl. Conf. on Very Large Data Bases. 918--929.
[7]
P. Baldi, P. Sadowski, and D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications 5, Article 4308 (July 2014). arXiv:hep-ph/1402.4735
[8]
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling Up All Pairs Similarity Search. In Proc. of the Intl. Conf. on World Wide Web. 131--140.
[9]
Richard E Bellman. 1961. Adaptive control processes: a guided tour. Princeton University press.
[10]
Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 18, 9 (1975), 509--517.
[11]
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. 2011. The Million Song Dataset. In Proc. of the 12th Intl. Conf. on Music Information Retrieval.
[12]
Christian Böhm, Bernhard Braunmüller, Markus Breunig, and Hans-Peter Kriegel. 2000. High Performance Clustering Based on the Similarity Join. In Proc. of the Intl. Conf. on Information and Knowledge Management. 298--305.
[13]
Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel. 2001. Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data. In ACM SIGMOD Record, Vol. 30. 379--388.
[14]
C. Bohm and H. P. Kriegel. 2001. A cost model and index architecture for the similarity join. In Proc. 17th Intl. Conf. on Data Engineering. 411--420.
[15]
Christian Böhm, Robert Noll, Claudia Plant, and Andrew Zherdin. 2009. Index-supported Similarity Join on Graphics Processors. In BTW. 57--66.
[16]
Brent Bryan, Frederick Eberhardt, and Christos Faloutsos. 2008. Compact similarity joins. In IEEE 24th Intl. Conf. on Data Engineering. 346--355.
[17]
Gang Chen, Keyu Yang, Lu Chen, Yunjun Gao, Baihua Zheng, and Chun Chen. 2017. Metric similarity joins using MapReduce. IEEE Transactions on Knowledge and Data Engineering 29, 3 (2017), 656--669.
[18]
Akash Das Sarma, Yeye He, and Surajit Chaudhuri. 2014. ClusterJoin: A Similarity Joins Framework using Map-Reduce. Proceedings of the VLDB Endowment 7, 12 (2014), 1059--1070.
[19]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. CACM 51, 1 (2008), 107--113.
[20]
Jens-Peter Dittrich and Bernhard Seeger. 2001. GESS: A Scalable Similarity-join Algorithm for Mining Large Data Sets in High Dimensional Spaces. In Proc. of the ACM Intl. Conf. on Knowledge Discovery and Data Mining. 47--56.
[21]
Robert J. Durrant and Ata Kabán. 2009. When is 'Nearest Neighbour' Meaningful: A Converse Theorem and Implications. Journal of Complexity 25, 4 (2009), 385--397.
[22]
Martin Ester, Hans Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd KDD. 226--231.
[23]
Raphael A. Finkel and Jon Louis Bentley. 1974. Quad trees a data structure for retrieval on composite keys. Acta informatica 4, 1 (1974), 1--9.
[24]
S. Fries, B. Boden, G. Stepien, and T. Seidl. 2014. PHiDJ: Parallel similarity self-join for high-dimensional vector data with MapReduce. In 2014 IEEE 30th Intl. Conf. on Data Engineering. 796--807.
[25]
Benoit Gallet and Michael Gowanlock. 2019. Load Imbalance Mitigation Optimizations for GPU-Accelerated Similarity Joins. In Proc. of the 2019 IEEE Intl. Parallel and Distributed Processing Symp. Workshops (IPDPSW), to appear.
[26]
Michael Gowanlock and Henri Casanova. 2016. Distance Threshold Similarity Searches: Efficient Trajectory Indexing on the GPU. IEEE Transactions on Parallel and Distributed Systems 27, 9 (2016), 2533--2545.
[27]
M. Gowanlock and B. Karsin. 2018. GPU Accelerated Self-Join for the Distance Similarity Metric. In 2018 IEEE Intl. Parallel and Distributed Processing Symp. Workshops (IPDPSW). 477--486.
[28]
Michael Gowanlock, Cody M Rude, David M Blair, Justin D Li, and Victor Pankratius. 2017. Clustering Throughput Optimization on the GPU. In Proc. of the IEEE Intl. Parallel and Distributed Processing Symp. 832--841.
[29]
Antonin Guttman. 1984. R-trees: a dynamic index structure for spatial searching. In Proc. of Intl. Conf. on Management of Data. 47--57.
[30]
Gibran Hemani, Athanasios Theocharidis, Wenhua Wei, and Chris Haley. 2011. EpiGPU: exhaustive pairwise epistasis scans parallelized on consumer level graphics cards. Bioinformatics 27, 11 (2011), 1462--1465.
[31]
Edwin H. Jacox and Hanan Samet. 2007. Spatial Join Techniques. ACM Trans. Database Syst. 32, 1, Article 7 (2007).
[32]
Edwin H. Jacox and Hanan Samet. 2008. Metric Space Similarity Joins. ACM Trans. Database Syst. 33, 2, Article 7 (2008), 38 pages.
[33]
Dmitri V Kalashnikov. 2013. Super-EGO: fast multi-dimensional similarity join. The VLDB Journal 22, 4 (2013), 561--585.
[34]
Jinwoong Kim, Won-Ki Jeong, and Beomseok Nam. 2015. Exploiting Massive Parallelism for Indexing Multi-Dimensional Datasets on the GPU. IEEE Transactions on Parallel and Distributed Systems 26, 8 (2015), 2258--2271.
[35]
Jinwoong Kim and Beomseok Nam. 2018. Co-processing heterogeneous parallel index for multi-dimensional datasets. J. Parallel and Distrib. Comput. 113 (2018), 195--203.
[36]
Krzysztof Koperski and Jiawei Han. 1995. Discovery of spatial association rules in geographic information databases. In Advances in spatial databases. Springer, 47--66.
[37]
Francesco Lettich, Salvatore Orlando, Claudio Silvestri, and Christian S Jensen. 2017. Manycore GPU processing of repeated range queries over streams of moving objects observations. Concurrency and Computation: Practice and Experience 29, 4 (2017), e3881.
[38]
M. Lichman. 2013. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[39]
Michael D Lieberman, Jagan Sankaranarayanan, and Hanan Samet. 2008. A fast similarity join algorithm using graphics processing units. In IEEE 24th Intl. Conf. on Data Engineering. 1111--1120.
[40]
NVIDIA. 2018. Pascal Tuning Guide. http://docs.nvidia.com/cuda/pascal-tuning-guide/index.html. Accessed 18-June-2018.
[41]
S. K. Prasad, M. McDermott, X. He, and S. Puri. 2015. GPU-based Parallel R-tree Construction and Querying. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 618--627.
[42]
Mahsan Rofouei, Thanos Stathopoulos, Sebi Ryffel, William Kaiser, and Majid Sarrafzadeh. 2008. Energy-aware high performance computing with graphic processing units. In Workshop on power aware computing and system.
[43]
Thomas Seidl, Sergej Fries, and Brigitte Boden. 2013. MR-DSJ: Distance-Based Self-Join for Large-Scale Vector Data Analysis with MapReduce. In BTW, Vol. 214. 37--56.
[44]
H. So, J. Chen, B. Yiu, and A. Yu. 2011. Medical Ultrasound Imaging: To GPU or Not to GPU? IEEE Micro 31, 5 (2011), 54--65.
[45]
Ilya Volnyansky and Vladimir Pestov. 2009. Curse of Dimensionality in Pivot Based Indexes. In Proc. of the Second Intl. Workshop on Similarity Search and Applications. 39--46.
[46]
Jianting Zhang, Simin You, and Le Gruenwald. 2012. U2STRA: High-performance Data Management of Ubiquitous Urban Sensing Trajectories on GPGPUs. In Proc. of the ACM Workshop on City Data Management. 5--12.

Cited By

View all
  • (2024)The Solar System Notification Alert Processing System (SNAPS): Asteroid Population Outlier DetectionThe Astronomical Journal10.3847/1538-3881/ad4da5168:2(56)Online publication date: 5-Jul-2024
  • (2023)Optimization and Comparison of Coordinate- and Metric-Based Indexes on GPUs for Distance Similarity SearchesComputational Science – ICCS 202310.1007/978-3-031-36021-3_37(357-364)Online publication date: 26-Jun-2023
  • (2022)Leveraging GPU Tensor Cores for Double Precision Euclidean Distance Calculations2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC56025.2022.00029(135-144)Online publication date: Dec-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DaMoN'19: Proceedings of the 15th International Workshop on Data Management on New Hardware
July 2019
150 pages
ISBN:9781450368018
DOI:10.1145/3329785
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. High-dimensional data
  3. In-memory data-base
  4. Index structure
  5. Query optimization
  6. Self-join

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SIGMOD/PODS '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)74
  • Downloads (Last 6 weeks)11
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Solar System Notification Alert Processing System (SNAPS): Asteroid Population Outlier DetectionThe Astronomical Journal10.3847/1538-3881/ad4da5168:2(56)Online publication date: 5-Jul-2024
  • (2023)Optimization and Comparison of Coordinate- and Metric-Based Indexes on GPUs for Distance Similarity SearchesComputational Science – ICCS 202310.1007/978-3-031-36021-3_37(357-364)Online publication date: 26-Jun-2023
  • (2022)Leveraging GPU Tensor Cores for Double Precision Euclidean Distance Calculations2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC56025.2022.00029(135-144)Online publication date: Dec-2022
  • (2020)A coordinate-oblivious index for high-dimensional distance similarity searches on the GPUProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392768(1-12)Online publication date: 29-Jun-2020
  • (2020)GPU-based efficient join algorithms on HadoopThe Journal of Supercomputing10.1007/s11227-020-03262-6Online publication date: 3-Apr-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media