research-article

kNN-join Query Processing Algorithm on Mapreduce for Large Amounts of Data

Authors:

Hyunjo Lee,

Jae-Woo Chang,

Cheoljoo ChaeAuthors Info & Claims

ISEEIE 2021: 2021 International Symposium on Electrical, Electronics and Information Engineering

Pages 538 - 544

https://doi.org/10.1145/3459104.3459192

Published: 20 July 2021 Publication History

Get Access

Abstract

Recently, the amount of data is rapidly increasing with the continuous development of computation and communication capabilities. So, it has been actively studied for the effective data analysis schemes of the large amounts of data on MapReduce which supports efficient parallel data processing for large-scale data. Among various queries for analysing data, k nearest neighbour (kNN) join query, which aims to combine the k nearest neighbours of each point of dataset R with those from another dataset S, has been considered typical. However, existing kNN join schemes on MapReduce require high computation cost for constructing and managing index structures. To solve the problems, we propose a kNN-join query processing algorithm on MapReduce for analysing large-scale data. First, our algorithm can reduce the overhead for constructing the index structure by using the seed-based dynamic partitioning. Second, it can reduce the computational overhead to find candidate partitions by using the average distance between a pair of neighbouring seeds. We show that our algorithm outperforms the existing scheme in terms of the query processing time.

References

[1]

J. Dean, S. Ghemawat, “MapReduce: Simplified data processing on large clusters”, Communications of the ACM, Vol. 51, No. 1, pp. 107-113, 2008.

Digital Library

Google Scholar

[2]

D. Jiang, BC. Ooi, L. Shi, S. Wu, “The performance of MapReduce: An in-depth study”, Proceedings of the VLDB Endowment, Vol 3.1-2, pp. 472-483, 2010.

Digital Library

Google Scholar

[3]

C. Ji, Y. Li, W. Qiu, U. Awada, K. Li, “Big Data Processing in Cloud Computing Environments”, ISPAN 2012, pp.17-23, 2012.

Digital Library

Google Scholar

[4]

C. Bohm, HP. Kriegel, “A cost model and index architecture for the similarity join”, ICDE 2001, pp. 411-420, 2001.

Crossref

Google Scholar

[5]

C. Xia, H. Lu, BC. Ooi, J. Hu, “Gorder: An efficient method for knn join processing”, VLDB 2004, pp. 756-767, 2004.

Google Scholar

[6]

W. Lu., Y. Shen, Y. Chen, BC. Ooi, “Efficient processing of k nearest neighbor joins using MapReduce”, Proceedings of the VLDB Endowment, Vol. 5(10), pp. 1016–1027, 2012.

Digital Library

Google Scholar

[7]

C. Yu, B. Cui, S. Wang, J. Su, “Efficient index-based knn join processing for high-dimensional data”, Information and Software Technology, Vol. 49, No.4, pp. 332-344, 2007.

Digital Library

Google Scholar

[8]

B. Yao, F. Li, P. Kumar, “K nearest neighbor queries and knn-joins in large relational databases (almost) for free”, ICDE 2010, pp 4-15, 2010.

Crossref

Google Scholar

[9]

R. Vernica, M. J. Carey, C. Li, “Efficient parallel set-similarity joins using MapReduce”, SIGMOD 2010, pp. 495–506, 2010.

Digital Library

Google Scholar

[10]

A. Metwally, C. Faloutsos, “V-smart-join: A scalable MapReduce framework for all-pair similarity joins of multisets and vectors”, PVLDB 2012, Vol. 5(8), pp. 704–715, 2012.

Google Scholar

[11]

R. Walpole, R. Myers, S. Myers, K. Ye, “Probability and Statistics for Engineers and Scientists”, Prentice-Hall 20

Google Scholar

Cited By

View all

Ukey NZhang GYang ZWang XLi BSaydam SZhang W(2024)A Cluster-Based Approach to kNN Join Over Batch-Dynamic High-Dimensional DataAdvanced Data Mining and Applications10.1007/978-981-96-0814-0_6(81-96)Online publication date: 13-Dec-2024
https://doi.org/10.1007/978-981-96-0814-0_6
Ukey NZhang GYang ZLi BLi WZhang W(2023)Efficient continuous kNN join over dynamic high-dimensional dataWorld Wide Web10.1007/s11280-023-01204-926:6(3759-3794)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1007/s11280-023-01204-9

Index Terms

kNN-join Query Processing Algorithm on Mapreduce for Large Amounts of Data
1. Information systems
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient parallel kNN joins for large data in MapReduce
EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, ...
A comparison of join algorithms for log processing in MaPreduce
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As ...
Efficient big data processing in Hadoop MapReduce

This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ISEEIE 2021: 2021 International Symposium on Electrical, Electronics and Information Engineering

February 2021

644 pages

ISBN:9781450389839

DOI:10.1145/3459104

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ISEEIE 2021

ISEEIE 2021: 2021 International Symposium on Electrical, Electronics and Information Engineering

February 19 - 21, 2021

Seoul, Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
47
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Ukey NZhang GYang ZWang XLi BSaydam SZhang W(2024)A Cluster-Based Approach to kNN Join Over Batch-Dynamic High-Dimensional DataAdvanced Data Mining and Applications10.1007/978-981-96-0814-0_6(81-96)Online publication date: 13-Dec-2024
https://doi.org/10.1007/978-981-96-0814-0_6
Ukey NZhang GYang ZLi BLi WZhang W(2023)Efficient continuous kNN join over dynamic high-dimensional dataWorld Wide Web10.1007/s11280-023-01204-926:6(3759-3794)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1007/s11280-023-01204-9

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Cited By

Index Terms

Recommendations

Efficient parallel kNN joins for large data in MapReduce

A comparison of join algorithms for log processing in MaPreduce

Efficient big data processing in Hadoop MapReduce

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations