[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2983323.2983669acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Reuse-based Optimization for Pig Latin

Published: 24 October 2016 Publication History

Abstract

Pig Latin is a popular language which is widely used for parallel processing of massive data sets. Currently, subexpressions occurring repeatedly in Pig Latin scripts are executed as many times as they appear, and the current Pig Latin optimizer does not identify reuse opportunities. We present a novel optimization approach aiming at identifying and reusing repeated subexpressions in Pig Latin scripts. Our optimization algorithm, named PigReuse, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and reuses their results as needed in order to compute exactly the same output as the original scripts. Our experiments demonstrate the effectiveness of our approach.

References

[1]
P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scans of large data files. PVLDB, 2008.
[2]
P. Alvaro, N. Conway, and A. Krioukov. Multi-query optimization for parallel dataflow systems, 2009.
[3]
J. Camacho-Rodríguez, D. Colazzo, M. Herschel, I. Manolescu, and S. R. Chowdhury. PigReuse: A Reuse-based Optimizer for Pig Latin. Technical report, 2016. https://hal.inria.fr/hal-01353891.
[4]
CDH. http://www.cloudera.com/.
[5]
I. Elghandour and A. Aboulnaga. ReStore: reusing results of MapReduce jobs. PVLDB, 2012.
[6]
G. Giannikis, D. Makreshanski, G. Alonso, and D. Kossmann. Shared workload optimization. PVLDB, 2014.
[7]
G. Graefe. The Volcano optimizer generator: Extensibility and efficient search. In ICDE, 1993.
[8]
S. Grumbach and T. Milo. Towards Tractable Algebras for Bags. In PODS, 1993.
[9]
Apache Hadoop. http://hadoop.apache.org/.
[10]
HDP. http://www.hortonworks.com/.
[11]
M. Jarke. Common subexpression isolation in multiple query optimization. In Query Processing in Database Systems. Springer, 1985.
[12]
K. Karanasos, A. Katsifodimos, and I. Manolescu. Delta: Scalable data dissemination under capacity constraints. PVLDB, 2013.
[13]
J. LeFevre, J. Sankaranarayanan, H. Hacigumus, J. Tatemura, N. Polyzotis, and M. J. Carey. Opportunistic Physical Design for Big Data Analytics. In SIGMOD, 2014.
[14]
F. Nagel, P. Boncz, and S. D. Viglas. Recycling in pipelined query evaluation. In ICDE, 2013.
[15]
T. Neumann and G. Moerkotte. Generating optimal dag-structured query evaluation plans. Computer Science-Research and Development, 24(3):103--117, 2009.
[16]
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: sharing across multiple queries in MapReduce. PVLDB, 2010.
[17]
C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX, 2008.
[18]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, 2008.
[19]
Apache Pig. http://pig.apache.org/.
[20]
http://cwiki.apache.org/confluence/display/PIG/PigMix.
[21]
P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and Extensible Algorithms for Multi Query Optimization. In SIGMOD, 2000.
[22]
T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23--52, Mar. 1988.
[23]
Y. N. Silva, P.-A. Larson, and J. Zhou. Exploiting Common Subexpressions for Cloud Query Processing. In ICDE, 2012.
[24]
G. Wang and C.-Y. Chan. Multi-Query Optimization in MapReduce Framework. PVLDB, 2013.
[25]
J. Yang. Algorithms for materialized view design in data warehousing environment. PVLDB, 1997.
[26]
J. Zhou, P.-A. Larson, J.-C. Freytag, and W. Lehner. Efficient exploitation of similar subexpressions for query processing. In SIGMOD, 2007.

Cited By

View all
  • (2023)Optimizing Data Pipelines for Machine Learning in Feature StoresProceedings of the VLDB Endowment10.14778/3625054.362506016:13(4230-4239)Online publication date: 1-Sep-2023
  • (2022)HippoProceedings of the VLDB Endowment10.14778/3510397.351040215:5(1038-1052)Online publication date: 18-May-2022
  • (2022)AutoView: An Autonomous Materialized View Management System with Encoder-ReducerIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3163195(1-1)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
October 2016
2566 pages
ISBN:9781450340731
DOI:10.1145/2983323
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. linear programming
  2. piglatin
  3. reuse-based optimization

Qualifiers

  • Research-article

Conference

CIKM'16
Sponsor:
CIKM'16: ACM Conference on Information and Knowledge Management
October 24 - 28, 2016
Indiana, Indianapolis, USA

Acceptance Rates

CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Optimizing Data Pipelines for Machine Learning in Feature StoresProceedings of the VLDB Endowment10.14778/3625054.362506016:13(4230-4239)Online publication date: 1-Sep-2023
  • (2022)HippoProceedings of the VLDB Endowment10.14778/3510397.351040215:5(1038-1052)Online publication date: 18-May-2022
  • (2022)AutoView: An Autonomous Materialized View Management System with Encoder-ReducerIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3163195(1-1)Online publication date: 2022
  • (2021)Exploiting Reused-Based Sharing Work Opportunities in Big Data Multiquery Optimization with FlinkBig Data10.1089/big.2020.01419:6(454-479)Online publication date: 1-Dec-2021
  • (2020)Automatic View Generation with Deep Learning and Reinforcement Learning2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00133(1501-1512)Online publication date: Apr-2020
  • (2020)A gray-box modeling methodology for runtime prediction of Apache Spark jobsDistributed and Parallel Databases10.1007/s10619-020-07286-yOnline publication date: 10-Mar-2020
  • (2019)PeregrineProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362726(416-427)Online publication date: 20-Nov-2019
  • (2019)AcornProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362702(206-219)Online publication date: 20-Nov-2019
  • (2019)Apache HiveProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3314045(1773-1786)Online publication date: 25-Jun-2019
  • (2018)Selecting subexpressions to materialize at datacenter scaleProceedings of the VLDB Endowment10.14778/3192965.319297111:7(800-812)Online publication date: 1-Mar-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media