[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
survey

Optimization of Complex Dataflows with User-Defined Functions

Published: 26 May 2017 Publication History

Abstract

In many fields, recent years have brought a sharp rise in the size of the data to be analyzed and the complexity of the analysis to be performed. Such analyses are often described as dataflows specified in declarative dataflow languages. A key technique to achieve scalability for such analyses is the optimization of the declarative programs; however, many real-life dataflows are dominated by user-defined functions (UDFs) to perform, for instance, text analysis, graph traversal, classification, or clustering. This calls for specific optimization techniques as the semantics of such UDFs are unknown to the optimizer.
In this article, we survey techniques for optimizing dataflows with UDFs. We consider methods developed over decades of research in relational database systems as well as more recent approaches spurred by the popularity of Map/Reduce-style data processing frameworks. We present techniques for syntactical dataflow modification, approaches for inferring semantics and rewrite options for UDFs, and methods for dataflow transformations both on the logical and the physical levels. Furthermore, we give a comprehensive overview on declarative dataflow languages for Big Data processing systems from the perspective of their build-in optimization techniques. Finally, we highlight open research challenges with the intention to foster more research into optimizing dataflows that contain UDFs.

References

[1]
Stefan Ackermann, Vojin Jovanovic, Tiark Rompf, and Martin Odersky. 2012. Jet: An embedded DSL for high performance big data processing. In Proceedings of the International Workshop on End-to-end Management of Big Data (BigData’12), held in conjunction with VLDB 2012. 1--10.
[2]
Foto N. Afrati, Dan Delorey, Mosha Pasumansky, and Jeffrey D. Ullman. 2014. Storing and querying tree-structured records in dremel. Proc. VLDB Endow. 7, 12 (2014), 1131--1142.
[3]
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939--964.
[4]
Alexander Alexandrov, Andreas Kunft, Asterios Katsifodimos, Felix Schüler, Lauritz Thamsen, Odej Kao, Tobias Herb, and Volker Markl. 2015. Implicit parallelism through deep language embedding. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). 47--61.
[5]
Wail Y. Alkowaileet, Sattam Alsubaiee, Michael J. Carey, Till Westmann, and Yingyi Bu. 2016. Large-scale complex analytics on semi-structured datasets using asterixdb and spark. Proc. VLDB Endow. 9, 13 (2016), 1585--1588.
[6]
Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, and Till Westmann. 2014. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 7, 14 (2014), 1905--1916.
[7]
Amplab. 2014. Big Data Benchmark. (2014). Retrieved October 31, 2016 from https://amplab.cs.berkeley.edu/benchmark/.
[8]
P. M. G. Apers, A. R. Hevner, and S. B. Yao. 1983. Optimization algorithms for distributed queries. IEEE Trans. Softw. Eng. 9, 1 (1983), 57--68.
[9]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). 1383--1394.
[10]
Ron Avnur and Joseph M. Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00).
[11]
Fuad Bajaber, Radwa Elshawi, Omar Batarfi, Abdulrahman Altalhi, Ahmed Barnawi, and Sherif Sakr. 2016. Big data 2.0 processing systems: Taxonomy and open challenges. J. Grid Comput. 14, 3 (2016), 379--405.
[12]
Ahmed Barnawi, Omar Batarfi, Seyed-Mehdi-Reza Beheshti, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, and Sherif Sakr. 2014. On characterizing the performance of distributed graph computation platforms. In Proceedings of the 6th TPC Technology Conference (TPCTC’14)—Performance Characterization and Benchmarking. 29--43.
[13]
Omar Batarfi, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, Seyed-Mehdi-Reza Beheshti, Ahmed Barnawi, and Sherif Sakr. 2015. Large scale graph processing systems: Survey and an experimental evaluation. Cluster Comput. 18, 3 (2015), 1189--1213.
[14]
Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). 119--130.
[15]
Philip A. Bernstein and Nathan Goodman. 1981. Power of natural semijoins. SIAM J. Comput. 10, 4 (1981), 751--771.
[16]
Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B. Rothnie Jr. 1981. Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst. 6, 4 (1981), 602--625.
[17]
G. Bruce Berriman and Steven L. Groom. 2011. How will astronomy archives survive the data tsunami? Commun. ACM 54, 12 (2011), 52--56.
[18]
Kevin S. Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Y. Eltabakh, Carl-Christian Kanne, Fatma Özcan, and Eugene J. Shekita. 2011. Jaql: A scripting language for large scale semistructured data analysis. Proc. VLDB Endow. 4, 12 (2011), 1272--1283.
[19]
Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, and Yuanyuan Tian. 2010. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 975--986.
[20]
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.
[21]
Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE’11). 1151--1162.
[22]
Vinayak R. Borkar, Yingyi Bu, E. Preston Carman Jr., Nicola Onose, Till Westmann, Pouria Pirzadeh, Michael J. Carey, and Vassilis J. Tsotras. 2015. Algebricks: A data model-agnostic compiler backend for big data languages. In Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC’15). 422--433.
[23]
Anthony Peter Graham Brown. 1998. Optimization of the order in which the comparisons of the components of a Boolean query expression are applied to a database record stored as a byte stream.
[24]
Jen Burge, Kamesh Munagala, and Utkarsh Srivastava. 2005. Ordering Pipelined Query Operators with Precedence Constraints. Technical Report. Stanford University.
[25]
Michael J. Cafarella and Christopher Ré. 2010. Manimal: Relational optimization for data-intensive programs. In Proceedings of the 13th International Workshop on the Web and Databases (WebDB’10). 10:1--10:6.
[26]
E. Preston Carman Jr., Till Westmann, Vinayak R. Borkar, Michael J. Carey, and Vassilis J. Tsotras. 2015. A scalable parallel xquery processor. In Proceedings of the IEEE International Conference on Big Data (Big Data’15). 164--173.
[27]
Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive datasets. Proc. VLDB Endow. 1, 2 (2008), 1265--1276.
[28]
Chee Yong Chan and Yannis E. Ioannidis. 1998. Bitmap index design and evaluation. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD’98). 355--366.
[29]
Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon, and Michael Wong. 2011. Tenzing a SQL implementation on the mapreduce framework. Proc. VLDB Endow. 4, 12 (2011), 1318--1327.
[30]
Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’98). ACM, 34--43.
[31]
Surajit Chaudhuri and Kyuseok Shim. 1999. Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24, 2 (1999), 177--228.
[32]
Richard L. Cole and Goetz Graefe. 1994. Optimization of dynamic query evaluation plans. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD’94). 150--160.
[33]
Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13, 4 (1991), 451--490.
[34]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI’04). 10--10.
[35]
Christos Doulkeridis and Kjetil Nørvåg. 2014. A survey of large-scale analytical query processing in mapreduce. VLDB J. 23, 3 (2014), 355--380.
[36]
Xuepeng Fan, Zhenyu Guo, Hai Jin, Xiaofei Liao, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Wei Lin, Jingren Zhou, and Lidong Zhou. 2015. Spotting code optimizations in data-parallel pipelines through periscope. IEEE Trans. Parallel Distrib. Syst. 26, 6 (2015), 1718--1731.
[37]
Richard A. Ganski and Harry K. T. Wong. 1987. Optimization of nested SQL queries revisited. SIGMOD Rec. 16, 3 (1987), 23--33.
[38]
Alan Gates, Jianyong Dai, and Thejas Nair. 2013. Apache pig’s optimizer. IEEE Data Eng. Bull. 36, 1 (2013), 34--45.
[39]
Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). 1197--1208.
[40]
Georg Gottlob, Christoph Koch, and Reinhard Pichler. 2005. Efficient algorithms for processing XPath queries. ACM Trans. Database Syst. 30, 2 (2005), 444--491.
[41]
Goetz Graefe. 1994. Volcano—An extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. 6, 1 (1994), 120--135.
[42]
Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19--29.
[43]
Goetz Graefe. 2009. Parallel query execution algorithms. In Encyclopedia of Database Systems, Ling Liu and M. Tamer ÖZsu (Eds.). Springer, 2030--2035.
[44]
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1, 1 (1997), 29--53.
[45]
Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting code optimizations in data-parallel pipelines through periSCOPE. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 121--133.
[46]
Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. 2014. An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7, 12 (2014), 1047--1058.
[47]
Michael Z. Hanani. 1977. An optimal evaluation of Boolean expressions in an online query system. Commun. ACM 20, 5 (1977), 344--347.
[48]
Michael Hausenblas and Jacques Nadeau. 2013. Apache drill: Interactive ad-hoc analysis at scale. Big Data 1, 2 (2013), 100--104.
[49]
Arvid Heise, Astrid Rheinländer, Marcus Leich, Ulf Leser, and Felix Naumann. 2012. Meteor/sopremo: An extensible query language and operator model. In Proceedings of the International Workshop on End-to-end Management of Big Data (BigData’12), held in conjunction with VLDB 2012. 1--10.
[50]
Joseph M. Hellerstein. 1998. Optimization techniques for queries with expensive methods. ACM Trans. Database Syst. 23, 2 (1998), 113--157.
[51]
Joseph M. Hellerstein and Jeffrey F. Naughton. 1996. Query execution techniques for caching expensive methods. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96). 423--434.
[52]
Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, and others. 2008. Big data: The future of biocuration. Nature 455, 7209 (2008), 47--50.
[53]
Fabian Hueske, Aljoscha Krettek, and Kostas Tzoumas. 2013. Enabling operator reordering in data flow programs through static code analysis. CoRR abs/1301.4200 (2013), 1--4.
[54]
Fabian Hueske, Mathias Peters, Matthias Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, and Kostas Tzoumas. 2012. Opening the black boxes in data flow optimization. Proc. VLDB Endow. 5, 11 (2012), 1256--1267.
[55]
Yannis E. Ioannidis. 1996. Query optimization. ACM Comput. Surv. 28, 1 (1996), 121--123.
[56]
Michael Isard and Yuan Yu. 2009. Distributed data-parallel computing using a high-level programming language. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD’09). 987--994.
[57]
Adam Jacobs. 2009. The pathologies of big data. Commun. ACM 52, 8 (2009), 36--44.
[58]
Eaman Jahani, Michael J. Cafarella, and Christopher Ré. 2011. Automatic optimization for mapreduce programs. Proc. VLDB Endow. 4, 6 (2011), 385--396.
[59]
Won Kim. 1982. On optimizing an SQL-like nested query. ACM Trans. Database Syst. 7, 3 (1982), 443--469.
[60]
Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, and others. 2015. Impala: A modern, open-source SQL engine for hadoop. In Proceedings of the 7th Biennial Conference on Innovative Data Systems (CIDR’15). 1--10.
[61]
Donald Kossmann. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 4 (2000), 422--469.
[62]
Georgia Kougka and Anastasios Gounaris. 2013. Declarative expression and optimization of data-intensive flows. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery (DaWaK’13). 13--25.
[63]
Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. 2012. Parallel data processing with mapreduce: A survey. SIGMOD Rec. 40, 4 (2012), 11--20.
[64]
Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using mapreduce. ACM Comput. Surv. 46, 3 (2014), 31:1--31:42.
[65]
Guy M. Lohman. 1988. Grammar-like functional rules for representing query optimization alternatives. In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (SIGMOD’88). 18--27.
[66]
Hongjun Lu and Michael J. Carey. 1985. Some experimental results on distributed join algorithms in a local network. In Proceedings of the 11th International Conference on Very Large Data Bases (VLDB’85). 292--304.
[67]
Lothar F. Mackert and Guy M. Lohman. 1986. R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the 12th International Conference on Very Large Data Bases (VLDB’86).
[68]
Vivien Marx. 2013. Biology: The big challenges of big data. Nature 498, 7453 (2013), 255--260.
[69]
Erik Meijer, Brian Beckman, and Gavin Bierman. 2006. LINQ: Reconciling object, relations and XML in the .net framework. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD’06). 706--706.
[70]
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th International Conference on Very Large Data Bases (VLDB’12). 330--339.
[71]
Donald Michie. 1968. “Memo” functions and machine learning. Nature 218, 5136 (1968), 19--22.
[72]
Donald Miner and Adam Shook. 2012. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc.
[73]
Jack Minker and Rita G. Minker. 1980. Optimization of Boolean expressions-historical developments. IEEE Ann. Hist. Comput. 2, 3 (1980), 227--238.
[74]
Derek Gordon Murray, Michael Isard, and Yuan Yu. 2011. Steno: Automatic optimization of declarative queries. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). 121--131.
[75]
Thomas Neumann. 2005. Efficient Generation and Execution of DAG-Structured Query Graphs. Ph.D. Dissertation. University of Mannheim, Germany.
[76]
Thomas Neumann. 2009. Query optimization (in relational databases). In Encyclopedia of Database Systems, Ling Liu and M. Tamer Zsu (Eds.). Springer, 2273--2278.
[77]
Eduardo Ogasawara, Jonas Dias, Vitor Silva, Fernando Chirigati, Daniel Oliveira, Fabio Porto, Patrick Valduriez, and Marta Mattoso. 2013. Chiron: A parallel engine for algebraic scientific workflows. Concurr. Comput.: Pract. Exper. 25, 16 (2013), 2327--2341.
[78]
Eduardo S. Ogasawara, Daniel de Oliveira, Patrick Valduriez, Jonas Dias, Fabio Porto, and Marta Mattoso. 2011. An algebraic approach for data-centric scientific workflows. Proc. VLDB Endow. 4, 12 (2011), 1328--1339.
[79]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08). 1099--1110.
[80]
M. Tamer Özsu and Patrick Valduriez. 1999. Principles of Distributed Database Systems (2nd ed.). Prentice-Hall, Inc.
[81]
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis with sawzall. Sci. Program. 13, 4 (2005), 277--298.
[82]
Horst Reichel. 1984. Behavioural equivalence—a unifying concept for initial and final specifications. In Proceedings of the 3rd Hungarian Computer Science Conference, Akademiai Kiado.
[83]
Astrid Rheinländer, Arvid Heise, Fabian Hueske, Ulf Leser, and Felix Naumann. 2015. SOFA: An extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52 (2015), 96--125.
[84]
Giulia Rumi, Claudia Colella, and Danilo Ardagna. 2014. Optimization techniques within the hadoop eco-system: A survey. In Proceedings of the 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC’14). 437--444.
[85]
Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. 2013. The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46, 1 (2013), Article 11.
[86]
Anish Das Sarma, Foto N. Afrati, Semih Salihoglu, and Jeffrey D. Ullman. 2013. Upper and lower bounds on the cost of a map-reduce computation. Proc. VLDB Endow. 6, 4 (2013), 277--288.
[87]
P. Griffiths Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data (SIGMOD’79). 23--34.
[88]
Yasin N. Silva, Paul-Ake Larson, and Jingren Zhou. 2012. Exploiting common subexpressions for cloud query processing. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE’12). 1337--1348.
[89]
Alkis Simitsis, Panos Vassiliadis, and Timos Sellis. 2005. Optimizing ETL processes in data warehouses. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05). 564--575.
[90]
Alkis Simitsis, Kevin Wilkinson, Malu Castellanos, and Umeshwar Dayal. 2012. Optimizing analytic data flows for multiple execution engines. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). 829--840.
[91]
David Simmen, Eugene Shekita, and Timothy Malkemus. 1996. Fundamental techniques for order optimization. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96). 57--67.
[92]
Utkarsh Srivastava, Kamesh Munagala, Jennifer Widom, and Rajeev Motwani. 2006. Query optimization over web services. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06). 355--366.
[93]
Konrad Stocker, Donald Kossmann, Reinhard Braumandl, and Alfons Kemper. 2001. Integrating semi-join-reducers into state of the art query processors. In Proceedings of the 17th International Conference on Data Engineering (ICDE’01). 575--584.
[94]
Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. 2010. MapReduce and parallel DBMSs: Friends or foes? Commun. ACM 53, 1 (2010), 64--71.
[95]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (2009), 1626--1629.
[96]
Patrick Valduriez. 1987. Join indices. ACM Trans. Database Syst. 12, 2 (1987), 218--246.
[97]
Panos Vassiliadis, Alkis Simitsis, and Eftychia Baikousi. 2009. A taxonomy of ETL activities. In Proceedings of the 12th Int. Workshop on Data Warehousing and OLAP (DOLAP’09). 25--32.
[98]
Henning Wachsmuth, Benno Stein, and Gregor Engels. 2011. Constructing efficient information extraction pipelines. In Proceedings of the 20th ACM Int. Conf. on Information and Knowledge Management (CIKM’11). 2237--2240.
[99]
Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. 2014. Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 1 (2014), 97--107.
[100]
Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2013. Shark: SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). 13--24.
[101]
Weipeng P. Yan and Per-Ake Larson. 1994. Performing group-by before join {query processing}. In Proceedings of the 10th International Conference on Data Engineering (ICDE’94). 89--100.
[102]
Weipeng P. Yan and Per-Ake Larson. 1995. Eager aggregation and lazy aggregation. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB’95). 345--357.
[103]
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’08). 1--14.
[104]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10). 10--10.
[105]
Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel databases meet mapreduce. VLDB J. 21, 5 (2012), 611--636.

Cited By

View all
  • (2024)CuttleFlow: Infrastructure-Specific Workflow Adaption for Improved Reusability2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678732(1-10)Online publication date: 16-Sep-2024
  • (2024)Study on motion behaviour of coal water slurry particles under vibration conditionsHeliyon10.1016/j.heliyon.2024.e2462910:3(e24629)Online publication date: Feb-2024
  • (2023)Toward Building Edge Learning PipelinesIEEE Internet Computing10.1109/MIC.2022.317164327:1(61-69)Online publication date: 1-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 50, Issue 3
May 2018
550 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3101309
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 May 2017
Accepted: 01 April 2017
Revised: 01 March 2017
Received: 01 November 2016
Published in CSUR Volume 50, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Dataflow optimization
  2. big data processing systems
  3. user-defined functions

Qualifiers

  • Survey
  • Research
  • Refereed

Funding Sources

  • German Research Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)88
  • Downloads (Last 6 weeks)9
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CuttleFlow: Infrastructure-Specific Workflow Adaption for Improved Reusability2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678732(1-10)Online publication date: 16-Sep-2024
  • (2024)Study on motion behaviour of coal water slurry particles under vibration conditionsHeliyon10.1016/j.heliyon.2024.e2462910:3(e24629)Online publication date: Feb-2024
  • (2023)Toward Building Edge Learning PipelinesIEEE Internet Computing10.1109/MIC.2022.317164327:1(61-69)Online publication date: 1-Jan-2023
  • (2022)Containerized execution of UDFsProceedings of the VLDB Endowment10.14778/3551793.355186015:11(3158-3171)Online publication date: 29-Sep-2022
  • (2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
  • (2021)A Roadmap to Critical Redesign Choices That Increase the Robustness of Business Process Redesign InitiativesJournal of Open Innovation: Technology, Market, and Complexity10.3390/joitmc70301787:3(178)Online publication date: Sep-2021
  • (2021)Declarative Data Analytics: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.295808433:6(2392-2411)Online publication date: 1-Jun-2021
  • (2021)SODA: A Semantics-Aware Optimization Framework for Data-Intensive Applications Using Hybrid Program Analysis2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00058(433-444)Online publication date: Sep-2021
  • (2021)An authorization model for query execution in the cloudThe VLDB Journal10.1007/s00778-021-00709-x31:3(555-579)Online publication date: 6-Nov-2021
  • (2021)Distributed Query Evaluation over Encrypted DataData and Applications Security and Privacy XXXV10.1007/978-3-030-81242-3_6(96-114)Online publication date: 14-Jul-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media