More Web Proxy on the site http://driver.im/

survey

Optimization of Complex Dataflows with User-Defined Functions

Authors:

Astrid Rheinländer,

Goetz GraefeAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 50, Issue 3

Article No.: 38, Pages 1 - 39

https://doi.org/10.1145/3078752

Published: 26 May 2017 Publication History

Abstract

In many fields, recent years have brought a sharp rise in the size of the data to be analyzed and the complexity of the analysis to be performed. Such analyses are often described as dataflows specified in declarative dataflow languages. A key technique to achieve scalability for such analyses is the optimization of the declarative programs; however, many real-life dataflows are dominated by user-defined functions (UDFs) to perform, for instance, text analysis, graph traversal, classification, or clustering. This calls for specific optimization techniques as the semantics of such UDFs are unknown to the optimizer.

In this article, we survey techniques for optimizing dataflows with UDFs. We consider methods developed over decades of research in relational database systems as well as more recent approaches spurred by the popularity of Map/Reduce-style data processing frameworks. We present techniques for syntactical dataflow modification, approaches for inferring semantics and rewrite options for UDFs, and methods for dataflow transformations both on the logical and the physical levels. Furthermore, we give a comprehensive overview on declarative dataflow languages for Big Data processing systems from the perspective of their build-in optimization techniques. Finally, we highlight open research challenges with the intention to foster more research into optimizing dataflows that contain UDFs.

References

[1]

Stefan Ackermann, Vojin Jovanovic, Tiark Rompf, and Martin Odersky. 2012. Jet: An embedded DSL for high performance big data processing. In Proceedings of the International Workshop on End-to-end Management of Big Data (BigData’12), held in conjunction with VLDB 2012. 1--10.

[2]

Foto N. Afrati, Dan Delorey, Mosha Pasumansky, and Jeffrey D. Ullman. 2014. Storing and querying tree-structured records in dremel. Proc. VLDB Endow. 7, 12 (2014), 1131--1142.

Digital Library

[3]

Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939--964.

Digital Library

[4]

Alexander Alexandrov, Andreas Kunft, Asterios Katsifodimos, Felix Schüler, Lauritz Thamsen, Odej Kao, Tobias Herb, and Volker Markl. 2015. Implicit parallelism through deep language embedding. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). 47--61.

Digital Library

[5]

Wail Y. Alkowaileet, Sattam Alsubaiee, Michael J. Carey, Till Westmann, and Yingyi Bu. 2016. Large-scale complex analytics on semi-structured datasets using asterixdb and spark. Proc. VLDB Endow. 9, 13 (2016), 1585--1588.

Digital Library

[6]

Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, and Till Westmann. 2014. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 7, 14 (2014), 1905--1916.

Digital Library

[7]

Amplab. 2014. Big Data Benchmark. (2014). Retrieved October 31, 2016 from https://amplab.cs.berkeley.edu/benchmark/.

[8]

P. M. G. Apers, A. R. Hevner, and S. B. Yao. 1983. Optimization algorithms for distributed queries. IEEE Trans. Softw. Eng. 9, 1 (1983), 57--68.

Digital Library

[9]

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). 1383--1394.

Digital Library

[10]

Ron Avnur and Joseph M. Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00).

Digital Library

[11]

Fuad Bajaber, Radwa Elshawi, Omar Batarfi, Abdulrahman Altalhi, Ahmed Barnawi, and Sherif Sakr. 2016. Big data 2.0 processing systems: Taxonomy and open challenges. J. Grid Comput. 14, 3 (2016), 379--405.

Digital Library

[12]

Ahmed Barnawi, Omar Batarfi, Seyed-Mehdi-Reza Beheshti, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, and Sherif Sakr. 2014. On characterizing the performance of distributed graph computation platforms. In Proceedings of the 6th TPC Technology Conference (TPCTC’14)—Performance Characterization and Benchmarking. 29--43.

[13]

Omar Batarfi, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, Seyed-Mehdi-Reza Beheshti, Ahmed Barnawi, and Sherif Sakr. 2015. Large scale graph processing systems: Survey and an experimental evaluation. Cluster Comput. 18, 3 (2015), 1189--1213.

Digital Library

[14]

Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). 119--130.

Digital Library

[15]

Philip A. Bernstein and Nathan Goodman. 1981. Power of natural semijoins. SIAM J. Comput. 10, 4 (1981), 751--771.

Digital Library

[16]

Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B. Rothnie Jr. 1981. Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst. 6, 4 (1981), 602--625.

Digital Library

[17]

G. Bruce Berriman and Steven L. Groom. 2011. How will astronomy archives survive the data tsunami? Commun. ACM 54, 12 (2011), 52--56.

Digital Library

[18]

Kevin S. Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Y. Eltabakh, Carl-Christian Kanne, Fatma Özcan, and Eugene J. Shekita. 2011. Jaql: A scripting language for large scale semistructured data analysis. Proc. VLDB Endow. 4, 12 (2011), 1272--1283.

Digital Library

[19]

Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, and Yuanyuan Tian. 2010. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 975--986.

Digital Library

[20]

Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.

Digital Library

[21]

Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE’11). 1151--1162.

Digital Library

[22]

Vinayak R. Borkar, Yingyi Bu, E. Preston Carman Jr., Nicola Onose, Till Westmann, Pouria Pirzadeh, Michael J. Carey, and Vassilis J. Tsotras. 2015. Algebricks: A data model-agnostic compiler backend for big data languages. In Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC’15). 422--433.

Digital Library

[23]

Anthony Peter Graham Brown. 1998. Optimization of the order in which the comparisons of the components of a Boolean query expression are applied to a database record stored as a byte stream.

[24]

Jen Burge, Kamesh Munagala, and Utkarsh Srivastava. 2005. Ordering Pipelined Query Operators with Precedence Constraints. Technical Report. Stanford University.

[25]

Michael J. Cafarella and Christopher Ré. 2010. Manimal: Relational optimization for data-intensive programs. In Proceedings of the 13th International Workshop on the Web and Databases (WebDB’10). 10:1--10:6.

Digital Library

[26]

E. Preston Carman Jr., Till Westmann, Vinayak R. Borkar, Michael J. Carey, and Vassilis J. Tsotras. 2015. A scalable parallel xquery processor. In Proceedings of the IEEE International Conference on Big Data (Big Data’15). 164--173.

Digital Library

[27]

Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive datasets. Proc. VLDB Endow. 1, 2 (2008), 1265--1276.

Digital Library

[28]

Chee Yong Chan and Yannis E. Ioannidis. 1998. Bitmap index design and evaluation. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD’98). 355--366.

Digital Library

[29]

Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon, and Michael Wong. 2011. Tenzing a SQL implementation on the mapreduce framework. Proc. VLDB Endow. 4, 12 (2011), 1318--1327.

Digital Library

[30]

Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’98). ACM, 34--43.

Digital Library

[31]

Surajit Chaudhuri and Kyuseok Shim. 1999. Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24, 2 (1999), 177--228.

Digital Library

[32]

Richard L. Cole and Goetz Graefe. 1994. Optimization of dynamic query evaluation plans. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD’94). 150--160.

Digital Library

[33]

Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13, 4 (1991), 451--490.

Digital Library

[34]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI’04). 10--10.

Digital Library

[35]

Christos Doulkeridis and Kjetil Nørvåg. 2014. A survey of large-scale analytical query processing in mapreduce. VLDB J. 23, 3 (2014), 355--380.

Digital Library

[36]

Xuepeng Fan, Zhenyu Guo, Hai Jin, Xiaofei Liao, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Wei Lin, Jingren Zhou, and Lidong Zhou. 2015. Spotting code optimizations in data-parallel pipelines through periscope. IEEE Trans. Parallel Distrib. Syst. 26, 6 (2015), 1718--1731.

Digital Library

[37]

Richard A. Ganski and Harry K. T. Wong. 1987. Optimization of nested SQL queries revisited. SIGMOD Rec. 16, 3 (1987), 23--33.

Digital Library

[38]

Alan Gates, Jianyong Dai, and Thejas Nair. 2013. Apache pig’s optimizer. IEEE Data Eng. Bull. 36, 1 (2013), 34--45.

[39]

Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). 1197--1208.

Digital Library

[40]

Georg Gottlob, Christoph Koch, and Reinhard Pichler. 2005. Efficient algorithms for processing XPath queries. ACM Trans. Database Syst. 30, 2 (2005), 444--491.

Digital Library

[41]

Goetz Graefe. 1994. Volcano—An extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. 6, 1 (1994), 120--135.

Digital Library

[42]

Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19--29.

[43]

Goetz Graefe. 2009. Parallel query execution algorithms. In Encyclopedia of Database Systems, Ling Liu and M. Tamer ÖZsu (Eds.). Springer, 2030--2035.

[44]

Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1, 1 (1997), 29--53.

Digital Library

[45]

Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting code optimizations in data-parallel pipelines through periSCOPE. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 121--133.

Digital Library

[46]

Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. 2014. An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7, 12 (2014), 1047--1058.

Digital Library

[47]

Michael Z. Hanani. 1977. An optimal evaluation of Boolean expressions in an online query system. Commun. ACM 20, 5 (1977), 344--347.

Digital Library

[48]

Michael Hausenblas and Jacques Nadeau. 2013. Apache drill: Interactive ad-hoc analysis at scale. Big Data 1, 2 (2013), 100--104.

[49]

Arvid Heise, Astrid Rheinländer, Marcus Leich, Ulf Leser, and Felix Naumann. 2012. Meteor/sopremo: An extensible query language and operator model. In Proceedings of the International Workshop on End-to-end Management of Big Data (BigData’12), held in conjunction with VLDB 2012. 1--10.

[50]

Joseph M. Hellerstein. 1998. Optimization techniques for queries with expensive methods. ACM Trans. Database Syst. 23, 2 (1998), 113--157.

Digital Library

[51]

Joseph M. Hellerstein and Jeffrey F. Naughton. 1996. Query execution techniques for caching expensive methods. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96). 423--434.

Digital Library

[52]

Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, and others. 2008. Big data: The future of biocuration. Nature 455, 7209 (2008), 47--50.

[53]

Fabian Hueske, Aljoscha Krettek, and Kostas Tzoumas. 2013. Enabling operator reordering in data flow programs through static code analysis. CoRR abs/1301.4200 (2013), 1--4.

[54]

Fabian Hueske, Mathias Peters, Matthias Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, and Kostas Tzoumas. 2012. Opening the black boxes in data flow optimization. Proc. VLDB Endow. 5, 11 (2012), 1256--1267.

Digital Library

[55]

Yannis E. Ioannidis. 1996. Query optimization. ACM Comput. Surv. 28, 1 (1996), 121--123.

Digital Library

[56]

Michael Isard and Yuan Yu. 2009. Distributed data-parallel computing using a high-level programming language. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD’09). 987--994.

Digital Library

[57]

Adam Jacobs. 2009. The pathologies of big data. Commun. ACM 52, 8 (2009), 36--44.

Digital Library

[58]

Eaman Jahani, Michael J. Cafarella, and Christopher Ré. 2011. Automatic optimization for mapreduce programs. Proc. VLDB Endow. 4, 6 (2011), 385--396.

Digital Library

[59]

Won Kim. 1982. On optimizing an SQL-like nested query. ACM Trans. Database Syst. 7, 3 (1982), 443--469.

Digital Library

[60]

Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, and others. 2015. Impala: A modern, open-source SQL engine for hadoop. In Proceedings of the 7th Biennial Conference on Innovative Data Systems (CIDR’15). 1--10.

[61]

Donald Kossmann. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 4 (2000), 422--469.

Digital Library

[62]

Georgia Kougka and Anastasios Gounaris. 2013. Declarative expression and optimization of data-intensive flows. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery (DaWaK’13). 13--25.

Digital Library

[63]

Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. 2012. Parallel data processing with mapreduce: A survey. SIGMOD Rec. 40, 4 (2012), 11--20.

Digital Library

[64]

Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using mapreduce. ACM Comput. Surv. 46, 3 (2014), 31:1--31:42.

Digital Library

[65]

Guy M. Lohman. 1988. Grammar-like functional rules for representing query optimization alternatives. In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (SIGMOD’88). 18--27.

Digital Library

[66]

Hongjun Lu and Michael J. Carey. 1985. Some experimental results on distributed join algorithms in a local network. In Proceedings of the 11th International Conference on Very Large Data Bases (VLDB’85). 292--304.

Digital Library

[67]

Lothar F. Mackert and Guy M. Lohman. 1986. R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the 12th International Conference on Very Large Data Bases (VLDB’86).

Digital Library

[68]

Vivien Marx. 2013. Biology: The big challenges of big data. Nature 498, 7453 (2013), 255--260.

[69]

Erik Meijer, Brian Beckman, and Gavin Bierman. 2006. LINQ: Reconciling object, relations and XML in the .net framework. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD’06). 706--706.

Digital Library

[70]

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th International Conference on Very Large Data Bases (VLDB’12). 330--339.

Digital Library

[71]

Donald Michie. 1968. “Memo” functions and machine learning. Nature 218, 5136 (1968), 19--22.

[72]

Donald Miner and Adam Shook. 2012. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc.

Digital Library

[73]

Jack Minker and Rita G. Minker. 1980. Optimization of Boolean expressions-historical developments. IEEE Ann. Hist. Comput. 2, 3 (1980), 227--238.

Digital Library

[74]

Derek Gordon Murray, Michael Isard, and Yuan Yu. 2011. Steno: Automatic optimization of declarative queries. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). 121--131.

Digital Library

[75]

Thomas Neumann. 2005. Efficient Generation and Execution of DAG-Structured Query Graphs. Ph.D. Dissertation. University of Mannheim, Germany.

[76]

Thomas Neumann. 2009. Query optimization (in relational databases). In Encyclopedia of Database Systems, Ling Liu and M. Tamer Zsu (Eds.). Springer, 2273--2278.

[77]

Eduardo Ogasawara, Jonas Dias, Vitor Silva, Fernando Chirigati, Daniel Oliveira, Fabio Porto, Patrick Valduriez, and Marta Mattoso. 2013. Chiron: A parallel engine for algebraic scientific workflows. Concurr. Comput.: Pract. Exper. 25, 16 (2013), 2327--2341.

[78]

Eduardo S. Ogasawara, Daniel de Oliveira, Patrick Valduriez, Jonas Dias, Fabio Porto, and Marta Mattoso. 2011. An algebraic approach for data-centric scientific workflows. Proc. VLDB Endow. 4, 12 (2011), 1328--1339.

Digital Library

[79]

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08). 1099--1110.

Digital Library

[80]

M. Tamer Özsu and Patrick Valduriez. 1999. Principles of Distributed Database Systems (2nd ed.). Prentice-Hall, Inc.

Digital Library

[81]

Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis with sawzall. Sci. Program. 13, 4 (2005), 277--298.

Digital Library

[82]

Horst Reichel. 1984. Behavioural equivalence—a unifying concept for initial and final specifications. In Proceedings of the 3rd Hungarian Computer Science Conference, Akademiai Kiado.

[83]

Astrid Rheinländer, Arvid Heise, Fabian Hueske, Ulf Leser, and Felix Naumann. 2015. SOFA: An extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52 (2015), 96--125.

Digital Library

[84]

Giulia Rumi, Claudia Colella, and Danilo Ardagna. 2014. Optimization techniques within the hadoop eco-system: A survey. In Proceedings of the 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC’14). 437--444.

[85]

Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. 2013. The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46, 1 (2013), Article 11.

Digital Library

[86]

Anish Das Sarma, Foto N. Afrati, Semih Salihoglu, and Jeffrey D. Ullman. 2013. Upper and lower bounds on the cost of a map-reduce computation. Proc. VLDB Endow. 6, 4 (2013), 277--288.

Digital Library

[87]

P. Griffiths Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data (SIGMOD’79). 23--34.

Digital Library

[88]

Yasin N. Silva, Paul-Ake Larson, and Jingren Zhou. 2012. Exploiting common subexpressions for cloud query processing. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE’12). 1337--1348.

Digital Library

[89]

Alkis Simitsis, Panos Vassiliadis, and Timos Sellis. 2005. Optimizing ETL processes in data warehouses. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05). 564--575.

Digital Library

[90]

Alkis Simitsis, Kevin Wilkinson, Malu Castellanos, and Umeshwar Dayal. 2012. Optimizing analytic data flows for multiple execution engines. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). 829--840.

Digital Library

[91]

David Simmen, Eugene Shekita, and Timothy Malkemus. 1996. Fundamental techniques for order optimization. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96). 57--67.

Digital Library

[92]

Utkarsh Srivastava, Kamesh Munagala, Jennifer Widom, and Rajeev Motwani. 2006. Query optimization over web services. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06). 355--366.

Digital Library

[93]

Konrad Stocker, Donald Kossmann, Reinhard Braumandl, and Alfons Kemper. 2001. Integrating semi-join-reducers into state of the art query processors. In Proceedings of the 17th International Conference on Data Engineering (ICDE’01). 575--584.

Digital Library

[94]

Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. 2010. MapReduce and parallel DBMSs: Friends or foes? Commun. ACM 53, 1 (2010), 64--71.

Digital Library

[95]

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (2009), 1626--1629.

Digital Library

[96]

Patrick Valduriez. 1987. Join indices. ACM Trans. Database Syst. 12, 2 (1987), 218--246.

Digital Library

[97]

Panos Vassiliadis, Alkis Simitsis, and Eftychia Baikousi. 2009. A taxonomy of ETL activities. In Proceedings of the 12th Int. Workshop on Data Warehousing and OLAP (DOLAP’09). 25--32.

Digital Library

[98]

Henning Wachsmuth, Benno Stein, and Gregor Engels. 2011. Constructing efficient information extraction pipelines. In Proceedings of the 20th ACM Int. Conf. on Information and Knowledge Management (CIKM’11). 2237--2240.

Digital Library

[99]

Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. 2014. Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 1 (2014), 97--107.

Digital Library

[100]

Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2013. Shark: SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). 13--24.

Digital Library

[101]

Weipeng P. Yan and Per-Ake Larson. 1994. Performing group-by before join {query processing}. In Proceedings of the 10th International Conference on Data Engineering (ICDE’94). 89--100.

Digital Library

[102]

Weipeng P. Yan and Per-Ake Larson. 1995. Eager aggregation and lazy aggregation. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB’95). 345--357.

Digital Library

[103]

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’08). 1--14.

Digital Library

[104]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10). 10--10.

Digital Library

[105]

Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel databases meet mapreduce. VLDB J. 21, 5 (2012), 611--636.

Digital Library

Cited By

Mecquenem NBosse SBountris VMohammadi SReinert KLeser U(2024)CuttleFlow: Infrastructure-Specific Workflow Adaption for Improved Reusability2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678732(1-10)Online publication date: 16-Sep-2024
https://doi.org/10.1109/e-Science62913.2024.10678732
Bai YSong WLiu Q(2024)Study on motion behaviour of coal water slurry particles under vibration conditionsHeliyon10.1016/j.heliyon.2024.e2462910:3(e24629)Online publication date: Feb-2024
https://doi.org/10.1016/j.heliyon.2024.e24629
Gounaris AMichailidou ADustdar S(2023)Toward Building Edge Learning PipelinesIEEE Internet Computing10.1109/MIC.2022.317164327:1(61-69)Online publication date: 1-Jan-2023
https://doi.org/10.1109/MIC.2022.3171643
Show More Cited By

Index Terms

Optimization of Complex Dataflows with User-Defined Functions
1. General and reference
  1. Document types
    1. Surveys and overviews
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization
      2. Parallel and distributed DBMSs
        MapReduce-based systems
    2. Query languages
      1. Query languages for non-relational engines
        MapReduce languages

Recommendations

Soft querying powered by user-defined functions in J-CO-QL ⁺
Abstract
Soft querying on databases (i.e., selecting data items that partially match selection conditions) was investigated on top of classical relational databases in past research works; however, constraints and limitations posed by ...
Parallelizing User-Defined Functions in Distributed Object-Relational DBMS
IDEAS '99: Proceedings of the 1999 International Symposium on Database Engineering & Applications

Full support of parallelism in object-relational database systems (ORDBMSs) is desired. The parallelization techniques developed for relational database systems are not adequate for ORDBMS because of the introduction of complex abstract data types and ...
A Framework For Inferring Properties of User-Defined Functions
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

User-defined functions (UDFs) are widely used to enhance the capabilities of DBMSs. However, using UDFs comes with a significant performance penalty because DBMSs treat UDFs as black boxes, which hinders their ability to optimize queries that invoke such ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 50, Issue 3

May 2018

550 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3101309

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL 32611

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 May 2017

Accepted: 01 April 2017

Revised: 01 March 2017

Received: 01 November 2016

Published in CSUR Volume 50, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Funding Sources

German Research Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
1,117
Total Downloads

Downloads (Last 12 months)88
Downloads (Last 6 weeks)9

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mecquenem NBosse SBountris VMohammadi SReinert KLeser U(2024)CuttleFlow: Infrastructure-Specific Workflow Adaption for Improved Reusability2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678732(1-10)Online publication date: 16-Sep-2024
https://doi.org/10.1109/e-Science62913.2024.10678732
Bai YSong WLiu Q(2024)Study on motion behaviour of coal water slurry particles under vibration conditionsHeliyon10.1016/j.heliyon.2024.e2462910:3(e24629)Online publication date: Feb-2024
https://doi.org/10.1016/j.heliyon.2024.e24629
Gounaris AMichailidou ADustdar S(2023)Toward Building Edge Learning PipelinesIEEE Internet Computing10.1109/MIC.2022.317164327:1(61-69)Online publication date: 1-Jan-2023
https://doi.org/10.1109/MIC.2022.3171643
Saur KMirmira TKaranasos KCamacho-Rodríguez J(2022)Containerized execution of UDFsProceedings of the VLDB Endowment10.14778/3551793.355186015:11(3158-3171)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551860
Grulich PZeuch SMarkl V(2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489501
Tsakalidis GVergidis K(2021)A Roadmap to Critical Redesign Choices That Increase the Robustness of Business Process Redesign InitiativesJournal of Open Innovation: Technology, Market, and Complexity10.3390/joitmc70301787:3(178)Online publication date: Sep-2021
https://doi.org/10.3390/joitmc7030178
Makrynioti NVassalos V(2021)Declarative Data Analytics: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.295808433:6(2392-2411)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TKDE.2019.2958084
Rao BLiu ZZhang HLu SWang L(2021)SODA: A Semantics-Aware Optimization Framework for Data-Intensive Applications Using Hybrid Program Analysis2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00058(433-444)Online publication date: Sep-2021
https://doi.org/10.1109/CLOUD53861.2021.00058
De Capitani di Vimercati SForesti SJajodia SLivraga GParaboschi SSamarati P(2021)An authorization model for query execution in the cloudThe VLDB Journal10.1007/s00778-021-00709-x31:3(555-579)Online publication date: 6-Nov-2021
https://doi.org/10.1007/s00778-021-00709-x
De Capitani di Vimercati SForesti SJajodia SLivraga GParaboschi SSamarati P(2021)Distributed Query Evaluation over Encrypted DataData and Applications Security and Privacy XXXV10.1007/978-3-030-81242-3_6(96-114)Online publication date: 14-Jul-2021
https://doi.org/10.1007/978-3-030-81242-3_6
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents