[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3299869.3324959acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark

Published: 25 June 2019 Publication History

Abstract

Thanks to a simple SQL extension, Recursive-aggregate-SQL (RaSQL) can express very powerful queries and declarative algorithms, such as classical graph algorithms and data mining algorithms. A novel compiler implementation allows RaSQL to map declarative queries into one basic fixpoint operator supporting aggregates in recursive queries. A fully optimized implementation of this fixpoint operator leads to superior performance, scalability and portability. Thus, our RaSQL system, which extends Spark SQL with the before-mentioned new constructs and implementation techniques, matches and often surpasses the performance of other systems, including Apache Giraph, GraphX and Myria.

References

[1]
{n. d.}. Arabic-2005. http://law.di.unimi.it/webdata/arabic-2005/.
[2]
{n. d.}. Cypher for Apache Spark. https://github.com/opencypher/ cypher-for-apache-spark.
[3]
{n. d.}. Floyd-Warshall Algorithm. https://en.wikipedia.org/wiki/ Floyd_Warshall_algorithm.
[4]
{n. d.}. GTgraph. http://www.cse.psu.edu/~kxm85/ software/GTgraph.
[5]
{n. d.}. LiveJournal social network. http://snap.stanford.edu/data/ com-LiveJournal.html.
[6]
{n. d.}. MATCH (SQL Graph). https://docs.microsoft.com/en-us/sql/ t-sql/queries/match-sql-graph.
[7]
{n. d.}. Multi level Marketing Model. https://en.wikipedia.org/wiki/ Multilevelmarketing.
[8]
{n. d.}. Orkut network. http://snap.stanford.edu/data/com-Orkut.html.
[9]
{n. d.}. Presto. http://prestodb.io.
[10]
{n. d.}. Recursion Example: Bill Of Materials. https://www.ibm.com/ support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.sql. ref.doc/doc/r0059242.html.
[11]
{n. d.}. The RaSQL language. http://rasql.org.
[12]
Foto N. Afrati, Vinayak R. Borkar, et al. 2011. Map-reduce extensions and recursive queries. In EDBT. 1--8.
[13]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, et al. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. 1383--1394.
[14]
David F. Bacon, Nathan Bales, et al. 2017. Spanner: Becoming a SQL System. In SIGMOD 2017. 331--343.
[15]
Francois Bancilhon. 1986. Naive Evaluation of Recursively Defined Relations. In KBMS. Springer-Verlag, 165--178.
[16]
Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR abs/1508.03619 (2015). arXiv:1508.03619 http://arxiv.org/abs/1508.03619
[17]
Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE. 1151--1162.
[18]
Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin. 2014. Summingbird: A framework for integrating batch and online mapreduce computations. PVLDB 7, 13 (2014), 1441--1451.
[19]
Yingyi Bu, Vinayak R. Borkar, Jianfeng Jia, Michael J. Carey, and Tyson Condie. 2014. Pregelix: Big(ger) Graph Analytics on a Dataflow Engine. PVLDB 8, 2 (2014), 161--172.
[20]
Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-Scale. PVLDB 8, 12 (2015), 1804--1815.
[21]
Tyson Condie, Ariyam Das, Matteo Interlandi, Alexander Shkapsky, Mohan Yang, and Carlo Zaniolo. 2018. Scaling-up reasoning and advanced analytics on BigData. TPLP 18, 5--6 (2018), 806--845.
[22]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[23]
Nadime Francis, Alastair Green, et al. 2018. Cypher: An Evolving Query Language for Property Graphs. In SIGMOD. 1433--1445.
[24]
Filippo Furfaro, Sergio Greco, Sumit Ganguly, and Carlo Zaniolo. 2002. Pushing Extrema Aggregates to Optimize Logic Queries. Inf. Syst. 27, 5 (July 2002), 321--343.
[25]
Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. 1991. Minimum and Maximum Predicates in Logic Programming. In PODS. 154--163.
[26]
Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. 1995. Extrema Predicates in Deductive Databases. JCS 51, 2 (1995), 244--259.
[27]
Allen Gelder. 1993. Foundations of aggregation in deductive databases. In Deductive and Object-Oriented Databases. Springer, 13--34.
[28]
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In USENIX, OSDI. 17--30.
[29]
Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI. 599--613.
[30]
Goetz Graefe and William McKenna. 1991. The Volcano optimizer generator. Technical Report. Colorado Univ at Boulder.
[31]
Huahai He and Ambuj K. Singh. 2008. Graphs-at-a-time: query language and access methods for graph databases. In SIGMOD. 405--418.
[32]
David B. Kemp and Peter J. Stuckey. 1991. Semantics of Logic Programs with Aggregates. In ISLP. 387--401.
[33]
Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing.
[34]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. 2010. What is Twitter, a social network or a news media?. In WWW, Raleigh, North Carolina, USA, April 26--30, 2010. 591--600.
[35]
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB 5, 8 (2012), 716--727.
[36]
Yi Lu, James Cheng, Da Yan, and Huanhuan Wu. 2014. Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation. PVLDB 8, 3 (2014), 281--292.
[37]
Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. In SIGMOD. 135--146.
[38]
Mirjana Mazuran, Edoardo Serra, and Carlo Zaniolo. 2013. Extending the Power of Datalog Recursion. VLDB J. 22, 4 (2013), 471--493.
[39]
Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS {XV}).
[40]
Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. In CIDR.
[41]
Svilen R. Mihaylov, Zachary G. Ives, and Sudipto Guha. 2012. REX: Recursive, Delta-based Data-centric Computation. PVLDB 5, 11 (July 2012), 1280--1291.
[42]
Jack Minker, Dietmar Seipel, and Carlo Zaniolo. 2014. Logic and Databases: A History of Deductive Databases. In Computational Logic. 571--627.
[43]
Inderpal Singh Mumick, Hamid Pirahesh, and Raghu Ramakrishnan. 1990. The Magic of Duplicates and Aggregates. In PVLDB. 264--277.
[44]
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In SOSP. 439--455.
[45]
Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. PVLDB 4, 9 (2011), 539--550.
[46]
Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The SQL++ Semi-structured Data Model and Query Language. CoRR abs/1405.3631 (2014).
[47]
Jorge Pérez, Marcelo Arenas, and Claudio Gutiérrez. 2009. Semantics and complexity of SPARQL. Database Syst. 34, 3 (2009), 16:1--16:45.
[48]
Kenneth A Ross and Yehoshua Sagiv. 1992. Monotonic aggregation in deductive databases. In PODS. 114--126.
[49]
Ariyam Das Sahil, M. Gandhi, and Carlo Zaniolo. 2018. ASTRO: A Datalog System for Advanced Stream Reasoning. CIKM (2018).
[50]
Marianne Shaw, Paraschos Koutris, Bill Howe, and Dan Suciu. 2012. Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop. In Datalog in Academia and Industry. 165--176.
[51]
Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big Data Analytics with Datalog Queries on Spark. In SIGMOD 2016. 1135--1149.
[52]
S. Sudarshan and Raghu Ramakrishnan. 1991. Aggregation and Relevance in Deductive Databases. In PVLDB. 501--511.
[53]
Ashish Thusoo, Joydeep Sen Sarma, et al. 2010. Hive - a petabyte scale data warehouse using Hadoop. In ICDE 2010. 996--1005.
[54]
Oskar van Rest, Sungpack Hong, et al. 2016. PGQL: a property graph query language. In GRADES Workshop. 7.
[55]
Jingjing Wang, Tobin Baker, et al. 2017. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR.
[56]
Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. 2015. Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines. PVLDB 8, 12 (2015), 1542--1553.
[57]
Markus Weimer, Tyson Condie, and Raghu Ramakrishnan. 2011. Machine learning in ScalOps, a higher order cloud computing language. In BigLearn.
[58]
Mohan Yang and Carlo Zaniolo. 2014. Main memory evaluation of recursive queries on multicore machines. In IEEE Big Data. 251--260.
[59]
Yuan Yu, Michael Isard, et al. 2008. DryadLINQ: A System for General- Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI. 1--14.
[60]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing withWorking Sets. In HotCloud.
[61]
Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. Subrahmanian, and Roberto Zicari. 1997. Advanced Database Systems. Morgan Kaufmann.
[62]
Carlo Zaniolo, Mohan Yang, et al. 2017. Fixpoint semantics and optimization of recursive Datalog programs with aggregates. TPLP 17, 5--6 (2017), 1048--1065.
[63]
Carlo Zaniolo, Mohan Yang, et al. 2018. Declarative BigData Algorithms via Aggregates and Relational Database Dependencies. In CEUR Workshop Proceedings 2018.
[64]
Yanfeng Zhang, Qixin Gao, Lixin Gao, and CuirongWang. 2011. PrIter: A Distributed Framework for Prioritized Iterative Computations. In SOCC. Article 13, 13:1--13:14 pages.
[65]
Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636.
[66]
Xin Zhou, Fusheng Wang, and Carlo Zaniolo. 2006. Efficient temporal coalescing query support in relational database systems. In DEXA. Springer, 676--686.

Cited By

View all

Index Terms

  1. RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
      June 2019
      2106 pages
      ISBN:9781450356435
      DOI:10.1145/3299869
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 June 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. aggregates in recursion
      2. declarative algorithms in sql
      3. distributed computation
      4. recursive query

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '19
      Sponsor:
      SIGMOD/PODS '19: International Conference on Management of Data
      June 30 - July 5, 2019
      Amsterdam, Netherlands

      Acceptance Rates

      SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;
      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)53
      • Downloads (Last 6 weeks)10
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA FrameworkModelling10.3390/modelling50300615:3(1173-1196)Online publication date: 4-Sep-2024
      • (2024)Convergence of datalog over (Pre-) SemiringsJournal of the ACM10.1145/364302771:2(1-55)Online publication date: 10-Apr-2024
      • (2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
      • (2023)Communication-Avoiding Recursive Aggregation2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00024(197-208)Online publication date: 31-Oct-2023
      • (2023)Finding a Second Wind: Speeding Up Graph Traversal Queries in RDBMSs Using Column-Oriented ProcessingModel and Data Engineering10.1007/978-3-031-49333-1_14(186-199)Online publication date: 22-Dec-2023
      • (2022)Optimizing Recursive Queries with Progam SynthesisProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517827(79-93)Online publication date: 10-Jun-2022
      • (2022)Recursive rules with aggregation: a simple unified semanticsJournal of Logic and Computation10.1093/logcom/exac07232:8(1659-1693)Online publication date: 14-Nov-2022
      • (2022)Fast datalog evaluation for batch and stream graph processingWorld Wide Web10.1007/s11280-021-00960-w25:2(971-1003)Online publication date: 20-Jan-2022
      • (2022)Functional Programming on Top of SQL EnginesPractical Aspects of Declarative Languages10.1007/978-3-030-94479-7_5(59-78)Online publication date: 17-Jan-2022
      • (2022)Recursive Rules with Aggregation: A Simple Unified SemanticsLogical Foundations of Computer Science10.1007/978-3-030-93100-1_11(156-179)Online publication date: 10-Jan-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media