[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

FlexpushdownDB: rethinking computation pushdown for cloud OLAP DBMSs

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Modern cloud-native OLAP databases adopt a storage-disaggregation architecture that separates the management of computation and storage. A major bottleneck in such an architecture is the network connecting the computation and storage layers. Computation pushdown is a promising solution to tackle this issue, which offloads some computation tasks to the storage layer to reduce network traffic. This paper presents FlexPushdownDB (FPDB), where we revisit the design of computation pushdown in a storage-disaggregation architecture, and then introduce several optimizations to further accelerate query processing. First, FPDB supports hybrid query execution, which combines local computation on cached data and computation pushdown to cloud storage at a fine granularity. Within the cache, FPDB uses a novel Weighted-LFU cache replacement policy that takes into account the cost of pushdown computation. Second, we design adaptive pushdown as a new mechanism to avoid throttling the storage-layer computation during pushdown, which pushes the request back to the computation layer at runtime if the storage-layer computational resource is insufficient. Finally, we derive a general principle to identify pushdown-amenable computational tasks, by summarizing common patterns of pushdown capabilities in existing systems, and further propose two new pushdown operators, namely, selection bitmap and distributed data shuffle. Evaluation on SSB and TPC-H shows each optimization can improve the performance by 2.2\(\times \), 1.9\(\times \), and 3\(\times \) respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Algorithm 1
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

Notes

  1. https://github.com/cloud-olap/FlexPushdownDB/tree/vldbj_24

References

  1. Akka. https://akka.io/

  2. Amazon Athena—Serverless interactive query service. https://aws.amazon.com/athena

  3. Amazon Elastic Compute Cloud. https://aws.amazon.com/pm/ec2

  4. Amazon Redshift Spectrum. https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html

  5. Amazon S3. https://aws.amazon.com/s3

  6. Amazon S3 documentation—GetObject. https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html

  7. Apache Arrow. https://arrow.apache.org

  8. Apache Calcite. https://calcite.apache.org

  9. Apache Hadoop. https://hadoop.apache.org

  10. Apache Parquet. https://parquet.apache.org

  11. AQUA (Advanced Query Accelerator) for Amazon Redshift. https://pages.awscloud.com/AQUA_Preview.html

  12. Arrow Flight RPC. https://arrow.apache.org/docs/format/Flight.html

  13. Arrow IPC Format. https://arrow.apache.org/docs/format/Columnar.html

  14. AWS Nitro System. https://aws.amazon.com/ec2/nitro

  15. Azure Data Lake Storage query acceleration. https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-query-acceleration

  16. Ceph. https://ceph.io

  17. DB2 Workload Manager. https://www.ibm.com/docs/en/db2/10.1.0?topic=manager-db2-workload

  18. Dremio SQL query engine. https://www.dremio.com/platform/sql-query-engine

  19. Gandiva: an LLVM-based arrow expression compiler. https://arrow.apache.org/blog/2018/12/05/gandiva-donation

  20. gRPC. https://grpc.io

  21. Intelligent query processing in SQL databases. https://learn.microsoft.com/en-us/sql/relational-databases/performance/intelligent-query-processing

  22. Minio. https://min.io

  23. Optimize performance with caching on Databricks. https://docs.databricks.com/en/optimizations/disk-cache.html

  24. PolarDB-X. https://www.alibabacloud.com/product/polardb-x

  25. Presto. https://prestodb.io

  26. Presto documentation—Alluxio cache service. https://prestodb.io/docs/current/cache/alluxio.html

  27. S3 select and glacier select—retrieving subsets of objects. https://aws.amazon.com/blogs/aws/s3-glacier-select

  28. Spark documentation—adaptive query execution. https://spark.apache.org/docs/latest/sql-performance-tuning.html

  29. SQL Server documentation—resource governor. https://learn.microsoft.com/en-us/sql/relational-databases/resource-governor/resource-governor

  30. TPC-H Benchmark. http://www.tpc.org/tpch

  31. Abadi, D.J., Madden, S.R., Hachem, N.: Column-stores vs. row-stores: How different are they really? In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 967–980 (2008)

  32. Aboulnaga, A., Babu, S.: Workload management for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 929–932 (2013)

  33. Agha, G.: Actors: a model of concurrent computation in distributed systems. MIT Press (1986)

  34. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark sql: Relational data processing in spark. In: SIGMOD, pp. 1383–1394 (2015)

  35. Armenatzoglou, N., Basu, S., Bhanoori, N., Cai, M., Chainani, N., Chinta, K., Govindaraju, V., Green, T.J., Gupta, M., Hillig, S., Hotinger, E., Leshinksy, Y., Liang, J., McCreedy, M., Nagel, F., Pandis, I., Parchas, P., Pathak, R., Polychroniou, O., Rahman, F., Saxena, G., Soundararajan, G., Subramanian, S., Terry, D.: Amazon redshift re-invented. In: Proceedings of the 2022 International Conference on Management of Data, pp. 2205–2217 (2022)

  36. Armstrong, J.: Erlang—a survey of the language and its industrial applications. In: Proc. INAP, vol. 96 (1996)

  37. Belady, L.A.: A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5(2), 78–101 (1966)

    Article  Google Scholar 

  38. Charousset, D., Hiesgen, R., Schmidt, T.C.: Revisiting actor programming in C++. Comput. Lang. Syst. Struct. 45(C), 105–131 (2016)

    Google Scholar 

  39. Dageville, B., Cruanes, T., Zukowski, M., Antonov, V., Avanes, A., Bock, J., Claybaugh, J., Engovatov, D., Hentschel, M., Huang, J., Lee, A.W., Motivala, A., Munir, A.Q., Pelley, S., Povinec, P., Rahn, G., Triantafyllis, S., Unterbrunner, P.: The snowflake elastic data warehouse. In: SIGMOD, pp. 215–226 (2016)

  40. Do, J., Kee, Y.S., Patel, J.M., Park, C., Park, K., DeWitt, D.J.: Query processing on smart ssds: Opportunities and challenges. In: SIGMOD, pp. 1221–1230 (2013)

  41. Francisco, P.: The Netezza Data Appliance Architecture (2011)

  42. Fushimi, S., Kitsuregawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine grace. In: VLDB, pp. 209–219 (1986)

  43. Gao, M., Kozyrakis, C.: Hrl: Efficient and flexible reconfigurable logic for near-data processing. In: HPCA, pp. 126–137 (2016)

  44. Ghomi, E.J., Rahmani, A.M., Qader, N.N.: Load-balancing algorithms in cloud computing: a survey. J. Netw. Comput. Appl. 88, 50–71 (2017)

    Article  Google Scholar 

  45. Ghose, S., Hsieh, K., Boroumand, A., Ausavarungnirun, R., Mutlu, O.: Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions. arXiv preprint arXiv:1802.00320 (2018)

  46. Gounaris, A., Paton, N.W., Fernandes, A.A.A., Sakellariou, R.: Adaptive query processing: a survey. In: Adv. Databases, pp. 11–25 (2002)

  47. Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating billion-record synthetic databases. SIGMOD Record 23(2), 243–252 (1994)

    Article  Google Scholar 

  48. Gu, B., Yoon, A.S., Bae, D.H., Jo, I., Lee, J., Yoon, J., Kang, J.U., Kwon, M., Yoon, C., Cho, S., Jeong, J., Chang, D.: Biscuit: a framework for near-data processing of big data workloads. In: ISCA, pp. 153–165 (2016)

  49. Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., Srinivasan, V.: Amazon redshift and the case for simpler data warehouses. In: SIGMOD, pp. 1917–1923 (2015)

  50. Harmouch, H., Naumann, F.: Cardinality estimation: an experimental survey. Proc. VLDB Endow. 11(4), 499–512 (2017)

    Article  Google Scholar 

  51. Hellerstein, J.M., Franklin, M., Chandrasekaran, S., Deshpande, A., Hildrum, K., Madden, S., Raman, V., Shah, M.A.: Adaptive query processing: technology in evolution. IEEE Data Eng. Bull. 23, 7–18 (2000)

    Google Scholar 

  52. Ives, Z.G., Halevy, A.Y., Weld, D.S., Florescu, D., Friedman, M.T.: Adaptive query processing for internet applications. IEEE Data(base) Eng. Bull. 23, 19–26 (2000)

    Google Scholar 

  53. Kabra, N., DeWitt, D.J.: Efficient mid-query re-optimization of sub-optimal query execution plans. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 106–117 (1998)

  54. Kepe, T.R., de Almeida, E.C., Alves, M.A.Z.: Database processing-in-memory: an experimental study. VLDB 13(3), 334–347 (2019)

    Google Scholar 

  55. Kim, K., Jung, J., Seo, I., Han, W.S., Choi, K., Chong, J.: Learned cardinality estimation: an in-depth study. In: Proceedings of the 2022 International Conference on Management of Data, pp. 1214–1227 (2022)

  56. Koo, G., Matam, K.K., I, T., Narra, H.V.K.G., Li, J., Tseng, H.W., Swanson, S., Annavaram, M.: Summarizer: trading communication with computing near storage. In: MICRO, pp. 219–231 (2017)

  57. Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandiver, B., Doshi, L., Bear, C.: The vertica analytic database: C-store 7 years later. VLDB 5(12), 1790–1801 (2012)

    Google Scholar 

  58. Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: How good are query optimizers, Really? Proc. VLDB Endow. 9(3), 204–215 (2015)

    Article  Google Scholar 

  59. Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: Leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 961–972 (2011)

  60. Lloyd, W., Pallickara, S., David, O., Arabi, M., Rojas, K.: Mitigating resource contention and heterogeneity in public clouds for scientific modeling services. In: 2017 IEEE International Conference on Cloud Engineering (IC2E), pp. 159–166 (2017)

  61. Niu, B., Martin, P., Powley, W.: Towards autonomic workload management in dbmss. J. Database Manage. (JDM) 20(3), 1–17 (2009)

    Article  Google Scholar 

  62. O’Neil, P., O’Neil, E., Chen, X., Revilak, S.: The star schema benchmark and augmented fact table indexing. In: Technology Conference on Performance Evaluation and Benchmarking, pp. 237–252 (2009)

  63. Pang, H.H., Carey, M.J., Livny, M.: Memory-adaptive external sorting. University of Wisconsin-Madison Department of Computer Sciences, Tech. rep. (1993)

  64. Pang, H.H., Carey, M.J., Livny, M.: Partially preemptible hash joins. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 59–68 (1993)

  65. Polychroniou, O., Sen, R., Ross, K.A.: Track join: Distributed joins with minimal network traffic. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1483–1494 (2014)

  66. Raman, V., Attaluri, G., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., Liu, S., Lohman, G.M., Malkemus, T., Mueller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A., Zhang, L.: Db2 with blu acceleration: so much more than just a column store. Proc. VLDB Endow. 6(11), 1080–1091 (2013)

    Article  Google Scholar 

  67. Rescorla, E.: The Transport Layer Security (TLS) Protocol Version 1.3. RFC 8446 (2018). https://doi.org/10.17487/RFC8446. https://www.rfc-editor.org/info/rfc8446

  68. Sahu, S., Nain, P., Diot, C., Firoiu, V., Towsley, D.: On achievable service differentiation with token bucket marking for tcp. ACM SIGMETRICS Perform Eval Rev 28(1), 23–33 (2000)

    Article  Google Scholar 

  69. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., Zdonik, S.: C-store: A column-oriented dbms. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 553–564 (2005)

  70. Tan, J., Ghanem, T., Perron, M., Yu, X., Stonebraker, M., DeWitt, D., Serafini, M., Aboulnaga, A., Kraska, T.: Choosing a cloud dbms: architectures and tradeoffs. VLDB 12(12), 2170–2182 (2019)

    Google Scholar 

  71. Tang, P.P., Tai, T.Y.: Network traffic characterization using token bucket model. In: IEEE INFOCOM’99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No. 99CH36320), pp. 51–62 (1999)

  72. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive—a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005 (2010)

  73. Ubell, M.: The Intelligent Database Machine (IDM). In: Query Processing in Database Systems, pp. 237–247 (1985)

  74. Vandiver, B., Prasad, S., Rana, P., Zik, E., Saeidi, A., Parimal, P., Pantela, S., Dave, J.: Eon mode: Bringing the vertica columnar database to the cloud. In: SIGMOD, pp. 797–809 (2018)

  75. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1–16 (2013)

  76. Verbitski, A., Gupta, A., Saha, D., Brahmadesam, M., Gupta, K., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvili, T., Bao, X.: Amazon aurora: Design considerations for high throughput cloud-native relational databases. In: SIGMOD, pp. 1041–1052 (2017)

  77. Verbitski, A., Gupta, A., Saha, D., Corey, J., Gupta, K., Brahmadesam, M., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvilli, T., et al.: Amazon aurora: On avoiding distributed consensus for i/os, commits, and membership changes. In: SIGMOD, pp. 789–796 (2018)

  78. Vuppalapati, M., Miron, J., Agarwal, R., Truong, D., Motivala, A., Cruanes, T.: Building an elastic query engine on disaggregated storage. In: NSDI, pp. 449–462 (2020)

  79. Weiss, R.: A technical overview of the oracle exadata database machine and exadata storage server. Oracle White Paper (2012)

  80. Woods, L., István, Z., Alonso, G.: Ibex: an intelligent storage engine with support for advanced sql offloading. VLDB 7(11), 963–974 (2014)

    Google Scholar 

  81. Wu, W., Naughton, J.F., Singh, H.: Sampling-based query re-optimization (2016)

  82. Xu, S., Bourgeat, T., Huang, T., Kim, H., Lee, S., Arvind, A.: Aquoman: an analytic-query offloading machine. In: MICRO, pp. 386–399 (2020)

  83. Yang, Y., Youill, M., Woicik, M., Liu, Y., Yu, X., Serafini, M., Aboulnaga, A., Stonebraker, M.: Flexpushdowndb: hybrid pushdown and caching in a cloud dbms. VLDB 14(11), 2101–2113 (2021)

    Google Scholar 

  84. Yu, X., Youill, M., Woicik, M., Ghanem, A., Serafini, M., Aboulnaga, A., Stonebraker, M.: Pushdowndb: Accelerating a dbms using s3 computation. In: ICDE, pp. 1802–1805 (2020)

  85. Zhang, M., Martin, P., Powley, W., Chen, J.: Workload management in database management systems: a taxonomy. IEEE Trans. Knowl. Data Eng. 30(7), 1386–1402 (2018)

    Article  Google Scholar 

  86. Zhang, W., Larson, P.A.: Dynamic memory adjustment for external mergesort. In: VLDB, vol. 97, pp. 25–29 (1997)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yifei Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Yu, X., Serafini, M. et al. FlexpushdownDB: rethinking computation pushdown for cloud OLAP DBMSs. The VLDB Journal 33, 1643–1670 (2024). https://doi.org/10.1007/s00778-024-00867-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-024-00867-8

Keywords

Navigation