[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2987550.2987553acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Adaptive Caching in Big SQL using the HDFS Cache

Published: 05 October 2016 Publication History

Abstract

The memory and storage hierarchy in database systems is currently undergoing a radical evolution in the context of Big Data systems. SQL-on-Hadoop systems share data with other applications in the Big Data ecosystem by storing their data in HDFS, using open file formats. However, they do not provide automatic caching mechanisms for storing data in memory. In this paper, we describe the architecture of IBM Big SQL and its use of the HDFS cache as an alternative to the traditional buffer pool, allowing in-memory data to be shared with other Big Data applications. We design novel adaptive caching algorithms for Big SQL tailored to the challenges of such an external cache scenario. Our experimental evaluation shows that only our adaptive algorithms perform well for diverse workload characteristics, and are able to adapt to evolving data access patterns. Finally, we discuss our experiences in addressing the new challenges imposed by external caching and summarize our insights about how to direct ongoing architectural evolution of external caching mechanisms.

References

[1]
HDFS Read Caching in Impala. http://blog.cloudera.com/blog/2014/08/new-in-cdh-5-1-hdfs-read-caching/, 2014. Accessed: 08.25.2016.
[2]
TPC-DS like Workload on Impala. http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-impala-1-4-widens-the-performance-gap/, 2014. Accessed: 08.25.2016.
[3]
Apache Arrow. https://arrow.apache.org/, 2016. Accessed: 08.25.2016.
[4]
Hadoop 2.0. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, 2016. Accessed: 08.25.2016.
[5]
Apache Hive. https://hive.apache.org/, 2016. Accessed: 08.25.2016.
[6]
Hortonworks: Centralized Cache Management in HDFS. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_hdfs_admin_tools/content/ch03.html, 2016. Accessed: 08.25.2016.
[7]
M. Abrams, C. R. Standridge, G. Abdulla, S. Williams, and E. A. Fox. Caching Proxies: Limitations and Potentials. Technical report, 1995.
[8]
M. Abrams, C. R. Standridge, G. Abdulla, E. A. Fox, and S. Williams. Removal Policies in Network Caches for World-Wide Web Documents. SIGCOMM Comput. Commun. Rev., 26(4), 1996.
[9]
C. Aggarwal, J. L. Wolf, and P. S. Yu. Caching on the World Wide Web. IEEE Trans. on Knowl. and Data Eng., 11(1), 1999.
[10]
A. Anderson, R. Kumar, A. Tomkins, and S. Vassilvitskii. The dynamics of repeat consumption. WWW '14, 2014.
[11]
P. Cao and S. Irani. Cost-Aware WWW Proxy Caching Algorithms. In USENIX, 1997.
[12]
S. Chaudhuri and V. Narasayya. Self-Tuning Database Systems: A Decade of Progress. In VLDB, 2007.
[13]
H.-T. Chou and D. J. DeWitt. An Evaluation of Buffer Management Strategies for Relational Database Systems. In VLDB, 1985.
[14]
S. Dar, M. J. Franklin, B. T. Jónsson, D. Srivastava, and M. Tan. Semantic Data Caching and Replacement. VLDB, 1996.
[15]
A. Floratou, N. Megiddo, N. Potti, F. Özcan, U. Kale, and J.-S. Hermes. Technical Report: Adaptive Caching Algorithms for Big Data Systems. http://domino.research.ibm.com/library/cyberdig.nsf/papers/B7CCB65324B57D7E85257ED700505AAC/File/RJ10531.pdf.
[16]
A. Floratou, U. F. Minhas, and F. Özcan. SQL-on-Hadoop: Full Circle Back to Shared-nothing Database Architectures. PVLDB, 7(12), 2014.
[17]
S. Gray, F. Özcan, H. Pereyra, B. van der Linden, and A. Zubiri. Big SQL 3.0: SQL-on-Hadoop without compromise. http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=SWW14019USEN#loaded.
[18]
O. H. Ibarra and C. E. Kim. Fast Approximation Algorithms for the Knapsack and Sum of Subset Problems. J. ACM, 22 (4), 1975.
[19]
K. Iwama and S. Taketomi. Removable Online Knapsack Problems. 2380:293--305, 2002.
[20]
S. Jiang and X. Zhang. LIRS: An Efficient Low Inter-reference Recency Set Replacement Policy to Improve Buffer Cache Performance. In ACM SIGMETRICS, 2002.
[21]
T. Johnson and D. Shasha. 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. In VLDB, 1994.
[22]
M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR, 2015.
[23]
D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. LRFU: A Spectrum of Policies That Subsumes the Least Recently Used and Least Frequently Used Policies. IEEE Trans. Comput., 50(12), 2001.
[24]
H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In SOCC, 2014.
[25]
S. Lightstone, M. Surendra, Y. Diao, S. S. Parekh, J. L. Hellerstein, K. Rose, A. J. Storm, and C. Garcia-Arellano. Control Theory: a Foundational Technique for Self Managing Databases. In ICDE Workshops, 2007.
[26]
N. Megiddo and D. S. Modha. ARC: A Self-Tuning, Low Overhead Replacement Cache. In FAST, 2003.
[27]
V. Narasayya, I. Menache, M. Singh, F. Li, M. Syamala, and S. Chaudhuri. Sharing Buffer Pool Memory in Multi-tenant Relational Database-as-a-service. PVLDB, 8(7), 2015.
[28]
E. J. O'Neil, P. E. O'Neil, and G. Weikum. The LRU-K Page Replacement Algorithm for Database Disk Buffering. In ACM SIGMOD, 1993.
[29]
L. Rizzo and L. Vicisano. Replacement Policies for a Proxy Cache. IEEE/ACM Trans. Netw., 8(2), 2000.
[30]
R. P. Wooster and M. Abrams. Proxy Caching That Estimates Page Load Delays. Computer Networks, 29(8-13), 1997.
[31]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. NSDI, 2012.
[32]
Y. Zhou, J. Philbin, and K. Li. The Multi-Queue Replacement Algorithm for Second Level Buffer Caches. In USENIX, 2001.

Cited By

View all
  • (2024)M4: A Framework for Per-Flow Quantile Estimation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00364(4787-4800)Online publication date: 13-May-2024
  • (2024)DAG-aware harmonizing job scheduling and data caching for disaggregated analytics frameworksFuture Generation Computer Systems10.1016/j.future.2024.03.005156(116-129)Online publication date: Jul-2024
  • (2023)Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage SystemsACM Transactions on Database Systems10.1145/362538948:4(1-40)Online publication date: 13-Nov-2023
  • Show More Cited By

Index Terms

  1. Adaptive Caching in Big SQL using the HDFS Cache

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
    October 2016
    534 pages
    ISBN:9781450345255
    DOI:10.1145/2987550
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. HDFS Caching
    2. SQL-on-Hadoop

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SoCC '16
    Sponsor:
    SoCC '16: ACM Symposium on Cloud Computing
    October 5 - 7, 2016
    CA, Santa Clara, USA

    Acceptance Rates

    SoCC '16 Paper Acceptance Rate 38 of 151 submissions, 25%;
    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)M4: A Framework for Per-Flow Quantile Estimation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00364(4787-4800)Online publication date: 13-May-2024
    • (2024)DAG-aware harmonizing job scheduling and data caching for disaggregated analytics frameworksFuture Generation Computer Systems10.1016/j.future.2024.03.005156(116-129)Online publication date: Jul-2024
    • (2023)Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage SystemsACM Transactions on Database Systems10.1145/362538948:4(1-40)Online publication date: 13-Nov-2023
    • (2023)Adaptive and Scalable Caching With Erasure Codes in Distributed Cloud-Edge Storage SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2022.316866211:2(1840-1853)Online publication date: 1-Apr-2023
    • (2022)CLQLMRS: improving cache locality in MapReduce job scheduling using Q-learningJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-022-00322-511:1Online publication date: 19-Sep-2022
    • (2022)PeriodicSketch: Finding Periodic Items in Data Streams2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00012(96-109)Online publication date: May-2022
    • (2021)Dynamic Cache Management of Cloud RAN and Multi-Access Edge Computing for 5G NetworksResearch Anthology on Developing and Optimizing 5G Networks and the Impact on Society10.4018/978-1-7998-7708-0.ch013(276-308)Online publication date: 2021
    • (2021)TridentProceedings of the VLDB Endowment10.14778/3461535.346154514:9(1570-1582)Online publication date: 22-Oct-2021
    • (2021)Towards High-Efficient Transaction Commitment in a Virtualized and Sustainable RDBMSIEEE Transactions on Sustainable Computing10.1109/TSUSC.2019.28908416:3(507-521)Online publication date: 1-Jul-2021
    • (2020)On Efficient Cache Management of Cloud Radio Access Networks for 5G Mobile NetworksFundamental and Supportive Technologies for 5G Mobile Networks10.4018/978-1-7998-1152-7.ch007(159-186)Online publication date: 2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media