Abstract
This work studies the problem of constructing a representative workload from a given input analytical query workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be controlled by monitoring system usage patterns, i.e., a representative workload, over time. To construct such a workload in a principled manner, we formalize the notions of workload representativity and coverage. These metrics capture the intuition that the distribution of features in a compressed workload should match a target distribution, increasing representativity, and include common queries as well as outliers, increasing coverage. We show that solving this problem optimally is computationally hard and present a novel greedy algorithm that provides approximation guarantees. We compare our techniques to established algorithms in this problem space such as sampling and clustering, and demonstrate advantages and key trade-offs.
Similar content being viewed by others
Notes
K‑medoids is an iterative greedy algorithm that chooses \(k\) cluster centers, assigns all points to the closest center and iteratively refines the points in each cluster.
Hierarchical clustering is a top-down approach where all points start in a single cluster and the algorithm recursively splits the points into \(k\) disjoint clusters.
A small sample is chosen to make sure that clustering algorithms can finish execution.
References
(2020) SQL Server execution statistics. https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-query-stats-transact-sql?view=sql-server-ver15. Accessed 17 Nov 2020
(2020) TPC‑H Benchmark. http://www.tpc.org/tpch. Accessed 17 Nov 2020
Agrawal S, Chaudhuri S, Kollar L, Marathe A, Narasayya V, Syamala M (2005) Database tuning advisor for microsoft sql server 2005. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp 930–932
Chaudhuri S, Gupta Kumar A, Narasayya V (2002) Compressing sql workloads. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, pp 488–499
Chaudhuri S, Narasayya V, Ganesan P (2003) Primitives for workload summarization and implications for SQL. In: Proceedings 2003 VLDB Conference. Elsevier, pp 730–741
Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R (2010) Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM symposium on Cloud computing, pp 143–154
Shaleen Deep, Gruenheid A, Paraschos Koutris, Naughton J, Viglas S (2020) Comprehensive and efficient workload compression. https://arxiv.org/abs/2011.05549. Accessed 17 Nov 2020
DeWitt DJ (1993) The wisconsin benchmark: past, present, and future
Jain S, Howe B (2019) Query2Vec: NLP meets databases for generalized workload Analytics. CIDR.
Krause A, Guestrin C (2005) A note on the budgeted maximization of submodular functions. Carnegie Mellon University. Center for Automated Learning and Discovery
Kul G, Luong D, Xie T, Coonan P, Chandola V, Kennedy O, Upadhyaya S (2016) Ettu: Analyzing query intents in corporate databases. In: International World Wide Web Conferences Steering Committee (ed) Proceedings of the 25th International Conference Companion on World Wide Web, pp 463–466
Kul G, Luong D, Xie T, Coonan P, Chandola V, Kennedy O, Upadhyaya S (2016) Summarizing large query logs in ettu. arXiv preprint arXiv:1608.01013
Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance N (2007) Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 420–429
Macke S, Yiming Z, Silu H, Parameswaran A (2018) Adaptive sampling for rapidly matching histograms. Proc Vldb Endow 11(10):1262–1275
O’Neil PE, O’Neil EJ, Chen X (2007) The star schema benchmark (SSB)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the article.
Additional information
This work was done while the authors Gruenheid, Viglas and Naughton were employed at Google.
Rights and permissions
Springer Nature oder sein Lizenzgeber (z.B. eine Gesellschaft oder ein*e andere*r Vertragspartner*in) hält die ausschließlichen Nutzungsrechte an diesem Artikel kraft eines Verlagsvertrags mit dem/den Autor*in(nen) oder anderen Rechteinhaber*in(nen); die Selbstarchivierung der akzeptierten Manuskriptversion dieses Artikels durch Autor*in(nen) unterliegt ausschließlich den Bedingungen dieses Verlagsvertrags und dem geltenden Recht.
About this article
Cite this article
Deep, S., Gruenheid, A., Koutris, P. et al. Comprehensive and Efficient Workload Summarization. Datenbank Spektrum 22, 249–256 (2022). https://doi.org/10.1007/s13222-022-00427-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-022-00427-w