[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Kodiak: leveraging materialized views for very low-latency analytics over high-dimensional web-scale data

Published: 01 September 2016 Publication History

Abstract

Turn's online advertising campaigns produce petabytes of data. This data is composed of trillions of events, e.g. impressions, clicks, etc., spanning multiple years. In addition to a timestamp, each event includes hundreds of fields describing the user's attributes, campaign's attributes, attributes of where the ad was served, etc.
Advertisers need advanced analytics to monitor their running campaigns' performance, as well as to optimize future campaigns. This involves slicing and dicing the data over tens of dimensions over arbitrary time ranges. Many of these queries need to power the web portal to provide reports and dashboards. For an interactive response time, they have to have tens of milliseconds latency. At Turn's scale of operations, no existing system was able to deliver this performance in a cost effective manner.
Kodiak, a distributed analytical data platform for web-scale high-dimensional data, was built to serve this need. It relies on pre-computations to materialize thousands of views to serve these advanced queries. These views are partitioned and replicated across Kodiak's storage nodes for scalability and reliability. They are system maintained as new events arrive. At query time, the system auto-selects the most suitable view to serve each query.
Kodiak has been used in production for over a year. It hosts 2490 views for over three petabytes of raw data serving over 200K queries daily. It has median and 99% query latencies of 8 ms and 252 ms respectively. Our experiments show that its query latency is 3 orders of magnitude faster than leading big data platforms on head-to-head comparisons using Turn's query workload. Moreover, Kodiak uses 4 orders of magnitude less resources to run the same workload.

References

[1]
P. Agrawal, A. Silberstein, B. F. Cooper, U. Srivastava, and R. Ramakrishnan. Asynchronous view maintenance for vlsd databases. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD '09, pages 179--192, New York, NY, USA, 2009. ACM.
[2]
HBase: the Hadoop database. http:///hbase.apache.org/.
[3]
The Apache Cassandra database. http:///cassandra.apache.org/.
[4]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 1383--1394, New York, NY, USA, 2015. ACM.
[5]
A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.
[6]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Computer Systems, 26(2), 2008.
[7]
S. Chen. Cheetah: A high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow., 3(1-2):1459--1468, Sept. 2010.
[8]
L. S. Colby, T. Griffin, L. Libkin, I. S. Mumick, and H. Trickey. Algorithms for deferred view maintenance. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD '96, pages 469--480, New York, NY, USA, 1996. ACM.
[9]
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. Proc. VLDB Endow., 1(2):1277--1288, Aug. 2008.
[10]
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 251--264, Berkeley, CA, USA, 2012. USENIX Association.
[11]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pages 205--220, New York, NY, USA, 2007. ACM.
[12]
H. Elmeleegy, Y. Li, Y. Qi, P. Wilmot, M. Wu, S. Kolay, A. Dasdan, and S. Chen. Overview of turn data management platform for digital advertising. PVLDB, 6(11):1138--1149, 2013.
[13]
K. Elmeleegy. Piranha: Optimizing short jobs in hadoop. Proc. VLDB Endow., 6(11):985--996, Aug. 2013.
[14]
A. Gupta and I. S. Mumick. Materialized views. chapter Maintenance of Materialized Views: Problems, Techniques, and Applications, pages 145--157. MIT Press, Cambridge, MA, USA, 1999.
[15]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: wait-free coordination for internet-scale systems. In USENIXATC'10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference, pages 11--11, 2010.
[16]
K. Y. Lee and M. H. Kim. Efficient incremental maintenance of data cubes. In Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB '06, pages 823--833. VLDB Endowment, 2006.
[17]
C. Mohan. History repeats itself: Sensible and nonsensql aspects of the nosql hoopla. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT '13, pages 11--16, New York, NY, USA, 2013. ACM.
[18]
I. S. Mumick, D. Quass, and B. S. Mumick. Maintenance of data cubes and summary tables in a warehouse. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, SIGMOD '97, pages 100--111, New York, NY, USA, 1997. ACM.
[19]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In Proc. ACM SIGMOD, 2008.
[20]
Oracle. Automatic Storage Management. http://docs.oracle.com/cd/E11882_01/server.112/e18951/asmcon.htm.
[21]
Oracle Clusterware. http://www.oracle.com/technetwork/database/database-technologies/clusterware/overview/index.html.
[22]
L. Qiao, K. Surlaker, S. Das, T. Quiggle, B. Schulman, B. Ghosh, A. Curtis, O. Seeliger, Z. Zhang, A. Auradar, C. Beaver, G. Brandt, M. Gandhi, K. Gopalakrishna, W. Ip, S. Jgadish, S. Lu, A. Pachev, A. Ramesh, A. Sebastian, R. Shanbhag, S. Subramaniam, Y. Sun, S. Topiwala, C. Tran, J. Westerman, and D. Zhang. On brewing fresh espresso: Linkedin's distributed data serving platform. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 1135--1146, New York, NY, USA, 2013. ACM.
[23]
K. A. Ross, D. Srivastava, and S. Sudarshan. Materialized view maintenance and integrity constraint checking: Trading space for time. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD '96, pages 447--458, New York, NY, USA, 1996. ACM.
[24]
K. Salem, K. Beyer, B. Lindsay, and R. Cochrane. How to roll a join: Asynchronous incremental view maintenance. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD '00, pages 129--140, New York, NY, USA, 2000. ACM.
[25]
J. Shute, R. Vingralek, B. Samwel, B. Handy, C. Whipkey, E. Rollins, M. Oancea, K. Littlefield, D. Menestrina, S. Ellner, J. Cieslewicz, I. Rae, T. Stancescu, and H. Apte. F1: A distributed sql database that scales. In VLDB, 2013.
[26]
B. Song, S. Liu, S. Kolay, and L. Lo. Antsboa: A new time series pipeline for big data processing, analyzing and querying in online advertising application. In First IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2015, Redwood City, CA, USA, March 30 - April 2, 2015, pages 223--232, 2015.
[27]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626--1629, Aug. 2009.
[28]
M. Traverso. Presto: Interacting with petabytes of data at Facebook. https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920.
[29]
Big Data Benchmark - AMPLab. https://amplab.cs.berkeley.edu/benchmark.
[30]
VMware, Inc. Tungsten Replicator 3.0 Manual. Technical report, VMware, Inc, 2015.
[31]
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 13--24, New York, NY, USA, 2013. ACM.
[32]
F. Yang, E. Tschetter, X. Léauté, N. Ray, G. Merlino, and D. Ganguli. Druid: A real-time analytical data store. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 157--168, New York, NY, USA, 2014. ACM.

Cited By

View all
  • (2020)Shared Execution Techniques for Business Data Analytics over Big Data StreamsProceedings of the 32nd International Conference on Scientific and Statistical Database Management10.1145/3400903.3400932(1-4)Online publication date: 7-Jul-2020
  • (2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
  • (2018)Moment-based quantile sketches for efficient high cardinality aggregation queriesProceedings of the VLDB Endowment10.14778/3236187.323621211:11(1647-1660)Online publication date: 1-Jul-2018
  • Show More Cited By
  1. Kodiak: leveraging materialized views for very low-latency analytics over high-dimensional web-scale data

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 9, Issue 13
    September 2016
    378 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 September 2016
    Published in PVLDB Volume 9, Issue 13

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Shared Execution Techniques for Business Data Analytics over Big Data StreamsProceedings of the 32nd International Conference on Scientific and Statistical Database Management10.1145/3400903.3400932(1-4)Online publication date: 7-Jul-2020
    • (2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
    • (2018)Moment-based quantile sketches for efficient high cardinality aggregation queriesProceedings of the VLDB Endowment10.14778/3236187.323621211:11(1647-1660)Online publication date: 1-Jul-2018
    • (2018)Selecting subexpressions to materialize at datacenter scaleProceedings of the VLDB Endowment10.14778/3192965.319297111:7(800-812)Online publication date: 1-Mar-2018
    • (2018)ScrubProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190513(1-15)Online publication date: 23-Apr-2018
    • (2018)Computation Reuse in Analytics Job Service at MicrosoftProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3190656(191-203)Online publication date: 27-May-2018

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media