[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Supporting scalable analytics with latency constraints

Published: 01 July 2015 Publication History

Abstract

Recently there has been a significant interest in building big data analytics systems that can handle both "big data" and "fast data". Our work is strongly motivated by recent real-world use cases that point to the need for a general, unified data processing framework to support analytical queries with different latency requirements. Toward this goal, we start with an analysis of existing big data systems to understand the causes of high latency. We then propose an extended architecture with mini-batches as granularity for computation and shuffling, and augment it with new model-driven resource allocation and runtime scheduling techniques to meet user latency requirements while maximizing throughput. Results from real-world workloads show that our techniques, implemented in Incremental Hadoop, reduce its latency from tens of seconds to sub-second, with 2x-5x increase in throughput. Our system also outperforms state-of-the-art distributed stream systems, Storm and Spark Streaming, by 1-2 orders of magnitude when combining latency and throughput.

References

[1]
D. J. Abadi, et al. The Design of the Borealis Stream Processing Engine. In CIDR, 2005.
[2]
T. Akidau, et al. Millwheel: Fault-tolerant stream processing at internet scale. In VLDB, pp. 734--746, 2013.
[3]
R. Ananthanarayanan, et al. Photon: fault-tolerant and scalable joining of continuous data streams. In SIGMOD, pp. 577--588, 2013.
[4]
S. Asmussen. Applied probability and queues; 2nd ed. Stochastic Modelling and Applied Probability. Springer, New York, 2003.
[5]
D. Borthakur, et al. Apache hadoop goes realtime at facebook. In SIGMOD, pp. 1071--1080, 2011.
[6]
A. Brito, et al. Scalable and low-latency data processing with stream mapreduce. In CloudCom, pp. 48--58, 2011.
[7]
D. Carney, et al. Operator scheduling in a data stream manager. In VLDB, pp. 333--353, 2003.
[8]
R. Castro Fernandez, et al. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, pp. 725--736, 2013.
[9]
S. Chandrasekaran, et al. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003.
[10]
F. Y. Chin and S. P. Fung. Improved competitive algorithms for online scheduling with partial job values. Theoretical Computer Science, 325(3):467--478, 2004.
[11]
T. Condie, et al. Mapreduce online. In NSDI, pp. 21--21, 2010.
[12]
H. David and H. Nagaraja. Order Statistics. Wiley, 2004.
[13]
M. L. Dertouzos. Control robotics: The procedural control of physical processes. In IFIF Congress, pp. 807--813, 1974.
[14]
A. D. Ferguson, et al. Jockey: Guaranteed job latency in data parallel clusters. In EuroSys, pp. 99--112, 2012.
[15]
A. Gates, et al. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414--1425, 2009.
[16]
V. Gulisano, et al. Streamcloud: An elastic and scalable data streaming system. IEEE Trans. Parallel Distrib. Syst., 23(12):2351--2365, 2012.
[17]
H. Herodotou. Hadoop performance models. Technical Report CS-2011-05, Duke Computer Science, 2011.
[18]
H. Herodotou, et al. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB, pp. 1111--1122, 2011.
[19]
V. Jalaparti, et al. Bridging the tenant-provider gap in cloud services. In SoCC, pp. 10:1--10:14, 2012.
[20]
B. Jansson. Choosing a good appointment system-a study of queues of the type (d, m, 1). Operations Research, 14(2):pp. 292--312, 1966.
[21]
J. F. C. Kingman. Some inequalities for the queue gi/g/1. Biometrika, 49(3/4):pp. 315--324, 1962.
[22]
G. Koren and D. Shasha. Dover; an optimal on-line scheduling algorithm for overloaded real-time systems. In RTSS, 1992.
[23]
V. Kumar, et al. Apache hadoop yarn: yet another resource negotiator. In SOCC, pp. 5:1--5:16, 2013.
[24]
W. Lam, et al. Muppet: Mapreduce-style processing of fast data. PVLDB, 5(12):1814--1825, 2012.
[25]
B. Li, et al. Supporting scalable analytics with latency constraints. Technical Report http://scalla.cs.umass.edu/papers/techreport2015.pdf.
[26]
B. Li, et al. A platform for scalable one-pass analytics using mapreduce. In SIGMOD, pp. 985--996, 2011.
[27]
D. Logothetis, et al. Stateful bulk processing for incremental analytics. In SoCC, pp. 51--62, 2010.
[28]
E. Mazur, et al. Towards scalable one-pass analytics using mapreduce. In IPDPS Workshops, pages 1102--1111, 2011.
[29]
G. Mishne, et al. Fast data in the era of big data: Twitter's real-time related query suggestion architecture. In SIGMOD, 2013.
[30]
K. Morton, et al. Paratimer: a progress indicator for mapreduce dags. In SIGMOD, pp. 507--518, 2010.
[31]
D. G. Murray, et al. Naiad: A timely dataflow system. In SOSP, pp. 439--455, 2013.
[32]
L. Neumeyer, et al. S4: Distributed stream computing platform. In ICDMW, pp. 170--177, 2010.
[33]
T. J. Ott. Simple inequalities for the d/g/1 queue. Operations Research, 35(4):pp. 589--597, 1987.
[34]
K. Ousterhout, et al. Sparrow: distributed, low latency scheduling. In SOSP, pp. 69--84, 2013.
[35]
Z. Qian, et al. Timestream: Reliable stream computation in the cloud. In EuroSys, pp. 1--14, 2013.
[36]
N. Tatbul, et al. Staying fit: Efficient load shedding techniques for distributed stream processing. In VLDB, pp. 159--170, 2007.
[37]
A. Toshniwal, et al. Storm@twitter. In SIGMOD, pp. 147--156, 2014.
[38]
F. Yang, et al. Sonora: A platform for continuous mobile-cloud computing. Technical Report, Microsoft Research Aisa, 2012.
[39]
M. Zaharia, et al. Discretized streams: fault-tolerant streaming computation at scale. In SOSP, pp. 423--438, 2013.
[40]
E. Zeitler and T. Risch. Massive scale-out of expensive continuous queries. PVLDB, 4(11):1181--1188, 2011.
[41]
Q. Zou, et al. From a stream of relational queries to distributed stream processing. PVLDB, 3(1-2):1394--1405, 2010.

Cited By

View all
  • (2024)Draconis: Network-Accelerated Scheduling for Microsecond-Scale WorkloadsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650060(333-348)Online publication date: 22-Apr-2024
  • (2021)ECNNProceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications10.1145/3491396.3506519(170-175)Online publication date: 28-Dec-2021
  • (2020)Resource Management and Scheduling in Distributed Stream Processing SystemsACM Computing Surveys10.1145/335539953:3(1-41)Online publication date: 28-May-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 11
July 2015
264 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2015
Published in PVLDB Volume 8, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Draconis: Network-Accelerated Scheduling for Microsecond-Scale WorkloadsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650060(333-348)Online publication date: 22-Apr-2024
  • (2021)ECNNProceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications10.1145/3491396.3506519(170-175)Online publication date: 28-Dec-2021
  • (2020)Resource Management and Scheduling in Distributed Stream Processing SystemsACM Computing Surveys10.1145/335539953:3(1-41)Online publication date: 28-May-2020
  • (2019)UDAOProceedings of the VLDB Endowment10.14778/3352063.335210312:12(1934-1937)Online publication date: 1-Aug-2019
  • (2018)HengeProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3267832(249-262)Online publication date: 11-Oct-2018
  • (2018)Recent Advancements in Event ProcessingACM Computing Surveys10.1145/317043251:2(1-36)Online publication date: 13-Feb-2018
  • (2018)Deep Learning for IoT Big Data and Streaming Analytics: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2018.284434120:4(2923-2960)Online publication date: 1-Oct-2018
  • (2018)Elastic CPU Cap Mechanism for Timely Dataflow ApplicationsComputational Science – ICCS 201810.1007/978-3-319-93698-7_42(554-568)Online publication date: 11-Jun-2018
  • (2017)Online Reconstruction of Structural Information from Datacenter LogsProceedings of the Twelfth European Conference on Computer Systems10.1145/3064176.3064195(344-358)Online publication date: 23-Apr-2017
  • (2017)Composed sketch framework for quantiles and cardinality queries over big data streamsProceedings of the ACM Turing 50th Celebration Conference - China10.1145/3063955.3063995(1-10)Online publication date: 12-May-2017
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media