[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Load Balancing with Job-Size Testing: Performance Improvement or Degradation?

Published: 17 April 2024 Publication History

Abstract

In the context of decision making under explorable uncertainty, scheduling with testing is a powerful technique used in the management of computer systems to improve performance via better job-dispatching decisions. Upon job arrival, a scheduler may run some testing algorithm against the job to extract some information about its structure, e.g., its size, and properly classify it. The acquisition of such knowledge comes with a cost because the testing algorithm delays the dispatching decisions, though this is under control. In this article, we analyze the impact of such extra cost in a load balancing setting by investigating the following questions: does it really pay off to test jobs? If so, under which conditions? Under mild assumptions connecting the information extracted by the testing algorithm in relationship with its running time, we show that whether scheduling with testing brings a performance degradation or improvement strongly depends on the traffic conditions, system size and the coefficient of variation of job sizes. Thus, the general answer to the above questions is non-trivial and some care should be considered when deploying a testing policy. Our results are achieved by proposing a load balancing model for scheduling with testing that we analyze in two limiting regimes. When the number of servers grows to infinity in proportion to the network demand, we show that job-size testing actually degrades performance unless short jobs can be predicted reliably almost instantaneously and the network load is sufficiently high. When the coefficient of variation of job sizes grows to infinity, we construct testing policies inducing an arbitrarily large performance gain with respect to running jobs untested.

References

[1]
Susanne Albers and Alexander Eckl. 2020. Explorable uncertainty in scheduling with non-uniform testing times. In Approximation and Online Algorithms: 18th International Workshop, WAOA 2020, Virtual Event, September 9–10, 2020, Revised Selected Articles. Springer-Verlag, Berlin, 127–142.
[2]
Susanne Albers and Alexander Eckl. 2021. Scheduling with testing on multiple identical parallel machines. In Algorithms and Data Structures: 17th International Symposium, WADS 2021, August 9-11, 2021, Proceedings. Springer-Verlag, Berlin, 29–42.
[3]
Saed Alizamir, Francis de Véricourt, and Peng Sun. 2013. Diagnostic accuracy under congestion. Management Science 59, 1 (2013), 157–171.
[4]
Jonatha Anselmi. 2020. Combining size-based load balancing with round-robin for scalable low latency. IEEE Transactions on Parallel and Distributed Systems 31, 4 (2020), 886–896.
[5]
J. Anselmi and J. Doncel. 2019. Asymptotically optimal size-interval task assignments. IEEE Transactions on Parallel and Distributed Systems 30, 11 (nov2019), 2422–2433.
[6]
S. Asmussen and. 2003. Applied Probability and Queues. Springer. 86013173
[7]
Eitan Bachmat and Josu Doncel. 2021. Size-based routing policies: Non-asymptotic analysis and design of decentralized systems. Sensors 21, 8 (2021). https://www.mdpi.com/1424-8220/21/8/2701
[8]
Eitan Bachmat, Josu Doncel, and Hagit Sarfati. 2019. Performance and stability analysis of the task assignment based on guessing size routing policy. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 1–13.
[9]
Eitan Bachmat and Hagit Sarfati. 2010. Analysis of SITA policies. Perform. Eval. 67, 2 (Feb.2010), 102–120.
[10]
Cynthia Bailey Lee, Yael Schwartzman, Jennifer Hardy, and Allan Snavely. 2005. Are user runtime estimates inherently inaccurate?. In Job Scheduling Strategies for Parallel Processing, Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin, Berlin, 253–263.
[11]
Luis Diego Briceno, Bhavesh Khemka, Howard Jay Siegel, Anthony A. Maciejewski, Christopher Groër, Gregory Koenig, Gene Okonski, and Steve Poole. 2011. Time utility functions for modeling and evaluating resource allocations in a heterogeneous computing system. In IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 7–19.
[12]
João M. P. Cardoso, José Gabriel F. Coutinho, and Pedro C. Diniz. 2017. Chapter 5 - Source code transformations and optimizations. In Embedded Computing for High Performance, João M. P. Cardoso, José Gabriel F. Coutinho, and Pedro C. Diniz (Eds.). Morgan Kaufmann, Boston, 137–183.
[13]
Wenyan Chen, Kejiang Ye, Yang Wang, Guoyao Xu, and Cheng-Zhong Xu. 2018. How does the workload look like in production cloud? Analysis and clustering of workloads on Alibaba cluster trace. In 2018 IEEE 24th Int. Conf. on Parallel and Distributed Systems (ICPADS). 102–109.
[14]
Robert J. Adler, Raisa E. Feldman, and Murad S. Taqqu (Eds.). 1998. A Practical Guide to Heavy Tails: Statistical Techniques and Applications. Birkhauser Boston Inc.
[15]
Mark Van der Boor, Sem C. Borst, Johan S. H. Van Leeuwaarden, and Debankur Mukherjee. 2022. Scalable load balancing in networked systems: A survey of recent advances. SIAM Rev. 64, 3 (2022), 554–622.
[16]
Sheng Di, Derrick Kondo, and Walfredo Cirne. 2012. Characterization and comparison of cloud versus grid workloads. In 2012 IEEE International Conference on Cluster Computing. 230–238.
[17]
Fanny Dufossé, Christoph Dürr, Noël Nadal, Denis Trystram, and Óscar C. Vásquez. 2022. Scheduling with a processing time oracle. Applied Mathematical Modelling 104 (2022), 701–720.
[18]
Christoph Dürr, Thomas Erlebach, Nicole Megow, and Julie Meißner. 2018. Scheduling with explorable uncertainty. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 94), Anna R. Karlin (Ed.). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 30:1–30:14.
[19]
Christoph Dürr, Thomas Erlebach, Nicole Megow, and Julie Meißner. 2020. An adversarial model for scheduling with testing. Algorithmica 82, 12 (2020), 3630–3675.
[20]
Muhammad El-Taha and Bacel Maddah. 2006. Allocation of service time in a multiserver system. Management Science 52, 4 (2006), 623–637.
[21]
Yuping Fan, Paul Rich, William E. Allcock, Michael E. Papka, and Zhiling Lan. 2017. Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 530–540.
[22]
Dror G. Feitelson. 2015. Workload Modeling for Computer Systems Performance Evaluation (1st ed.). Cambridge University Press, USA.
[23]
Ana Gainaru, Hongyang Sun, Guillaume Aupy, Yuankai Huo, Bennett A. Landman, and Padma Raghavan. 2019. On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows. The International Journal of High Performance Computing Applications 33, 6 (2019), 1140–1158.
[24]
Mor Harchol-Balter. 2002. Task assignment with unknown duration. J. ACM 49, 2 (2002), 260–288.
[25]
Mor Harchol-Balter, Mark E. Crovella, and Cristina D. Murta. 1999. On choosing a task assignment policy for a distributed server system. J. Parallel and Distrib. Comput. 59, 2 (1999), 204–228.
[26]
Mor Harchol-Balter and Allen B. Downey. 1996. Exploiting process lifetime distributions for dynamic load balancing. In Proceedings of the Int. Conf. on Measurement and Modeling of Computer Systems (Philadelphia, Pennsylvania, USA) (SIGMETRICS ’96). ACM, New York, NY, USA, 13–24.
[27]
Mor Harchol-Balter, Alan Scheller-Wolf, and Andrew R. Young. 2009. Surprising results on task assignment in server farms with high-variability workloads. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (Seattle, WA, USA) (SIGMETRICS ’09). ACM, New York, NY, USA, 287–298.
[28]
Esa Hyytiä and Rhonda Righter. 2024. Towards the optimal dynamic size-aware dispatching. Performance Evaluation 164 (2024), 102396.
[29]
Retsef Levi, Thomas Magnanti, and Yaron Shaposhnik. 2019. Scheduling with testing. Management Science 65, 2 (2019), 776–793.
[30]
Alex F. Mills, Nilay Tanık Argon, and Serhan Ziya. 2013. Resource-based patient prioritization in mass-casualty incidents. Manufacturing & Service Operations Management 15, 3 (2013), 361–377.
[31]
Mina Naghshnejad and Mukesh Singhal. 2020. A hybrid scheduling platform: A runtime prediction reliability aware scheduling platform to improve HPC scheduling performance. The Journal of Supercomputing 76 (2020), 28 pages.
[32]
G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. 2000. Pace—A toolset for the performance prediction of parallel and distributed systems. The International Journal of High Performance Computing Applications 14, 3 (2000), 228–251.
[33]
Alan Scheller-Wolf and Karl Sigman. 1997. Delay moments for FIFO GI/GI/s queues. Queueing Systems 25, 1/4 (Jan.1997), 77–95.
[34]
Dan Tsafrir, Yoav Etsion, and Dror G. Feitelson. 2007. Backfilling using system-generated predictions rather than user runtime estimates. IEEE Transactions on Parallel and Distributed Systems 18, 6 (2007), 789–803.
[35]
Y. Wiseman, K. Schwan, and P. Widener. 2004. Efficient end to end data exchange using configurable compression. In 24th Int. Conf. on Distributed Computing Systems. 228–235.
[36]
Carl Witt, Marc Bux, Wladislaw Gusew, and Ulf Leser. 2019. Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Information Systems 82 (2019), 33–52.
[37]
Bo Zhang and Bert Zwart. 2013. Steady-state analysis for multiserver queues under size interval task assignment in the quality-driven regime. Math. of Op. Res. 38, 3 (2013), 504–525.
[38]
Darko Zivanovic, Milan Pavlovic, Milan Radulovic, Hyunsung Shin, Jongpil Son, Sally A. Mckee, Paul M. Carpenter, Petar Radojković, and Eduard Ayguadé. 2017. Main memory in HPC: Do we need more or could we live with less? ACM Trans. Archit. Code Optim. 14, 1, Article 3 (Mar.2017), 26 pages.
[39]
Salah Zrigui, Raphael Y. de Camargo, Arnaud Legrand, and Denis Trystram. 2022. Improving the performance of batch schedulers using online job runtime classification. J. Parallel and Distrib. Comput. 164 (2022), 83–95.

Cited By

View all
  • (2025)Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349363136:1(43-54)Online publication date: Jan-2025

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Modeling and Performance Evaluation of Computing Systems
ACM Transactions on Modeling and Performance Evaluation of Computing Systems  Volume 9, Issue 2
June 2024
103 pages
EISSN:2376-3647
DOI:10.1145/3613566
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2024
Online AM: 04 March 2024
Accepted: 25 February 2024
Revised: 17 January 2024
Received: 17 October 2023
Published in TOMPECS Volume 9, Issue 2

Check for updates

Author Tags

  1. Load balancing
  2. dispatching policies
  3. size-based routing
  4. scheduling with testing
  5. explorable uncertainty

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)137
  • Downloads (Last 6 weeks)4
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349363136:1(43-54)Online publication date: Jan-2025

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media