More Web Proxy on the site http://driver.im/

research-article

Load Balancing with Job-Size Testing: Performance Improvement or Degradation?

Authors:

Jonatha Anselmi,

Josu DoncelAuthors Info & Claims

ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Volume 9, Issue 2

Article No.: 8, Pages 1 - 27

https://doi.org/10.1145/3651154

Published: 17 April 2024 Publication History

Abstract

In the context of decision making under explorable uncertainty, scheduling with testing is a powerful technique used in the management of computer systems to improve performance via better job-dispatching decisions. Upon job arrival, a scheduler may run some testing algorithm against the job to extract some information about its structure, e.g., its size, and properly classify it. The acquisition of such knowledge comes with a cost because the testing algorithm delays the dispatching decisions, though this is under control. In this article, we analyze the impact of such extra cost in a load balancing setting by investigating the following questions: does it really pay off to test jobs? If so, under which conditions? Under mild assumptions connecting the information extracted by the testing algorithm in relationship with its running time, we show that whether scheduling with testing brings a performance degradation or improvement strongly depends on the traffic conditions, system size and the coefficient of variation of job sizes. Thus, the general answer to the above questions is non-trivial and some care should be considered when deploying a testing policy. Our results are achieved by proposing a load balancing model for scheduling with testing that we analyze in two limiting regimes. When the number of servers grows to infinity in proportion to the network demand, we show that job-size testing actually degrades performance unless short jobs can be predicted reliably almost instantaneously and the network load is sufficiently high. When the coefficient of variation of job sizes grows to infinity, we construct testing policies inducing an arbitrarily large performance gain with respect to running jobs untested.

References

[1]

Susanne Albers and Alexander Eckl. 2020. Explorable uncertainty in scheduling with non-uniform testing times. In Approximation and Online Algorithms: 18th International Workshop, WAOA 2020, Virtual Event, September 9–10, 2020, Revised Selected Articles. Springer-Verlag, Berlin, 127–142.

Digital Library

[2]

Susanne Albers and Alexander Eckl. 2021. Scheduling with testing on multiple identical parallel machines. In Algorithms and Data Structures: 17th International Symposium, WADS 2021, August 9-11, 2021, Proceedings. Springer-Verlag, Berlin, 29–42.

Digital Library

[3]

Saed Alizamir, Francis de Véricourt, and Peng Sun. 2013. Diagnostic accuracy under congestion. Management Science 59, 1 (2013), 157–171.

Digital Library

[4]

Jonatha Anselmi. 2020. Combining size-based load balancing with round-robin for scalable low latency. IEEE Transactions on Parallel and Distributed Systems 31, 4 (2020), 886–896.

Digital Library

[5]

J. Anselmi and J. Doncel. 2019. Asymptotically optimal size-interval task assignments. IEEE Transactions on Parallel and Distributed Systems 30, 11 (nov2019), 2422–2433.

Digital Library

[6]

S. Asmussen and. 2003. Applied Probability and Queues. Springer. 86013173

[7]

Eitan Bachmat and Josu Doncel. 2021. Size-based routing policies: Non-asymptotic analysis and design of decentralized systems. Sensors 21, 8 (2021). https://www.mdpi.com/1424-8220/21/8/2701

[8]

Eitan Bachmat, Josu Doncel, and Hagit Sarfati. 2019. Performance and stability analysis of the task assignment based on guessing size routing policy. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 1–13.

[9]

Eitan Bachmat and Hagit Sarfati. 2010. Analysis of SITA policies. Perform. Eval. 67, 2 (Feb.2010), 102–120.

Digital Library

[10]

Cynthia Bailey Lee, Yael Schwartzman, Jennifer Hardy, and Allan Snavely. 2005. Are user runtime estimates inherently inaccurate?. In Job Scheduling Strategies for Parallel Processing, Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin, Berlin, 253–263.

[11]

Luis Diego Briceno, Bhavesh Khemka, Howard Jay Siegel, Anthony A. Maciejewski, Christopher Groër, Gregory Koenig, Gene Okonski, and Steve Poole. 2011. Time utility functions for modeling and evaluating resource allocations in a heterogeneous computing system. In IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 7–19.

Digital Library

[12]

João M. P. Cardoso, José Gabriel F. Coutinho, and Pedro C. Diniz. 2017. Chapter 5 - Source code transformations and optimizations. In Embedded Computing for High Performance, João M. P. Cardoso, José Gabriel F. Coutinho, and Pedro C. Diniz (Eds.). Morgan Kaufmann, Boston, 137–183.

[13]

Wenyan Chen, Kejiang Ye, Yang Wang, Guoyao Xu, and Cheng-Zhong Xu. 2018. How does the workload look like in production cloud? Analysis and clustering of workloads on Alibaba cluster trace. In 2018 IEEE 24th Int. Conf. on Parallel and Distributed Systems (ICPADS). 102–109.

[14]

Robert J. Adler, Raisa E. Feldman, and Murad S. Taqqu (Eds.). 1998. A Practical Guide to Heavy Tails: Statistical Techniques and Applications. Birkhauser Boston Inc.

[15]

Mark Van der Boor, Sem C. Borst, Johan S. H. Van Leeuwaarden, and Debankur Mukherjee. 2022. Scalable load balancing in networked systems: A survey of recent advances. SIAM Rev. 64, 3 (2022), 554–622.

Digital Library

[16]

Sheng Di, Derrick Kondo, and Walfredo Cirne. 2012. Characterization and comparison of cloud versus grid workloads. In 2012 IEEE International Conference on Cluster Computing. 230–238.

Digital Library

[17]

Fanny Dufossé, Christoph Dürr, Noël Nadal, Denis Trystram, and Óscar C. Vásquez. 2022. Scheduling with a processing time oracle. Applied Mathematical Modelling 104 (2022), 701–720.

[18]

Christoph Dürr, Thomas Erlebach, Nicole Megow, and Julie Meißner. 2018. Scheduling with explorable uncertainty. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 94), Anna R. Karlin (Ed.). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 30:1–30:14.

[19]

Christoph Dürr, Thomas Erlebach, Nicole Megow, and Julie Meißner. 2020. An adversarial model for scheduling with testing. Algorithmica 82, 12 (2020), 3630–3675.

Digital Library

[20]

Muhammad El-Taha and Bacel Maddah. 2006. Allocation of service time in a multiserver system. Management Science 52, 4 (2006), 623–637.

Digital Library

[21]

Yuping Fan, Paul Rich, William E. Allcock, Michael E. Papka, and Zhiling Lan. 2017. Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 530–540.

[22]

Dror G. Feitelson. 2015. Workload Modeling for Computer Systems Performance Evaluation (1st ed.). Cambridge University Press, USA.

Digital Library

[23]

Ana Gainaru, Hongyang Sun, Guillaume Aupy, Yuankai Huo, Bennett A. Landman, and Padma Raghavan. 2019. On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows. The International Journal of High Performance Computing Applications 33, 6 (2019), 1140–1158.

Digital Library

[24]

Mor Harchol-Balter. 2002. Task assignment with unknown duration. J. ACM 49, 2 (2002), 260–288.

Digital Library

[25]

Mor Harchol-Balter, Mark E. Crovella, and Cristina D. Murta. 1999. On choosing a task assignment policy for a distributed server system. J. Parallel and Distrib. Comput. 59, 2 (1999), 204–228.

Digital Library

[26]

Mor Harchol-Balter and Allen B. Downey. 1996. Exploiting process lifetime distributions for dynamic load balancing. In Proceedings of the Int. Conf. on Measurement and Modeling of Computer Systems (Philadelphia, Pennsylvania, USA) (SIGMETRICS ’96). ACM, New York, NY, USA, 13–24.

Digital Library

[27]

Mor Harchol-Balter, Alan Scheller-Wolf, and Andrew R. Young. 2009. Surprising results on task assignment in server farms with high-variability workloads. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (Seattle, WA, USA) (SIGMETRICS ’09). ACM, New York, NY, USA, 287–298.

Digital Library

[28]

Esa Hyytiä and Rhonda Righter. 2024. Towards the optimal dynamic size-aware dispatching. Performance Evaluation 164 (2024), 102396.

[29]

Retsef Levi, Thomas Magnanti, and Yaron Shaposhnik. 2019. Scheduling with testing. Management Science 65, 2 (2019), 776–793.

Digital Library

[30]

Alex F. Mills, Nilay Tanık Argon, and Serhan Ziya. 2013. Resource-based patient prioritization in mass-casualty incidents. Manufacturing & Service Operations Management 15, 3 (2013), 361–377.

Digital Library

[31]

Mina Naghshnejad and Mukesh Singhal. 2020. A hybrid scheduling platform: A runtime prediction reliability aware scheduling platform to improve HPC scheduling performance. The Journal of Supercomputing 76 (2020), 28 pages.

Digital Library

[32]

G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. 2000. Pace—A toolset for the performance prediction of parallel and distributed systems. The International Journal of High Performance Computing Applications 14, 3 (2000), 228–251.

Digital Library

[33]

Alan Scheller-Wolf and Karl Sigman. 1997. Delay moments for FIFO GI/GI/s queues. Queueing Systems 25, 1/4 (Jan.1997), 77–95.

Digital Library

[34]

Dan Tsafrir, Yoav Etsion, and Dror G. Feitelson. 2007. Backfilling using system-generated predictions rather than user runtime estimates. IEEE Transactions on Parallel and Distributed Systems 18, 6 (2007), 789–803.

Digital Library

[35]

Y. Wiseman, K. Schwan, and P. Widener. 2004. Efficient end to end data exchange using configurable compression. In 24th Int. Conf. on Distributed Computing Systems. 228–235.

[36]

Carl Witt, Marc Bux, Wladislaw Gusew, and Ulf Leser. 2019. Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Information Systems 82 (2019), 33–52.

Digital Library

[37]

Bo Zhang and Bert Zwart. 2013. Steady-state analysis for multiserver queues under size interval task assignment in the quality-driven regime. Math. of Op. Res. 38, 3 (2013), 504–525.

Digital Library

[38]

Darko Zivanovic, Milan Pavlovic, Milan Radulovic, Hyunsung Shin, Jongpil Son, Sally A. Mckee, Paul M. Carpenter, Petar Radojković, and Eduard Ayguadé. 2017. Main memory in HPC: Do we need more or could we live with less? ACM Trans. Archit. Code Optim. 14, 1, Article 3 (Mar.2017), 26 pages.

Digital Library

[39]

Salah Zrigui, Raphael Y. de Camargo, Arnaud Legrand, and Denis Trystram. 2022. Improving the performance of batch schedulers using online job runtime classification. J. Parallel and Distrib. Comput. 164 (2022), 83–95.

Digital Library

Cited By

Anselmi JDoncel J(2025)Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349363136:1(43-54)Online publication date: Jan-2025
https://doi.org/10.1109/TPDS.2024.3493631

Index Terms

Load Balancing with Job-Size Testing: Performance Improvement or Degradation?
1. Mathematics of computing
  1. Probability and statistics
    1. Stochastic processes
      1. Markov processes
2. Networks
  1. Network performance evaluation
    1. Network performance analysis

Recommendations

A genetic algorithm for job shop scheduling with load balancing
AI'05: Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence

This paper deals with the load-balancing of machines in a real-world job-shop scheduling problem with identical machines. The load-balancing algorithm allocates jobs, split into lots, on identical machines, with objectives to reduce job total throughput ...
Improved approximation algorithms for non-preemptive multiprocessor scheduling with testing
Abstract
Multiprocessor scheduling, also called scheduling on parallel identical machines to minimize the makespan, is a classic optimization problem which has been extensively studied. Scheduling with testing is an online variant, where the processing ...
Improved Approximation Algorithms for Multiprocessor Scheduling with Testing
Frontiers of Algorithmics
Abstract
Multiprocessor scheduling, also called scheduling on parallel identical machines to minimize the makespan, is a classic optimization problem that has received numerous studies. Scheduling with testing is an online variant where the processing time ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Modeling and Performance Evaluation of Computing Systems

ACM Transactions on Modeling and Performance Evaluation of Computing Systems Volume 9, Issue 2

June 2024

103 pages

EISSN:2376-3647

DOI:10.1145/3613566

Editor:
Leana Golubchik
University of Southern California, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2024

Online AM: 04 March 2024

Accepted: 25 February 2024

Revised: 17 January 2024

Received: 17 October 2023

Published in TOMPECS Volume 9, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
137
Total Downloads

Downloads (Last 12 months)137
Downloads (Last 6 weeks)4

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Anselmi JDoncel J(2025)Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349363136:1(43-54)Online publication date: Jan-2025
https://doi.org/10.1109/TPDS.2024.3493631

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents