[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Optimal Scheduling in the Multiserver-job Model under Heavy Traffic

Published: 08 December 2022 Publication History

Abstract

Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Essentially all of the theoretical work on multiserver-job systems focuses on maximizing utilization, with almost nothing known about mean response time. In simpler settings, such as various known-size single-server-job settings, minimizing mean response time is merely a matter of prioritizing small jobs. However, for the multiserver-job system, prioritizing small jobs is not enough, because we must also ensure servers are not unnecessarily left idle. Thus, minimizing mean response time requires prioritizing small jobs while simultaneously maximizing throughput. Our question is how to achieve these joint objectives.
We devise the ServerFilling-SRPT scheduling policy, which is the first policy to minimize mean response time in the multiserver-job model in the heavy traffic limit. In addition to proving this heavy-traffic result, we present empirical evidence that ServerFilling-SRPT outperforms all existing scheduling policies for all loads, with improvements by orders of magnitude at higher loads.
Because ServerFilling-SRPT requires knowing job sizes, we also define the ServerFilling-Gittins policy, which is optimal when sizes are unknown or partially known.

References

[1]
Timothy G. Armstrong, Zhao Zhang, Daniel S. Katz, Michael Wilde, and Ian T. Foster. 2010. Scheduling many-task workloads on supercomputers: Dealing with trailing tasks. In 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers. 1--10.
[2]
Edward Arthurs and Joseph S. Kaufman. 1979. Sizing a message store subject to blocking criteria. In Proceedings of the third international symposium on modelling and performance evaluation of computer systems: Performance of computer systems. 547--564.
[3]
Percy H. Brill and Linda Green. 1984. Queues in Which Customers Receive Simultaneous Service from a Random Number of Servers: A System Point Approach. Management Science, Vol. 30, 1 (1984), 51--68.
[4]
Danilo Carastan-Santos, Raphael Y. De Camargo, Denis Trystram, and Salah Zrigui. 2019. One Can Only Gain by Replacing EASY Backfilling: A Simple Scheduling Policies Case Study. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 1--10.
[5]
Walfredo Cirne and Francine Berman. 2001. A model for moldable supercomputer jobs. In Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001. 8 pp.
[6]
Allen B. Downey. 1997. Using queue time predictions for processor allocation. In workshop on Job Scheduling Strategies for Parallel Processing. Springer, 35--57.
[7]
Yoav Etsion and Dan Tsafrir. 2005. A short survey of commercial cluster batch schedulers. School of Computer Science and Engineering, The Hebrew University of Jerusalem, Vol. 44221 (2005), 2005--13.
[8]
Dror G. Feitelson and Larry Rudolph. 1996. Toward convergence in job schedulers for parallel supercomputers. In Job Scheduling Strategies for Parallel Processing, Dror G. Feitelson and Larry Rudolph (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--26.
[9]
Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn. 2004. Parallel job scheduling--a status report. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--16.
[10]
Dimitrios Filippopoulos and Helen Karatza. 2007. An M/M/2 parallel system model with pure space sharing among rigid jobs. Mathematical and Computer Modelling, Vol. 45, 5 (2007), 491 -- 530.
[11]
Javad Ghaderi. 2016. Randomized algorithms for scheduling VMs in the cloud. In IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications. 1--9.
[12]
John Gittins, Kevin Glazebrook, and Richard Weber. 2011. Multi-armed bandit allocation indices. John Wiley & Sons.
[13]
Isaac Grosof, Mor Harchol-Balter, and Alan Scheller-Wolf. 2021. WCFS: A new framework for analyzing multiserver systems. arXiv preprint arXiv:2109.12663 (2021). Electronic companion (full version) of the paper by the same name in Queueing Systems.
[14]
Isaac Grosof, Mor Harchol-Balter, and Alan Scheller-Wolf. 2022. WCFS: A new framework for analyzing multiserver systems. Queueing Systems (2022).
[15]
Isaac Grosof, Ziv Scully, and Mor Harchol-Balter. 2018. SRPT for multiserver systems. Performance Evaluation, Vol. 127--128 (2018), 154--175.
[16]
Isaac Grosof, Ziv Scully, and Mor Harchol-Balter. 2019. Load Balancing Guardrails: Keeping Your Heavy Traffic on the Road to Low Response Times. Proc. ACM Meas. Anal. Comput. Syst., Vol. 3, 2, Article 42 (jun 2019), 31 pages.
[17]
Mian Guo, Quansheng Guan, and Wende Ke. 2018. Optimal Scheduling of VMs in Queueing Cloud Computing Systems With a Heterogeneous Workload. IEEE Access, Vol. 6 (2018), 15178--15191.
[18]
Mor Harchol-Balter. 2013. Performance modeling and design of computer systems: queueing theory in action. Cambridge University Press.
[19]
Mor Harchol-Balter. 2022. The multiserver job queueing model. Queueing Systems, Vol. 100, 3 (2022), 201--203.
[20]
Yige Hong. 2022. Sharp Zero-Queueing Bounds for Multi-Server Jobs. SIGMETRICS Perform. Eval. Rev., Vol. 49, 2 (jan 2022), 66--68.
[21]
Minnesota Supercomputing Institute. 2020. Queues. https://www.msi.umn.edu/queues
[22]
James Patton Jones and Bill Nitzberg. 1999. Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization. In Job Scheduling Strategies for Parallel Processing, Dror G. Feitelson and Larry Rudolph (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--16.
[23]
Sung Shick Kim. 1979. M/M/s queueing system where customers demand multiple server use. Ph.,D. Dissertation. Southern Methodist University.
[24]
Siva Theja Maguluri, Rayadurgam Srikant, and Lei Ying. 2012. Stochastic models of load balancing and scheduling in cloud computing clusters. In 2012 Proceedings IEEE Infocom. IEEE, 702--710.
[25]
Konstantinos Psychas and Javad Ghaderi. 2018. Randomized Algorithms for Scheduling Multi-Resource Jobs in the Cloud. IEEE/ACM Transactions on Networking, Vol. 26, 5 (2018), 2202--2215.
[26]
Alexander Rumyantsev, Robert Basmadjian, Sergey Astafiev, and Alexander Golovin. 2022. Three-level modeling of a speed-scaling supercomputer. Annals of Operations Research (2022), 1--29.
[27]
Alexander Rumyantsev and Evsey Morozov. 2017. Stability criterion of a multiserver model with simultaneous service. Annals of Operations Research, Vol. 252, 1 (2017), 29--39.
[28]
Linus Schrage. 1968. A Proof of the Optimality of the Shortest Remaining Processing Time Discipline. Operations Research, Vol. 16, 3 (1968), 687--690.
[29]
Linus E. Schrage and Louis W. Miller. 1966. The Queue M/G/1 with the Shortest Remaining Processing Time Discipline. Operations Research, Vol. 14, 4 (1966), 670--684.
[30]
Ziv Scully. 2021. WINE: A New Queueing Identity for Analyzing Scheduling Policies in Multiserver Systems. https://ziv.codes/pdf/wine-talk.pdf INFORMS Annual Meeting.
[31]
Ziv Scully, Isaac Grosof, and Mor Harchol-Balter. 2020. The Gittins Policy is Nearly Optimal in the M/G/k under Extremely General Conditions. Proc. ACM Meas. Anal. Comput. Syst., Vol. 4, 3, Article 43 (Nov. 2020), 29 pages.
[32]
Ziv Scully, Isaac Grosof, and Mor Harchol-Balter. 2021. Optimal multiserver scheduling with unknown job sizes in heavy traffic. Performance Evaluation, Vol. 145 (2021), 102150.
[33]
Ziv Scully and Mor Harchol-Balter. 2021. The Gittins Policy in the M/G/1 Queue. In 2021 19th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt). 1--8.
[34]
Srividya Srinivasan, Rajkumar Kettimuthu, Vijay Subramani, and Ponnuswamy Sadayappan. 2002. Characterization of backfilling strategies for parallel job scheduling. In Proceedings. International Conference on Parallel Processing Workshop. 514--519.
[35]
Wei Tang, Zhiling Lan, Narayan Desai, Daniel Buettner, and Yongen Yu. 2011. Reducing Fragmentation on Torus-Connected Supercomputers. In 2011 IEEE International Parallel Distributed Processing Symposium. 828--839.
[36]
Wei Tang, Dongxu Ren, Zhiling Lan, and Narayan Desai. 2012. Adaptive Metric-Aware Job Scheduling for Production Supercomputers. In 2012 41st International Conference on Parallel Processing Workshops. 107--115.
[37]
Oleg M. Tikhonenko. 2005. Generalized Erlang problem for service systems with finite total capacity. Problems of Information Transmission, Vol. 41, 3 (2005), 243--253.
[38]
Muhammad Tirmazi, Adam Barker, Nan Deng, Md E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: The next Generation. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys '20). Association for Computing Machinery, New York, NY, USA, Article 30, 14 pages.
[39]
Nico M. van Dijk. 1989. Blocking of finite source inputs which require simultaneous servers with general think and holding times. Operations Research Letters, Vol. 8, 1 (1989), 45 -- 52.
[40]
Chad Vizino, Nathan Stone, John Kochmar, and J. Ray Scott. 2005. Batch Scheduling on the Cray XT3. CUG 2005 (2005).
[41]
Juan Wang and Wenming Guo. 2009. The Application of Backfilling in Cluster Systems. In 2009 WRI International Conference on Communications and Mobile Computing, Vol. 3. 55--59.
[42]
Weina Wang, Qiaomin Xie, and Mor Harchol-Balter. 2021. Zero Queueing for Multi-Server Jobs. In Abstract Proceedings of the 2021 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems (Virtual Event, China) (SIGMETRICS '21). Association for Computing Machinery, NY, NY, USA, 13--14.
[43]
Ward Whitt. 1985. Blocking when service is required from several facilities simultaneously. AT&T Technical Journal, Vol. 64, 8 (1985), 1807--1856.

Cited By

View all
  • (2025)The Impact of Service Demand Variability on Data Center PerformanceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349779236:2(120-132)Online publication date: Feb-2025
  • (2025)Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349363136:1(43-54)Online publication date: Jan-2025
  • (2024)Scheduling for Reduced Tail Task Latencies in Highly Utilized DatacentersProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698522(302-321)Online publication date: 20-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 6, Issue 3
POMACS
December 2022
534 pages
EISSN:2476-1249
DOI:10.1145/3576048
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2022
Published in POMACS Volume 6, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. asymptotic optimality
  2. gittins
  3. heavy traffic
  4. latency
  5. multiserver-job
  6. response time
  7. scheduling
  8. sojurn time
  9. srpt

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)249
  • Downloads (Last 6 weeks)31
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)The Impact of Service Demand Variability on Data Center PerformanceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349779236:2(120-132)Online publication date: Feb-2025
  • (2025)Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349363136:1(43-54)Online publication date: Jan-2025
  • (2024)Scheduling for Reduced Tail Task Latencies in Highly Utilized DatacentersProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698522(302-321)Online publication date: 20-Nov-2024
  • (2024)Bounds on M/G/k Scheduling Under Moderate Load Improving on SRPT-k and Tightening Lower BoundsACM SIGMETRICS Performance Evaluation Review10.1145/3695411.369542152:2(24-26)Online publication date: 6-Sep-2024
  • (2024)Strongly Tail-Optimal Scheduling in the Light-Tailed M/G/1Proceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560118:2(1-33)Online publication date: 29-May-2024
  • (2024)The Saturated Multiserver Job Queuing Model with Two Classes of Jobs: Exact and Approximate ResultsACM SIGMETRICS Performance Evaluation Review10.1145/3649477.364949351:4(28-29)Online publication date: 23-Feb-2024
  • (2024)Performance of the Gittins Policy in the G/G/1 and G/G/k, With and Without Setup TimesACM SIGMETRICS Performance Evaluation Review10.1145/3649477.364948551:4(12-13)Online publication date: 23-Feb-2024
  • (2024)The RESET and MARC Techniques, with Application to Multiserver-Job AnalysisACM SIGMETRICS Performance Evaluation Review10.1145/3649477.364948251:4(6-7)Online publication date: 23-Feb-2024
  • (2024)Scheduling Multi-Server Jobs is Not EasyProceedings of the Twenty-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3641512.3686374(171-180)Online publication date: 14-Oct-2024
  • (2024)On Optimal Server Allocation for Moldable Jobs with Concave Speed-UpProceedings of the Twenty-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3641512.3686370(191-200)Online publication date: 14-Oct-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media