More Web Proxy on the site http://driver.im/

research-article

Open access

Optimal Scheduling in the Multiserver-job Model under Heavy Traffic

Authors:

Mor Harchol-Balter,

Alan Scheller-WolfAuthors Info & Claims

Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 6, Issue 3

Article No.: 51, Pages 1 - 32

https://doi.org/10.1145/3570612

Published: 08 December 2022 Publication History

Abstract

Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Essentially all of the theoretical work on multiserver-job systems focuses on maximizing utilization, with almost nothing known about mean response time. In simpler settings, such as various known-size single-server-job settings, minimizing mean response time is merely a matter of prioritizing small jobs. However, for the multiserver-job system, prioritizing small jobs is not enough, because we must also ensure servers are not unnecessarily left idle. Thus, minimizing mean response time requires prioritizing small jobs while simultaneously maximizing throughput. Our question is how to achieve these joint objectives.

We devise the ServerFilling-SRPT scheduling policy, which is the first policy to minimize mean response time in the multiserver-job model in the heavy traffic limit. In addition to proving this heavy-traffic result, we present empirical evidence that ServerFilling-SRPT outperforms all existing scheduling policies for all loads, with improvements by orders of magnitude at higher loads.

Because ServerFilling-SRPT requires knowing job sizes, we also define the ServerFilling-Gittins policy, which is optimal when sizes are unknown or partially known.

References

[1]

Timothy G. Armstrong, Zhao Zhang, Daniel S. Katz, Michael Wilde, and Ian T. Foster. 2010. Scheduling many-task workloads on supercomputers: Dealing with trailing tasks. In 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers. 1--10.

[2]

Edward Arthurs and Joseph S. Kaufman. 1979. Sizing a message store subject to blocking criteria. In Proceedings of the third international symposium on modelling and performance evaluation of computer systems: Performance of computer systems. 547--564.

[3]

Percy H. Brill and Linda Green. 1984. Queues in Which Customers Receive Simultaneous Service from a Random Number of Servers: A System Point Approach. Management Science, Vol. 30, 1 (1984), 51--68.

Digital Library

[4]

Danilo Carastan-Santos, Raphael Y. De Camargo, Denis Trystram, and Salah Zrigui. 2019. One Can Only Gain by Replacing EASY Backfilling: A Simple Scheduling Policies Case Study. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 1--10.

[5]

Walfredo Cirne and Francine Berman. 2001. A model for moldable supercomputer jobs. In Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001. 8 pp.

[6]

Allen B. Downey. 1997. Using queue time predictions for processor allocation. In workshop on Job Scheduling Strategies for Parallel Processing. Springer, 35--57.

[7]

Yoav Etsion and Dan Tsafrir. 2005. A short survey of commercial cluster batch schedulers. School of Computer Science and Engineering, The Hebrew University of Jerusalem, Vol. 44221 (2005), 2005--13.

[8]

Dror G. Feitelson and Larry Rudolph. 1996. Toward convergence in job schedulers for parallel supercomputers. In Job Scheduling Strategies for Parallel Processing, Dror G. Feitelson and Larry Rudolph (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--26.

Digital Library

[9]

Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn. 2004. Parallel job scheduling--a status report. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--16.

[10]

Dimitrios Filippopoulos and Helen Karatza. 2007. An M/M/2 parallel system model with pure space sharing among rigid jobs. Mathematical and Computer Modelling, Vol. 45, 5 (2007), 491 -- 530.

Digital Library

[11]

Javad Ghaderi. 2016. Randomized algorithms for scheduling VMs in the cloud. In IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications. 1--9.

Digital Library

[12]

John Gittins, Kevin Glazebrook, and Richard Weber. 2011. Multi-armed bandit allocation indices. John Wiley & Sons.

[13]

Isaac Grosof, Mor Harchol-Balter, and Alan Scheller-Wolf. 2021. WCFS: A new framework for analyzing multiserver systems. arXiv preprint arXiv:2109.12663 (2021). Electronic companion (full version) of the paper by the same name in Queueing Systems.

[14]

Isaac Grosof, Mor Harchol-Balter, and Alan Scheller-Wolf. 2022. WCFS: A new framework for analyzing multiserver systems. Queueing Systems (2022).

[15]

Isaac Grosof, Ziv Scully, and Mor Harchol-Balter. 2018. SRPT for multiserver systems. Performance Evaluation, Vol. 127--128 (2018), 154--175.

[16]

Isaac Grosof, Ziv Scully, and Mor Harchol-Balter. 2019. Load Balancing Guardrails: Keeping Your Heavy Traffic on the Road to Low Response Times. Proc. ACM Meas. Anal. Comput. Syst., Vol. 3, 2, Article 42 (jun 2019), 31 pages.

Digital Library

[17]

Mian Guo, Quansheng Guan, and Wende Ke. 2018. Optimal Scheduling of VMs in Queueing Cloud Computing Systems With a Heterogeneous Workload. IEEE Access, Vol. 6 (2018), 15178--15191.

[18]

Mor Harchol-Balter. 2013. Performance modeling and design of computer systems: queueing theory in action. Cambridge University Press.

Digital Library

[19]

Mor Harchol-Balter. 2022. The multiserver job queueing model. Queueing Systems, Vol. 100, 3 (2022), 201--203.

Digital Library

[20]

Yige Hong. 2022. Sharp Zero-Queueing Bounds for Multi-Server Jobs. SIGMETRICS Perform. Eval. Rev., Vol. 49, 2 (jan 2022), 66--68.

Digital Library

[21]

Minnesota Supercomputing Institute. 2020. Queues. https://www.msi.umn.edu/queues

[22]

James Patton Jones and Bill Nitzberg. 1999. Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization. In Job Scheduling Strategies for Parallel Processing, Dror G. Feitelson and Larry Rudolph (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--16.

[23]

Sung Shick Kim. 1979. M/M/s queueing system where customers demand multiple server use. Ph.,D. Dissertation. Southern Methodist University.

[24]

Siva Theja Maguluri, Rayadurgam Srikant, and Lei Ying. 2012. Stochastic models of load balancing and scheduling in cloud computing clusters. In 2012 Proceedings IEEE Infocom. IEEE, 702--710.

[25]

Konstantinos Psychas and Javad Ghaderi. 2018. Randomized Algorithms for Scheduling Multi-Resource Jobs in the Cloud. IEEE/ACM Transactions on Networking, Vol. 26, 5 (2018), 2202--2215.

Digital Library

[26]

Alexander Rumyantsev, Robert Basmadjian, Sergey Astafiev, and Alexander Golovin. 2022. Three-level modeling of a speed-scaling supercomputer. Annals of Operations Research (2022), 1--29.

[27]

Alexander Rumyantsev and Evsey Morozov. 2017. Stability criterion of a multiserver model with simultaneous service. Annals of Operations Research, Vol. 252, 1 (2017), 29--39.

[28]

Linus Schrage. 1968. A Proof of the Optimality of the Shortest Remaining Processing Time Discipline. Operations Research, Vol. 16, 3 (1968), 687--690.

Digital Library

[29]

Linus E. Schrage and Louis W. Miller. 1966. The Queue M/G/1 with the Shortest Remaining Processing Time Discipline. Operations Research, Vol. 14, 4 (1966), 670--684.

Digital Library

[30]

Ziv Scully. 2021. WINE: A New Queueing Identity for Analyzing Scheduling Policies in Multiserver Systems. https://ziv.codes/pdf/wine-talk.pdf INFORMS Annual Meeting.

[31]

Ziv Scully, Isaac Grosof, and Mor Harchol-Balter. 2020. The Gittins Policy is Nearly Optimal in the M/G/k under Extremely General Conditions. Proc. ACM Meas. Anal. Comput. Syst., Vol. 4, 3, Article 43 (Nov. 2020), 29 pages.

Digital Library

[32]

Ziv Scully, Isaac Grosof, and Mor Harchol-Balter. 2021. Optimal multiserver scheduling with unknown job sizes in heavy traffic. Performance Evaluation, Vol. 145 (2021), 102150.

[33]

Ziv Scully and Mor Harchol-Balter. 2021. The Gittins Policy in the M/G/1 Queue. In 2021 19th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt). 1--8.

[34]

Srividya Srinivasan, Rajkumar Kettimuthu, Vijay Subramani, and Ponnuswamy Sadayappan. 2002. Characterization of backfilling strategies for parallel job scheduling. In Proceedings. International Conference on Parallel Processing Workshop. 514--519.

[35]

Wei Tang, Zhiling Lan, Narayan Desai, Daniel Buettner, and Yongen Yu. 2011. Reducing Fragmentation on Torus-Connected Supercomputers. In 2011 IEEE International Parallel Distributed Processing Symposium. 828--839.

[36]

Wei Tang, Dongxu Ren, Zhiling Lan, and Narayan Desai. 2012. Adaptive Metric-Aware Job Scheduling for Production Supercomputers. In 2012 41st International Conference on Parallel Processing Workshops. 107--115.

[37]

Oleg M. Tikhonenko. 2005. Generalized Erlang problem for service systems with finite total capacity. Problems of Information Transmission, Vol. 41, 3 (2005), 243--253.

Digital Library

[38]

Muhammad Tirmazi, Adam Barker, Nan Deng, Md E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: The next Generation. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys '20). Association for Computing Machinery, New York, NY, USA, Article 30, 14 pages.

Digital Library

[39]

Nico M. van Dijk. 1989. Blocking of finite source inputs which require simultaneous servers with general think and holding times. Operations Research Letters, Vol. 8, 1 (1989), 45 -- 52.

Digital Library

[40]

Chad Vizino, Nathan Stone, John Kochmar, and J. Ray Scott. 2005. Batch Scheduling on the Cray XT3. CUG 2005 (2005).

[41]

Juan Wang and Wenming Guo. 2009. The Application of Backfilling in Cluster Systems. In 2009 WRI International Conference on Communications and Mobile Computing, Vol. 3. 55--59.

[42]

Weina Wang, Qiaomin Xie, and Mor Harchol-Balter. 2021. Zero Queueing for Multi-Server Jobs. In Abstract Proceedings of the 2021 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems (Virtual Event, China) (SIGMETRICS '21). Association for Computing Machinery, NY, NY, USA, 13--14.

[43]

Ward Whitt. 1985. Blocking when service is required from several facilities simultaneously. AT&T Technical Journal, Vol. 64, 8 (1985), 1807--1856.

Cited By

Olliaro DAnggraito AMarsan MBalsamo SMarin A(2025)The Impact of Service Demand Variability on Data Center PerformanceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349779236:2(120-132)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3497792
Anselmi JDoncel J(2025)Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349363136:1(43-54)Online publication date: Jan-2025
https://doi.org/10.1109/TPDS.2024.3493631
Vijayakumar SMadhavapeddy AKalyvianaki E(2024)Scheduling for Reduced Tail Task Latencies in Highly Utilized DatacentersProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698522(302-321)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698522
Show More Cited By

Index Terms

Optimal Scheduling in the Multiserver-job Model under Heavy Traffic

Recommendations

Optimal Scheduling in the Multiserver-job Model under Heavy Traffic
SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Essentially all of the theoretical work on multiserver-job systems focuses on maximizing utilization, with almost nothing known about mean response ...
Optimal Scheduling in the Multiserver-job Model under Heavy Traffic
SIGMETRICS '23

Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Essentially all of the theoretical work on multiserver-job systems focuses on maximizing utilization, with almost nothing known about mean response ...
Invited Paper: ServerFilling: A better approach to packing multiserver jobs
ApPLIED 2023: Proceedings of the 5th workshop on Advanced tools, programming languages, and PLatforms for Implementing and Evaluating algorithms for Distributed systems

Ever since the advent of "multiserver jobs" (jobs that require more than one server or core simultaneously), practitioners have been faced with the question of how to pack these jobs into a compute cluster. While many policies have been proposed, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 6, Issue 3

POMACS

December 2022

534 pages

EISSN:2476-1249

DOI:10.1145/3576048

Editors:
Augustin Chaintreau
Columbia University
,
Leana Golubchik
University of Southern California, United States
,
Zhi-Li Zhang
University of Minnesota, United States

Issue’s Table of Contents

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2022

Published in POMACS Volume 6, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
417
Total Downloads

Downloads (Last 12 months)249
Downloads (Last 6 weeks)31

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Olliaro DAnggraito AMarsan MBalsamo SMarin A(2025)The Impact of Service Demand Variability on Data Center PerformanceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349779236:2(120-132)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3497792
Anselmi JDoncel J(2025)Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job ModelIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349363136:1(43-54)Online publication date: Jan-2025
https://doi.org/10.1109/TPDS.2024.3493631
Vijayakumar SMadhavapeddy AKalyvianaki E(2024)Scheduling for Reduced Tail Task Latencies in Highly Utilized DatacentersProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698522(302-321)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698522
Grosof IWang Z(2024)Bounds on M/G/k Scheduling Under Moderate Load Improving on SRPT-k and Tightening Lower BoundsACM SIGMETRICS Performance Evaluation Review10.1145/3695411.369542152:2(24-26)Online publication date: 6-Sep-2024
https://dl.acm.org/doi/10.1145/3695411.3695421
Yu GScully Z(2024)Strongly Tail-Optimal Scheduling in the Light-Tailed M/G/1Proceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560118:2(1-33)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3656011
Olliaro DAjmone Marsan MBalsamo SMarin A(2024)The Saturated Multiserver Job Queuing Model with Two Classes of Jobs: Exact and Approximate ResultsACM SIGMETRICS Performance Evaluation Review10.1145/3649477.364949351:4(28-29)Online publication date: 23-Feb-2024
https://dl.acm.org/doi/10.1145/3649477.3649493
Hong YScully Z(2024)Performance of the Gittins Policy in the G/G/1 and G/G/k, With and Without Setup TimesACM SIGMETRICS Performance Evaluation Review10.1145/3649477.364948551:4(12-13)Online publication date: 23-Feb-2024
https://dl.acm.org/doi/10.1145/3649477.3649485
Grosof IHong YHarchol-Balter MScheller-Wolf A(2024)The RESET and MARC Techniques, with Application to Multiserver-Job AnalysisACM SIGMETRICS Performance Evaluation Review10.1145/3649477.364948251:4(6-7)Online publication date: 23-Feb-2024
https://dl.acm.org/doi/10.1145/3649477.3649482
Vaze RPapavassiliou SSchmid S(2024)Scheduling Multi-Server Jobs is Not EasyProceedings of the Twenty-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3641512.3686374(171-180)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3641512.3686374
Ghanbarian SMukhopadhyay AMazumdar RGuillemin FPapavassiliou SSchmid S(2024)On Optimal Server Allocation for Moldable Jobs with Concave Speed-UpProceedings of the Twenty-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing10.1145/3641512.3686370(191-200)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3641512.3686370
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents