[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3337821.3337890acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Public Access

Speculative Scheduling for Stochastic HPC Applications

Published: 05 August 2019 Publication History

Abstract

New emerging fields are developing a growing number of large-scale applications with heterogeneous, dynamic and data-intensive requirements that put a high emphasis on productivity and thus are not tuned to run efficiently on today's high performance computing (HPC) systems. Some of these applications, such as neuroscience workloads and those that use adaptive numerical algorithms, develop modeling and simulation workflows with stochastic execution times and unpredictable resource requirements. When they are deployed on current HPC systems using existing resource management solutions, it can result in loss of efficiency for the users and decrease in effective system utilization for the platform providers.
In this paper, we consider the current HPC scheduling model and describe the challenge it poses for stochastic applications due to the strict requirement in its job deployment policies. To address the challenge, we present speculative scheduling techniques that adapt the resource requirements of a stochastic application on-the-fly, based on its past execution behavior instead of relying on estimates given by the user. We focus on improving the overall system utilization and application response time without disrupting the current HPC scheduling model or the application development process. Our solution can operate alongside existing HPC batch schedulers without interfering with their usage modes. We show that speculative scheduling can improve the system utilization and average application response time by 25-30% compared to the classical HPC approach.

References

[1]
Guillaume Aupy, Ana Gainaru, Valentin Honoré, Padma Raghavan, Yves Robert, and Hongyang Sun. 2019. Reservation Strategies for Stochastic Jobs. In IEEE International Parallel and Distributed Processing Symposium.
[2]
Abhinav Bhatele, Lukasz Wesolowski, Eric Bohm, Edgar Solomonik, and Laxmikant V. Kale. 2010. Understanding Application Performance via Micro-benchmarks on Three Large Supercomputers: Intrepid, Ranger and Jaguar. The International Journal of High Performance Computing Applications 24, 4 (2010), 411--427.
[3]
Louis-Claude Canon, AurÃl'lie Kong Win Chang, Yves Robert, and Frederic Vivien. 2018. Scheduling independent stochastic tasks under deadline and budget constraints. Research Report 9178. INRIA.
[4]
Louis-Claude Canon and Emmanuel Jeannot. 2010. Evaluation and optimization of the robustness of dag schedules in heterogeneous environments. IEEE Transactions on Parallel and Distributed Systems 21, 4 (2010), 532--546.
[5]
N. Capit, G. Da Costa, Y. Georgiou, G. Huard, C. Martin, G. Mounie, P. Neyron, and O. Richard. 2005. A batch scheduler with high level components. In CCGrid. 776--783.
[6]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113.
[7]
F. Dong, J. Luo, A. Song, and J. Jin. 2010. Resource Load Based Stochastic DAGs Scheduling Mechanism for Grid Environment. In 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC). 197--204.
[8]
Ana Gainaru et al. 2019. ScheduleFlow: A simulator for HPC schedulers. http://https://github.com/anagainaru/SchedulerSimulator. (2019). {Online; accessed 19-April-2019}.
[9]
Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn. 2005. Parallel Job Scheduling --- a Status Report. In Proceedings of the 10th International Conference on Job Scheduling Strategies for Parallel Processing (JSSPP'04). Springer-Verlag, Berlin, Heidelberg, 1--16.
[10]
Ana Gainaru, Hongyang Sun, Guillaume Aupy, Yuankai Huo, Bennett A Landman, and Padma Raghavan. 2019. On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows. The International Journal of High Performance Computing Applications 0, 0 (2019), 1094342019841681.
[11]
Bruno Gaujal and Jean-Marc Vincent. 2009. Comparisons of Stochastic Task-Resource Systems. In Introduction to Scheduling. Springer, Chapter 10.
[12]
Robert L. Harrigan, Benjamin C. Yvernault, Brian D. Boyd, Stephen M. Damon, Kyla David Gibney, Benjamin N. Conrad, Nicholas S. Phillips, Baxter P. Rogers, Yurui Gao, and Bennett A. Landman. 2016. Vanderbilt University Institute of Imaging Science Center for Computational Imaging XNAT: A multimodal data archive and processing environment. NeuroImage 124 (2016), 1097--1101.
[13]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI'11). 295--308.
[14]
Yuankai Huo, Andrew J. Plassard, Aaron Carass, Susan M. Resnick, Dzung L. Pham, Jerry L. Prince, and Bennett A. Landman. 2016. Consistent cortical reconstruction and multi-atlas brain segmentation. NeuroImage 138 (2016), 197--210.
[15]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems. 59--72.
[16]
K. Li, X. Tang, B. Veeravalli, and K. Li. 2015. Scheduling Precedence Constrained Stochastic Tasks on Heterogeneous Cluster Systems. IEEE Trans. Comput. 64, 1 (2015), 191--204.
[17]
T. Mukherjee, Q. Tang, C. Ziesman, S. K. S. Gupta, and P. Cayton. 2007. Software Architecture for Dynamic Thermal Management in Datacenters. In 2007 2nd International Conference on Communication Systems Software and Middleware. 1--11.
[18]
Michael L. Pinedo. 2008. Scheduling: Theory, Algorithms, and Systems (3rd ed.). Springer.
[19]
W. Tang, Z. Lan, N. Desai, D. Buettner, and Y. Yu. 2011. Reducing Fragmentation on Torus-Connected Supercomputers. In 2011 IEEE International Parallel Distributed Processing Symposium. 828--839.
[20]
Xiaoyong Tang, Kenli Li, Guiping Liao, Kui Fang, and Fan Wu. 2011. A Stochastic Scheduling Algorithm for Precedence Constrained Tasks on Grid. Future Gener. Comput. Syst. 27, 8 (Oct. 2011), 1083--1091.
[21]
H. Topcuoglu, S. Hariri, and Min-You Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE TPDS 13, 3 (March 2002), 260--274.
[22]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing. 5:1--5:16.
[23]
Yan Wang, Yuyin Zhou, Wei Shen, Seyoun Park, Elliot K. Fishman, and Alan L. Yuille. 2018. Abdominal multi-organ segmentation with organ-attention networks and statistical fusion. CoRR abs/1804.08414 (2018).
[24]
Ole Weidner, Malcolm Atkinson, Adam Barker, and Rosa Filgueira Vicente. 2016. Rethinking High Performance Computing Platforms: Challenges, Opportunities and Recommendations. In Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing. 19--26.
[25]
Gideon Weiss. 1992. Turnpike Optimality of Smith's Rule in Parallel Machines Stochastic Scheduling. Math. Oper. Res. 17, 2 (May 1992), 255--270.
[26]
Mark W. Woolrich, Saad Jbabdi, Brian Patenaude, Michael Chappell, Salima Makni, Timothy Behrens, Christian Beckmann, Mark Jenkinson, and Stephen M. Smith. 2009. Bayesian analysis of neuroimaging data in FSL. NeuroImage 45, 1, Supplement 1 (2009), S173 - S186.
[27]
Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In JSSPP. 44--60.

Cited By

View all
  • (2024)Qualitatively Analyzing Optimization Objectives in the Design of HPC Resource ManagerACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/37019869:4(1-28)Online publication date: 29-Oct-2024
  • (2024)Automated HPC Workload Generation Combining Statistical Modeling and Autoregressive AnalysisBenchmarking, Measuring, and Optimizing10.1007/978-981-97-0316-6_10(153-170)Online publication date: 14-Feb-2024
  • (2023)Toward automated algorithm configuration for distributed hybrid flow shop scheduling with multiprocessor tasksKnowledge-Based Systems10.1016/j.knosys.2023.110309264:COnline publication date: 15-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '19: Proceedings of the 48th International Conference on Parallel Processing
August 2019
1107 pages
ISBN:9781450362955
DOI:10.1145/3337821
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPC runtime
  2. Scheduling algorithm
  3. stochastic applications

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2019

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)14
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Qualitatively Analyzing Optimization Objectives in the Design of HPC Resource ManagerACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/37019869:4(1-28)Online publication date: 29-Oct-2024
  • (2024)Automated HPC Workload Generation Combining Statistical Modeling and Autoregressive AnalysisBenchmarking, Measuring, and Optimizing10.1007/978-981-97-0316-6_10(153-170)Online publication date: 14-Feb-2024
  • (2023)Toward automated algorithm configuration for distributed hybrid flow shop scheduling with multiprocessor tasksKnowledge-Based Systems10.1016/j.knosys.2023.110309264:COnline publication date: 15-Mar-2023
  • (2023)Optimization Metrics for the Evaluation of Batch Schedulers in HPCJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-43943-8_5(97-115)Online publication date: 15-Sep-2023
  • (2022)Adaptively Periodic I/O Scheduling for Concurrent HPC ApplicationsElectronics10.3390/electronics1109131811:9(1318)Online publication date: 21-Apr-2022
  • (2021)Analytical and Numerical Evaluation of Co-Scheduling Strategies and Their ApplicationComputers10.3390/computers1010012210:10(122)Online publication date: 2-Oct-2021
  • (2021)A HPC Co-scheduler with Reinforcement LearningJob Scheduling Strategies for Parallel Processing10.1007/978-3-030-88224-2_7(126-148)Online publication date: 6-Oct-2021
  • (2020)Profiles of Upcoming HPC Applications and Their Impact on Reservation StrategiesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303972832:5(1178-1190)Online publication date: 23-Dec-2020
  • (2020)Reservation and Checkpointing Strategies for Stochastic Jobs2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00092(853-863)Online publication date: May-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media