[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3698038.3698551acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Cloud-native Workflow Scheduling using a Hybrid Priority Rule, Dynamic Resource Allocation, and Dynamic Task Partition

Published: 20 November 2024 Publication History

Abstract

As cloud-native workflow orchestration tools become increasingly important for complex data science workloads, there is a growing need for more efficient scheduling. Existing cloud schedulers rely on basic heuristics and user choice for task partitioning for parallel computing, leading to under-utilization of cluster resources and prolonged job completion times. To address this, we propose a novel workflow scheduling algorithm that leverages workflow characteristics to enhance resource utilization and reduce weighted job completion time. The algorithm combines three sub-algorithms, each reflecting a distinct aspect of the scheduling strategy: 1) Hybrid Maximum Children (MC) -Weighted Shortest Critical Path Time (WSCPT) rule alternates between two heuristics, MC and WSCPT, which prioritize jobs based on workflow structure and critical path, respectively. The choice between these heuristics is dynamically adjusted according to the cluster queue size. 2) Dynamic Resource Allocation (DRA), which dynamically adjusts the number of executors assigned to each workflow, and 3) Dynamic Task Partition (DTP), which autonomously determines the task parallelism level. We tested our algorithm with extensive experiments on various workflow types using Spark-imitated simulation. Our algorithm outperformed other schedulers, including learning-based models, by reducing 21-47% of the combined performance of average job completion time and makespan for unweighted workflows and reducing at least 50% of weighted job completion time for weighted workflows.

References

[1]
Thomas L Adam, K. Mani Chandy, and JR Dickson. 1974. A comparison of list schedules for parallel processing systems. Commun. ACM 17, 12 (1974), 685--690.
[2]
Kunal Agrawal, Yuxiong He, Wen Jing Hsu, and Charles E Leiserson. 2006. Adaptive scheduling with parallelism feedback. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. 100--109.
[3]
Kunal Agrawal, Jing Li, Kefu Lu, and Benjamin Moseley. 2016. Scheduling parallel DAG jobs online to minimize average flow time. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 176--189.
[4]
Mohammed Amoon, Nirmeen El-Bahnasawy, and Mai ElKazaz. 2019. An efficient cost-based algorithm for scheduling workflow tasks in cloud computing systems. Neural Computing and Applications 31 (2019), 1353--1363.
[5]
Savaş Balin. 2011. Parallel machine scheduling with fuzzy processing times using a robust genetic algorithm and simulation. Information Sciences 181, 17 (2011), 3551--3569.
[6]
Angel Beltre, Pankaj Saha, and Madhusudhan Govindaraju. 2019. Kube-sphere: An approach to multi-tenant fair scheduling for kubernetes clusters. In 2019 IEEE Cloud Summit. IEEE, 14--20.
[7]
The TPC-H Benchmarks. 2018. TPC-H 2018. www.tpc.org/tpch/
[8]
David Bernstein. 2014. Containers and Cloud: From LXC to Docker to Kubernetes. IEEE Cloud Computing 1, 3 (2014), 81--84. https://doi.org/10.1109/MCC.2014.51
[9]
Tracy D Braun, Howard Jay Siegel, Noah Beck, Ladislau L Bölöni, Muthucumaru Maheswaran, Albert I Reuther, James P Robertson, Mitchell D Theys, Bin Yao, Debra Hensgen, et al. 2001. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Journal of Parallel and Distributed computing 61, 6 (2001), 810--837.
[10]
Nigel Brown. 2017. Scheduling Services on a Docker Swarm Mode Cluster. https://semaphoreci.com/community/tutorials/scheduling-services-on-a-docker-swarm-mode-cluster. [Online; accessed 02-June-2022].
[11]
Aniket Chakrabarti, Srinivasan Parthasarathy, and Christopher Stewart. 2016. Green-and heterogeneity-aware partitioning for data analytics. In 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 366--371.
[12]
Kit Yan Chan, Bilal Abu-Salih, Raneem Qaddoura, Al-Zoubi Ala'M, Vasile Palade, Duc-Son Pham, Javier Del Ser, and Khan Muhammad. 2023. Deep neural networks in the cloud: Review, applications, challenges and research directions. Neurocomputing 545 (2023), 126327.
[13]
Mukoe Cheong, Hyunsung Lee, Ikjun Yeom, and Honguk Woo. 2019. SCARL: Attentive reinforcement learning-based scheduling in a multi-resource heterogeneous cluster. IEEE Access 7 (2019), 153432--153444.
[14]
Tainã Coleman, Henri Casanova, Loïc Pottier, Manav Kaushik, Ewa Deelman, and Rafael Ferreira da Silva. 2022. WfCommons: A Framework for Enabling Scientific Workflow Research and Development. Future Generation Computer Systems 128 (2022), 16--27. https://doi.org/10.1016/j.future.2021.09.043
[15]
Carlo Curino, Evan Philip Charles Jones, Yang Zhang, and Samuel R Madden. 2010. Schism: a workload-driven approach to database replication and partitioning. (2010).
[16]
Rafael Ferreira da Silva, Rosa Filgueira, Ewa Deelman, Erola Pairo-Castineira, Ian M Overton, and Malcolm P Atkinson. 2019. Using simple pid-inspired controllers for online resilient resource management of distributed scientific workflows. Future Generation Computer Systems 95 (2019), 615--628.
[17]
Ciprian Dobre and Fatos Xhafa. 2014. Parallel programming paradigms and frameworks in big data era. International Journal of Parallel Programming 42, 5 (2014), 710--738.
[18]
Tingting Dong, Fei Xue, Chuangbai Xiao, and Juntao Li. 2020. Task scheduling based on deep reinforcement learning in a cloud manufacturing environment. Concurrency and Computation: Practice and Experience 32, 11 (2020), e5654.
[19]
J Geetha and NG Harshit. 2019. Implementation and performance comparison of partitioning techniques in apache spark. In 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, 1--5.
[20]
Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert NM Watson, and Steven Hand. 2016. Firmament: Fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 99--115.
[21]
Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, and Samee Ullah Khan. 2015. The rise of "big data" on cloud computing: Review and open research issues. Information systems 47 (2015), 98--115.
[22]
Te C Hu. 1961. Parallel sequencing and assembly line problems. Operations research 9, 6 (1961), 841--848.
[23]
Alibaba Inc. 2018. Alibaba production cluster data v2018. https://github.com/alibaba/clusterdata/tree/v2018
[24]
JK Jeevitha and G Athisha. 2021. A novel scheduling approach to improve the energy efficiency in cloud computing data centers. Journal of Ambient Intelligence and Humanized Computing 12 (2021), 6639--6649.
[25]
Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chaliparambil, Giovanni Matteo Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga. 2015. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). 485--497.
[26]
AV Karthick, E Ramaraj, and R Ganapathy Subramanian. 2014. An efficient multi queue job scheduling for cloud computing. In 2014 World Congress on Computing and Communication Technologies. IEEE, 164--166.
[27]
Walter H. Kohler. 1975. A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems. IEEE Trans. Comput. 100, 12 (1975), 1235--1238.
[28]
KubernetesDoc2024 2024. Kubernetes Documentation. https://kubernetes.io/docs/home/. [Online; accessed Mar-15-2024].
[29]
Yu-Kwong Kwok and Ishfaq Ahmad. 1995. Bubble scheduling: A quasi dynamic algorithm for static allocation of tasks to parallel architectures. In Proceedings. Seventh IEEE Symposium on Parallel and Distributed Processing. IEEE, 36--43.
[30]
Yu-Kwong Kwok and Ishfaq Ahmad. 1996. Dynamic critical-path scheduling: An effective technique for allocating task graphs to multiprocessors. IEEE transactions on parallel and distributed systems 7, 5 (1996), 506--521.
[31]
Chunlin Li, Qianqian Cai, and Youlong Luo. 2022. Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment. The Journal of Supercomputing 78, 3 (2022), 3561--3604.
[32]
Suyi Li, Wei Wang, Jun Yang, Guangzhen Chen, and Daohe Lu. 2023. Golgi: Performance-aware, resource-efficient function scheduling for serverless computing. In Proceedings of the 2023 ACM Symposium on Cloud Computing. 32--47.
[33]
Hongyun Liu, Peng Chen, Xue Ouyang, Hui Gao, Bing Yan, Paola Grosso, and Zhiming Zhao. 2023. Robustness challenges in Reinforcement Learning based time-critical cloud resource scheduling: A Meta-Learning based solution. Future Generation Computer Systems 146 (2023), 18--33.
[34]
Florian Lonsing and Armin Biere. 2010. DepQBF: A dependency-aware QBF solver. Journal on Satisfiability, Boolean Modeling and Computation 7, 2--3 (2010), 71--76.
[35]
Mohammad Sultan Mahmud, Joshua Zhexue Huang, Salman Salloum, Tamer Z Emara, and Kuanishbay Sadatdiynov. 2020. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining and Analytics 3, 2 (2020), 85--101.
[36]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication. 270--288.
[37]
Rimma Nehme and Nicolas Bruno. 2011. Automated partitioning design in parallel database systems. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 1137--1148.
[38]
Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 61--72.
[39]
Amir Masoud Rahmani and Mohammad Ali Vahedi. 2008. A novel task scheduling in multiprocessor systems with genetic algorithm by using elitism stepping method. INFOCOMP Journal of Computer Science 7, 2 (2008), 58--64.
[40]
Sheldon M Ross. 2014. Introduction to probability models. Academic press.
[41]
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. 351--364.
[42]
S Selvarani and G Sudha Sadhasivam. 2010. Improved cost-based algorithm for task scheduling in cloud computing. In 2010 IEEE International Conference on Computational Intelligence and Computing Research. IEEE, 1--5.
[43]
Rathijit Sen, Alekh Jindal, Hiren Patel, and Shi Qiao. 2020. Autotoken: Predicting peak parallelism for big data analytics at microsoft. Proceedings of the VLDB Endowment 13, 12 (2020), 3326--3339.
[44]
Anil Shanbhag, Alekh Jindal, Samuel Madden, Jorge Quiane, and Aaron J Elmore. 2017. A robust partitioning scheme for ad-hoc query workloads. In Proceedings of the 2017 symposium on Cloud Computing. 229--241.
[45]
Zhao Tong, Hongjian Chen, Xiaomei Deng, Kenli Li, and Keqin Li. 2020. A scheduling scheme in the cloud computing environment using deep Q-learning. Information Sciences 512 (2020), 1170--1191.
[46]
Haluk Topcuoglu, Salim Hariri, and Min-You Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems 13, 3 (2002), 260--274.
[47]
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. 2016. TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the Eleventh European Conference on Computer Systems. 1--16.
[48]
Milan Vojnovic, Fei Xu, and Jingren Zhou. 2012. Sampling based range partition methods for big data analytics. Technical report, Microsoft Research (2012).
[49]
Argo Workflows. [n. d.]. The workflow engine for Kubernetes. https://argoproj.github.io/argo-workflows/. [Online; accessed 05-June-2022].
[50]
Min-You Wu and Daniel D Gajski. [n. d.]. Hypertool: A programming aid for message-passing systems. ([n. d.]).
[51]
Yichao Yang, Yanbo Zhou, Zhili Sun, and Haitham Cruickshank. 2013. Heuristic scheduling algorithms for allocation of virtualized network and computing resources. Journal of Software Engineering and Applications 6, 1 (2013), 1--13.
[52]
Lingjuan Ye, Yuanqing Xia, Liwen Yang, and Ce Yan. 2021. SHWS: Stochastic hybrid workflows dynamic scheduling in cloud container services. IEEE Transactions on Automation Science and Engineering 19, 3 (2021), 2620--2636.
[53]
Guangyao Zhou, Wenhong Tian, Rajkumar Buyya, Ruini Xue, and Liang Song. 2024. Deep reinforcement learning-based methods for resource scheduling in cloud computing: A review and future directions. Artificial Intelligence Review 57, 5 (2024), 124.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '24: Proceedings of the 2024 ACM Symposium on Cloud Computing
November 2024
1062 pages
ISBN:9798400712869
DOI:10.1145/3698038
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 November 2024

Check for updates

Author Tags

  1. Cloud native computing
  2. Dynamic resource allocation
  3. job scheduling
  4. task partitioning
  5. workflow scheduling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • IBMIllinois Discovery Accelerator Institute

Conference

SoCC '24
Sponsor:
SoCC '24: ACM Symposium on Cloud Computing
November 20 - 22, 2024
WA, Redmond, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 100
    Total Downloads
  • Downloads (Last 12 months)100
  • Downloads (Last 6 weeks)100
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media