[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

TERMS: : Task management policies to achieve high performance for mixed workloads using surplus resources

Published: 01 December 2022 Publication History

Abstract

Resource contentions and performance interferences can lead to workload performance degradation in mixed-workload deployment clusters. Previous work guarantees the resource requirements of latency-sensitive tasks and reduces performance losses to batch jobs by reclaiming surplus resources from over-provisioned tasks. While the fragmentation of resources leads to a mismatch between provisioned resources and task requirements, resulting in high operation overheads and losses of task fairness. This paper proposes TERMS, the task management policies based on task relevance, resource distribution, and task fairness to achieve efficient and low-cost task management. TERMS mainly includes three types of management policies. The task scheduling policy can schedule new tasks according to task relevance. Task selection strategies select tasks for resource provisioning and task resumption based on resource requirements and task fairness. If necessary, the node selection strategy can be used to choose befitting target nodes based on task relevance and node resource information for task migration when eliminating straggler tasks. Evaluation results show that TERMS can further improve the performance of latency-sensitive services and batch jobs, reduce management overheads, and avoid operation failures.

Highlights

Place latency-sensitive new tasks in the queue according to task relevance.
Select batch tasks to provision resources for latency-sensitive new tasks.
Select batch tasks to provision resources for latency-sensitive straggler tasks.
Ensure task fairness when tasks are preempted or resumed.
Choose befitting target nodes for latency-sensitive straggler task migration.

References

[1]
M. Alam, K.A. Shakil, S. Sethi, Analysis and clustering of workload in Google cluster trace based on resource usage, in: 2016 IEEE Intl Conference on Computational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES), 2016, pp. 740–747.
[2]
W. Ali, R. Pellizzoni, H. Yun, Virtual gang scheduling of parallel real-time tasks, in: 2021 Design, Automation Test in Europe Conference Exhibition (DATE), 2021, pp. 270–275,.
[3]
Alibaba (2018): Alibaba cluster trace data. https://github.com/alibaba/clusterdata.
[4]
G. Amvrosiadis, J.W. Park, et al., On the diversity of cluster workloads and its impact on research results, in: 2018 USENIX Annual Technical Conference (USENIX ATC 18), USENIX Association, Boston, MA, 2018, pp. 533–546.
[5]
G. Ananthanarayanan, S. Kandula, et al., Reining in the outliers in map-reduce clusters using mantri, in: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, USA, 2010, pp. 265–278.
[6]
G. Ananthanarayanan, M.C.C. Hung, et al., Grass: trimming stragglers in approximation analytics, in: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, USA, 2014, pp. 289–302.
[7]
M. Armbrust, R.S. Xin, et al., Spark sql: relational data processing in spark, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, New York, NY, USA, 2015, pp. 1383–1394,.
[8]
W. Chen, J. Rao, X. Zhou, Preemptive, low latency datacenter scheduling via lightweight virtualization, in: 2017 USENIX Annual Technical Conference (USENIX ATC 17), USENIX Association, Santa Clara, CA, 2017, pp. 251–263.
[9]
W. Chen, A. Pi, S. Wang, X. Zhou, Characterizing scheduling delay for low-latency data analytics workloads, in: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, 2018, pp. 630–639.
[10]
Y. Cheng, Z. Chai, A. Anwar, Characterizing co-located datacenter workloads: an alibaba case study, in: Proceedings of the 9th Asia-Pacific Workshop on Systems, Association for Computing Machinery, New York, NY, USA, 2018, pp. 1–3,.
[11]
CRIU (2020): Checkpoint/restore in userspace (criu). https://criu.org.
[12]
C. Curino, D.E. Difallah, C. Douglas, et al., Reservation-based scheduling: if you're late don't blame us!, in: Proceedings of the ACM Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2014, pp. 1–14,.
[13]
J. Dean, L.A. Barroso, The tail at scale, Commun. ACM 56 (2013) 74–80.
[14]
P. Delgado, D. Didona, et al., Kairos: preemptive data center scheduling without runtime estimates, in: Proceedings of the ACM Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2018, pp. 135–148,.
[15]
Docker (2019): Docker - enterprise container platform. https://www.docker.com.
[16]
D.G. Feitelson, M.A. Jettee, Improved utilization and responsiveness with gang scheduling, in: D.G. Feitelson, L. Rudolph (Eds.), Job Scheduling Strategies for Parallel Processing, Springer Berlin Heidelberg, Berlin, Heidelberg, 1997, pp. 238–261.
[17]
Foundation, A.S. (2018): Apache hadoop. https://hadoop.apache.org.
[18]
P. Garefalakis, K. Karanasos, et al., Medea: scheduling of long running applications in shared production clusters, in: Proceedings of the Thirteenth European Conference on Computer Systems, Association for Computing Machinery, New York, NY, USA, 2018, pp. 1–13,.
[19]
P. Garraghan, X. Ouyang, R. Yang, et al., Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters, IEEE Trans. Serv. Comput. 12 (2016) 91–104.
[20]
Google (2020): Google cluster trace data. https://github.com/google/cluster-data.
[21]
B. Hindman, A. Konwinski, et al., Mesos: a platform for fine-grained resource sharing in the data center, in: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, USA, 2011, pp. 295–308.
[22]
S. Huang, J. Huang, J. Dai, et al., The hibench benchmark suite: characterization of the mapreduce-based data analysis, in: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), IEEE, 2010, pp. 41–51,.
[23]
C. Iorgulescu, R. Azimi, Y. Kwon, et al., Perfiso: performance isolation for commercial latency-sensitive services, in: 2018 USENIX Annual Technical Conference (USENIX ATC 18), USENIX Association, Boston, MA, 2018, pp. 519–532.
[24]
B. Jennings, R. Stadler, Resource management in clouds: survey and research challenges, J. Netw. Syst. Manag. 23 (2015) 567–619.
[25]
C. Jiang, G. Han, J. Lin, et al., Characteristics of co-allocated online services and batch jobs in Internet data centers: a case study from alibaba cloud, IEEE Access 7 (2019) 22495–22508.
[26]
H. Jin, F. Chen, S. Wu, et al., Towards low-latency batched stream processing by pre-scheduling, IEEE Trans. Parallel Distrib. Syst. 30 (2018) 710–722.
[27]
H. Karatza, Gang scheduling in a distributed system under processor failures and time-varying gang size, in: The Ninth IEEE Workshop on Future Trends of Distributed Computing Systems, 2003, FTDCS 2003. Proceedings, 2003, pp. 330–336,.
[28]
J. Liu, H. Shen, Dependency-aware and resource-efficient scheduling for heterogeneous jobs in clouds, in: 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, 2016, pp. 110–117,.
[29]
J. Liu, J. Ren, W. Dai, D. Zhang, P. Zhou, Y. Zhang, G. Min, N. Najjari, Online multi-workflow scheduling under uncertain task execution time in iaas clouds, IEEE Trans. Cloud Comput. 9 (2021) 1180–1194,.
[30]
Organization, L.K. (2019): Namespaces and cgroups. https://www.kernel.org/doc/html/latest/admin-guide/index.html.
[31]
Organization, T.C. (2019): Podman - manage pods, containers, and container images. https://github.com/containers/podman.
[32]
C. Reiss, A. Tumanov, et al., Heterogeneity and dynamicity of clouds at scale: Google trace analysis, in: Proceedings of the Third ACM Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2012, pp. 1–13,.
[33]
K. Rzadca, P. Findeisen, J. Swiderski, et al., Autopilot: workload autoscaling at Google, in: Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery, New York, NY, USA, 2020, pp. 1–16,.
[34]
H. Shen, C. Li, Zeno: a straggler diagnosis system for distributed computing using machine learning, in: High Performance Computing, Springer, Cham, 2018, pp. 144–162.
[35]
S. Singh, I. Chana, A survey on resource scheduling in cloud computing: issues and challenges, J. Grid Comput. 14 (2016) 217–264,.
[36]
X. Sun, C. Hu, R. Yang, et al., Rose: cluster resource scheduling via speculative over-subscription, in: 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), IEEE, 2018, pp. 949–960.
[37]
Y. Tan, F. Wu, Q. Wu, X. Liao, Resource stealing: a resource multiplexing method for mix workloads in cloud system, J. Supercomput. 75 (2019) 33–49.
[38]
H. Tian, Y. Zheng, W. Wang, Characterizing and synthesizing task dependencies of data-parallel jobs in alibaba cloud, in: Proceedings of the ACM Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2019, pp. 139–151,.
[39]
M. Tirmazi, A. Barker, N. Deng, et al., Borg: the next generation, in: Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery, New York, NY, USA, 2020, pp. 1–14,.
[40]
V.K. Vavilapalli, A.C. Murthy, et al., Apache hadoop yarn: yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2013, pp. 1–16,.
[41]
Virtuozzo (2019): Openvz - open source container-based virtualization for Linux. https://openvz.org.
[42]
D. Wang, G. Joshi, G.W. Wornell, Efficient straggler replication in large-scale parallel computing, ACM Trans. Model. Perform. Eval. Comput. Syst. 4 (2019) 1–23.
[43]
J. Wang, W. Bao, X. Zhu, L.T. Yang, Y. Xiang, Festal: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds, IEEE Trans. Comput. 64 (2015) 2545–2558,.
[44]
L. Wang, J. Zhan, C. Luo, et al., Bigdatabench: a big data benchmark suite from Internet services, in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, 2014, pp. 488–499,.
[45]
H. Yabuuchi, D. Taniwaki, et al., Low-latency job scheduling with preemption for the development of deep learning, in: 2019 USENIX Conference on Operational Machine Learning, USENIX Association, Santa Clara, CA, 2019, pp. 27–30.
[46]
H. Yan, X. Zhu, H. Chen, H. Guo, W. Zhou, W. Bao, Deft: dynamic fault-tolerant elastic scheduling for tasks with uncertain runtime in cloud, Inf. Sci. 477 (2019) 30–46,.
[47]
J. Yu, D. Feng, W. Tong, P. Lv, Y. Xiong, Ceres: container-based elastic resource management system for mixed workloads, in: 50th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, 2021, pp. 1–10,.
[48]
M. Zaharia, M. Chowdhury, et al., Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in: Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), USENIX Association, 2012, pp. 15–28.
[49]
H. Zhou, Y. Li, H. Yang, et al., Bigroots: an effective approach for root-cause analysis of stragglers in big data system, IEEE Access 6 (2018) 41966–41977.
[50]
X. Zhu, J. Wang, H. Guo, D. Zhu, L.T. Yang, L. Liu, Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds, IEEE Trans. Parallel Distrib. Syst. 27 (2016) 3501–3517,.

Cited By

View all
  • (2024)Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333451935:1(169-185)Online publication date: 1-Jan-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing  Volume 170, Issue C
Dec 2022
87 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 December 2022

Author Tags

  1. Resource management
  2. Task management policy
  3. Surplus resource
  4. Resource reclamation
  5. Task selection

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333451935:1(169-185)Online publication date: 1-Jan-2024

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media