More Web Proxy on the site http://driver.im/

research-article

TERMS: : Task management policies to achieve high performance for mixed workloads using surplus resources

Authors:

Dan FengAuthors Info & Claims

Volume 170, Issue C

Pages 74 - 85

https://doi.org/10.1016/j.jpdc.2022.08.005

Published: 01 December 2022 Publication History

Abstract

Resource contentions and performance interferences can lead to workload performance degradation in mixed-workload deployment clusters. Previous work guarantees the resource requirements of latency-sensitive tasks and reduces performance losses to batch jobs by reclaiming surplus resources from over-provisioned tasks. While the fragmentation of resources leads to a mismatch between provisioned resources and task requirements, resulting in high operation overheads and losses of task fairness. This paper proposes TERMS, the task management policies based on task relevance, resource distribution, and task fairness to achieve efficient and low-cost task management. TERMS mainly includes three types of management policies. The task scheduling policy can schedule new tasks according to task relevance. Task selection strategies select tasks for resource provisioning and task resumption based on resource requirements and task fairness. If necessary, the node selection strategy can be used to choose befitting target nodes based on task relevance and node resource information for task migration when eliminating straggler tasks. Evaluation results show that TERMS can further improve the performance of latency-sensitive services and batch jobs, reduce management overheads, and avoid operation failures.

Highlights

•

Place latency-sensitive new tasks in the queue according to task relevance.

•

Select batch tasks to provision resources for latency-sensitive new tasks.

•

Select batch tasks to provision resources for latency-sensitive straggler tasks.

•

Ensure task fairness when tasks are preempted or resumed.

•

Choose befitting target nodes for latency-sensitive straggler task migration.

References

[1]

M. Alam, K.A. Shakil, S. Sethi, Analysis and clustering of workload in Google cluster trace based on resource usage, in: 2016 IEEE Intl Conference on Computational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES), 2016, pp. 740–747.

[2]

W. Ali, R. Pellizzoni, H. Yun, Virtual gang scheduling of parallel real-time tasks, in: 2021 Design, Automation Test in Europe Conference Exhibition (DATE), 2021, pp. 270–275,.

[3]

Alibaba (2018): Alibaba cluster trace data. https://github.com/alibaba/clusterdata.

[4]

G. Amvrosiadis, J.W. Park, et al., On the diversity of cluster workloads and its impact on research results, in: 2018 USENIX Annual Technical Conference (USENIX ATC 18), USENIX Association, Boston, MA, 2018, pp. 533–546.

[5]

G. Ananthanarayanan, S. Kandula, et al., Reining in the outliers in map-reduce clusters using mantri, in: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, USA, 2010, pp. 265–278.

[6]

G. Ananthanarayanan, M.C.C. Hung, et al., Grass: trimming stragglers in approximation analytics, in: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, USA, 2014, pp. 289–302.

[7]

M. Armbrust, R.S. Xin, et al., Spark sql: relational data processing in spark, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, New York, NY, USA, 2015, pp. 1383–1394,.

Digital Library

[8]

W. Chen, J. Rao, X. Zhou, Preemptive, low latency datacenter scheduling via lightweight virtualization, in: 2017 USENIX Annual Technical Conference (USENIX ATC 17), USENIX Association, Santa Clara, CA, 2017, pp. 251–263.

[9]

W. Chen, A. Pi, S. Wang, X. Zhou, Characterizing scheduling delay for low-latency data analytics workloads, in: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, 2018, pp. 630–639.

[10]

Y. Cheng, Z. Chai, A. Anwar, Characterizing co-located datacenter workloads: an alibaba case study, in: Proceedings of the 9th Asia-Pacific Workshop on Systems, Association for Computing Machinery, New York, NY, USA, 2018, pp. 1–3,.

Digital Library

[11]

CRIU (2020): Checkpoint/restore in userspace (criu). https://criu.org.

[12]

C. Curino, D.E. Difallah, C. Douglas, et al., Reservation-based scheduling: if you're late don't blame us!, in: Proceedings of the ACM Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2014, pp. 1–14,.

Digital Library

[13]

J. Dean, L.A. Barroso, The tail at scale, Commun. ACM 56 (2013) 74–80.

Digital Library

[14]

P. Delgado, D. Didona, et al., Kairos: preemptive data center scheduling without runtime estimates, in: Proceedings of the ACM Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2018, pp. 135–148,.

Digital Library

[15]

Docker (2019): Docker - enterprise container platform. https://www.docker.com.

[16]

D.G. Feitelson, M.A. Jettee, Improved utilization and responsiveness with gang scheduling, in: D.G. Feitelson, L. Rudolph (Eds.), Job Scheduling Strategies for Parallel Processing, Springer Berlin Heidelberg, Berlin, Heidelberg, 1997, pp. 238–261.

[17]

Foundation, A.S. (2018): Apache hadoop. https://hadoop.apache.org.

[18]

P. Garefalakis, K. Karanasos, et al., Medea: scheduling of long running applications in shared production clusters, in: Proceedings of the Thirteenth European Conference on Computer Systems, Association for Computing Machinery, New York, NY, USA, 2018, pp. 1–13,.

Digital Library

[19]

P. Garraghan, X. Ouyang, R. Yang, et al., Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters, IEEE Trans. Serv. Comput. 12 (2016) 91–104.

[20]

Google (2020): Google cluster trace data. https://github.com/google/cluster-data.

[21]

B. Hindman, A. Konwinski, et al., Mesos: a platform for fine-grained resource sharing in the data center, in: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, USA, 2011, pp. 295–308.

[22]

S. Huang, J. Huang, J. Dai, et al., The hibench benchmark suite: characterization of the mapreduce-based data analysis, in: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), IEEE, 2010, pp. 41–51,.

[23]

C. Iorgulescu, R. Azimi, Y. Kwon, et al., Perfiso: performance isolation for commercial latency-sensitive services, in: 2018 USENIX Annual Technical Conference (USENIX ATC 18), USENIX Association, Boston, MA, 2018, pp. 519–532.

[24]

B. Jennings, R. Stadler, Resource management in clouds: survey and research challenges, J. Netw. Syst. Manag. 23 (2015) 567–619.

Digital Library

[25]

C. Jiang, G. Han, J. Lin, et al., Characteristics of co-allocated online services and batch jobs in Internet data centers: a case study from alibaba cloud, IEEE Access 7 (2019) 22495–22508.

[26]

H. Jin, F. Chen, S. Wu, et al., Towards low-latency batched stream processing by pre-scheduling, IEEE Trans. Parallel Distrib. Syst. 30 (2018) 710–722.

[27]

H. Karatza, Gang scheduling in a distributed system under processor failures and time-varying gang size, in: The Ninth IEEE Workshop on Future Trends of Distributed Computing Systems, 2003, FTDCS 2003. Proceedings, 2003, pp. 330–336,.

[28]

J. Liu, H. Shen, Dependency-aware and resource-efficient scheduling for heterogeneous jobs in clouds, in: 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, 2016, pp. 110–117,.

[29]

J. Liu, J. Ren, W. Dai, D. Zhang, P. Zhou, Y. Zhang, G. Min, N. Najjari, Online multi-workflow scheduling under uncertain task execution time in iaas clouds, IEEE Trans. Cloud Comput. 9 (2021) 1180–1194,.

[30]

Organization, L.K. (2019): Namespaces and cgroups. https://www.kernel.org/doc/html/latest/admin-guide/index.html.

[31]

Organization, T.C. (2019): Podman - manage pods, containers, and container images. https://github.com/containers/podman.

[32]

C. Reiss, A. Tumanov, et al., Heterogeneity and dynamicity of clouds at scale: Google trace analysis, in: Proceedings of the Third ACM Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2012, pp. 1–13,.

Digital Library

[33]

K. Rzadca, P. Findeisen, J. Swiderski, et al., Autopilot: workload autoscaling at Google, in: Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery, New York, NY, USA, 2020, pp. 1–16,.

Digital Library

[34]

H. Shen, C. Li, Zeno: a straggler diagnosis system for distributed computing using machine learning, in: High Performance Computing, Springer, Cham, 2018, pp. 144–162.

[35]

S. Singh, I. Chana, A survey on resource scheduling in cloud computing: issues and challenges, J. Grid Comput. 14 (2016) 217–264,.

Digital Library

[36]

X. Sun, C. Hu, R. Yang, et al., Rose: cluster resource scheduling via speculative over-subscription, in: 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), IEEE, 2018, pp. 949–960.

[37]

Y. Tan, F. Wu, Q. Wu, X. Liao, Resource stealing: a resource multiplexing method for mix workloads in cloud system, J. Supercomput. 75 (2019) 33–49.

[38]

H. Tian, Y. Zheng, W. Wang, Characterizing and synthesizing task dependencies of data-parallel jobs in alibaba cloud, in: Proceedings of the ACM Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2019, pp. 139–151,.

Digital Library

[39]

M. Tirmazi, A. Barker, N. Deng, et al., Borg: the next generation, in: Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery, New York, NY, USA, 2020, pp. 1–14,.

Digital Library

[40]

V.K. Vavilapalli, A.C. Murthy, et al., Apache hadoop yarn: yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, Association for Computing Machinery, New York, NY, USA, 2013, pp. 1–16,.

Digital Library

[41]

Virtuozzo (2019): Openvz - open source container-based virtualization for Linux. https://openvz.org.

[42]

D. Wang, G. Joshi, G.W. Wornell, Efficient straggler replication in large-scale parallel computing, ACM Trans. Model. Perform. Eval. Comput. Syst. 4 (2019) 1–23.

[43]

J. Wang, W. Bao, X. Zhu, L.T. Yang, Y. Xiang, Festal: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds, IEEE Trans. Comput. 64 (2015) 2545–2558,.

Digital Library

[44]

L. Wang, J. Zhan, C. Luo, et al., Bigdatabench: a big data benchmark suite from Internet services, in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, 2014, pp. 488–499,.

[45]

H. Yabuuchi, D. Taniwaki, et al., Low-latency job scheduling with preemption for the development of deep learning, in: 2019 USENIX Conference on Operational Machine Learning, USENIX Association, Santa Clara, CA, 2019, pp. 27–30.

[46]

H. Yan, X. Zhu, H. Chen, H. Guo, W. Zhou, W. Bao, Deft: dynamic fault-tolerant elastic scheduling for tasks with uncertain runtime in cloud, Inf. Sci. 477 (2019) 30–46,.

[47]

J. Yu, D. Feng, W. Tong, P. Lv, Y. Xiong, Ceres: container-based elastic resource management system for mixed workloads, in: 50th International Conference on Parallel Processing, Association for Computing Machinery, New York, NY, USA, 2021, pp. 1–10,.

Digital Library

[48]

M. Zaharia, M. Chowdhury, et al., Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in: Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), USENIX Association, 2012, pp. 15–28.

[49]

H. Zhou, Y. Li, H. Yang, et al., Bigroots: an effective approach for root-cause analysis of stragglers in big data system, IEEE Access 6 (2018) 41966–41977.

[50]

X. Zhu, J. Wang, H. Guo, D. Zhu, L.T. Yang, L. Liu, Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds, IEEE Trans. Parallel Distrib. Syst. 27 (2016) 3501–3517,.

Digital Library

Cited By

Tan DLi JWang HLi XLiu WQin ZZhou KXie MTao MEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Tela: A Temporal Load-Aware Cloud Virtual Disk Placement SchemeProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707252(1084-1100)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707252
Li TYing SZhao YShang J(2024)Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333451935:1(169-185)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3334519

Index Terms

TERMS: Task management policies to achieve high performance for mixed workloads using surplus resources

Index terms have been assigned to the content through auto-classification.

Recommendations

Task-aware garbage collection in a multi-tasking virtual machine
ISMM '06: Proceedings of the 5th international symposium on Memory management

A multi-tasking virtual machine (MVM) executes multiple programs in isolation, within a single operating system process. The goal of a MVM is to improve startup time, overall system throughput, and performance, by effective reuse and sharing of system ...
Real-time scheduling for parallel tasks with resource reclamation
Abstract
This paper considers the real-time scheduling of a parallel task with reclaiming computing resources, which can be utilized for soft real-time tasks or switching to low-energy mode to save energy. Existing works allocate a rectangular piece of ...
Load-aware task migration algorithm toward adaptive load balancing in Edge Computing
Abstract
The rapid advancement of the Internet of Things (IoT) is leading to more and more devices joining the network to interact with information, which requires improving the performance of IoT applications to accommodate more data, faster response ...
Highlights
- Adjusting node state and evaluating migration size based on resource requirements.
- Migration optimization transforms into a weighted tripartite graph matching problem.
- Achieving migration decisions that maximize similarity and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing Volume 170, Issue C

Dec 2022

87 pages

ISSN:0743-7315

Issue’s Table of Contents

Elsevier Inc.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 December 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tan DLi JWang HLi XLiu WQin ZZhou KXie MTao MEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Tela: A Temporal Load-Aware Cloud Virtual Disk Placement SchemeProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707252(1084-1100)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707252
Li TYing SZhao YShang J(2024)Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333451935:1(169-185)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3334519

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents