[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3472456.3472459acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

CERES: Container-Based Elastic Resource Management System for Mixed Workloads

Published: 05 October 2021 Publication History

Abstract

It is common to deploy multiple workloads in one cluster to achieve high resource utilization, which tends to bring more resource contentions and performance interferences. If the allocable resources cannot satisfy the resource requirements of a task, the task should wait for resources, significantly increasing its scheduling latency. The inappropriate resource requirements may make a running task become a swollen task or a straggler task, which makes many allocated resources underutilized or the task be processed slowly. Therefore, how to guarantee the QoS of various services in the mixed workload deployment cluster is a challenge. Existing solutions preempt the resources from batch jobs to guarantee the resource requirements of latency-sensitive tasks without taking into account the underutilized resources in swollen tasks, which inevitably compromises the performance of batch jobs. Thus, we try to meet the resource requirements of newly incoming latency-sensitive tasks and straggler tasks with the underutilized resources instead of directly preempting the resources of batch jobs.
This paper presents CERES, which tries to ensure the QoS of latency-sensitive services and reduce the performance impact on batch jobs. Firstly, CERES periodically screens out swollen tasks from batch jobs and straggler tasks from latency-sensitive services. Secondly, CERES reclaims resources from the swollen tasks and even preempts resources from common batch tasks when the idle resources in the cluster cannot meet the resource requirements of newly incoming latency-sensitive tasks and the straggler tasks. If there are sufficient allocable resources in the cluster, CERES expands the resources of the straggler tasks. We have implemented CERES based on Hadoop YARN and conducted comprehensive experiments. The results show that compared with the state-of-the-art approach, CERES can decrease the task completion time of latency-sensitive services by 20.77%, reduce performance losses to batch jobs by 15.46%, and improve the cluster resource utilization by 27.06%.

References

[1]
Mansaf Alam, Kashish Ara Shakil, and Shuchi Sethi. 2015. Analysis and Clustering of Workload in Google Cluster Trace based on Resource Usage. arxiv:1501.01426
[2]
Alibaba. 2018. Alibaba Cluster Trace Data. https://github.com/alibaba/clusterdata
[3]
George Amvrosiadis, Jun Woo Park, 2018. On the diversity of cluster workloads and its impact on research results. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 533–546. https://www.usenix.org/conference/atc18/presentation/amvrosiadis
[4]
Ganesh Ananthanarayanan, Michael Chien-Chun Hung, 2014. GRASS: Trimming Stragglers in Approximation Analytics. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation(NSDI’14). USENIX Association, USA, 289–302.
[5]
Ganesh Ananthanarayanan, Srikanth Kandula, 2010. Reining in the Outliers in Map-Reduce Clusters Using Mantri. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation(OSDI’10). USENIX Association, USA, 265–278.
[6]
Michael Armbrust, Reynold S. Xin, 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data(SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 1383–1394. https://doi.org/10.1145/2723372.2742797
[7]
Wei Chen, Aidi Pi, Shaoqi Wang, and Xiaobo Zhou. 2018. Characterizing scheduling delay for low-latency data analytics workloads. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 630–639.
[8]
Wei Chen, Jia Rao, and Xiaobo Zhou. 2017. Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 251–263. https://www.usenix.org/conference/atc17/technical-sessions/presentation/chen-wei
[9]
Yue Cheng, Zheng Chai, and Ali Anwar. 2018. Characterizing Co-Located Datacenter Workloads: An Alibaba Case Study. In Proceedings of the 9th Asia-Pacific Workshop on Systems(APSys ’18). Association for Computing Machinery, New York, NY, USA, Article 12, 3 pages. https://doi.org/10.1145/3265723.3265742
[10]
CRIU community. 2019. CRIU - Checkpoint/Restore In Userspace. https://criu.org
[11]
Carlo Curino, Djellel E. Difallah, Chris Douglas, 2014. Reservation-Based Scheduling: If You’re Late Don’t Blame Us!. In Proceedings of the ACM Symposium on Cloud Computing(SOCC ’14). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/2670979.2670981
[12]
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74–80.
[13]
Pamela Delgado, Diego Didona, 2018. Kairos: Preemptive Data Center Scheduling Without Runtime Estimates. In Proceedings of the ACM Symposium on Cloud Computing(SoCC ’18). Association for Computing Machinery, New York, NY, USA, 135–148. https://doi.org/10.1145/3267809.3267838
[14]
Docker. 2019. Docker - Enterprise Container Platform. https://www.docker.com
[15]
Apache Software Foundation. 2018. Apache Hadoop. https://hadoop.apache.org
[16]
Panagiotis Garefalakis, Konstantinos Karanasos, 2018. Medea: Scheduling of Long Running Applications in Shared Production Clusters. In Proceedings of the Thirteenth EuroSys Conference(EuroSys ’18). Association for Computing Machinery, New York, NY, USA, Article 4, 13 pages. https://doi.org/10.1145/3190508.3190549
[17]
Peter Garraghan, Xue Ouyang, Renyu Yang, 2016. Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Transactions on Services Computing 12, 1 (2016), 91–104.
[18]
Google. 2020. Google Cluster Trace Data. https://github.com/google/cluster-data
[19]
Benjamin Hindman, Andy Konwinski, 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation(NSDI’11). USENIX Association, USA, 295–308.
[20]
Shengsheng Huang, Jie Huang, Jinquan Dai, 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010). IEEE, 41–51. https://doi.org/10.1109/ICDEW.2010.5452747
[21]
Calin Iorgulescu, Reza Azimi, Youngjin Kwon, 2018. PerfIso: Performance Isolation for Commercial Latency-Sensitive Services. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 519–532. https://ww w.usenix.org/conference/atc18/presentation/iorgulescu
[22]
Congfeng Jiang, Guangjie Han, Jiangbin Lin, 2019. Characteristics of co-allocated online services and batch jobs in internet data centers: a case study from Alibaba cloud. IEEE Access 7(2019), 22495–22508.
[23]
Hai Jin, Fei Chen, Song Wu, 2018. Towards low-latency batched stream processing by pre-scheduling. IEEE Transactions on Parallel and Distributed Systems 30, 3 (2018), 710–722.
[24]
Jinwei Liu and Haiying Shen. 2016. Dependency-aware and resource-efficient scheduling for heterogeneous jobs in clouds. In 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 110–117. https://doi.org/10.1109/CloudCom.2016.0032
[25]
Linux Kernel Organization. 2019. Control Groups. https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cgroups.html
[26]
Linux Kernel Organization. 2019. Namespaces. https://www.kernel.org/doc/html/latest/admin-guide/namespaces/index.html
[27]
The Containers Organization. 2019. Podman - Manage pods, containers, and container images. https://github.com/containers/podman
[28]
Charles Reiss, Alexey Tumanov, 2012. Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis. In Proceedings of the Third ACM Symposium on Cloud Computing(SoCC ’12). Association for Computing Machinery, New York, NY, USA, Article 7, 13 pages. https://doi.org/10.1145/2391229.2391236
[29]
Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, 2020. Autopilot: Workload Autoscaling at Google. In Proceedings of the Fifteenth European Conference on Computer Systems(EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 16, 16 pages. https://doi.org/10.1145/3342195.3387524
[30]
Huanxing Shen and Cong Li. 2018. Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning. In High Performance Computing. Springer, Cham, 144–162.
[31]
Sukhpal Singh and Inderveer Chana. 2016. A survey on resource scheduling in cloud computing: Issues and challenges. Journal of grid computing 14, 2 (2016), 217–264.
[32]
Xiaoyang Sun, Chunming Hu, Renyu Yang, 2018. Rose: Cluster resource scheduling via speculative over-subscription. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 949–960.
[33]
Huangshi Tian, Yunchuan Zheng, and Wei Wang. 2019. Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud. In Proceedings of the ACM Symposium on Cloud Computing(SoCC ’19). Association for Computing Machinery, New York, NY, USA, 139–151. https://doi.org/10.1145/3357223.3362710
[34]
Muhammad Tirmazi, Adam Barker, Nan Deng, 2020. Borg: The next Generation. In Proceedings of the Fifteenth European Conference on Computer Systems(EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 30, 14 pages. https://doi.org/10.1145/3342195.3387517
[35]
Vinod Kumar Vavilapalli, Arun C. Murthy, 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing(SOCC ’13). Association for Computing Machinery, New York, NY, USA, Article 5, 16 pages. https://doi.org/10.1145/2523616.2523633
[36]
Da Wang, Gauri Joshi, and Gregory W Wornell. 2019. Efficient straggler replication in large-scale parallel computing. ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS) 4, 2 (2019), 1–23.
[37]
Lei Wang, Jianfeng Zhan, Chunjie Luo, 2014. BigDataBench: A big data benchmark suite from internet services. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 488–499. https://doi.org/10.1109/HPCA.2014.6835958
[38]
Hidehito Yabuuchi, Daisuke Taniwaki, 2019. Low-latency Job Scheduling with Preemption for the Development of Deep Learning. In 2019 USENIX Conference on Operational Machine Learning. USENIX Association, Santa Clara, CA, 27–30. https://www.usenix.org/conference/opml19/presentation/yabuuchi
[39]
Honggang Zhou, Yunchun Li, Hailong Yang, 2018. BigRoots: An effective approach for root-cause analysis of stragglers in big data system. IEEE Access 6(2018), 41966–41977.

Cited By

View all
  • (2024)A Survey on Spatio-Temporal Big Data Analytics Ecosystem: Resource Management, Processing Platform, and ApplicationsIEEE Transactions on Big Data10.1109/TBDATA.2023.334261910:2(174-193)Online publication date: Apr-2024
  • (2023)Tango: Harmonious Management and Scheduling for Mixed Services Co-located among Distributed Edge-CloudsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605589(595-604)Online publication date: 13-Sep-2023
  • (2023)Topology-Aware Self-Adaptive Resource Provisioning for Microservices2023 IEEE International Conference on Web Services (ICWS)10.1109/ICWS60048.2023.00016(28-35)Online publication date: Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. containerized task
  2. elastic resource management
  3. mixed workload deployment
  4. task filter

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)2
Reflects downloads up to 26 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Survey on Spatio-Temporal Big Data Analytics Ecosystem: Resource Management, Processing Platform, and ApplicationsIEEE Transactions on Big Data10.1109/TBDATA.2023.334261910:2(174-193)Online publication date: Apr-2024
  • (2023)Tango: Harmonious Management and Scheduling for Mixed Services Co-located among Distributed Edge-CloudsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605589(595-604)Online publication date: 13-Sep-2023
  • (2023)Topology-Aware Self-Adaptive Resource Provisioning for Microservices2023 IEEE International Conference on Web Services (ICWS)10.1109/ICWS60048.2023.00016(28-35)Online publication date: Jul-2023
  • (2022)TERMSJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.08.005170:C(74-85)Online publication date: 1-Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media