More Web Proxy on the site http://driver.im/

research-article

Learning Cooperative Oversubscription for Cloud by Chance-Constrained Multi-Agent Reinforcement Learning

Authors:

Xiangfeng Wang,

Saravan Rajmohan,

Dongmei ZhangAuthors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 2927 - 2936

https://doi.org/10.1145/3543507.3583298

Published: 30 April 2023 Publication History

Abstract

Oversubscription is a common practice for improving cloud resource utilization. It allows the cloud service provider to sell more resources than the physical limit, assuming not all users would fully utilize the resources simultaneously. However, how to design an oversubscription policy that improves utilization while satisfying some safety constraints remains an open problem. Existing methods and industrial practices are over-conservative, ignoring the coordination of diverse resource usage patterns and probabilistic constraints. To address these two limitations, this paper formulates the oversubscription for cloud as a chance-constrained optimization problem and proposes an effective Chance-Constrained Multi-Agent Reinforcement Learning (C2MARL) method to solve this problem. Specifically, C2MARL reduces the number of constraints by considering their upper bounds and leverages a multi-agent reinforcement learning paradigm to learn a safe and optimal coordination policy. We evaluate our C2MARL on an internal cloud platform and public cloud datasets. Experiments show that our C2MARL outperforms existing methods in improving utilization () under different levels of safety constraints.

References

[1]

Amazon. 2018. BYOL and Oversubscription. https://aws.amazon.com/blogs/compute/byol-and-oversubscription/. Accessed: 2022-08-13.

[2]

D.P. Bertsekas. 1982. Constrained Optimization and Lagrange Multiplier Methods. Academic Press. https://books.google.com.hk/books¿id=AX0jIIftffkC

[3]

Faruk Caglar and Aniruddha Gokhale. 2014. iOverbook: intelligent resource-overbooking to support soft real-time applications in the cloud. In 2014 IEEE 7th International Conference on Cloud Computing. IEEE, 538–545.

Digital Library

[4]

Jie Chen, Chun Cao, Ying Zhang, Xiaoxing Ma, Haiwei Zhou, and Chengwei Yang. 2018. Improving cluster resource efficiency with oversubscription. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), Vol. 1. IEEE, 144–153.

[5]

Jie Chen, Chun Cao, Ying Zhang, Xiaoxing Ma, Haiwei Zhou, and Chengwei Yang. 2018. Improving cluster resource efficiency with oversubscription. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), Vol. 1. IEEE, 144–153.

[6]

Maxime C Cohen, Philipp W Keller, Vahab Mirrokni, and Morteza Zadimoghaddam. 2019. Overcommitment in cloud services: Bin packing with chance constraints. Management Science 65, 7 (2019), 3255–3271.

Digital Library

[7]

Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. 2020. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems 33 (2020), 8378–8390.

[8]

Peter Geibel and Fritz Wysotzki. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research 24 (2005), 81–108.

[9]

Rahul Ghosh and Vijay K Naik. 2012. Biting off safely more than you can chew: Predictive analytics for resource over-commit in iaas cloud. In 2012 IEEE Fifth International Conference on Cloud Computing. IEEE, 25–32.

Digital Library

[10]

Joachim Giesen and Soeren Laue. 2019. Combining ADMM and the Augmented Lagrangian Method for Efficiently Handling Many Constraints. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 4525–4531. https://doi.org/10.24963/ijcai.2019/629

[11]

Venu Govindaraju, Vijay Raghavan, and Calyampudi Radhakrishna Rao. 2015. Big data analytics. Elsevier.

[12]

Jing Guo, Zihao Chang, Sa Wang, Haiyang Ding, Yihui Feng, Liang Mao, and Yungang Bao. 2019. Who limits the resource efficiency of my datacenter: An analysis of alibaba datacenter traces. In Proceedings of the International Symposium on Quality of Service. 1–10.

Digital Library

[13]

Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, and Thomas Moscibroda. 2020. Protean: VM Allocation Service at Scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 845–861. https://www.usenix.org/conference/osdi20/presentation/hadary

[14]

Rachel Householder, Scott Arnold, and Robert Green. 2014. On cloud-based oversubscription. arXiv preprint arXiv:1402.4758 (2014).

[15]

Alok Gautam Kumbhare, Reza Azimi, Ioannis Manousakis, Anand Bonde, Felipe Frujeri, Nithish Mahalingam, Pulkit A Misra, Seyyed Ahmad Javadi, Bianca Schroeder, Marcus Fontoura, 2021. { Prediction-Based} Power Oversubscription in Cloud Platforms. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 473–487.

[16]

Norman Levenberg and Evgeny A Poletsky. 2002. Reverse Markov inequality. In ANNALES-ACADEMIAE SCIENTIARUM FENNICAE MATHEMATICA, Vol. 27. ACADEMIA SCIENTIARUM FENNICA, 173–182.

[17]

Chuan Luo, Bo Qiao, Xin Chen, Pu Zhao, Randolph Yao, Hongyu Zhang, Wei Wu, Andrew Zhou, and Qingwei Lin. 2020. Intelligent Virtual Machine Provisioning in Cloud Computing. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Christian Bessiere (Ed.). International Joint Conferences on Artificial Intelligence Organization, 1495–1502. https://doi.org/10.24963/ijcai.2020/208 Main track.

[18]

Elton Pan, Panagiotis Petsagkourakis, Max Mowbray, Dongda Zhang, and Ehecatl Antonio del Rio-Chanona. 2021. Constrained model-free reinforcement learning for process optimization. Computers & Chemical Engineering 154 (2021), 107462.

[19]

Santiago Paternain, Miguel Calvo-Fullana, Luiz FO Chamon, and Alejandro Ribeiro. 2019. Learning safe policies via primal-dual methods. In 2019 IEEE 58th Conference on Decision and Control (CDC). IEEE, 6491–6497.

Digital Library

[20]

Baiyu Peng, Jingliang Duan, Jianyu Chen, Shengbo Eben Li, Genjin Xie, Congsheng Zhang, Yang Guan, Yao Mu, and Enxin Sun. 2022. Model-Based Chance-Constrained Reinforcement Learning via Separated Proportional-Integral Lagrangian. IEEE Transactions on Neural Networks and Learning Systems (2022).

[21]

Baiyu Peng, Yao Mu, Yang Guan, Shengbo Eben Li, Yuming Yin, and Jianyu Chen. 2021. Model-based actor-critic with chance constraint for stochastic system. In 2021 60th IEEE Conference on Decision and Control (CDC). IEEE, 4694–4700.

Digital Library

[22]

Marc Platini, Thomas Ropars, Benoit Pelletier, and Noel De Palma. 2018. CPU overheating characterization in HPC systems: a case study. In 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS). IEEE, 59–68.

[23]

Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 205–218. https://www.usenix.org/conference/atc20/presentation/shahrad

[24]

Eliezer Shlifer and YJTS Vardi. 1975. An airline overbooking policy. Transportation Science 9, 2 (1975), 101–114.

Digital Library

[25]

Jennie Si, Andrew G Barto, Warren B Powell, and Don Wunsch. 2004. Handbook of learning and approximate dynamic programming. Vol. 2. John Wiley & Sons.

[26]

Michael Soltys and Katharine Soltys. 2020. WordPress on AWS: a Communication Framework. CoRR abs/2007.01823 (2020). arXiv:2007.01823https://arxiv.org/abs/2007.01823

[27]

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech M. Czarnecki, Vinícius Flores Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. 2018. Value-Decomposition Networks For Cooperative Multi-Agent Learning. ArXiv abs/1706.05296 (2018).

Index Terms

Learning Cooperative Oversubscription for Cloud by Chance-Constrained Multi-Agent Reinforcement Learning
1. Computing methodologies
  1. Artificial intelligence
2. Networks
  1. Network services
    1. Cloud computing

Recommendations

Developing an Elastic Cloud Computing Application through Multi-Agent Systems

the integration of a multi-agent environment and a cloud platform is to the best of our knowledge a new concept in terms of elastic application to run on cloud platforms. Currently in literature can be found some works focused on merging cloud computing ...
Reinforcement Learning Based Service Provisioning for a Greener Cloud
ICECCS '14: Proceedings of the 2014 3rd International Conference on Eco-friendly Computing and Communication Systems

Cloud computing is an emerging distributed computing model consisting of massive data enters for making different services available to the users. In the current scenario where energy consumption and wastage in the IT field is looked upon with growing ...
Integrating Motivated Learning and k-Winner-Take-All to Coordinate Multi-agent Reinforcement Learning
WI-IAT '14: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 03

This work addresses the coordination issue in distributed optimization problem (DOP) where multiple distinct and time-critical tasks are performed to satisfy a global objective function. The performance of these tasks has to be coordinated due to the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '23: Proceedings of the ACM Web Conference 2023

April 2023

4293 pages

ISBN:9781450394161

DOI:10.1145/3543507

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '23

Sponsor:

SIGWEB

WWW '23: The ACM Web Conference 2023

April 30 - May 4, 2023

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
258
Total Downloads

Downloads (Last 12 months)99
Downloads (Last 6 weeks)13

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten