More Web Proxy on the site http://driver.im/

research-article

Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale

Authors:

Ali Hossein Abbasi Abyaneh,

Seyed Majid ZahediAuthors Info & Claims

Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 6, Issue 3

Article No.: 59, Pages 1 - 25

https://doi.org/10.1145/3570611

Published: 08 December 2022 Publication History

Abstract

We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings tens of millions of scheduling decisions have to be made per second. Achieving this throughput while providing microsecond-scale latency and high availability is extremely challenging. To address this challenge, we design a fully decentralized load-balancing framework. In this framework, servers collectively balance the load in the system. We model the interactions among servers as a cooperative stochastic game. To find the game's parametric Nash equilibrium, we design and implement a decentralized algorithm based on multi-agent-learning theory. We empirically show that our proposed algorithm is adaptive and scalable while outperforming state-of-the art alternatives. In homogeneous settings, Malcolm performs as well as the best alternative among other baselines. In heterogeneous settings, compared to other baselines, for lower loads, Malcolm improves tail latency by up to a factor of four. And for the same tail latency, Malcolm achieves up to 60% more throughput compared to the best alternative among other baselines.

References

[1]

2022. Memcached key-value store. https://memcached.org/.

[2]

2022. MongoDB. https://www.mongodb.com/.

[3]

2022. PyTorch C API. https://pytorch.org/cppdocs.

[4]

2022. RDMA Core Userspace Libraries and Daemons. https://github.com/linux-rdma/rdma-core/.

[5]

2022. Redis data structure store. https://redis.io/.

[6]

2022. RocksDB. https://rocksdb.org/.

[7]

2022. Volt Active Data. https://www.voltactivedata.com/.

[8]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and implementation (OSDI). 265--283.

[9]

Krste Asanovi?. 2014. Firebox: A hardware building block for 2020 warehouse-scale computers. https://www.usenix. org/node/179918.

[10]

Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the Killer Microseconds. Communications of the ACM (CACM) 60, 4 (mar 2017), 48--54.

Digital Library

[11]

Luiz André Barroso, Jeffrey Dean, and Urs Holzle. 2003. Web search for a planet: The Google cluster architecture. IEEE micro 23, 2 (2003), 22--28.

[12]

Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2016. The IX operating system: Combining low latency, high throughput, and efficiency in a protected dataplane. ACM Transactions on Computer Systems (TOCS) 34, 4 (2016), 1--39.

Digital Library

[13]

Sol Boucher, Anuj Kalia, David G Andersen, and Michael Kaminsky. 2018. Putting the "micro" back in microservice. In Proceedings of the USENIX Annual Technical Conference (ATC). 645--650.

[14]

Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 285--300.

[15]

E. G. Coffman and R. C. Wood. 1966. Interarrival Statistics for Time Sharing Systems. Communications of the ACM (CACM) 9, 7 (1966), 500--503.

Digital Library

[16]

Paolo Costa, Hitesh Ballani, Kaveh Razavi, and Ian Kash. 2015. R2C2: A Network Stack for Rack-Scale Computers. In Proceedings of the Annual Conference on the ACM Special Interest Group on Data Communication (SIGCOMM). 551--564.

Digital Library

[17]

Alexandros Daglis, Mark Sutherland, and Babak Falsafi. 2019. RPCValet: NI-driven tail-aware balancing of "'s-scale RPCs. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 35--48.

Digital Library

[18]

Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. 2009. The complexity of computing a Nash equilibrium. SIAM J. Comput. 39, 1 (2009), 195--259.

Digital Library

[19]

Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM (CACM) 56, 2 (2013), 74--80.

Digital Library

[20]

Aleksandar Dragojevi?, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast Remote Memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 401--414.

[21]

Mark E. Femal and Vincent W. Freeh. 2005. Boosting Data Center Performance Through Non-Uniform Power Allocation. In Proceedings of the 2nd International Conference on Automatic Computing (ICAC). 250--261.

[22]

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[23]

Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS). 2145--2153.

[24]

Linux Foundation. 2015. Data Plane Development Kit (DPDK). http://www.dpdk.org/.

[25]

Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. 2020. Caladan: Mitigating Interference at Microsecond Timescales. 281--297.

[26]

Jason Gaitonde and Éva Tardos. 2020. Stability and learning in strategic queuing systems. In Proceedings of the 21st ACM Conference on Economics and Computation, (EC). 319--347.

[27]

Jason Gaitonde and Éva Tardos. 2021. Virtues of patience in strategic queuing systems. In Proceedings of the 22nd ACM Conference on Economics and Computation (EC). 520--540.

[28]

Kristen Gardner, Jazeem Abdul Jaleel, Alexander Wickeham, and Sherwin Doroudi. 2021. Scalable Load Balancing in the Presence of Heterogeneous Servers. ACM SIGMETRICS Performance Evaluation Review 48, 3 (2021), 37--38.

Digital Library

[29]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Digital Library

[30]

Daniel Grosu and Anthony T Chronopoulos. 2005. Noncooperative load balancing in distributed systems. J. Parallel and Distrib. Comput. 65, 9 (2005), 1022--1034.

Digital Library

[31]

Daniel Grosu, Anthony T Chronopoulos, and Ming-Ying Leung. 2002. Load balancing in distributed systems: An approach using cooperative games. In Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 10--pp.

[32]

Tim Hellemans, Tejas Bodas, and Benny Van Houdt. 2019. Performance analysis of workload dependent load balancing policies. Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) 3, 2 (2019), 1--35.

Digital Library

[33]

Tim Hellemans and Benny Van Houdt. 2018. On the power-of-d-choices with least loaded server selection. Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) 2, 2 (2018), 1--22.

Digital Library

[34]

Illés Antal Horváth, Ziv Scully, and Benny Van Houdt. 2019. Mean field analysis of join-below-threshold load balancing for resource sharing servers. Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) 3, 3 (2019), 1--21.

Digital Library

[35]

Intel. 2022. Intel® Rack Scale Design (Intel® RSD). https://rb.gy/uxvjjt/.

[36]

Engin Ipek, Meyrem Kirman, Nevin Kirman, and Jose F Martinez. 2007. Core fusion: Accommodating software diversity in chip multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA). 186--197.

Digital Library

[37]

Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. 2019. Shinjuku: Preemptive scheduling for ?'second-scale tail latency. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 345--360.

[38]

Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter RPCs can be general and fast. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 1--16.

[39]

Kostas Katrinis, Dimitris Syrivelis, Dionisios Pnevmatikatos, Georgios Zervas, Dimitris Theodoropoulos, Iordanis Koutsopoulos, K Hasharoni, Daniel Raho, Christian Pinto, F Espina, et al. 2016. Rack-scale disaggregated cloud data centers: The dReDBox project vision. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). 690--695.

[40]

Kimberly Keeton. 2015. The Machine: An Architecture for Memory-centric Computing. https://rb.gy/2xgd7j.

[41]

Marios Kogias and Edouard Bugnion. 2020. HovercRaft: Achieving scalability and fault-tolerance for microsecond-scale datacenter services. In Proceedings of the 15th ACM European Conference on Computer Systems (EuroSys). 1--17.

Digital Library

[42]

Marios Kogias, George Prekas, Adrien Ghosn, Jonas Fietz, and Edouard Bugnion. 2019. R2P2: Making RPCs first-class datacenter citizens. In Proceedings of the USENIX Annual Technical Conference (ATC). 863--880.

[43]

Chinmay Kulkarni, Sara Moore, Mazhar Naqvi, Tian Zhang, Robert Ricci, and Ryan Stutsman. 2018. Splinter: Baremetal Extensions for Multi-tenant Low-latency Storage. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 627--643.

[44]

Rakesh Kumar, Keith I Farkas, Norman P Jouppi, Parthasarathy Ranganathan, and Dean M Tullsen. 2003. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 81--92.

[45]

Benjamin C Lee. 2016. Datacenter design and management: A computer architect's perspective. Synthesis Lectures on Computer Architecture 11, 1 (2016), 1--121.

[46]

Donghwan Lee, Niao He, Parameswaran Kamalaruban, and Volkan Cevher. 2020. Optimization for reinforcement learning: From a single agent to cooperative agents. IEEE Signal Processing Magazine 37, 3 (2020), 123--135.

[47]

Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Rowstron, Hugh Williams, and Xiaohan Zhao. 2016. XFabric: A reconfigurable in-rack network for rack-scale computers. In Proceedings of the 13th USENIX Conference on Networked Systems Design and Implementation (NSDI). 15--29.

[48]

Stefanos Leonardos, Will Overman, Ioannis Panageas, and Georgios Piliouras. 2021. Global Convergence of Multi- Agent Policy Gradient in Markov Potential Games. In Proceedings of the International Conference on Learning Representations (ICLR).

[49]

Jacob Leverich and Christos Kozyrakis. 2014. Reconciling high server utilization and sub-millisecond quality-of-service. In Proceedings of the 9th European Conference on Computer Systems (EuroSys). 1--14.

Digital Library

[50]

Hyeontaek Lim, Dongsu Han, David G Andersen, and Michael Kaminsky. 2014. MICA: A Holistic Approach to Fast In-Memory Key-value Storage. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 429--444.

[51]

Hwa-Chun Lin and Cauligi S Raghavendra. 1992. A dynamic load-balancing policy with a central job dispatcher (LBC). IEEE Transactions on Software Engineering 18, 2 (1992), 148.

Digital Library

[52]

Michael L Littman, Thomas L Dean, and Leslie Pack Kaelbling. 1995. On the complexity of solving Markov decision problems. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI). 394--402.

[53]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR).

[54]

Ryan Lowe, YiWu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. 2017. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS). 6382--6393.

[55]

Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M Sleiman, Ronald Dreslinski, Thomas FWenisch, and Scott Mahlke. 2012. Composite cores: Pushing heterogeneity into a core. In Proceedings of the 45th annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 317--328.

Digital Library

[56]

Sergio Valcarcel Macua, Javier Zazo, and Santiago Zazo. 2018. Learning Parametric Closed-Loop Policies for Markov Potential Games. In Proceedings of the International Conference on Learning Representations (ICLR).

[57]

Jason Mars and Lingjia Tang. 2013. Whare-map: Heterogeneity in "homogeneous" warehouse-scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA). 619--630.

Digital Library

[58]

James McCauley, Aurojit Panda, Arvind Krishnamurthy, and Scott Shenker. 2019. Thoughts on Load Distribution and the Role of Programmable Switches. ACM SIGCOMM Computer Communication Review 49, 1 (2019), 18--23.

Digital Library

[59]

Michael Mitzenmacher. 1997. How useful is old information?. In Proceedings of the 16th Annual ACM Symposium on Principles of Distributed Computing (PODC). 83--91.

[60]

Michael Mitzenmacher. 2001. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems (TPDS) 12, 10 (2001), 1094--1104.

Digital Library

[61]

Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, David Garcia-Soriano, Nicolas Kourtellis, and Marco Serafini. 2015. The power of both choices: Practical load balancing for distributed stream processing engines. In Proceedings of the 31st IEEE International Conference on Data Engineering (ICDE). IEEE, 137--148.

[62]

Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2014. Scale-out NUMA. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 3--17.

[63]

Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (NSDI). 361--377.

[64]

Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, low latency scheduling. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP). 69--84.

Digital Library

[65]

Christos H Papadimitriou and John N Tsitsiklis. 1987. The complexity of Markov decision processes. Mathematics of Operations Research 12, 3 (1987), 441--450.

Digital Library

[66]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. (2019), 8026--8037.

[67]

Satish Penmatsa and Anthony T Chronopoulos. 2011. Game-theoretic static load balancing for distributed systems. J. Parallel and Distrib. Comput. 71, 4 (2011), 537--555.

Digital Library

[68]

George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, (SOSP). 325--341.

Digital Library

[69]

Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA). 13--24.

[70]

Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne: Core-aware thread management. In Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation (OSDI). 145--160.

[71]

Microsoft Research. 2013. Rack-scale Computing. https://rb.gy/ps9fzo.

[72]

Robert F. Rosin. 1965. Determining a Computing Center Environment. Communications of the ACM (CACM) 8, 7 (1965), 463--468.

Digital Library

[73]

Amitabha Roy, Laurent Bindschaedler, Jasmina Malicevic, and Willy Zwaenepoel. 2015. Chaos: Scale-out graph processing from secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP). 410--424.

Digital Library

[74]

Bianca Schroeder, Adam Wierman, and Mor Harchol-Balter. 2006. Open versus Closed: A Cautionary Tale. In Proceedings of the 3rd USENIX Symposium on Networked Systems Design and Implementation (NSDI). 18.

[75]

Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys). 351--364.

Digital Library

[76]

Jori Selen, Ivo Adan, Stella Kapodistria, and Johan van Leeuwaarden. 2016. Steady-state analysis of shortest expected delay routing. Queueing Systems 84, 3 (2016), 309--354.

Digital Library

[77]

Vishal Shrivastav, Asaf Valadarsky, Hitesh Ballani, Paolo Costa, Ki Suh Lee, Han Wang, Rachit Agarwal, and Hakim Weatherspoon. 2019. Shoal: A network architecture for disaggregated racks. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (NSDI). 255--270.

[78]

Alexander L Stolyar. 2015. Pull-based load distribution in large-scale heterogeneous service systems. Queueing Systems 80, 4 (2015), 341--361.

Digital Library

[79]

Alexander L Stolyar and Kavita Ramanan. 2001. Largest weighted delay first scheduling: Large deviations and optimality. Annals of Applied Probability (2001), 1--48.

[80]

Riky Subrata, Albert Y Zomaya, and Bjorn Landfeldt. 2007. Game-theoretic approach for load balancing in computational grids. IEEE Transactions on Parallel and Distributed Systems (TPDS) 19, 1 (2007), 66--76.

Digital Library

[81]

M Aater Suleman, Milad Hashemi, Chris Wilkerson, Yale N Patt, et al. 2012. Morphcore: An energy-efficient microarchitecture for high performance ILP and high throughput TLP. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 305--316.

[82]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

Digital Library

[83]

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NeurIPS). 1057--1063.

[84]

Ming Tan. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the 10th International Conference on Machine Learning (ICML). 330--337.

[85]

Xueyan Tang and Samuel T Chanson. 2000. Optimizing static job scheduling in a network of heterogeneous computers. In Proceedings of the International Conference on Parallel Processing (ICPP). IEEE, 373--382.

[86]

Paul Teich. 2017. Under The Hood Of Googles TPU2 Machine Learning Clusters. https://rb.gy/3xmprc.

[87]

Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. 2013. Speedy transactions in multicore in-memory databases. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP). 18--32.

Digital Library

[88]

Venkateshwaran Venkataramani, Zach Amsden, Nathan Bronson, George Cabrera III, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Jeremy Hoon, et al. 2012. TAO: How Facebook serves the social graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 791--792.

Digital Library

[89]

Edward S.Walter and Victor L.Wallace. 1967. Further Analysis of a Computing Center Environment. Communications of the ACM (CACM) 10, 5 (1967), 266--272.

Digital Library

[90]

Xiaofeng Wang and Tuomas Sandholm. 2002. Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Proceedings of the 15th International Conference on Neural Information Processing Systems (NeurIPS). 1603--1610.

[91]

Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8, 3--4 (1992), 279--292.

Digital Library

[92]

Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP). 87--104.

Digital Library

[93]

Adam Wierman and Bert Zwart. 2012. Is tail-optimal scheduling possible? Operations Research 60, 5 (2012), 1249--1257.

Digital Library

[94]

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 3 (1992), 229--256.

Digital Library

[95]

D Randall Wilson and Tony R Martinez. 2003. The general inefficiency of batch training for gradient descent learning. Neural Networks 16, 10 (2003), 1429--1451.

Digital Library

[96]

Heng Zhang, Mingkai Dong, and Haibo Chen. 2016. Efficient and available in-memory KV-store with hybrid erasure coding and replication. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST). 167--180.

Digital Library

[97]

Kaiqing Zhang, Zhuoran Yang, and Tamer Ba?ar. 2021. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control (2021), 321--384.

[98]

Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In Proceedings of the International Conference on Machine Learning (ICML). 5872--5881.

[99]

Wenli Zheng and Xiaorui Wang. 2015. Data center sprinting: Enabling computational sprinting at the data center level. In Proceedings of the 35th International Conference on Distributed Computing Systems (ICDCS). 175--184.

[100]

Xingyu Zhou, Ness Shroff, and Adam Wierman. 2021. Asymptotically optimal load balancing in large-scale heterogeneous systems with multiple dispatchers. Performance Evaluation 145 (2021), 102146.

[101]

Xingyu Zhou, Fei Wu, Jian Tan, Kannan Srinivasan, and Ness Shroff. 2018. Degree of queue imbalance: Overcoming the limitation of heavy-traffic delay optimality in load balancing systems. Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) 2, 1 (2018), 1--41.

Digital Library

[102]

Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. 2020. RackSched: A microsecond-scale scheduler for rack-scale computers. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 1225--1240.

Cited By

Zadeh SZandi FBuckley MGanjali Y(2023)Meta-Migration: Reducing Switch Migration Tail Latency Through Competition2023 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking57963.2023.10186446(1-9)Online publication date: 12-Jun-2023
https://doi.org/10.23919/IFIPNetworking57963.2023.10186446

Index Terms

Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
    2. Other architectures
      1. Heterogeneous (hybrid) systems
2. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence
      1. Cooperation and coordination
      2. Multi-agent systems
  2. Machine learning
    1. Machine learning algorithms

Recommendations

Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale
SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings, tens of millions of scheduling decisions have to be made per second. Achieving this throughput while ...
Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale
SIGMETRICS '23

We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings, tens of millions of scheduling decisions have to be made per second. Achieving this throughput while ...
Agent coalitions for load balancing in cloud data centers
Abstract
The workload of Cloud data centers is constantly fluctuating causing imbalances across physical hosts that may lead to violations of service-level agreements. To mitigate workload imbalances, this work proposes a concurrent agent-based ...
Highlights
- Agents in coalitions progressively balance data center sections.
- Supported by a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 6, Issue 3

POMACS

December 2022

534 pages

EISSN:2476-1249

DOI:10.1145/3576048

Editors:
Augustin Chaintreau
Columbia University
,
Leana Golubchik
University of Southern California, United States
,
Zhi-Li Zhang
University of Minnesota, United States

Issue’s Table of Contents

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2022

Published in POMACS Volume 6, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

CFI-JELF
ORF-RI

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
204
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)9

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zadeh SZandi FBuckley MGanjali Y(2023)Meta-Migration: Reducing Switch Migration Tail Latency Through Competition2023 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking57963.2023.10186446(1-9)Online publication date: 12-Jun-2023
https://doi.org/10.23919/IFIPNetworking57963.2023.10186446

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents