More Web Proxy on the site http://driver.im/

research-article

Open access

Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

Authors:

Siddharth Samsi,

Vijay Gadepally,

Devesh TiwariAuthors Info & Claims

HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Pages 3 - 16

https://doi.org/10.1145/3588195.3592997

Published: 07 August 2023 Publication History

Abstract

Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces KAIROS, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. KAIROS designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade machine learning (ML) models shows that KAIROS yields up to 2x the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.

References

[1]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. Mark: Exploiting cloud services for cost-effective, slo-aware machine learning inference serving. In 2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19), pages 1049--1062, 2019.

[2]

Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, et al. The architectural implications of facebook's dnn-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 488--501. IEEE, 2020.

[3]

Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620--629. IEEE, 2018.

[4]

Daniel Crankshaw, XinWang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), pages 613--627, 2017.

[5]

Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. Inferline: latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 477--491, 2020.

Digital Library

[6]

Lin Ning and Xipeng Shen. Deep reuse: streamline cnn inference on the fly via coarse-grained computation reuse. In Proceedings of the ACM International Conference on Supercomputing, pages 438--448, 2019.

Digital Library

[7]

Yujeong Choi, Yunseong Kim, and Minsoo Rhu. Lazy batching: An sla-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 493--506. IEEE, 2021.

[8]

Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. Batch: machine learning inference serving on serverless platforms with adaptive batching. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 972--986. IEEE Computer Society, 2020.

[9]

Luping Wang, Lingyun Yang, Yinghao Yu, Wei Wang, Bo Li, Xianchao Sun, Jian He, and Liping Zhang. Morphling: Fast, near-optimal auto-configuration for cloud-native model serving. In Proceedings of the ACM Symposium on Cloud Computing, pages 639--653, 2021.

Digital Library

[10]

Christina Delimitrou and Christos Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices, 48(4):77--88, 2013.

Digital Library

[11]

Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the Eleventh European Conference on Computer Systems, pages 1--16, 2016.

Digital Library

[12]

Subho Banerjee, Saurabh Jha, Zbigniew Kalbarczyk, and Ravishankar Iyer. Inductive-bias-driven reinforcement learning for efficient schedules in heterogeneous clusters. In International Conference on Machine Learning, pages 629--641. PMLR, 2020.

[13]

Husheng Zhou, Soroush Bateni, and Cong Liu. S" 3dnn: Supervised streaming and scheduling for gpu-accelerated real-time dnn workloads. In 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 190--201. IEEE, 2018.

[14]

Yecheng Xiang and Hyoseung Kim. Pipelined data-parallel cpu/gpu scheduling for multi-dnn real-time inference. In 2019 IEEE Real-Time Systems Symposium (RTSS), pages 392--405. IEEE, 2019.

[15]

Yitao Hu, Rajrup Ghosh, and Ramesh Govindan. Scrooge: A cost-effective deep learning inference system. In Proceedings of the ACM Symposium on Cloud Computing, pages 624--638, 2021.

Digital Library

[16]

Baolin Li, Rohan Basu Roy, Tirthak Patel, Vijay Gadepally, Karen Gettings, and Devesh Tiwari. Ribbon: cost-effective and qos-aware deep learning model inference using a diverse pool of cloud computing instances. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--13, 2021.

[17]

Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S Lee, David Brooks, and Carole-Jean Wu. Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 982--995. IEEE, 2020.

Digital Library

[18]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving dnns like clockwork: Performance predictability from the bottom up. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pages 443--462, 2020.

[19]

Tirthak Patel and Devesh Tiwari. Clite: Efficient and qos-aware co-location of multiple latency-critical jobs for warehouse scale computers. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 193--206. IEEE, 2020.

[20]

Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, pages 248--259, 2011.

Digital Library

[21]

Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices, 49(4):127--144, 2014.

Digital Library

[22]

Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. Hawk: Hybrid datacenter scheduling. In 2015 {USENIX} Annual Technical Conference ({USENIX} {ATC} 15), pages 499--510, 2015.

[23]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 450--462, 2015.

Digital Library

[24]

Rajiv Nishtala, Paul Carpenter, Vinicius Petrucci, and Xavier Martorell. Hipster: Hybrid task manager for latency-critical cloud workloads. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 409--420. IEEE, 2017.

[25]

Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pages 481--498, 2020.

[26]

Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. Scheduling heterogeneous multi-cores through performance impact estimation (pie). In 2012 39th Annual International Symposium on Computer Architecture (ISCA), pages 213--224. IEEE, 2012.

[27]

Kai Hwang, Xiaoying Bai, Yue Shi, Muyang Li, Wen-Guang Chen, and Yongwei Wu. Cloud performance modeling with benchmark evaluation of elastic scaling strategies. IEEE Transactions on parallel and distributed systems, 27(1):130--143, 2015.

Digital Library

[28]

Ang Li, Xiaowei Yang, Ming Zhang, and S Kandula. Cloudcmp: Shopping for a cloud made easy. HotCloud, 10:1--7, 2010.

[29]

Joel Scheuner and Philipp Leitner. A cloud benchmark suite combining micro and applications benchmarks. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, pages 161--166, 2018.

Digital Library

[30]

Mohammad Shahrad and David Wentzlaff. Availability knob: Flexible userdefined availability in the cloud. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 42--56, 2016.

Digital Library

[31]

Neeraja J Yadwadkar, Bharath Hariharan, Joseph E Gonzalez, Burton Smith, and Randy H Katz. Selecting the best vm across multiple public clouds: A data-driven performance modeling approach. In Proceedings of the 2017 Symposium on Cloud Computing, pages 452--465, 2017.

Digital Library

[32]

Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey. Bobtail: Avoiding long tails in the cloud. In 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13), pages 329--341, 2013.

[33]

Seyedhamid Mashhadi Moghaddam, Sareh Fotuhi Piraghaj, Michael O'Sullivan, Cameron Walker, and Charles Unsworth. Energy-efficient and sla-aware virtual machine selection algorithm for dynamic resource allocation in cloud data centers. In 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC), pages 103--113. IEEE, 2018.

[34]

Peipei Zhou, Jiayi Sheng, Cody Hao Yu, Peng Wei, Jie Wang, Di Wu, and Jason Cong. Mocha: Multinode cost optimization in heterogeneous clouds with accelerators. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 273--279, 2021.

Digital Library

[35]

Yanqi Zhang, Weizhe Hua, Zhuangzhuang Zhou, G Edward Suh, and Christina Delimitrou. Sinan: Ml-based and qos-aware resource management for cloud microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 167--181, 2021.

Digital Library

[36]

Baolin Li, Tirthak Patel, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. Miso: exploiting multi-instance gpu capability on multi-tenant gpu clusters. In Proceedings of the 13th Symposium on Cloud Computing, pages 173--189, 2022.

Digital Library

[37]

Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch. Kungfu: Making training in distributed machine learning adaptive. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pages 937--954, 2020.

[38]

Changyeon Jo, Youngsu Cho, and Bernhard Egger. A machine learning approach to live migration modeling. In Proceedings of the 2017 Symposium on Cloud Computing, pages 351--364, 2017.

Digital Library

[39]

Young Geun Kim and Carole-JeanWu. Autoscale: Energy efficiency optimization for stochastic edge inference using reinforcement learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1082--1096. IEEE, 2020.

[40]

Seulki Lee and Shahriar Nirjon. Subflow: A dynamic induced-subgraph strategy toward real-time dnn inference and training. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 15--29. IEEE, 2020.

[41]

Ming Yang, ShigeWang, Joshua Bakita, Thanh Vu, F Donelson Smith, James H Anderson, and Jan-Michael Frahm. Re-thinking cnn frameworks for time-sensitive autonomous-driving applications: Addressing an industrial challenge. In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 305--317. IEEE, 2019.

[42]

Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. Grandslam: Guaranteeing slas for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1--16, 2019.

Digital Library

[43]

Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, and Shan Lu. {ALERT}: Accurate learning for energy and timeliness. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), pages 353--369, 2020.

[44]

Weihao Cui, Han Zhao, Quan Chen, Ningxin Zheng, Jingwen Leng, Jieru Zhao, Zhuo Song, Tao Ma, Yong Yang, Chao Li, et al. Enable simultaneous dnn services based on deterministic operator overlap and precise latency prediction. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021.

Digital Library

[45]

Francisco Romero, Mark Zhao, Neeraja J Yadwadkar, and Christos Kozyrakis. Llama: A heterogeneous & serverless framework for auto-tuning video analytics pipelines. In Proceedings of the ACM Symposium on Cloud Computing, pages 1--17, 2021.

Digital Library

[46]

Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. {INFaaS}: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397--411, 2021.

[47]

Peter JM Van Laarhoven and Emile HL Aarts. Simulated annealing. In Simulated annealing: Theory and applications, pages 7--15. Springer, 1987.

[48]

Michael Orr and Oliver Sinnen. Optimal task scheduling for partially heterogeneous systems. Parallel Computing, 107:102815, 2021.

Digital Library

[49]

Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems, 13(3):260--274, 2002.

Digital Library

[50]

Richard M Karp, Umesh V Vazirani, and Vijay V Vazirani. An optimal algorithm for on-line bipartite matching. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 352--358, 1990.

Digital Library

[51]

Roy Jonker and Anton Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325--340, 1987.

Digital Library

[52]

Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1--2):83--97, 1955.

[53]

David F Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679--1696, 2016.

[54]

Shuang Chen, Angela Jin, Christina Delimitrou, and José F Martínez. Retail: Opting for learning simplicity to enable qos-aware power management in the cloud. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 155--168. IEEE, 2022.

[55]

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. In Noise reduction in speech processing, pages 1--4. Springer, 2009.

[56]

Jan Karel Lenstra, David B Shmoys, and Éva Tardos. Approximation algorithms for scheduling unrelated parallel machines. Mathematical programming, 46: 259--271, 1990.

[57]

Yossi Azar and Amir Epstein. Convex programming for scheduling unrelated parallel machines. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 331--337, 2005.

Digital Library

[58]

John P Lehoczky. Real-time queueing theory. In 17th IEEE Real-Time Systems Symposium, pages 186--195. IEEE, 1996.

[59]

Robert B Cooper. Queueing theory. In Proceedings of the ACM'81 conference, pages 119--122, 1981.

Digital Library

[60]

Sparsh Mittal. A survey of techniques for approximate computing. ACM Computing Surveys (CSUR), 48(4):1--33, 2016.

[61]

Rena Nainggolan, Resianta Perangin-angin, Emma Simarmata, and Astuti Feriani Tarigan. Improved the performance of the k-means cluster using the sum of squared error (sse) optimized by using the elbow method. In Journal of Physics: Conference Series, volume 1361, page 012015. IOP Publishing, 2019.

[62]

Nvidia triton inference server. URL https://docs.nvidia.com/deeplearning/tritoninference-server/.

[63]

grpc. URL https://grpc.io/.

[64]

Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen Boyd, and Matei Zaharia. Solving largescale granular resource allocation problems efficiently with pop. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 521--537, 2021.

Digital Library

[65]

Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping Chen, Mikhail Shiryaev, and Alexander Heinecke. Optimizing deep learning recommender systems training on cpu cluster architectures. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15. IEEE, 2020.

Digital Library

[66]

Minhui Xie, Kai Ren, Youyou Lu, Guangxu Yang, Qingxing Xu, BihaiWu, Jiazhen Lin, Hongbo Ao, Wanhong Xu, and Jiwu Shu. Kraken: memory-efficient continual learning for large-scale real-time recommendations. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--17. IEEE, 2020.

[67]

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.

[68]

Andreas Argyriou, Miguel González-Fierro, and Le Zhang. Microsoft recommenders: Best practices for production-ready recommendation systems. In Companion Proceedings of the Web Conference 2020, pages 50--51, 2020.

Digital Library

[69]

Samuel Hsia, Udit Gupta, Mark Wilkening, Carole-Jean Wu, Gu-Yeon Wei, and David Brooks. Cross-stack workload characterization of deep recommendation systems. In 2020 IEEE International Symposium on Workload Characterization (IISWC), pages 157--168. IEEE, 2020.

[70]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pages 173--182, 2017.

Digital Library

[71]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide&deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7--10, 2016.

Digital Library

[72]

Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 43--51, 2019.

Digital Library

[73]

Guorui Zhou, Na Mou, Ying Fan, Qi Pi,Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5941--5948, 2019.

Digital Library

[74]

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3--18, 2019.

Digital Library

[75]

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, et al. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446--459. IEEE, 2020.

Digital Library

[76]

Harshad Kasture and Daniel Sanchez. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1--10. IEEE, 2016.

[77]

Jing Li, Kunal Agrawal, Sameh Elnikety, Yuxiong He, I-Ting Angelina Lee, Chenyang Lu, and Kathryn S McKinley. Work stealing for interactive services to meet target latency. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1--13, 2016.

Digital Library

[78]

Thomas Wortmann and Giacomo Nannicini. Black-box optimisation methods for architectural design. 2016.

[79]

Jamie Ericson, Masoud Mohammadian, and Fabiana Santana. Analysis of performance variability in public cloud computing. In 2017 IEEE International Conference on Information Reuse and Integration (IRI), pages 308--314. IEEE, 2017.

Digital Library

Cited By

Bogacka KSowiński PDanilenka ABiot FWasielewska-Michniewska KGanzha MPaprzycki MPalau C(2024)Flexible Deployment of Machine Learning Inference Pipelines in the Cloud–Edge–IoT ContinuumElectronics10.3390/electronics1310188813:10(1888)Online publication date: 11-May-2024
https://doi.org/10.3390/electronics13101888
Ahmad SGuan HSitaraman RMencagli GDazzi PLowenthal DBadia R(2024)Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy ScalingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658688(267-280)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658688
Ye ZGao WHu QSun PWang XLuo YZhang TWen Y(2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
https://doi.org/10.1145/3638757
Show More Cited By

Index Terms

Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
    2. Other architectures
      1. Heterogeneous (hybrid) systems
2. Computing methodologies
  1. Machine learning

Recommendations

Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing
APSys '23: Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems

With the emerging of machine learning, many commercial companies increasingly utilize machine learning inference systems as backend services to improve their products. Serverless computing is a modern paradigm that provides auto-scaling, event-driven ...
Machine Learning: The State of the Art

The two fundamental problems in machine learning (ML) are statistical analysis and algorithm design. The former tells us the principles of the mathematical models that we establish from the observation data. The latter defines the conditions on which ...
Machine Learning Inference on Serverless Platforms Using Model Decomposition
UCC '23: Proceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing

Serverless offers a scalable and cost-effective service model for users to run applications without focusing on underlying infrastructure or physical servers. While the Serverless architecture is not designed to address the unique challenges posed by ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

August 2023

350 pages

ISBN:9798400701559

DOI:10.1145/3588195

General Chair:
Ali R. Butt
Virginia Tech, USA
,
Program Chairs:
Ningfang Mi
Northeastern University, USA
,
Kyle Chard
University of Chicago & Argonne National Laboratory, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

United States Air Force Research Laboratory
Assistant Secretary of Defense for Research and Engineering

Conference

HPDC '23

Sponsor:

HPDC '23: The 32nd International Symposium on High-Performance Parallel and Distributed Computing

June 16 - 23, 2023

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
961
Total Downloads

Downloads (Last 12 months)763
Downloads (Last 6 weeks)50

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bogacka KSowiński PDanilenka ABiot FWasielewska-Michniewska KGanzha MPaprzycki MPalau C(2024)Flexible Deployment of Machine Learning Inference Pipelines in the Cloud–Edge–IoT ContinuumElectronics10.3390/electronics1310188813:10(1888)Online publication date: 11-May-2024
https://doi.org/10.3390/electronics13101888
Ahmad SGuan HSitaraman RMencagli GDazzi PLowenthal DBadia R(2024)Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy ScalingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658688(267-280)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658688
Ye ZGao WHu QSun PWang XLuo YZhang TWen Y(2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
https://doi.org/10.1145/3638757
Li BSamsi SGadepally VTiwari DMohror KArnold DBadia R(2023)Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference ServiceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607034(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607034

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents