[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3588195.3592997acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Open access

Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

Published: 07 August 2023 Publication History

Abstract

Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces KAIROS, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. KAIROS designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade machine learning (ML) models shows that KAIROS yields up to 2x the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.

References

[1]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. Mark: Exploiting cloud services for cost-effective, slo-aware machine learning inference serving. In 2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19), pages 1049--1062, 2019.
[2]
Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, et al. The architectural implications of facebook's dnn-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 488--501. IEEE, 2020.
[3]
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620--629. IEEE, 2018.
[4]
Daniel Crankshaw, XinWang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), pages 613--627, 2017.
[5]
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. Inferline: latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 477--491, 2020.
[6]
Lin Ning and Xipeng Shen. Deep reuse: streamline cnn inference on the fly via coarse-grained computation reuse. In Proceedings of the ACM International Conference on Supercomputing, pages 438--448, 2019.
[7]
Yujeong Choi, Yunseong Kim, and Minsoo Rhu. Lazy batching: An sla-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 493--506. IEEE, 2021.
[8]
Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. Batch: machine learning inference serving on serverless platforms with adaptive batching. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 972--986. IEEE Computer Society, 2020.
[9]
Luping Wang, Lingyun Yang, Yinghao Yu, Wei Wang, Bo Li, Xianchao Sun, Jian He, and Liping Zhang. Morphling: Fast, near-optimal auto-configuration for cloud-native model serving. In Proceedings of the ACM Symposium on Cloud Computing, pages 639--653, 2021.
[10]
Christina Delimitrou and Christos Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices, 48(4):77--88, 2013.
[11]
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the Eleventh European Conference on Computer Systems, pages 1--16, 2016.
[12]
Subho Banerjee, Saurabh Jha, Zbigniew Kalbarczyk, and Ravishankar Iyer. Inductive-bias-driven reinforcement learning for efficient schedules in heterogeneous clusters. In International Conference on Machine Learning, pages 629--641. PMLR, 2020.
[13]
Husheng Zhou, Soroush Bateni, and Cong Liu. S" 3dnn: Supervised streaming and scheduling for gpu-accelerated real-time dnn workloads. In 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 190--201. IEEE, 2018.
[14]
Yecheng Xiang and Hyoseung Kim. Pipelined data-parallel cpu/gpu scheduling for multi-dnn real-time inference. In 2019 IEEE Real-Time Systems Symposium (RTSS), pages 392--405. IEEE, 2019.
[15]
Yitao Hu, Rajrup Ghosh, and Ramesh Govindan. Scrooge: A cost-effective deep learning inference system. In Proceedings of the ACM Symposium on Cloud Computing, pages 624--638, 2021.
[16]
Baolin Li, Rohan Basu Roy, Tirthak Patel, Vijay Gadepally, Karen Gettings, and Devesh Tiwari. Ribbon: cost-effective and qos-aware deep learning model inference using a diverse pool of cloud computing instances. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--13, 2021.
[17]
Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S Lee, David Brooks, and Carole-Jean Wu. Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 982--995. IEEE, 2020.
[18]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving dnns like clockwork: Performance predictability from the bottom up. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pages 443--462, 2020.
[19]
Tirthak Patel and Devesh Tiwari. Clite: Efficient and qos-aware co-location of multiple latency-critical jobs for warehouse scale computers. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 193--206. IEEE, 2020.
[20]
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, pages 248--259, 2011.
[21]
Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices, 49(4):127--144, 2014.
[22]
Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. Hawk: Hybrid datacenter scheduling. In 2015 {USENIX} Annual Technical Conference ({USENIX} {ATC} 15), pages 499--510, 2015.
[23]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 450--462, 2015.
[24]
Rajiv Nishtala, Paul Carpenter, Vinicius Petrucci, and Xavier Martorell. Hipster: Hybrid task manager for latency-critical cloud workloads. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 409--420. IEEE, 2017.
[25]
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pages 481--498, 2020.
[26]
Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. Scheduling heterogeneous multi-cores through performance impact estimation (pie). In 2012 39th Annual International Symposium on Computer Architecture (ISCA), pages 213--224. IEEE, 2012.
[27]
Kai Hwang, Xiaoying Bai, Yue Shi, Muyang Li, Wen-Guang Chen, and Yongwei Wu. Cloud performance modeling with benchmark evaluation of elastic scaling strategies. IEEE Transactions on parallel and distributed systems, 27(1):130--143, 2015.
[28]
Ang Li, Xiaowei Yang, Ming Zhang, and S Kandula. Cloudcmp: Shopping for a cloud made easy. HotCloud, 10:1--7, 2010.
[29]
Joel Scheuner and Philipp Leitner. A cloud benchmark suite combining micro and applications benchmarks. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, pages 161--166, 2018.
[30]
Mohammad Shahrad and David Wentzlaff. Availability knob: Flexible userdefined availability in the cloud. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 42--56, 2016.
[31]
Neeraja J Yadwadkar, Bharath Hariharan, Joseph E Gonzalez, Burton Smith, and Randy H Katz. Selecting the best vm across multiple public clouds: A data-driven performance modeling approach. In Proceedings of the 2017 Symposium on Cloud Computing, pages 452--465, 2017.
[32]
Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey. Bobtail: Avoiding long tails in the cloud. In 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13), pages 329--341, 2013.
[33]
Seyedhamid Mashhadi Moghaddam, Sareh Fotuhi Piraghaj, Michael O'Sullivan, Cameron Walker, and Charles Unsworth. Energy-efficient and sla-aware virtual machine selection algorithm for dynamic resource allocation in cloud data centers. In 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC), pages 103--113. IEEE, 2018.
[34]
Peipei Zhou, Jiayi Sheng, Cody Hao Yu, Peng Wei, Jie Wang, Di Wu, and Jason Cong. Mocha: Multinode cost optimization in heterogeneous clouds with accelerators. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 273--279, 2021.
[35]
Yanqi Zhang, Weizhe Hua, Zhuangzhuang Zhou, G Edward Suh, and Christina Delimitrou. Sinan: Ml-based and qos-aware resource management for cloud microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 167--181, 2021.
[36]
Baolin Li, Tirthak Patel, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. Miso: exploiting multi-instance gpu capability on multi-tenant gpu clusters. In Proceedings of the 13th Symposium on Cloud Computing, pages 173--189, 2022.
[37]
Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch. Kungfu: Making training in distributed machine learning adaptive. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pages 937--954, 2020.
[38]
Changyeon Jo, Youngsu Cho, and Bernhard Egger. A machine learning approach to live migration modeling. In Proceedings of the 2017 Symposium on Cloud Computing, pages 351--364, 2017.
[39]
Young Geun Kim and Carole-JeanWu. Autoscale: Energy efficiency optimization for stochastic edge inference using reinforcement learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1082--1096. IEEE, 2020.
[40]
Seulki Lee and Shahriar Nirjon. Subflow: A dynamic induced-subgraph strategy toward real-time dnn inference and training. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 15--29. IEEE, 2020.
[41]
Ming Yang, ShigeWang, Joshua Bakita, Thanh Vu, F Donelson Smith, James H Anderson, and Jan-Michael Frahm. Re-thinking cnn frameworks for time-sensitive autonomous-driving applications: Addressing an industrial challenge. In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 305--317. IEEE, 2019.
[42]
Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. Grandslam: Guaranteeing slas for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1--16, 2019.
[43]
Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, and Shan Lu. {ALERT}: Accurate learning for energy and timeliness. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), pages 353--369, 2020.
[44]
Weihao Cui, Han Zhao, Quan Chen, Ningxin Zheng, Jingwen Leng, Jieru Zhao, Zhuo Song, Tao Ma, Yong Yang, Chao Li, et al. Enable simultaneous dnn services based on deterministic operator overlap and precise latency prediction. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021.
[45]
Francisco Romero, Mark Zhao, Neeraja J Yadwadkar, and Christos Kozyrakis. Llama: A heterogeneous & serverless framework for auto-tuning video analytics pipelines. In Proceedings of the ACM Symposium on Cloud Computing, pages 1--17, 2021.
[46]
Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. {INFaaS}: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397--411, 2021.
[47]
Peter JM Van Laarhoven and Emile HL Aarts. Simulated annealing. In Simulated annealing: Theory and applications, pages 7--15. Springer, 1987.
[48]
Michael Orr and Oliver Sinnen. Optimal task scheduling for partially heterogeneous systems. Parallel Computing, 107:102815, 2021.
[49]
Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems, 13(3):260--274, 2002.
[50]
Richard M Karp, Umesh V Vazirani, and Vijay V Vazirani. An optimal algorithm for on-line bipartite matching. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 352--358, 1990.
[51]
Roy Jonker and Anton Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325--340, 1987.
[52]
Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1--2):83--97, 1955.
[53]
David F Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679--1696, 2016.
[54]
Shuang Chen, Angela Jin, Christina Delimitrou, and José F Martínez. Retail: Opting for learning simplicity to enable qos-aware power management in the cloud. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 155--168. IEEE, 2022.
[55]
Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. In Noise reduction in speech processing, pages 1--4. Springer, 2009.
[56]
Jan Karel Lenstra, David B Shmoys, and Éva Tardos. Approximation algorithms for scheduling unrelated parallel machines. Mathematical programming, 46: 259--271, 1990.
[57]
Yossi Azar and Amir Epstein. Convex programming for scheduling unrelated parallel machines. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 331--337, 2005.
[58]
John P Lehoczky. Real-time queueing theory. In 17th IEEE Real-Time Systems Symposium, pages 186--195. IEEE, 1996.
[59]
Robert B Cooper. Queueing theory. In Proceedings of the ACM'81 conference, pages 119--122, 1981.
[60]
Sparsh Mittal. A survey of techniques for approximate computing. ACM Computing Surveys (CSUR), 48(4):1--33, 2016.
[61]
Rena Nainggolan, Resianta Perangin-angin, Emma Simarmata, and Astuti Feriani Tarigan. Improved the performance of the k-means cluster using the sum of squared error (sse) optimized by using the elbow method. In Journal of Physics: Conference Series, volume 1361, page 012015. IOP Publishing, 2019.
[62]
Nvidia triton inference server. URL https://docs.nvidia.com/deeplearning/tritoninference-server/.
[63]
grpc. URL https://grpc.io/.
[64]
Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen Boyd, and Matei Zaharia. Solving largescale granular resource allocation problems efficiently with pop. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 521--537, 2021.
[65]
Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping Chen, Mikhail Shiryaev, and Alexander Heinecke. Optimizing deep learning recommender systems training on cpu cluster architectures. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15. IEEE, 2020.
[66]
Minhui Xie, Kai Ren, Youyou Lu, Guangxu Yang, Qingxing Xu, BihaiWu, Jiazhen Lin, Hongbo Ao, Wanhong Xu, and Jiwu Shu. Kraken: memory-efficient continual learning for large-scale real-time recommendations. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--17. IEEE, 2020.
[67]
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
[68]
Andreas Argyriou, Miguel González-Fierro, and Le Zhang. Microsoft recommenders: Best practices for production-ready recommendation systems. In Companion Proceedings of the Web Conference 2020, pages 50--51, 2020.
[69]
Samuel Hsia, Udit Gupta, Mark Wilkening, Carole-Jean Wu, Gu-Yeon Wei, and David Brooks. Cross-stack workload characterization of deep recommendation systems. In 2020 IEEE International Symposium on Workload Characterization (IISWC), pages 157--168. IEEE, 2020.
[70]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pages 173--182, 2017.
[71]
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide&deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7--10, 2016.
[72]
Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 43--51, 2019.
[73]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi,Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5941--5948, 2019.
[74]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3--18, 2019.
[75]
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, et al. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446--459. IEEE, 2020.
[76]
Harshad Kasture and Daniel Sanchez. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1--10. IEEE, 2016.
[77]
Jing Li, Kunal Agrawal, Sameh Elnikety, Yuxiong He, I-Ting Angelina Lee, Chenyang Lu, and Kathryn S McKinley. Work stealing for interactive services to meet target latency. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1--13, 2016.
[78]
Thomas Wortmann and Giacomo Nannicini. Black-box optimisation methods for architectural design. 2016.
[79]
Jamie Ericson, Masoud Mohammadian, and Fabiana Santana. Analysis of performance variability in public cloud computing. In 2017 IEEE International Conference on Information Reuse and Integration (IRI), pages 308--314. IEEE, 2017.

Cited By

View all
  • (2024)Flexible Deployment of Machine Learning Inference Pipelines in the Cloud–Edge–IoT ContinuumElectronics10.3390/electronics1310188813:10(1888)Online publication date: 11-May-2024
  • (2024)Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy ScalingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658688(267-280)Online publication date: 3-Jun-2024
  • (2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
  • Show More Cited By

Index Terms

  1. Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
        August 2023
        350 pages
        ISBN:9798400701559
        DOI:10.1145/3588195
        • General Chair:
        • Ali R. Butt,
        • Program Chairs:
        • Ningfang Mi,
        • Kyle Chard
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 07 August 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. heterogeneous hardware
        2. inference systems
        3. machine learning

        Qualifiers

        • Research-article

        Funding Sources

        • United States Air Force Research Laboratory
        • Assistant Secretary of Defense for Research and Engineering

        Conference

        HPDC '23

        Acceptance Rates

        Overall Acceptance Rate 166 of 966 submissions, 17%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)763
        • Downloads (Last 6 weeks)50
        Reflects downloads up to 11 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Flexible Deployment of Machine Learning Inference Pipelines in the Cloud–Edge–IoT ContinuumElectronics10.3390/electronics1310188813:10(1888)Online publication date: 11-May-2024
        • (2024)Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy ScalingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658688(267-280)Online publication date: 3-Jun-2024
        • (2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
        • (2023)Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference ServiceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607034(1-15)Online publication date: 12-Nov-2023

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media