More Web Proxy on the site http://driver.im/

research-article

Open access

SPLIT: QoS-Aware DNN Inference on Shared GPU via Evenly-Sized Model Splitting

Authors:

Wenbo ZhangAuthors Info & Claims

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

Pages 605 - 614

https://doi.org/10.1145/3605573.3605627

Published: 13 September 2023 Publication History

All formats PDF

Abstract

Improving QoS by simultaneously reducing the latency violation rate and jitter in the presence of multiple deep learning inference (DLI) tasks sharing a single edge computing processor remains a challenge. However, existing DLI systems at the edge, designed to maximize throughput, face performance challenges when confronted with requests with varying QoS.

In this paper, we present SPLIT, a QoS-aware DNN inference system on shared GPU via evenly-sized model splitting to improve QoS by reducing the latency violation rate and jitter. SPLIT applies a genetic algorithm to evenly split models into diverse operator combinations, or blocks, thereby minimizing the standard deviation of block execution time to reduce jitter. Furthermore, we develop a preemption method based on a greedy algorithm to swiftly assess whether an incoming request should preempt to minimize latency. We evaluate SPLIT with five common deep learning models and the experimental results reveal that SPLIT outperforms state-of-the-art approaches, reducing the latency violation rate by up to 43% and jitter by up to 69.3%.

Supplemental Material

PDF File

Stage 1-Artifact Description: apdx187s1-filled_template.zip Stage 2-Badges: apdx187s2-filled_template_compiled_pdf_outfn.pdf & apdx187s2-filled_template.zip

Download
281.63 KB

ZIP File

Stage 1-Artifact Description: apdx187s1-filled_template.zip Stage 2-Badges: apdx187s2-filled_template_compiled_pdf_outfn.pdf & apdx187s2-filled_template.zip

Download
195.69 KB

ZIP File

Stage 1-Artifact Description: apdx187s1-filled_template.zip Stage 2-Badges: apdx187s2-filled_template_compiled_pdf_outfn.pdf & apdx187s2-filled_template.zip

Download
199.35 KB

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. { TensorFlow} : A System for { Large-Scale} Machine Learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265–283.

[2]

Shane Bergsma, Timothy Zeyl, Arik Senderovich, and J. Christopher Beck. 2021. Generating Complex, Realistic Cloud Workloads using Recurrent Neural Networks. In SOSP ’21: ACM SIGOPS 28th Symposium on Operating Systems Principles, Virtual Event / Koblenz, Germany, October 26-29, 2021, Robbert van Renesse and Nickolai Zeldovich (Eds.). ACM, 376–391. https://doi.org/10.1145/3477132.3483590

Digital Library

[3]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, Andrea C. Arpaci-Dusseau and Geoff Voelker (Eds.). USENIX Association, 578–594. https://www.usenix.org/conference/osdi18/presentation/chen

Digital Library

[4]

Yujeong Choi and Minsoo Rhu. 2020. PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22-26, 2020. IEEE, 220–233. https://doi.org/10.1109/HPCA47549.2020.00027

[5]

Prafulla N. Dawadi, Diane Joyce Cook, and Maureen Schmitter-Edgecombe. 2016. Automated Cognitive Health Assessment From Smart Home-Based Behavior Data. IEEE J. Biomed. Health Informatics 20, 4 (2016), 1188–1194. https://doi.org/10.1109/JBHI.2015.2445754

[6]

ONNX Runtime developers. 2021. ONNX Runtime. https://onnxruntime.ai/. Version: x.y.z.

[7]

Aditya Dhakal, Sameer G. Kulkarni, and K. K. Ramakrishnan. 2020. GSLICE: controlled spatial sharing of GPUs for a scalable inference platform. In SoCC ’20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020, Rodrigo Fonseca, Christina Delimitrou, and Beng Chin Ooi (Eds.). ACM, 492–506. https://doi.org/10.1145/3419111.3421284

Digital Library

[8]

Soroush Ghodrati, Byung Hoon Ahn, Joon Kyung Kim, Sean Kinzer, Brahmendra Reddy Yatham, Navateja Alla, Hardik Sharma, Mohammad Alian, Eiman Ebrahimi, Nam Sung Kim, Cliff Young, and Hadi Esmaeilzadeh. 2020. Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17-21, 2020. IEEE, 681–697. https://doi.org/10.1109/MICRO50266.2020.00062

[9]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. CoRR abs/2006.02464 (2020). arXiv:2006.02464https://arxiv.org/abs/2006.02464

[10]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, Marcos K. Aguilera and Hakim Weatherspoon (Eds.). USENIX Association, 539–558. https://www.usenix.org/conference/osdi22/presentation/han

[11]

Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. GRNN: Low-Latency and Scalable RNN Inference on GPUs. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 41:1–41:16. https://doi.org/10.1145/3302424.3303949

Digital Library

[12]

Zhaowu Huang, Fang Dong, Dian Shen, Junxue Zhang, Huitian Wang, Guangxing Cai, and Qiang He. 2021. Enabling Low Latency Edge Intelligence based on Multi-exit DNNs in the Wild. In 41st IEEE International Conference on Distributed Computing Systems, ICDCS 2021, Washington DC, USA, July 7-10, 2021. IEEE, 729–739. https://doi.org/10.1109/ICDCS51616.2021.00075

[13]

Arpan Jain, Tim Moon, Tom Benson, Hari Subramoni, Sam Adé Jacobs, Dhabaleswar K. Panda, and Brian Van Essen. 2021. SUPER: SUb-Graph Parallelism for TransformERs. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. IEEE, 629–638. https://doi.org/10.1109/IPDPS49936.2021.00071

[14]

Wonseok Jang, Hansaem Jeong, Kyungtae Kang, Nikil D. Dutt, and Jong-Chan Kim. 2020. R-TOD: Real-Time Object Detector with Minimized End-to-End Delay for Autonomous Driving. In 41st IEEE Real-Time Systems Symposium, RTSS 2020, Houston, TX, USA, December 1-4, 2020. IEEE, 191–204. https://doi.org/10.1109/RTSS49844.2020.00027

[15]

Beomyeol Jeon, Linda Cai, Pallavi Srivastava, Jintao Jiang, Xiaolan Ke, Yitao Meng, Cong Xie, and Indranil Gupta. 2020. Baechi: fast device placement of machine learning graphs. In SoCC ’20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020, Rodrigo Fonseca, Christina Delimitrou, and Beng Chin Ooi (Eds.). ACM, 416–430. https://doi.org/10.1145/3419111.3421302

Digital Library

[16]

Joo Seong Jeong, Jingyu Lee, Donghyun Kim, Changmin Jeon, Changjin Jeong, Youngki Lee, and Byung-Gon Chun. 2022. Band: coordinated multi-DNN inference on heterogeneous mobile processors. In MobiSys ’22: The 20th Annual International Conference on Mobile Systems, Applications and Services, Portland, Oregon, 27 June 2022 - 1 July 2022, Nirupama Bulusu, Ehsan Aryafar, Aruna Balasubramanian, and Junehwa Song (Eds.). ACM, 235–247. https://doi.org/10.1145/3498361.3538948

Digital Library

[17]

Emre Kilcioglu, Hamed Mirghasemi, Ivan Stupia, and Luc Vandendorpe. 2021. An Energy-Efficient Fine-Grained Deep Neural Network Partitioning Scheme for Wireless Collaborative Fog Computing. IEEE Access 9 (2021), 79611–79627. https://doi.org/10.1109/ACCESS.2021.3084689

[18]

Youngsok Kim, Joonsung Kim, Dongju Chae, Daehyun Kim, and Jangwoo Kim. 2019. μ Layer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 45:1–45:15. https://doi.org/10.1145/3302424.3303950

Digital Library

[19]

Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. 2019. PaddlePaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing 1, 1 (2019), 105–115.

[20]

Emiliano Miluzzo, Tianyu Wang, and Andrew T. Campbell. 2010. EyePhone: activating mobile phones with your eyes. In Proceedings of the 2ndt ACM SIGCOMM Workshop on Networking, Systems, and Applications for Mobile Handhelds, MobiHeld 2010, New Delhi, India, August 30, 2010, Landon P. Cox and Alec Wolman (Eds.). ACM, 15–20. https://doi.org/10.1145/1851322.1851328

Digital Library

[21]

Mehdi Mohammadi and Ala I. Al-Fuqaha. 2018. Enabling Cognitive Smart Cities Using Big Data and Machine Learning: Approaches and Challenges. IEEE Commun. Mag. 56, 2 (2018), 94–101. https://doi.org/10.1109/MCOM.2018.1700298

Digital Library

[22]

Mehdi Mohammadi, Ala I. Al-Fuqaha, Mohsen Guizani, and Jun-Seok Oh. 2018. Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services. IEEE Internet Things J. 5, 2 (2018), 624–635. https://doi.org/10.1109/JIOT.2017.2712560

[23]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019, Tim Brecht and Carey Williamson (Eds.). ACM, 1–15. https://doi.org/10.1145/3341301.3359646

Digital Library

[24]

NVIDIA. 2020. Cuda streams. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

[25]

NVIDIA. 2021. NVIDIA Multi-Process Service. https://docs.nvidia.com/deploy/mps

[26]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8024–8035.

[27]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).

[28]

Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 6517–6525. https://doi.org/10.1109/CVPR.2017.690

[29]

Wonik Seo, Sanghoon Cha, Yeonjae Kim, Jaehyuk Huh, and Jongse Park. 2021. SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms. ACM Trans. Archit. Code Optim. 18, 4 (2021), 43:1–43:26. https://doi.org/10.1145/3460352

Digital Library

[30]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019, Tim Brecht and Carey Williamson (Eds.). ACM, 322–337. https://doi.org/10.1145/3341301.3359658

Digital Library

[31]

Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramírez, Nacho Navarro, and Mateo Valero. 2014. Enabling preemptive multiprogramming on GPUs. In ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14-18, 2014. IEEE Computer Society, 193–204. https://doi.org/10.1109/ISCA.2014.6853208

[32]

Yuanjia Xu, Heng Wu, Wenbo Zhang, and Yi Hu. 2022. EOP: efficient operator partition for deep learning inference over edge servers. In VEE ’22: 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Virtual Event, Switzerland, 1 March 2022, John Criswell, Dan Williams, and Yubin Xia (Eds.). ACM, 45–57. https://doi.org/10.1145/3516807.3516820

Digital Library

[33]

Ming Yang, Shige Wang, Joshua Bakita, Thanh Vu, F. Donelson Smith, James H. Anderson, and Jan-Michael Frahm. 2019. Re-Thinking CNN Frameworks for Time-Sensitive Autonomous-Driving Applications: Addressing an Industrial Challenge. In 25th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2019, Montreal, QC, Canada, April 16-18, 2019, Björn B. Brandenburg (Ed.). IEEE, 305–317. https://doi.org/10.1109/RTAS.2019.00033

[34]

Fuxun Yu, Shawn Bray, Di Wang, Longfei Shangguan, Xulong Tang, Chenchen Liu, and Xiang Chen. 2021. Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU. In IEEE/ACM International Conference On Computer Aided Design, ICCAD 2021, Munich, Germany, November 1-4, 2021. IEEE, 1–9. https://doi.org/10.1109/ICCAD51958.2021.9643501

Digital Library

[35]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10-12, 2019, Dahlia Malkhi and Dan Tsafrir (Eds.). USENIX Association, 1049–1062. https://www.usenix.org/conference/atc19/presentation/zhang-chengliang

Cited By

Lyu XLi YHe YRen CNi WLiu RZhu PCui Q(2024)Objective-Driven Differentiable Optimization of Traffic Prediction and Resource Allocation for Split AI Inference Edge NetworksIEEE Transactions on Machine Learning in Communications and Networking10.1109/TMLCN.2024.34498312(1178-1192)Online publication date: 2024
https://doi.org/10.1109/TMLCN.2024.3449831
Nabavinejad SReda SGuo T(2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00063

Index Terms

SPLIT: QoS-Aware DNN Inference on Shared GPU via Evenly-Sized Model Splitting
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
      1. Embedded software
2. Computing methodologies
  1. Machine learning

Recommendations

The study of QoS guarantee in the optical burst switching internet backbone

Optical switching technology can be categorized into optical circuit switching (OCS), optical packet switching (OPS) and optical burst switching (OBS). OCS is suitable for large amounts of data transmission; however, the channel utilization is ...
Performance Analysis for QoS Provisioning in MPLS Networks

In Multiprotocol Label Switching (MPLS) networks, Label Switching Routers (LSRs) cannot only transmit IP packets fast with cut-through mechanism, but also solve traffic engineering issue. In this paper, we will consider the case where real time and non-...
Absolute QoS differentiation in optical burst-switched networks

A number of schemes have been proposed for providing quality-of-service (QoS) differentiation in optical burst-switched (OBS) networks. Most existing schemes are based on a relative QoS model in which the service requirements for a given class of traffic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

August 2023

858 pages

ISBN:9798400708435

DOI:10.1145/3605573

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Provincial Key Research and Development Program of Shandong, China
National Natural Science Foundation of China

Conference

ICPP 2023

ICPP 2023: 52nd International Conference on Parallel Processing

August 7 - 10, 2023

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
673
Total Downloads

Downloads (Last 12 months)474
Downloads (Last 6 weeks)58

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lyu XLi YHe YRen CNi WLiu RZhu PCui Q(2024)Objective-Driven Differentiable Optimization of Traffic Prediction and Resource Allocation for Split AI Inference Edge NetworksIEEE Transactions on Machine Learning in Communications and Networking10.1109/TMLCN.2024.34498312(1178-1192)Online publication date: 2024
https://doi.org/10.1109/TMLCN.2024.3449831
Nabavinejad SReda SGuo T(2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00063

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten