[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3605573.3605627acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

SPLIT: QoS-Aware DNN Inference on Shared GPU via Evenly-Sized Model Splitting

Published: 13 September 2023 Publication History

Abstract

Improving QoS by simultaneously reducing the latency violation rate and jitter in the presence of multiple deep learning inference (DLI) tasks sharing a single edge computing processor remains a challenge. However, existing DLI systems at the edge, designed to maximize throughput, face performance challenges when confronted with requests with varying QoS.
In this paper, we present SPLIT, a QoS-aware DNN inference system on shared GPU via evenly-sized model splitting to improve QoS by reducing the latency violation rate and jitter. SPLIT applies a genetic algorithm to evenly split models into diverse operator combinations, or blocks, thereby minimizing the standard deviation of block execution time to reduce jitter. Furthermore, we develop a preemption method based on a greedy algorithm to swiftly assess whether an incoming request should preempt to minimize latency. We evaluate SPLIT with five common deep learning models and the experimental results reveal that SPLIT outperforms state-of-the-art approaches, reducing the latency violation rate by up to 43% and jitter by up to 69.3%.

Supplemental Material

PDF File
Stage 1-Artifact Description: apdx187s1-filled_template.zip Stage 2-Badges: apdx187s2-filled_template_compiled_pdf_outfn.pdf & apdx187s2-filled_template.zip
ZIP File
Stage 1-Artifact Description: apdx187s1-filled_template.zip Stage 2-Badges: apdx187s2-filled_template_compiled_pdf_outfn.pdf & apdx187s2-filled_template.zip
ZIP File
Stage 1-Artifact Description: apdx187s1-filled_template.zip Stage 2-Badges: apdx187s2-filled_template_compiled_pdf_outfn.pdf & apdx187s2-filled_template.zip

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. { TensorFlow} : A System for { Large-Scale} Machine Learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265–283.
[2]
Shane Bergsma, Timothy Zeyl, Arik Senderovich, and J. Christopher Beck. 2021. Generating Complex, Realistic Cloud Workloads using Recurrent Neural Networks. In SOSP ’21: ACM SIGOPS 28th Symposium on Operating Systems Principles, Virtual Event / Koblenz, Germany, October 26-29, 2021, Robbert van Renesse and Nickolai Zeldovich (Eds.). ACM, 376–391. https://doi.org/10.1145/3477132.3483590
[3]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, Andrea C. Arpaci-Dusseau and Geoff Voelker (Eds.). USENIX Association, 578–594. https://www.usenix.org/conference/osdi18/presentation/chen
[4]
Yujeong Choi and Minsoo Rhu. 2020. PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22-26, 2020. IEEE, 220–233. https://doi.org/10.1109/HPCA47549.2020.00027
[5]
Prafulla N. Dawadi, Diane Joyce Cook, and Maureen Schmitter-Edgecombe. 2016. Automated Cognitive Health Assessment From Smart Home-Based Behavior Data. IEEE J. Biomed. Health Informatics 20, 4 (2016), 1188–1194. https://doi.org/10.1109/JBHI.2015.2445754
[6]
ONNX Runtime developers. 2021. ONNX Runtime. https://onnxruntime.ai/. Version: x.y.z.
[7]
Aditya Dhakal, Sameer G. Kulkarni, and K. K. Ramakrishnan. 2020. GSLICE: controlled spatial sharing of GPUs for a scalable inference platform. In SoCC ’20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020, Rodrigo Fonseca, Christina Delimitrou, and Beng Chin Ooi (Eds.). ACM, 492–506. https://doi.org/10.1145/3419111.3421284
[8]
Soroush Ghodrati, Byung Hoon Ahn, Joon Kyung Kim, Sean Kinzer, Brahmendra Reddy Yatham, Navateja Alla, Hardik Sharma, Mohammad Alian, Eiman Ebrahimi, Nam Sung Kim, Cliff Young, and Hadi Esmaeilzadeh. 2020. Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17-21, 2020. IEEE, 681–697. https://doi.org/10.1109/MICRO50266.2020.00062
[9]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. CoRR abs/2006.02464 (2020). arXiv:2006.02464https://arxiv.org/abs/2006.02464
[10]
Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, Marcos K. Aguilera and Hakim Weatherspoon (Eds.). USENIX Association, 539–558. https://www.usenix.org/conference/osdi22/presentation/han
[11]
Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. GRNN: Low-Latency and Scalable RNN Inference on GPUs. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 41:1–41:16. https://doi.org/10.1145/3302424.3303949
[12]
Zhaowu Huang, Fang Dong, Dian Shen, Junxue Zhang, Huitian Wang, Guangxing Cai, and Qiang He. 2021. Enabling Low Latency Edge Intelligence based on Multi-exit DNNs in the Wild. In 41st IEEE International Conference on Distributed Computing Systems, ICDCS 2021, Washington DC, USA, July 7-10, 2021. IEEE, 729–739. https://doi.org/10.1109/ICDCS51616.2021.00075
[13]
Arpan Jain, Tim Moon, Tom Benson, Hari Subramoni, Sam Adé Jacobs, Dhabaleswar K. Panda, and Brian Van Essen. 2021. SUPER: SUb-Graph Parallelism for TransformERs. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. IEEE, 629–638. https://doi.org/10.1109/IPDPS49936.2021.00071
[14]
Wonseok Jang, Hansaem Jeong, Kyungtae Kang, Nikil D. Dutt, and Jong-Chan Kim. 2020. R-TOD: Real-Time Object Detector with Minimized End-to-End Delay for Autonomous Driving. In 41st IEEE Real-Time Systems Symposium, RTSS 2020, Houston, TX, USA, December 1-4, 2020. IEEE, 191–204. https://doi.org/10.1109/RTSS49844.2020.00027
[15]
Beomyeol Jeon, Linda Cai, Pallavi Srivastava, Jintao Jiang, Xiaolan Ke, Yitao Meng, Cong Xie, and Indranil Gupta. 2020. Baechi: fast device placement of machine learning graphs. In SoCC ’20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020, Rodrigo Fonseca, Christina Delimitrou, and Beng Chin Ooi (Eds.). ACM, 416–430. https://doi.org/10.1145/3419111.3421302
[16]
Joo Seong Jeong, Jingyu Lee, Donghyun Kim, Changmin Jeon, Changjin Jeong, Youngki Lee, and Byung-Gon Chun. 2022. Band: coordinated multi-DNN inference on heterogeneous mobile processors. In MobiSys ’22: The 20th Annual International Conference on Mobile Systems, Applications and Services, Portland, Oregon, 27 June 2022 - 1 July 2022, Nirupama Bulusu, Ehsan Aryafar, Aruna Balasubramanian, and Junehwa Song (Eds.). ACM, 235–247. https://doi.org/10.1145/3498361.3538948
[17]
Emre Kilcioglu, Hamed Mirghasemi, Ivan Stupia, and Luc Vandendorpe. 2021. An Energy-Efficient Fine-Grained Deep Neural Network Partitioning Scheme for Wireless Collaborative Fog Computing. IEEE Access 9 (2021), 79611–79627. https://doi.org/10.1109/ACCESS.2021.3084689
[18]
Youngsok Kim, Joonsung Kim, Dongju Chae, Daehyun Kim, and Jangwoo Kim. 2019. μ Layer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 45:1–45:15. https://doi.org/10.1145/3302424.3303950
[19]
Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. 2019. PaddlePaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing 1, 1 (2019), 105–115.
[20]
Emiliano Miluzzo, Tianyu Wang, and Andrew T. Campbell. 2010. EyePhone: activating mobile phones with your eyes. In Proceedings of the 2ndt ACM SIGCOMM Workshop on Networking, Systems, and Applications for Mobile Handhelds, MobiHeld 2010, New Delhi, India, August 30, 2010, Landon P. Cox and Alec Wolman (Eds.). ACM, 15–20. https://doi.org/10.1145/1851322.1851328
[21]
Mehdi Mohammadi and Ala I. Al-Fuqaha. 2018. Enabling Cognitive Smart Cities Using Big Data and Machine Learning: Approaches and Challenges. IEEE Commun. Mag. 56, 2 (2018), 94–101. https://doi.org/10.1109/MCOM.2018.1700298
[22]
Mehdi Mohammadi, Ala I. Al-Fuqaha, Mohsen Guizani, and Jun-Seok Oh. 2018. Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services. IEEE Internet Things J. 5, 2 (2018), 624–635. https://doi.org/10.1109/JIOT.2017.2712560
[23]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019, Tim Brecht and Carey Williamson (Eds.). ACM, 1–15. https://doi.org/10.1145/3341301.3359646
[24]
NVIDIA. 2020. Cuda streams. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
[25]
NVIDIA. 2021. NVIDIA Multi-Process Service. https://docs.nvidia.com/deploy/mps
[26]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8024–8035.
[27]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).
[28]
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 6517–6525. https://doi.org/10.1109/CVPR.2017.690
[29]
Wonik Seo, Sanghoon Cha, Yeonjae Kim, Jaehyuk Huh, and Jongse Park. 2021. SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms. ACM Trans. Archit. Code Optim. 18, 4 (2021), 43:1–43:26. https://doi.org/10.1145/3460352
[30]
Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019, Tim Brecht and Carey Williamson (Eds.). ACM, 322–337. https://doi.org/10.1145/3341301.3359658
[31]
Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramírez, Nacho Navarro, and Mateo Valero. 2014. Enabling preemptive multiprogramming on GPUs. In ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14-18, 2014. IEEE Computer Society, 193–204. https://doi.org/10.1109/ISCA.2014.6853208
[32]
Yuanjia Xu, Heng Wu, Wenbo Zhang, and Yi Hu. 2022. EOP: efficient operator partition for deep learning inference over edge servers. In VEE ’22: 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Virtual Event, Switzerland, 1 March 2022, John Criswell, Dan Williams, and Yubin Xia (Eds.). ACM, 45–57. https://doi.org/10.1145/3516807.3516820
[33]
Ming Yang, Shige Wang, Joshua Bakita, Thanh Vu, F. Donelson Smith, James H. Anderson, and Jan-Michael Frahm. 2019. Re-Thinking CNN Frameworks for Time-Sensitive Autonomous-Driving Applications: Addressing an Industrial Challenge. In 25th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2019, Montreal, QC, Canada, April 16-18, 2019, Björn B. Brandenburg (Ed.). IEEE, 305–317. https://doi.org/10.1109/RTAS.2019.00033
[34]
Fuxun Yu, Shawn Bray, Di Wang, Longfei Shangguan, Xulong Tang, Chenchen Liu, and Xiang Chen. 2021. Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU. In IEEE/ACM International Conference On Computer Aided Design, ICCAD 2021, Munich, Germany, November 1-4, 2021. IEEE, 1–9. https://doi.org/10.1109/ICCAD51958.2021.9643501
[35]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10-12, 2019, Dahlia Malkhi and Dan Tsafrir (Eds.). USENIX Association, 1049–1062. https://www.usenix.org/conference/atc19/presentation/zhang-chengliang

Cited By

View all
  • (2024)Objective-Driven Differentiable Optimization of Traffic Prediction and Resource Allocation for Split AI Inference Edge NetworksIEEE Transactions on Machine Learning in Communications and Networking10.1109/TMLCN.2024.34498312(1178-1192)Online publication date: 2024
  • (2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024

Index Terms

  1. SPLIT: QoS-Aware DNN Inference on Shared GPU via Evenly-Sized Model Splitting

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing
      August 2023
      858 pages
      ISBN:9798400708435
      DOI:10.1145/3605573
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 September 2023

      Check for updates

      Author Tags

      1. QoS
      2. block
      3. deep learning inference
      4. evenly-sized
      5. model splitting

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Provincial Key Research and Development Program of Shandong, China
      • National Natural Science Foundation of China

      Conference

      ICPP 2023
      ICPP 2023: 52nd International Conference on Parallel Processing
      August 7 - 10, 2023
      UT, Salt Lake City, USA

      Acceptance Rates

      Overall Acceptance Rate 91 of 313 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)474
      • Downloads (Last 6 weeks)58
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Objective-Driven Differentiable Optimization of Traffic Prediction and Resource Allocation for Split AI Inference Edge NetworksIEEE Transactions on Machine Learning in Communications and Networking10.1109/TMLCN.2024.34498312(1178-1192)Online publication date: 2024
      • (2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media