[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3472456.3472501acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

AMPS-Inf: Automatic Model Partitioning for Serverless Inference with Cost Efficiency

Published: 05 October 2021 Publication History

Abstract

The salient pay-per-use nature of serverless computing has driven its continuous penetration as an alternative computing paradigm for various workloads. Yet, challenges arise and remain open when shifting machine learning workloads to the serverless environment. Specifically, the restriction on the deployment size over serverless platforms combining with the complexity of neural network models makes it difficult to deploy large models in a single serverless function. In this paper, we aim to fully exploit the advantages of the serverless computing paradigm for machine learning workloads targeting at mitigating management and overall cost while meeting the response-time Service Level Objective (SLO). We design and implement AMPS-Inf, an autonomous framework customized for model inferencing in serverless computing. Driven by the cost-efficiency and timely-response, our proposed AMPS-Inf automatically generates the optimal execution and resource provisioning plans for inference workloads. The core of AMPS-Inf relies on the formulation and solution of a Mixed-Integer Quadratic Programming problem for model partitioning and resource provisioning with the objective of minimizing cost without violating response time SLO. We deploy AMPS-Inf on the AWS Lambda platform, evaluate with the state-of-the-art pre-trained models in Keras including ResNet50, Inception-V3 and Xception, and compare with Amazon SageMaker and three baselines. Experimental results demonstrate that AMPS-Inf achieves up to 98% cost saving without degrading response time performance.

References

[1]
[1] GUROBI Optimization. Retrieved December 20, 2020 from https://www.gurobi.com/
[2]
[2] Keras. Retrieved January 5, 2021 from https://keras.io/
[3]
[3] Pillow. Retrieved January 5, 2021 from https://pillow.readthedocs.io/en/stable/
[4]
[4] TensorFlow. Retrieved January 5, 2021 from https://www.tensorflow.org/
[5]
[5] VGG16 Function. https://keras.io/api/applications/vgg/#vgg16-function
[6]
[6] VGG19 Function. https://keras.io/api/applications/vgg/#vgg19-function
[7]
2018. PredictionIO. https://predictionio.apache.org/
[8]
2018. RedisML. https://github.com/RedisLabsModules/redisml
[9]
2020. Amazon EC2. https://aws.amazon.com/ec2/
[10]
2020. Amazon ElastiCache. https://aws.amazon.com/elasticache/
[11]
2020. Amazon S3. https://aws.amazon.com/s3/
[12]
2020. Amazon SageMaker. https://aws.amazon.com/sagemaker/
[13]
2020. Amazon SageMaker Pricing. https://aws.amazon.com/sagemaker/pricing/
[14]
2020. AWS Lambda. https://aws.amazon.com/lambda/
[15]
2020. AWS Lambda Limits. https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html
[16]
2020. AWS Lambda Pricing. https://aws.amazon.com/lambda/pricing/
[17]
2020. AWS Step Functions. https://aws.amazon.com/step-functions/
[18]
2020. Azure Functions. https://azure.microsoft.com/en-us/services/functions/
[19]
2020. Cloud Functions. https://cloud.google.com/functions
[20]
2020. Deploy Python Lambda functions with .zip file archives. https://docs.aws.amazon.com/lambda/latest/dg/python-package.html
[21]
2020. Lambda Layers. https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html
[22]
2020. Multi Model Server. https://github.com/awslabs/multi-model-server
[23]
Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. 2020. Batch: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 972–986.
[24]
Anirban Bhattacharjee, Zhuangwei Kang Ajay Dev Chhokra, Hongyang Sun, Aniruddha Gokhale, and Gabor Karsai. 2019. Barista: Efficient and Scalable Serverless Serving System for Deep Learning Prediction services. In 2019 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 23–33.
[25]
Alain Billionnet, Sourour Elloumi, and Marie-Christine Plateau. 2008. Quadratic 0–1 Programming: Tightening Linear or Quadratic Convex Reformulation by Use of Relaxations. RAIRO-Operations Research 42, 2, 103–121.
[26]
Christian Bliek1ú, Pierre Bonami, and Andrea Lodi. 2014. Solving Mixed-Integer Quadratic Programming Problems with IBM-CPLEX: a Progress Report. In Proceedings of the twenty-sixth RAMP symposium. 16–17.
[27]
Pierre Bonami, Andrea Lodi, and Giulia Zarpellon. 2018. Learning a Classification of Mixed-Integer Quadratic Programming Problems. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research. Springer, 595–604.
[28]
Joao Carreira, Pedro Fonseca, Alexey Tumanov, Andrew Zhang, and Randy Katz. 2019. Cirrus: a Serverless Framework for End-to-end ML Workflows. In Proceedings of the ACM Symposium on Cloud Computing. 13–24.
[29]
François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251–1258.
[30]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613–627.
[31]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805(2018).
[32]
Steven Diamond and Stephen Boyd. 2016. CVXPY: A Python-Embedded Modeling Language for Convex Optimization. Journal of Machine Learning Research 17, 83 (2016), 1–5.
[33]
Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Vasuki Balasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivaraman, George Porter, and Keith Winstein. 2017. Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI).
[34]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149(2015).
[35]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[36]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861(2017).
[37]
Jananie Jarachanthan, Li Chen, Fei Xu, and Bo Li. May 17-21, 2021. Astra:Autonomous Serverless Analytics with Cost-Efficiency and QoS-Awareness. In 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2021).
[38]
Vaithilingam Jeyakumar, Alex M Rubinov, and Zhi You Wu. 2007. Non-Convex Quadratic Minimization Problems With Quadratic Constraints: Global Optimality Conditions. Mathematical programming 110, 3 (2007), 521–541.
[39]
Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. 2017. Occupy the Cloud: Distributed Computing for the 99%. In Proceedings of the 2017 Symposium on Cloud Computing. 445–451.
[40]
Youngbin Kim and Jimmy Lin. 2018. Serverless Data Analytics with Flint. In 11th International Conference on Cloud Computing (CLOUD). IEEE.
[41]
Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. 2018. Pocket: Elastic Ephemeral Storage for Serverless Analytics. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 427–444.
[42]
Mohan Kodandarama, Mohammed Shaikh, and Shreeshrita Patnaik. 2020. SerFer: Serverless Inference of Machine Learning Models. (2020). https://divatekodand.github.io/files/serfer.pdf
[43]
Benjamin D Lee, Michael A Timony, and Pablo Ruiz. 2019. DNAvisualization. org: a Serverless Web Tool for DNA Sequence Visualization. Nucleic acids research 47, W1 (2019), W20–W25.
[44]
Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-Serving: Flexible, High-Performance Ml Serving. arXiv preprint arXiv:1712.06139(2017).
[45]
Qifan Pu, Shivaram Venkataraman, and Ion Stoica. 2019. Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI).
[46]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, and Jonathon Shlens. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
[47]
Hao Wang, Di Niu, and Baochun Li. 2019. Distributed Machine Learning with a Serverless Architecture. In IEEE Conference on Computer Communications,INFOCOM 2019. IEEE, 1288–1296.
[48]
Minchen Yu, Zhifeng Jiang, Hok Chun Ng, Wei Wang, Ruichuan Chen, and Bo Li. 2021. Gillis: Serving Large Neural Networks in ServerlessFunctions with Automatic Model Partitioning. In 41st IEEE International Conference on Distributed Computing Systems.
[49]
Diego Zanon. 2017. Building Serverless Web Applications. Packt Publishing Ltd.
[50]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, Slo-aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049–1062.
[51]
Fuzhen Zhang. 2006. The Schur Complement and Its Applications. Vol. 4. Springer Science & Business Media.
[52]
Michael Zhang, Chandra Krintz, Rich Wolski, and Markus Mock. 2019. Seneca: Fast and Low Cost Hyperparameter Search for Machine Learning Models. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 404–408.

Cited By

View all
  • (2024)Advancing Serverless Computing for Scalable AI Model Inference: Challenges and OpportunitiesProceedings of the 10th International Workshop on Serverless Computing10.1145/3702634.3702950(1-6)Online publication date: 2-Dec-2024
  • (2024)Pre-Warming is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-LoadingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698509(178-195)Online publication date: 20-Nov-2024
  • (2024)StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific WorkflowsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390288:1(1-29)Online publication date: 21-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cost efficiency
  2. machine learning inference
  3. serverless computing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)159
  • Downloads (Last 6 weeks)12
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Advancing Serverless Computing for Scalable AI Model Inference: Challenges and OpportunitiesProceedings of the 10th International Workshop on Serverless Computing10.1145/3702634.3702950(1-6)Online publication date: 2-Dec-2024
  • (2024)Pre-Warming is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-LoadingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698509(178-195)Online publication date: 20-Nov-2024
  • (2024)StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific WorkflowsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390288:1(1-29)Online publication date: 21-Feb-2024
  • (2024)Demystifying the Cost of Serverless Computing: Towards a Win-Win DealIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333084935:1(59-72)Online publication date: 1-Jan-2024
  • (2024)Exploiting Processor Heterogeneity to Improve Throughput and Reduce Latency for Deep Neural Network Inference2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00012(37-48)Online publication date: 13-Nov-2024
  • (2024)HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682915(1-10)Online publication date: 19-Jun-2024
  • (2024)On Efficient Zygote Container Planning and Task Scheduling for Edge Native Application AccelerationIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621106(2259-2268)Online publication date: 20-May-2024
  • (2024)Proactive Elastic Scheduling for Serverless Ensemble Inference Services2024 IEEE International Conference on Web Services (ICWS)10.1109/ICWS62655.2024.00121(1025-1035)Online publication date: 7-Jul-2024
  • (2024)FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00168(2109-2122)Online publication date: 13-May-2024
  • (2024)Serverless application composition leveraging function fusionFuture Generation Computer Systems10.1016/j.future.2023.12.010153:C(403-418)Online publication date: 16-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media