More Web Proxy on the site http://driver.im/

research-article

AMPS-Inf: Automatic Model Partitioning for Serverless Inference with Cost Efficiency

Authors:

Jananie Jarachanthan,

Bo LiAuthors Info & Claims

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Article No.: 14, Pages 1 - 12

https://doi.org/10.1145/3472456.3472501

Published: 05 October 2021 Publication History

Abstract

The salient pay-per-use nature of serverless computing has driven its continuous penetration as an alternative computing paradigm for various workloads. Yet, challenges arise and remain open when shifting machine learning workloads to the serverless environment. Specifically, the restriction on the deployment size over serverless platforms combining with the complexity of neural network models makes it difficult to deploy large models in a single serverless function. In this paper, we aim to fully exploit the advantages of the serverless computing paradigm for machine learning workloads targeting at mitigating management and overall cost while meeting the response-time Service Level Objective (SLO). We design and implement AMPS-Inf, an autonomous framework customized for model inferencing in serverless computing. Driven by the cost-efficiency and timely-response, our proposed AMPS-Inf automatically generates the optimal execution and resource provisioning plans for inference workloads. The core of AMPS-Inf relies on the formulation and solution of a Mixed-Integer Quadratic Programming problem for model partitioning and resource provisioning with the objective of minimizing cost without violating response time SLO. We deploy AMPS-Inf on the AWS Lambda platform, evaluate with the state-of-the-art pre-trained models in Keras including ResNet50, Inception-V3 and Xception, and compare with Amazon SageMaker and three baselines. Experimental results demonstrate that AMPS-Inf achieves up to 98% cost saving without degrading response time performance.

References

[1]

[1] GUROBI Optimization. Retrieved December 20, 2020 from https://www.gurobi.com/

[2]

[2] Keras. Retrieved January 5, 2021 from https://keras.io/

[3]

[3] Pillow. Retrieved January 5, 2021 from https://pillow.readthedocs.io/en/stable/

[4]

[4] TensorFlow. Retrieved January 5, 2021 from https://www.tensorflow.org/

[5]

[5] VGG16 Function. https://keras.io/api/applications/vgg/#vgg16-function

[6]

[6] VGG19 Function. https://keras.io/api/applications/vgg/#vgg19-function

[7]

2018. PredictionIO. https://predictionio.apache.org/

[8]

2018. RedisML. https://github.com/RedisLabsModules/redisml

[9]

2020. Amazon EC2. https://aws.amazon.com/ec2/

[10]

2020. Amazon ElastiCache. https://aws.amazon.com/elasticache/

[11]

2020. Amazon S3. https://aws.amazon.com/s3/

[12]

2020. Amazon SageMaker. https://aws.amazon.com/sagemaker/

[13]

2020. Amazon SageMaker Pricing. https://aws.amazon.com/sagemaker/pricing/

[14]

2020. AWS Lambda. https://aws.amazon.com/lambda/

[15]

2020. AWS Lambda Limits. https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html

[16]

2020. AWS Lambda Pricing. https://aws.amazon.com/lambda/pricing/

[17]

2020. AWS Step Functions. https://aws.amazon.com/step-functions/

[18]

2020. Azure Functions. https://azure.microsoft.com/en-us/services/functions/

[19]

2020. Cloud Functions. https://cloud.google.com/functions

[20]

2020. Deploy Python Lambda functions with .zip file archives. https://docs.aws.amazon.com/lambda/latest/dg/python-package.html

[21]

2020. Lambda Layers. https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html

[22]

2020. Multi Model Server. https://github.com/awslabs/multi-model-server

[23]

Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. 2020. Batch: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 972–986.

[24]

Anirban Bhattacharjee, Zhuangwei Kang Ajay Dev Chhokra, Hongyang Sun, Aniruddha Gokhale, and Gabor Karsai. 2019. Barista: Efficient and Scalable Serverless Serving System for Deep Learning Prediction services. In 2019 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 23–33.

[25]

Alain Billionnet, Sourour Elloumi, and Marie-Christine Plateau. 2008. Quadratic 0–1 Programming: Tightening Linear or Quadratic Convex Reformulation by Use of Relaxations. RAIRO-Operations Research 42, 2, 103–121.

[26]

Christian Bliek1ú, Pierre Bonami, and Andrea Lodi. 2014. Solving Mixed-Integer Quadratic Programming Problems with IBM-CPLEX: a Progress Report. In Proceedings of the twenty-sixth RAMP symposium. 16–17.

[27]

Pierre Bonami, Andrea Lodi, and Giulia Zarpellon. 2018. Learning a Classification of Mixed-Integer Quadratic Programming Problems. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research. Springer, 595–604.

Digital Library

[28]

Joao Carreira, Pedro Fonseca, Alexey Tumanov, Andrew Zhang, and Randy Katz. 2019. Cirrus: a Serverless Framework for End-to-end ML Workflows. In Proceedings of the ACM Symposium on Cloud Computing. 13–24.

Digital Library

[29]

François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251–1258.

[30]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613–627.

[31]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805(2018).

[32]

Steven Diamond and Stephen Boyd. 2016. CVXPY: A Python-Embedded Modeling Language for Convex Optimization. Journal of Machine Learning Research 17, 83 (2016), 1–5.

Digital Library

[33]

Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Vasuki Balasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivaraman, George Porter, and Keith Winstein. 2017. Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI).

Digital Library

[34]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149(2015).

[35]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[36]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861(2017).

[37]

Jananie Jarachanthan, Li Chen, Fei Xu, and Bo Li. May 17-21, 2021. Astra:Autonomous Serverless Analytics with Cost-Efficiency and QoS-Awareness. In 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2021).

[38]

Vaithilingam Jeyakumar, Alex M Rubinov, and Zhi You Wu. 2007. Non-Convex Quadratic Minimization Problems With Quadratic Constraints: Global Optimality Conditions. Mathematical programming 110, 3 (2007), 521–541.

[39]

Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. 2017. Occupy the Cloud: Distributed Computing for the 99%. In Proceedings of the 2017 Symposium on Cloud Computing. 445–451.

Digital Library

[40]

Youngbin Kim and Jimmy Lin. 2018. Serverless Data Analytics with Flint. In 11th International Conference on Cloud Computing (CLOUD). IEEE.

[41]

Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. 2018. Pocket: Elastic Ephemeral Storage for Serverless Analytics. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 427–444.

[42]

Mohan Kodandarama, Mohammed Shaikh, and Shreeshrita Patnaik. 2020. SerFer: Serverless Inference of Machine Learning Models. (2020). https://divatekodand.github.io/files/serfer.pdf

[43]

Benjamin D Lee, Michael A Timony, and Pablo Ruiz. 2019. DNAvisualization. org: a Serverless Web Tool for DNA Sequence Visualization. Nucleic acids research 47, W1 (2019), W20–W25.

[44]

Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-Serving: Flexible, High-Performance Ml Serving. arXiv preprint arXiv:1712.06139(2017).

[45]

Qifan Pu, Shivaram Venkataraman, and Ion Stoica. 2019. Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI).

[46]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, and Jonathon Shlens. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.

[47]

Hao Wang, Di Niu, and Baochun Li. 2019. Distributed Machine Learning with a Serverless Architecture. In IEEE Conference on Computer Communications,INFOCOM 2019. IEEE, 1288–1296.

Digital Library

[48]

Minchen Yu, Zhifeng Jiang, Hok Chun Ng, Wei Wang, Ruichuan Chen, and Bo Li. 2021. Gillis: Serving Large Neural Networks in ServerlessFunctions with Automatic Model Partitioning. In 41st IEEE International Conference on Distributed Computing Systems.

[49]

Diego Zanon. 2017. Building Serverless Web Applications. Packt Publishing Ltd.

[50]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, Slo-aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049–1062.

[51]

Fuzhen Zhang. 2006. The Schur Complement and Its Applications. Vol. 4. Springer Science & Business Media.

[52]

Michael Zhang, Chandra Krintz, Rich Wolski, and Markus Mock. 2019. Seneca: Fast and Low Cost Hyperparameter Search for Machine Learning Models. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 404–408.

Cited By

Wang LJiang YMi N(2024)Advancing Serverless Computing for Scalable AI Model Inference: Challenges and OpportunitiesProceedings of the 10th International Workshop on Serverless Computing10.1145/3702634.3702950(1-6)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3702634.3702950
Sui YYu HHu YLi JWang H(2024)Pre-Warming is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-LoadingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698509(178-195)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698509
Basu Roy RTiwari D(2024)StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific WorkflowsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390288:1(1-29)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639028
Show More Cited By

Recommendations

Supporting Multi-Provider Serverless Computing on the Edge
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

Serverless computing has recently emerged as a new execution model for cloud computing, in which service providers offer compute runtimes, also known as Function-as-a-Service (FaaS) platforms, allowing users to develop, execute and manage application ...
Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing
APSys '23: Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems

With the emerging of machine learning, many commercial companies increasingly utilize machine learning inference systems as backend services to improve their products. Serverless computing is a modern paradigm that provides auto-scaling, event-driven ...
Self-Provisioning Infrastructures for the Next Generation Serverless Computing
Abstract
Serverless computing has ushered in a transformative paradigm, with a promise to alleviate developers from the intricacies of infrastructure management. However, current serverless platforms predominantly offer only serverless compute ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

August 2021

927 pages

ISBN:9781450390682

DOI:10.1145/3472456

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Louisiana Board of Regents

Conference

ICPP 2021

ICPP 2021: 50th International Conference on Parallel Processing

August 9 - 12, 2021

IL, Lemont, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
471
Total Downloads

Downloads (Last 12 months)159
Downloads (Last 6 weeks)12

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang LJiang YMi N(2024)Advancing Serverless Computing for Scalable AI Model Inference: Challenges and OpportunitiesProceedings of the 10th International Workshop on Serverless Computing10.1145/3702634.3702950(1-6)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3702634.3702950
Sui YYu HHu YLi JWang H(2024)Pre-Warming is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-LoadingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698509(178-195)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698509
Basu Roy RTiwari D(2024)StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific WorkflowsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390288:1(1-29)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639028
Liu FNiu Y(2024)Demystifying the Cost of Serverless Computing: Towards a Win-Win DealIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333084935:1(59-72)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3330849
Beaumont ODavid JEyraud-Dubois LThibault S(2024)Exploiting Processor Heterogeneity to Improve Throughput and Reduce Latency for Deep Neural Network Inference2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00012(37-48)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00012
Chen JXu FGu YChen LLiu FZhou Z(2024)HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682915(1-10)Online publication date: 19-Jun-2024
https://doi.org/10.1109/IWQoS61813.2024.10682915
Li YGu LQu ZTian LZeng D(2024)On Efficient Zygote Container Planning and Task Scheduling for Edge Native Application AccelerationIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621106(2259-2268)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621106
He SFeng BDing Z(2024)Proactive Elastic Scheduling for Serverless Ensemble Inference Services2024 IEEE International Conference on Web Services (ICWS)10.1109/ICWS62655.2024.00121(1025-1035)Online publication date: 7-Jul-2024
https://doi.org/10.1109/ICWS62655.2024.00121
Oakley JFerhatosmanoglu H(2024)FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00168(2109-2122)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00168
Czentye JSonkoly B(2024)Serverless application composition leveraging function fusionFuture Generation Computer Systems10.1016/j.future.2023.12.010153:C(403-418)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1016/j.future.2023.12.010
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents