[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3472883.3486977acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Cloud-Scale Runtime Verification of Serverless Applications

Published: 01 November 2021 Publication History

Abstract

Serverless platforms aim to simplify the deployment, scaling, and management of cloud applications. Serverless applications are inherently distributed, and are executed using shortlived ephemeral processes. The use of short-lived ephemeral processes simplifies application scaling and management, but also means that existing approaches to monitoring distributed systems and detecting bugs cannot be applied to serverless applications. In this paper we propose Watchtower, a framework that enables runtime monitoring of serverless applications. Watchtower takes program properties as inputs, and can detect cases where applications violate these properties. We design Watchtower to minimize application changes, and to scale at the same rate as the application. We achieve the former by instrumenting libraries rather than application code, and the latter by structuring Watchtower as a serverless application. Once a bug is found, developers can use the Watchtower debugger to identify and address the root cause of the bug.

Supplementary Material

MP4 File (Day1_Session2_Order_3_Watchtower.mp4)
Presentation video

References

[1]
2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). OJ L 119 (2016), 1--88.
[2]
2019. Building a truly global e-commerce platform --- part 2: architecture & technology. https://blog.labdigital.nl/building-a-truly-global-e-commerce-platform-part-2-architecture-technology-4a3f1afd5616.
[3]
2019. λ serverless backend implementation for RealWorld using AWS DynamoDB + Lambda. https://github.com/anishkny/realworld-dynamodb-lambda.
[4]
2019. MoonMail - Email marketing platform for bulk emailing via Amazon SES. https://moonmail.io/.
[5]
2019. Nietzsche - Scrap quotes from Goodreads and schedule random tweets. https://github.com/rpidanny/Nietzsche.
[6]
2019. Noiiice - a serverless blog built on NuxtJS, AWS, serverless framework, and irrational exuberance. https://github.com/DylanAllen/noiiice.
[7]
2019. pingbot - A website monitoring/health-checking tool based on serverless architecture. https://github.com/toricls/pingbot.
[8]
2019. RealWorld. https://github.com/gothinkster/realworld.
[9]
2019. Why we use serverless architecture at Freetrade. https://blog.freetrade.io/why-we-use-serverless-architecture-at-freetrade-e668c7bf5d42.
[10]
2019. YoYo - A dead simple comment engine built on top of AWS lambda and React, alternative comment service to Disqus. https://github.com/metrue/YoYo.
[11]
2020. Serverless Video Preview and Analysis Service. https://github.com/laardee/video-preview-and-analysis-service.
[12]
Gautam Altekar and Ion Stoica. 2009. ODR: output-deterministic replay for multicore debugging. In SOSP.
[13]
Silviu Andrica and George Candea. 2011. WaRR: A tool for high-fidelity web application record and replay. In 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 403--410.
[14]
Apache. 2021. OpenWhisk. https://github.com/apache/openwhisk.
[15]
Anish Arora, Sandeep Kulkarni, and Murat Demirbas. 2000. Resettable vector clocks. In PODC.
[16]
Mona Attariyan, MIchael Chow, and Jason Flinn. 2012. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. In OSDI.
[17]
AWS. 2019. Amazon CloudWatch. https://aws.amazon.com/cloudwatch/.
[18]
AWS. 2019. Amazon Kinesis. https://aws.amazon.com/kinesis/.
[19]
AWS. 2020. Amazon API Gateway. https://aws.amazon.com/apigateway/.
[20]
AWS. 2020. Amazon Aurora. https://aws.amazon.com/rds/aurora/.
[21]
AWS. 2020. Amazon DynamoDB. https://aws.amazon.com/dynamodb/.
[22]
AWS. 2020. Amazon S3. https://aws.amazon.com/s3/.
[23]
AWS. 2020. AWS Lambda. https://aws.amazon.com/lambda/.
[24]
AWS lambda dep 2020. AWS Lambda deployment package in Node.js. https://docs.aws.amazon.com/lambda/latest/dg/nodejs-package.html.
[25]
AWS SDK Release 2020. aws-sdk-js: History for CHANGELOG.md. https://github.com/aws/aws-sdk-js/commits/master/CHANGELOG.md.
[26]
Özalp Babaoğlu, Eddy Fromentin, and Michel Raynal. 1995. Debugging Distributed Executions by Using Language Recognition. In ICPP (2). 55--62.
[27]
Özalp Babaoğlu, Eddy Fromentin, and Michel Raynal. 1996. A unified framework for the specification and run-time detection of dynamic properties in distributed computations. Journal of Systems and Software 33, 3 (1996), 287--298.
[28]
Özalp Babaoğlu and Keith Marzullo. 1993. Consistent global states of distributed systems: Fundamental concepts and mechanisms. Distributed Systems 53 (1993).
[29]
Özalp Babaoğlu and Michel Raynal. 1995. Specification and verification of dynamic properties in distributed computations. J. Parallel and Distrib. Comput. 28, 2 (1995), 173--185.
[30]
Bitwarden 2019. Bitwarden. https://bitwarden.com/.
[31]
Bernd Bruegge, Tim Gottschalk, and Bin Luo. 1993. A framework for dynamic program analyzers. In Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications. 65--82.
[32]
Brian Burg, Richard Bailey, Andrew J Ko, and Michael D Ernst. 2013. Interactive record/replay for web application debugging. In Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, 473--484.
[33]
David Calavera and Lorenzo Fontana. 2019. Linux Observability with BPF: Advanced Programming for Performance Analysis and Networking. O'Reilly Media.
[34]
K Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems (TOCS) 3, 1 (1985), 63--75.
[35]
Himanshu Chauhan, Vijay K Garg, Aravind Natarajan, and Neeraj Mittal. 2013. A distributed abstraction algorithm for online predicate detection. In 2013 IEEE 32nd International Symposium on Reliable Distributed Systems. IEEE, 101--110.
[36]
Sarah E Chodrow and Mohamed G Gouda. 1995. Implementation of the sentry system. Software: Practice and Experience 25, 4 (1995), 373--387.
[37]
Cloudflare. 2020. Cloudflare Workers. https://workers.cloudflare.com/.
[38]
Jeremy Daly. 2020. Serverless Community Survey 2020. https://github.com/jeremydaly/serverless-community-survey-2020.
[39]
Nelly Delgado, Ann Q Gates, and Steve Roach. 2004. A taxonomy and catalog of runtime software-fault monitoring tools. IEEE Transactions on software Engineering 30, 12 (2004), 859--872.
[40]
Christian Drabek and Gereon Weiss. 2017. DANA-Description and Analysis of Networked Applications. In RV-CuBES. 71--80.
[41]
Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. 1988. Consensus in the presence of partial synchrony. Journal of the ACM (JACM) 35, 2 (1988), 288--323.
[42]
Yliès Falcone, Srđan Krstić, Giles Reger, and Dmitriy Traytel. 2018. A taxonomy for classifying runtime verification tools. In International Conference on Runtime Verification. Springer, 241--262.
[43]
Fastly. 2020. Fastly Compute@Edge. https://www.fastly.com/products/edge-compute/serverless.
[44]
Colin J Fidge. 1988. Partial orders for parallel debugging. In Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and Distributed debugging. 183--194.
[45]
Cormac Flanagan and Patrice Godefroid. 2005. Dynamic partial-order reduction for model checking software. ACM Sigplan Notices 40, 1 (2005), 110--121.
[46]
Rodrigo Fonseca, George Porter, Randy H Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX conference on Networked systems design & implementation. USENIX Association, 20--20.
[47]
Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chatterjee, Christos Kozyrakis, Matei Zaharia, and Keith Winstein. 2019. From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 475--488.
[48]
Sadjad Fouladi, Riad S Wahby, Brennan Shacklett, Karthikeyan Vasuki Balasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivaraman, George Porter, and Keith Winstein. 2017. Encoding, fast and slow: Low-latency video processing using thousands of tiny threads. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 363--376.
[49]
Eddy Fromentin, Michel Raynal, Vijay K Garg, and Alex Tomlinson. 1994. On the fly testing of regular patterns in distributed computations. In 1994 Internatonal Conference on Parallel Processing Vol. 2, Vol. 2. IEEE, 73--76.
[50]
Ornan Gerstel, Shmuel Zaks, Michel Hurfin, Noël Plouzeau, and Michel Raynal. 1994. On-the-fly replay: a practical paradigm and its implementation for distributed debugging. In Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing. IEEE, 266--272.
[51]
Google. 2020. Cloud Functions. https://cloud.google.com/functions.
[52]
Google. 2020. Cloud Spanner. https://cloud.google.com/spanner.
[53]
Stewart Grant, Hendrik Cech, and Ivan Beschastnikh. 2018. Inferring and asserting distributed system invariants. In ICSE. 1149--1159.
[54]
Brendan Gregg and Jim Mauro. 2011. DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X, and FreeBSD. Prentice Hall Professional.
[55]
Weiming Gu, Greg Eisenhauer, Eileen Kraemer, Karsten Schwan, John Stasko, Jeffrey Vetter, and Nirupama Mallavarupu. 1995. Falcon: On-line monitoring and steering of large-scale parallel programs. In Proceedings Frontiers' 95. The Fifth Symposium on the Frontiers of Massively Parallel Computation. IEEE, 422--429.
[56]
Arjun Guha, Claudiu Saftoiu, and Shriram Krishnamurthi. 2010. The essence of JavaScript. In ECOOP. 126--150.
[57]
Zhenyu Guo, Xi Wang, Jian Tang, Xuezheng Liu, Zhilei Xu, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. 2008. R2: An Application-Level Kernel for Record and Replay. In OSDI.
[58]
Michel Hurfin, Masaaki Mizuno, Michel Raynal, and Mukesh Singhal. 1998. Efficient distributed detection of conjunctions of local predicates. IEEE Transactions on Software Engineering 24, 8 (1998), 664--677.
[59]
Michel Hurfin, Noël Plouzeau, and Michel Raynal. 1993. Debugging tool for distributed Estelle programs. computer communications 16, 5 (1993), 328--333.
[60]
Michel Hurfin, Noël Plouzeau, and Michel Raynal. 1993. Detecting atomic sequences of predicates in distributed computations. In Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging. 32--42.
[61]
Ecma International. 2015. ECMAScript 2015 Language Specification (6th ed.). Geneva. http://www.ecma-international.org/ecma-262/6.0/ECMA-262.pdf.
[62]
Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. 2017. Occupy the cloud: Distributed computing for the 99%. In Proceedings of the 2017 Symposium on Cloud Computing. ACM, 445--451.
[63]
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, et al. 2017. Canopy: An end-to-end performance tracing and analysis system. In SOSP.
[64]
Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, and William G Griswold. 2001. An overview of AspectJ. In European Conference on Object-Oriented Programming. Springer, 327--354.
[65]
Ki-Suh Lee, Han Wang, Vishal Shrivastav, and Hakim Weatherspoon. 2016. Globally Synchronized Time via Datacenter Networks. Proceedings of the 2016 ACM SIGCOMM Conference (2016).
[66]
California State Legislature. 2019. California Consumer Privacy Act of 2018 (Assembly Bill No. 375). https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180AB375.
[67]
Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. 2008. D3S: Debugging Deployed Distributed Systems. In NSDI.
[68]
Friedemann Mattern. 1988. Virtual time and global states of distributed systems. In Proceedings of the International Workshop on Parallel and Distributed Algorithms.
[69]
James W Mickens, Jeremy Elson, and Jon Howell. 2010. Mugshot: Deterministic Capture and Replay for JavaScript Applications. In NSDI, Vol. 10. 159--174.
[70]
Microsoft. 2020. Azure API Management. https://azure.microsoft.com/en-us/services/api-management/.
[71]
Microsoft. 2020. Azure Blob Storage. https://azure.microsoft.com/en-us/services/storage/blobs/.
[72]
Microsoft. 2020. Azure Functions. https://azure.microsoft.com/en-us/services/functions/.
[73]
Microsoft Azure. 2020. Azure Functions scale and hosting. https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale.
[74]
Aloysius K Mok and Guangtian Liu. 1997. Efficient Run-Time Monitoring of Timing Constraints. In IEEE Real Time Technology and Applications Symposium. 252--262.
[75]
Menna Mostafa and Borzoo Bonakdarpour. 2015. Decentralized runtime verification of LTL specifications in distributed systems. In 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 494--503.
[76]
Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. In NSDI.
[77]
Node.js. 2019. Node.js v12.13.0 Documentation --- Debugger. https://nodejs.org/docs/latest-v12.x/api/debugger.html.
[78]
Oleg Obleukhov. 2020. Building a more accurate time service at Facebook scale. https://engineering.fb.com/production-engineering/ntpservice/.
[79]
Robert O'Callahan, Chris Jones, Nathan Froyd, Kyle Huey, Albert Noll, and Nimrod Partush. 2017. Engineering record and replay for deployability. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 377--389.
[80]
OpenTracing. 2020. https://opentracing.io/.
[81]
Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, and Rebecca Isaacs. 2020. Distributed Tracing in Practice: Instrumenting, Analyzing, and Debugging Microservices. O'Reilly Media.
[82]
Lin Quan, John Heidemann, and Yuri Pradkin. 2014. When the Internet sleeps: Correlating diurnal networks with external factors. In IMC.
[83]
Giles Reger, Helena Cuenca Cruz, and David Rydeheard. 2015. MarQ: monitoring at runtime with QEA. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 596--610.
[84]
Patrick Reynolds, Charles Edwin Killian, Janet L Wiener, Jeffrey C Mogul, Mehul A Shah, and Amin Vahdat. 2006. Pip: Detecting the Unexpected in Distributed Systems. In NSDI, Vol. 6. 9--9.
[85]
Michiel Ronsse and Koen De Bosschere. 1999. RecPlay: a fully integrated practical record/replay system. ACM Transactions on Computer Systems (TOCS) 17, 2 (1999), 133--152.
[86]
Yasushi Saito. 2005. Jockey: a user-space library for record-replay debugging. In Proceedings of the sixth international symposium on Automated analysis-driven debugging. ACM, 69--76.
[87]
Brandon Schlinker, Italo Cunha, Yi-Ching Chiu, Srikanth Sundaresan, and Ethan Katz-Bassett. 2019. Internet Performance from Facebook's Edge. In IMC.
[88]
Koushik Sen, Swaroop Kalasapur, Tasneem Brutch, and Simon Gibbs. 2013. Jalangi: a selective record-replay and dynamic analysis framework for JavaScript. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 488--498.
[89]
Koushik Sen, Abhay Vardhan, Gul Agha, and Grigore Rosu. 2004. Efficient decentralized monitoring of safety in distributed systems. In Proceedings of the 26th International Conference on Software Engineering. IEEE Computer Society, 418--427.
[90]
Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In ATC.
[91]
Vaishaal Shankar, Karl Krauth, Qifan Pu, Eric Jonas, Shivaram Venkataraman, Ion Stoica, Benjamin Recht, and Jonathan Ragan-Kelley. 2018. numpywren: serverless linear algebra. arXiv preprint arXiv:1810.09679 (2018).
[92]
Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).
[93]
Patrik Simek. 2019. vm2. https://github.com/patriksimek/vm2.
[94]
Beth Trushkowsky, Peter Bodík, Armando Fox, Michael J Franklin, Michael I Jordan, and David A Patterson. 2011. The SCADS Director: Scaling a Distributed Storage System Under Stringent Performance Requirements. In FAST, Vol. 11. 163--176.
[95]
Jeffrey J. P. Tsai, K-Y Fang, H-Y Chen, and Y-D Bi. 1990. A noninterference monitoring and replay mechanism for real-time software testing and debugging. IEEE Transactions on Software Engineering 16, 8 (1990), 897--916.
[96]
Zipkin. 2019. https://zipkin.io/.

Cited By

View all
  • (2024)FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00047(415-426)Online publication date: 28-Oct-2024
  • (2023)Halfmoon: Log-Optimal Fault-Tolerant Stateful Serverless ComputingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613154(314-330)Online publication date: 23-Oct-2023
  • (2023)Rise of the Planet of Serverless Computing: A Systematic ReviewACM Transactions on Software Engineering and Methodology10.1145/357964332:5(1-61)Online publication date: 21-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '21: Proceedings of the ACM Symposium on Cloud Computing
November 2021
685 pages
ISBN:9781450386388
DOI:10.1145/3472883
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2021

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SoCC '21
Sponsor:
SoCC '21: ACM Symposium on Cloud Computing
November 1 - 4, 2021
WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)338
  • Downloads (Last 6 weeks)49
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00047(415-426)Online publication date: 28-Oct-2024
  • (2023)Halfmoon: Log-Optimal Fault-Tolerant Stateful Serverless ComputingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613154(314-330)Online publication date: 23-Oct-2023
  • (2023)Rise of the Planet of Serverless Computing: A Systematic ReviewACM Transactions on Software Engineering and Methodology10.1145/357964332:5(1-61)Online publication date: 21-Jul-2023
  • (2023)Executing Microservice Applications on Serverless, CorrectlyProceedings of the ACM on Programming Languages10.1145/35712067:POPL(367-395)Online publication date: 11-Jan-2023
  • (2023)Fine-Grained Performance and Cost Modeling and Optimization for FaaS ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321478334:1(180-194)Online publication date: 1-Jan-2023
  • (2023)Comparison of Integration Coverage Criteria for Serverless Applications2023 IEEE International Conference on Service-Oriented System Engineering (SOSE)10.1109/SOSE58276.2023.00014(67-74)Online publication date: Jul-2023
  • (2023)Cloud Computing Based (Serverless computing) using Serverless architecture for Dynamic Web Hosting and cost Optimization2023 International Conference on Computer Communication and Informatics (ICCCI)10.1109/ICCCI56745.2023.10128286(1-6)Online publication date: 23-Jan-2023
  • (2023)Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platformJournal of Systems and Software10.1016/j.jss.2023.111611198:COnline publication date: 1-Apr-2023
  • (2022) Astrea: Auto-Serverless Analytics Towards Cost-Efficiency and QoS-Awareness IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317206933:12(3833-3849)Online publication date: 1-Dec-2022
  • (2022)Canary: Fault-Tolerant FaaS for Stateful Time-Sensitive ApplicationsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00046(1-16)Online publication date: Nov-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media