[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3373376.3378484acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

Effective Concurrency Testing for Distributed Systems

Published: 13 March 2020 Publication History

Abstract

Despite their wide deployment, distributed systems remain notoriously hard to reason about. Unexpected interleavings of concurrent operations and failures may lead to undefined behaviors and cause serious consequences. We present Morpheus, the first concurrency testing tool leveraging partial order sampling, a randomized testing method formally analyzed and empirically validated to provide strong probabilistic guarantees of error-detection, for real-world distributed systems. Morpheus introduces conflict analysis to further improve randomized testing by predicting and focusing on operations that affect the testing result. Inspired by the recent shift in building distributed systems using higher-level languages and frameworks, Morpheus targets Erlang. Evaluation on four popular distributed systems in Erlang including RabbitMQ, a message broker service, and Mnesia, a distributed database in the Erlang standard libraries, shows that Morpheus is effective: It found previously unknown errors in every system checked, 11 total, all of which are flaws in their core protocols that may cause deadlocks, unexpected crashes, or inconsistent states.

References

[1]
Amazon simpledb. https://aws.amazon.com/simpledb.
[2]
Apache cassandra. http://cassandra.apache.org.
[3]
Cassandra-6023. https://issues.apache.org/jira/browse/CASSANDRA-6023.
[4]
Erlang programming language. https://erlang.org.
[5]
The go programming language specification. https://golang.org/ref/spec.
[6]
Why whatsapp only needs 50 engineers for its 900m users. https://www.wired.com/2015/09/whatsapp-serves-900-million-users-50-engineers/.
[7]
Parosh Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. Optimal dynamic partial order reduction. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '14, pages 373--384, New York, NY, USA, 2014. ACM.
[8]
Elvira Albert, Puri Arenas, María García de la Banda, Miguel Gómez-Zamalloa, and Peter J. Stuckey. Context-sensitive dynamic partial order reduction. In Rupak Majumdar and Viktor Kunvc ak, editors, Computer Aided Verification, pages 526--543, Cham, 2017. Springer International Publishing.
[9]
Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer Al-Kiswany. An analysis of network-partitioning failures in cloud systems. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'18, pages 51--68, Berkeley, CA, USA, 2018. USENIX Association.
[10]
Stavros Aronis, Scott Lystig Fritchie, and Konstantinos Sagonas. Testing and verifying chain repair methods for corfu using stateless model checking. In Nadia Polikarpova and Steve Schneider, editors, Integrated Formal Methods, pages 227--242, Cham, 2017. Springer International Publishing.
[11]
Stavros Aronis, Bengt Jonsson, Magnus Lång, and Konstantinos Sagonas. Optimal dynamic partial order reduction with observers. In Dirk Beyer and Marieke Huisman, editors, Tools and Algorithms for the Construction and Analysis of Systems, pages 229--248, Cham, 2018. Springer International Publishing.
[12]
Radu Banabic and George Candea. Fast black-box testing of system recovery code. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys '12, pages 281--294, New York, NY, USA, 2012. ACM.
[13]
Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. A randomized scheduler with probabilistic guarantees of finding bugs. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 167--178, New York, NY, USA, 2010. ACM.
[14]
Richard Carlsson, Björn Gustavsson, Erik Johansson, Thomas Lindgren, Sven-Olof Nyström, Mikael Pettersson, and Robert Virding. Core erlang 1.0.3 language specification. 2004.
[15]
M. Christakis, A. Gotovos, and K. Sagonas. Systematic testing for detecting concurrency errors in erlang programs. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, pages 154--163, March 2013.
[16]
Koen Claessen, Michal Palka, Nicholas Smallbone, John Hughes, Hans Svensson, Thomas Arts, and Ulf Wiger. Finding race conditions in erlang with quickcheck and pulse. In Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming, ICFP '09, pages 149--160, New York, NY, USA, 2009. ACM.
[17]
Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Akash Lal, and Paul Thomson. Asynchronous programming, analysis and testing with state machines. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '15, pages 154--164, New York, NY, USA, 2015. ACM.
[18]
Cormac Flanagan and Stephen N. Freund. Fasttrack: Efficient and precise dynamic race detection. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, pages 121--133, New York, NY, USA, 2009. ACM.
[19]
Cormac Flanagan and Patrice Godefroid. Dynamic partial-order reduction for model checking software. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '05, pages 110--121, New York, NY, USA, 2005. ACM.
[20]
Lars-Åke Fredlund and Hans Svensson. Mcerlang: A model checker for a distributed functional programming language. In Proceedings of the 12th ACM SIGPLAN International Conference on Functional Programming, ICFP '07, pages 125--136, New York, NY, USA, 2007. ACM.
[21]
Patrice Godefroid. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996.
[22]
Patrice Godefroid. Model checking for programming languages using verisoft. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '97, pages 174--186, New York, NY, USA, 1997. ACM.
[23]
Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical software model checking via dynamic interface reduction. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 265--278, New York, NY, USA, 2011. ACM.
[24]
Daniel Jackson. Software Abstractions: Logic, Language, and Analysis. The MIT Press, 2006.
[25]
Charles Edwin Killian, James W. Anderson, Ryan Braud, Ranjit Jhala, and Amin M. Vahdat. Mace: Language support for building distributed systems. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pages 179--188, New York, NY, USA, 2007. ACM.
[26]
Kyle Kingsbury. Jepsen, 2016--2019.
[27]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. Samc: Semantic-aware model checking for fast discovery of deep bugs in cloud systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 399--414, Berkeley, CA, USA, 2014. USENIX Association.
[28]
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. Taxdc: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, pages 517--530, New York, NY, USA, 2016. ACM.
[29]
Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. Dcatch: Automatically detecting distributed concurrency bugs in cloud systems. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, pages 677--691, New York, NY, USA, 2017. ACM.
[30]
Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, and Chen Tian. Fcatch: Automatically detecting time-of-fault bugs in cloud systems. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '18, pages 419--431, New York, NY, USA, 2018. ACM.
[31]
Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. Avio: Detecting atomicity violations via access interleaving invariants. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pages 37--48, New York, NY, USA, 2006. ACM.
[32]
Jeffrey F. Lukman, Huan Ke, Cesar A. Stuardo, Riza O. Suminto, Daniar H. Kurniawan, Dikaimin Simon, Satria Priambada, Chen Tian, Feng Ye, Tanakorn Leesatapornwongsa, Aarti Gupta, Shan Lu, and Haryadi S. Gunawi. Flymc: Highly scalable testing of complex interleavings in distributed systems. In Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys '19, pages 20:1--20:16, New York, NY, USA, 2019. ACM.
[33]
Rupak Majumdar and Filip Niksic. Why is random testing effective for partition tolerance bugs? Proc. ACM Program. Lang., 2(POPL):46:1--46:24, December 2017.
[34]
Dahlia Malkhi, Mahesh Balakrishnan, John D. Davis, Vijayan Prabhakaran, and Ted Wobber. From paxos to corfu: A flash-speed shared log. SIGOPS Oper. Syst. Rev., 46(1):47--51, February 2012.
[35]
Friedemann Mattern. Virtual time and global states of distributed systems. Parallel and Distributed Algorithms, pages 215--226, 1989.
[36]
Madanlal Musuvathi, David Y. W. Park, Andy Chou, Dawson R. Engler, and David L. Dill. Cmc: A pragmatic approach to model checking real code. In Proceedings of the 5th Symposium on Operating Systems Design and implementation, OSDI '02, pages 75--88, Berkeley, CA, USA, 2002. USENIX Association.
[37]
Madanlal Musuvathi and Shaz Qadeer. Iterative context bounding for systematic testing of multithreaded programs. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pages 446--455, New York, NY, USA, 2007. ACM.
[38]
Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gerard Basler, Piramanayagam Arumuga Nainar, and Iulian Neamtiu. Finding and reproducing heisenbugs in concurrent programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, pages 267--280, Berkeley, CA, USA, 2008. USENIX Association.
[39]
Mayur Naik, Alex Aiken, and John Whaley. Effective static race detection for java. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '06, pages 308--319, New York, NY, USA, 2006. ACM.
[40]
Robert O'Callahan and Jong-Deok Choi. Hybrid dynamic data race detection. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '03, pages 167--178, New York, NY, USA, 2003. ACM.
[41]
Diego Ongaro and John Ousterhout. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, pages 305--320, Berkeley, CA, USA, 2014. USENIX Association.
[42]
Burcu Kulahcioglu Ozkan, Rupak Majumdar, Filip Niksic, Mitra Tabaei Befrouei, and Georg Weissenbacher. Randomized testing of distributed systems with probabilistic guarantees. Proc. ACM Program. Lang., 2(OOPSLA):160:1--160:28, October 2018.
[43]
Burcu Kulahcioglu Ozkan, Rupak Majumdar, and Simin Oraee. Trace aware random testing for distributed systems. Proc. ACM Program. Lang., 3(OOPSLA), October 2019.
[44]
Soyeon Park, Shan Lu, and Yuanyuan Zhou. Ctrigger: Exposing atomicity violation bugs from their hiding places. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIV, pages 25--36, New York, NY, USA, 2009. ACM.
[45]
Koushik Sen. Effective random testing of concurrent programs. In Proceedings of the Twenty-second IEEE/ACM International Conference on Automated Software Engineering, ASE '07, pages 323--332, New York, NY, USA, 2007. ACM.
[46]
Koushik Sen. Race directed random testing of concurrent programs. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '08, pages 11--21, New York, NY, USA, 2008. ACM.
[47]
Jiri Simsa, Randy Bryant, and Garth Gibson. dbug: Systematic evaluation of distributed systems. In Proceedings of the 5th International Conference on Systems Software Verification, SSV'10, pages 3--3, Berkeley, CA, USA, 2010. USENIX Association.
[48]
Paul Thomson, Alastair F. Donaldson, and Adam Betts. Concurrency testing using controlled schedulers: An empirical study. ACM Trans. Parallel Comput., 2(4):23:1--23:37, February 2016.
[49]
Robbert van Renesse and Fred B. Schneider. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 7--7, Berkeley, CA, USA, 2004. USENIX Association.
[50]
Jan Wen Voung, Ranjit Jhala, and Sorin Lerner. Relay: Static race detection on millions of lines of code. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC-FSE '07, pages 205--214, New York, NY, USA, 2007. ACM.
[51]
Chao Wang, Mahmoud Said, and Aarti Gupta. Coverage guided systematic concurrency testing. In Proceedings of the 33rd International Conference on Software Engineering, ICSE '11, pages 221--230, New York, NY, USA, 2011. ACM.
[52]
Ulf Wiger. Github - locks, 2017.
[53]
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. Modist: Transparent model checking of unmodified distributed systems. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI'09, pages 213--228, Berkeley, CA, USA, 2009. USENIX Association.
[54]
Junfeng Yang, Can Sar, and Dawson Engler. Explode: A lightweight, general system for finding serious storage system errors. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI '06, pages 10--10, Berkeley, CA, USA, 2006. USENIX Association.
[55]
Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using model checking to find serious file system errors. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 19--19, Berkeley, CA, USA, 2004. USENIX Association.
[56]
Jie Yu, Satish Narayanasamy, Cristiano Pereira, and Gilles Pokam. Maple: A coverage-driven testing tool for multithreaded programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '12, pages 485--502, New York, NY, USA, 2012. ACM.
[57]
Xinhao Yuan, Junfeng Yang, and Ronghui Gu. Partial order aware concurrency sampling. In Hana Chockler and Georg Weissenbacher, editors, Computer Aided Verification, pages 317--335, Cham, 2018. Springer International Publishing.

Cited By

View all
  • (2024)VConMC: Enabling Consistency Verification for Distributed Systems Using Implementation-Level Model Checkers and Consistency OraclesElectronics10.3390/electronics1306115313:6(1153)Online publication date: 21-Mar-2024
  • (2024)Generalized Concurrency Testing Tool for Distributed SystemsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3685309(1861-1865)Online publication date: 11-Sep-2024
  • (2024)Greybox Fuzzing for Concurrency TestingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640389(482-498)Online publication date: 27-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
March 2020
1412 pages
ISBN:9781450371025
DOI:10.1145/3373376
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. conflict analysis
  2. distributed systems
  3. partial order sampling
  4. partial-order reduction
  5. randomized testing

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '20

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)585
  • Downloads (Last 6 weeks)61
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)VConMC: Enabling Consistency Verification for Distributed Systems Using Implementation-Level Model Checkers and Consistency OraclesElectronics10.3390/electronics1306115313:6(1153)Online publication date: 21-Mar-2024
  • (2024)Generalized Concurrency Testing Tool for Distributed SystemsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3685309(1861-1865)Online publication date: 11-Sep-2024
  • (2024)Greybox Fuzzing for Concurrency TestingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640389(482-498)Online publication date: 27-Apr-2024
  • (2023)Greybox Fuzzing of Distributed SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623097(1615-1629)Online publication date: 15-Nov-2023
  • (2023)Model Checking Guided Testing for Distributed SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587442(127-143)Online publication date: 8-May-2023
  • (2023)Coverage Guided Fault Injection for Cloud Systems2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00186(2211-2223)Online publication date: May-2023
  • (2023)Evolutionary Approach for Concurrency Testing of Ripple Blockchain Consensus Algorithm2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)10.1109/ICSE-SEIP58684.2023.00009(36-47)Online publication date: May-2023
  • (2023)Model‐checking‐driven explorative testing of CRDT designs and implementationsJournal of Software: Evolution and Process10.1002/smr.2555Online publication date: 15-Mar-2023
  • (2022)Controlled concurrency testing via periodical schedulingProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510178(474-486)Online publication date: 21-May-2022
  • (2022)Who goes first? detecting go concurrency bugs via message reorderingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507753(888-902)Online publication date: 28-Feb-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media