A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Grace Nansamba¹,
Amani Altarawneh² &
Anthony Skjellum¹

340 Accesses
Explore all metrics

Abstract

Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tolerant operations. This narrow model of faults (usually restricted only to process or node failures) is insufficient. Without a more general model for consensus, gaps in the ability to detect, isolate, mitigate, and recover HPC applications efficiently will arise. Focusing on crash failures is insufficient because a chain of underlying components may lead to system failures in MPI. What is more, clusters and leadership-class machines alike often have Reliability, Availability, and Serviceability Systems to convey predictive and real-time fault and error information, which does not map strictly to process and node crashes. A broader study of failures beyond crash failures in MPI will thus be useful in conjunction with consensus mechanism for developers as they continue to design, develop, and implement fault-tolerant HPC systems that reflect observable faults in actual systems. We describe key factors that must be considered during consensus-mechanism design. We illustrate some of the current MPI fault tolerance models based on use cases. We offer a novel classification of common consensus mechanisms based on these factors such as the network model, failure types, and based on use cases (e.g., fault detection, synchronization) of the consensus in the computation process, including crash fault tolerance as one category.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fault-Aware Group-Collective Communication Creation and Repair in MPI

Legio: fault resiliency for embarrassingly parallel MPI applications

Article Open access 25 June 2021

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Article 30 September 2017

Notes

Consensus under fault-free operations is also an inherent property of typical bulk-synchronous parallel programs / data-parallel programs.
Byzantine failures include crash failures.
Reaching agreement is never guaranteed in theory, but is often possible heuristically in practice (cf, FLP [34]).
There can be security concerns about enabling a parallel program to receive fault information from the exterior of the parallel system. Coping with any possible covert channels through translation and vetting of such information appears tractable in practice.

References

Fromentin, E., Raynal, M., Tronel, F.: On classes of problems in asyn- chronous distributed systems with process crashes. In: Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003), pp. 470–477 (1999). https://doi.org/10.1109/ICDCS.1999.776549
Hassani, A., Skjellum, A., Brightwell, R.: Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 750–755, (2014). https://doi.org/10.1109/DSN.2014.78
Sultana, N., Rüfenacht, M., Skjellum, A., Laguna, I., Mohror, K.: Failure recovery for bulk synchronous applications with MPI stages. Parallel Comput. 84, 1–14 (2019). https://doi.org/10.1016/j.parco.2019.02.007
Article Google Scholar
Amin, H.: Toward a scalable, transactional, fault-tolerant message passing interface for petascale and exascale machines. PhD dissertation, The University of Alabama at Birmingham (2014)
Altarawneh, A., Herschberg, T., Medury, S., Kandah, F., Skjellum, A.: Buterin’s scalability trilemma viewed through a state-change-based classification for common consensus algorithms. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0727–0736 (2020). https://doi.org/10.1109/CCWC47524.2020.9031204
Dolev, D., Reischuk, R.: Bounds on information exchange for byzantine agreement. J. ACM (JACM) 32(1), 191–204 (1985)
Article MathSciNet MATH Google Scholar
Giménez, A., Gamblin, T., Bhatele, A., Wood, C., Shoga, K., Marathe, A., Bremer, P.-T., Hamann, B., Schulz, M.: Scrubjay: Deriving knowledge from the disarray of hpc performance data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’17, New York, (2017). https://doi.org/10.1145/3126908.3126935
Guo, H., Di, S., Gupta, R., Peterka, T., Cappello, F.: La VALSE: scalable log visualization for fault characterization in supercomputers. In: Childs, H., Cucchietti, F. (eds.) Eurographics Symposium on Parallel Graphics and Visualization. The Eurographics Association (2018)
Martino, C. D., Jha, S., Kramer, W., Kalbarczyk, Z., Iyer, R. K.: Logdiver: A tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at EXtreme Scale, pp. 11–18. Association for Computing Machinery, FTXS ’15, New York, (2015). https://doi.org/10.1145/2751504.2751511
Buntinas, D.: Scalable distributed consensus to support mpi fault tolerance. In : 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249 (2012). https://doi.org/10.1109/IPDPS.2012.113
Nowakowski, W.: Network management software for redundant ethernet ring. Theor. Appl. Sci. 48, 24–29 (2017)
Article Google Scholar
Libby, R.: Effective HPC hardware management and failure prediction strategy using IPMI. In: Proceedings of the Linux Symposium. Citeseer, (2003)
Baudet, M., Ching, A., Chursin, A., Danezis, G., Garillot, F., Li, Z., Malkhi, D., Naor, O., Perelman, D., Sonnino, A.: State machine replication in the libra blockchain (2019)
Driscoll, K., Hall, B., Paulitsch, M., Zumsteg, P., Sivencrona, H.: The real byzantine generals. In: The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576), vol. 2, pp. 6.D.4–61 (2004). https://doi.org/10.1109/DASC.2004.1390734
Forum, M.P.I.: MPI: A Message-passing Interface Standard, Version 3.1. (2015). High-Performance Computing Center Stuttgart, University of Stuttgart, (2015). URL https://books.google.com/books?id=Fbv7jwEACAAJ
Bar-Noy, A., Dolev, D.: Consensus algorithms with one-bit messages. Distrib. Comput. 4(3), 105–110 (1991)
Article MathSciNet MATH Google Scholar
Castro, M., Liskov, B.: Practical byzantine fault tolerance. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, pp. 173–186, (1999). URL https://dl.acm.org/citation.cfm?id=296824
El-Sayed, N., Schroeder, B.: Reading between the lines of failure logs: Understanding how hpc systems fail. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12. IEEE, (2013)
King, S., Nadal, S.: Ppcoin: Peer-to-peer crypto-currency with proof-of-stake. self-published paper, (2012)
Ismail, L., Materwala, H.: A review of blockchain architecture and consensus protocols: use cases, challenges, and solutions. Symmetry (2019). https://doi.org/10.3390/sym11101198
Article Google Scholar
Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. USENIX Association, USENIX ATC’14, pp. 305-320, USA (2014)
Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’11, New York (2011a). https://doi.org/10.1145/2063384.2063443
Chang, T.-H., Hong, M., Liao, W.-C., Wang, X.: Asynchronous distributed admm for large-scale optimization-part i: algorithm and convergence analysis. IEEE Trans. Signal Process. 64(12), 3118–3130 (2016). https://doi.org/10.1109/TSP.2016.2537271
Article MathSciNet MATH Google Scholar
Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., Abraham, I.: Hotstuff: Bft consensus with linearity and responsiveness. In: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pp. 347–356. ACM (2019)
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27(3), 244–254 (2013). https://doi.org/10.1177/1094342013488238
Article Google Scholar
Katti, A., Di Fatta, G., Naughton, T., Engelmann, C.: Scalable and fault tolerant failure detection and consensus. In: Proceedings of the 22nd European MPI Users’ Group Meeting. Association for Computing Machinery, EuroMPI ’15, New York, (2015) https://doi.org/10.1145/2802658.2802660
Popov, S.: The tangle. White Paper 1(3) (2018)
Altarawneh, A., Skjellum, A.: The security ingredients for correct and byzantine fault-tolerant blockchain consensus algorithms. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), pp. 1–9, (2020). https://doi.org/10.1109/ISNCC49221.2020.9297326
Al-Mamun, A., Li, T., Sadoghi, M., Jiang, L., Shen, H.-T., Zhao, D.: Hpchain: an mpi-based blockchain framework for data fidelity in high-performance computing systems (2019)
Cachin, C., Vukolić, M.: Blockchain consensus protocols in the wild. arXiv preprint arXiv:1707.01873, (2017)
De Angelis, S.: Assessing security and performances of consensus algorithms for permissioned blockchains. arXiv preprint arXiv:1805.03490, (2018)
Dwork, C., Naor, M.: Pricing via processing or combatting junk mail. In: Brickell, E.F. (ed.) Advances in Cryptology – CRYPTO’ 92, pp. 139–147. Springer, Berlin Heidelberg (1993)
Chapter Google Scholar
Lamport, L.: The weak byzantine generals problem. J. ACM 30(3), 668–676 (1983). https://doi.org/10.1145/2402.322398
Article MathSciNet MATH Google Scholar
Bosilca, G., Bouteiller, A., Herault, T., Le Fèvre, V., Robert, Y., Dongarra, J.: Revisiting credit distribution algorithms for distributed termination detection. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 611–620 (2021). https://doi.org/10.1109/IPDPSW52791.2021.00095
Moise, I.: Efficient agreement protocols in asynchronous distributed systems. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 2022–2025. IEEE, (2011)
Ropars, T., Lefray, A., Kim, D., Schiper, A.: Efficient process replication for MPI applications: Sharing work between replicas. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 645–654, (2015). https://doi.org/10.1109/IPDPS.2015.29
Fischer, M.J.: The consensus problem in unreliable distributed systems (a brief survey). In: Karpinski, M. (ed.) Foundations of Computation Theory, pp. 127–140. Springer, Berlin Heidelberg (1983)
Chapter Google Scholar
Sankaran, S., Squyres, J.M., Barrett, B., Sahay, V., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The lam/mpi checkpoint/restart framework: system-initiated checkpointing. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005). https://doi.org/10.1177/1094342005056139
Article Google Scholar
Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. (2011b). https://doi.org/10.1145/2063384.2063443
Woo, S., Lang, S., Latham, R., Ross, R., Thakur, R.: Reliable MPI-IO through layout-aware replication (2011)
Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998). https://doi.org/10.1145/279227.279229
Article MATH Google Scholar
Borowsky, E., Gafni, E.: Generalized flp impossibility result for<i>t</i>-resilient asynchronous computations. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing. STOC ’93, pp. 91–100. Association for Computing Machinery, New York. ISBN 0897915917. (1993). https://doi.org/10.1145/167088.167119
Brokaw, T., Koziuk, G.: The intelligent platform management interface (IPMI) and enclosure management. Electron. Eng. (Lond.) 72, 19 (2000)
Google Scholar
Costa, C. H. A., Park, Y., Rosenburg, B. S., Cher, C.-Y., Ryu, K. D.: A system software approach to proactive memory-error avoidance. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 707–718. IEEE Press, (2014). https://doi.org/10.1109/SC.2014.63
Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: Logaider: A tool for mining potential correlations of hpc log events. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451 (2017). https://doi.org/10.1109/CCGRID.2017.18
Leners, J.B., Wu, H., Hung, W.-L., Aguilera, M.K, Walfish, M.: Detecting failures in distributed systems with the falcon spy network, In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 279–294. New York, NY, Association for Computing Machinery (2011). https://doi.org/10.1145/2043556.2043583
Moses, Y., Raynal, M.: Revisiting simultaneous consensus with crash failures. J. Parallel Distrib. Comput. 69(4), 400–409 (2009). https://doi.org/10.1016/j.jpdc.2009.01.001
Article Google Scholar
Bano, S., Sonnino, A., Al-Bassam, M., Azouvi, S., McCorry, P., Meiklejohn, S., Danezis, G.: Sok: Consensus in the age of blockchains. In: Proceedings of the 1st ACM Conference on Advances in Financial Technologies, pp. 183–198 (2019)
Aguilera, M. K., Toueg, S.: Randomization and failure detection: a hybrid approach to solve consensus. Technical report (1996)
Al-Mamun, A., Zhao, D.: BAASH: enabling blockchain-as-a-service on high-performance computing systems. CoRR Preprint at arxiv: 2001.07022 (2020)
Lamport, L., Shostak, R.E., Pease, M.C.: The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982)
Article MATH Google Scholar
Duan, S.: Building reliable and practical byzantine fault tolerance. PhD dissertation, University of California Davis (2016)
Fan, X., Chai, Q.: Roll-dpos: A randomized delegated proof of stake scheme for scalable blockchain-based internet of things systems. In: Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. MobiQuitous ’18, pp. 482–484. New York (2018). https://doi.org/10.1145/3286978.3287023
...Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., Debardeleben, N.A., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)
Article Google Scholar
Darius, B.: Scalable distributed consensus to support mpi fault tolerance. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249. IEEE, (2012)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Depend. Secur. Comput. 7(4), 337–350 (2009)
Article Google Scholar
Leesatapornwongsa, T., Lukman, J.F., Lu, S., Gunawi, H.S.: TaxDC: a taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. SIGPLAN Not. 51(4), 517–530 (2016). https://doi.org/10.1145/2954679.2872374
Article Google Scholar
Omwenga, M., Otim, J., Lumala, A.: Robust mobile cloud services through offline support, pp. 90–93 (2012). https://doi.org/10.1109/ACSEAC.2012.27
Stone, J., Partridge, C.: When the CRC and TCP checksum disagree. SIGCOMM Comput. Commun. Rev. 30(4), 309–319 (2000). https://doi.org/10.1145/347057.347561
Article Google Scholar
Huang, S.-T.: Detecting termination of distributed computations by external agents. In: [1989] Proceedings. The 9th International Conference on Distributed Computing Systems, pp. 79–84, (1989). https://doi.org/10.1109/ICDCS.1989.37933
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011). https://doi.org/10.1561/2200000016
Article MATH Google Scholar
Losada, N., González, P., Martín, M.J., Bosilca, G., Bouteiller, A., Teranishi, K.: Fault tolerance of MPI applications in exascale systems: the ULFM solution. Future Gener. Comput. Syst. 106, 467–481 (2020). https://doi.org/10.1016/j.future.2020.01.026
Article Google Scholar
Hassani, A., Skjellum, A., Bangalore, P. V., Brightwell, R.: Practical resilient cases for fa-mpi, a transactional fault-tolerant mpi. In: Proceedings of the 3rd Workshop on Exascale MPI. Association for Computing Machinery, ExaMPI ’15, New York (2015). https://doi.org/10.1145/2831129.2831130
Hursey, J., Naughton, T., Vallee, G., Graham, R.L.: A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) Recent Advances in the Message Passing Interface, pp. 255–263. Springer, Berlin Heidelberg (2011)
Chapter Google Scholar
Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. J. ACM 35(2), 288–323 (1988). https://doi.org/10.1145/42282.42283
Article MathSciNet Google Scholar
García-Pérez, Á., Gotsman, A., Meshman, Y., Sergey, I.: Paxos consensus, deconstructed and abstracted. In: Ahmed, A. (ed.) Programming Languages and Systems, pp. 912–939. Springer International Publishing, Cham (2018)
Chapter MATH Google Scholar
Miguel Castro, B.L.: Practical byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst. 20(4), 398–461 (2002). https://doi.org/10.1145/571637.571640
Article Google Scholar

Download references

Acknowledgements

This work was performed with partial support from the National Science Foundation under Grants Nos. CCF-1562659, CCF-1562306, CCF-1617690, CCF-1822191, and CCF-1821431. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

University of Tennessee at Chattanooga, Chattanooga, USA
Grace Nansamba & Anthony Skjellum
Colorado State University, Fort Collins, USA
Amani Altarawneh

Authors

Grace Nansamba
View author publications
You can also search for this author in PubMed Google Scholar
Amani Altarawneh
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Skjellum
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The first draft of the manuscript was written by all authors and we all commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Grace Nansamba.

Ethics declarations

Conflict of interest

The authors confirm that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nansamba, G., Altarawneh, A. & Skjellum, A. A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC. Int J Parallel Prog 51, 128–149 (2023). https://doi.org/10.1007/s10766-022-00749-y

Download citation

Received: 11 September 2022
Accepted: 21 November 2022
Published: 12 December 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10766-022-00749-y

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fault-Aware Group-Collective Communication Creation and Repair in MPI

Legio: fault resiliency for embarrassingly parallel MPI applications

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fault-Aware Group-Collective Communication Creation and Repair in MPI

Legio: fault resiliency for embarrassingly parallel MPI applications

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation