Abstract
Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tolerant operations. This narrow model of faults (usually restricted only to process or node failures) is insufficient. Without a more general model for consensus, gaps in the ability to detect, isolate, mitigate, and recover HPC applications efficiently will arise. Focusing on crash failures is insufficient because a chain of underlying components may lead to system failures in MPI. What is more, clusters and leadership-class machines alike often have Reliability, Availability, and Serviceability Systems to convey predictive and real-time fault and error information, which does not map strictly to process and node crashes. A broader study of failures beyond crash failures in MPI will thus be useful in conjunction with consensus mechanism for developers as they continue to design, develop, and implement fault-tolerant HPC systems that reflect observable faults in actual systems. We describe key factors that must be considered during consensus-mechanism design. We illustrate some of the current MPI fault tolerance models based on use cases. We offer a novel classification of common consensus mechanisms based on these factors such as the network model, failure types, and based on use cases (e.g., fault detection, synchronization) of the consensus in the computation process, including crash fault tolerance as one category.
Similar content being viewed by others
Notes
Consensus under fault-free operations is also an inherent property of typical bulk-synchronous parallel programs / data-parallel programs.
Byzantine failures include crash failures.
Reaching agreement is never guaranteed in theory, but is often possible heuristically in practice (cf, FLP [34]).
There can be security concerns about enabling a parallel program to receive fault information from the exterior of the parallel system. Coping with any possible covert channels through translation and vetting of such information appears tractable in practice.
References
Fromentin, E., Raynal, M., Tronel, F.: On classes of problems in asyn- chronous distributed systems with process crashes. In: Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003), pp. 470–477 (1999). https://doi.org/10.1109/ICDCS.1999.776549
Hassani, A., Skjellum, A., Brightwell, R.: Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 750–755, (2014). https://doi.org/10.1109/DSN.2014.78
Sultana, N., Rüfenacht, M., Skjellum, A., Laguna, I., Mohror, K.: Failure recovery for bulk synchronous applications with MPI stages. Parallel Comput. 84, 1–14 (2019). https://doi.org/10.1016/j.parco.2019.02.007
Amin, H.: Toward a scalable, transactional, fault-tolerant message passing interface for petascale and exascale machines. PhD dissertation, The University of Alabama at Birmingham (2014)
Altarawneh, A., Herschberg, T., Medury, S., Kandah, F., Skjellum, A.: Buterin’s scalability trilemma viewed through a state-change-based classification for common consensus algorithms. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0727–0736 (2020). https://doi.org/10.1109/CCWC47524.2020.9031204
Dolev, D., Reischuk, R.: Bounds on information exchange for byzantine agreement. J. ACM (JACM) 32(1), 191–204 (1985)
Giménez, A., Gamblin, T., Bhatele, A., Wood, C., Shoga, K., Marathe, A., Bremer, P.-T., Hamann, B., Schulz, M.: Scrubjay: Deriving knowledge from the disarray of hpc performance data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’17, New York, (2017). https://doi.org/10.1145/3126908.3126935
Guo, H., Di, S., Gupta, R., Peterka, T., Cappello, F.: La VALSE: scalable log visualization for fault characterization in supercomputers. In: Childs, H., Cucchietti, F. (eds.) Eurographics Symposium on Parallel Graphics and Visualization. The Eurographics Association (2018)
Martino, C. D., Jha, S., Kramer, W., Kalbarczyk, Z., Iyer, R. K.: Logdiver: A tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at EXtreme Scale, pp. 11–18. Association for Computing Machinery, FTXS ’15, New York, (2015). https://doi.org/10.1145/2751504.2751511
Buntinas, D.: Scalable distributed consensus to support mpi fault tolerance. In : 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249 (2012). https://doi.org/10.1109/IPDPS.2012.113
Nowakowski, W.: Network management software for redundant ethernet ring. Theor. Appl. Sci. 48, 24–29 (2017)
Libby, R.: Effective HPC hardware management and failure prediction strategy using IPMI. In: Proceedings of the Linux Symposium. Citeseer, (2003)
Baudet, M., Ching, A., Chursin, A., Danezis, G., Garillot, F., Li, Z., Malkhi, D., Naor, O., Perelman, D., Sonnino, A.: State machine replication in the libra blockchain (2019)
Driscoll, K., Hall, B., Paulitsch, M., Zumsteg, P., Sivencrona, H.: The real byzantine generals. In: The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576), vol. 2, pp. 6.D.4–61 (2004). https://doi.org/10.1109/DASC.2004.1390734
Forum, M.P.I.: MPI: A Message-passing Interface Standard, Version 3.1. (2015). High-Performance Computing Center Stuttgart, University of Stuttgart, (2015). URL https://books.google.com/books?id=Fbv7jwEACAAJ
Bar-Noy, A., Dolev, D.: Consensus algorithms with one-bit messages. Distrib. Comput. 4(3), 105–110 (1991)
Castro, M., Liskov, B.: Practical byzantine fault tolerance. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, pp. 173–186, (1999). URL https://dl.acm.org/citation.cfm?id=296824
El-Sayed, N., Schroeder, B.: Reading between the lines of failure logs: Understanding how hpc systems fail. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12. IEEE, (2013)
King, S., Nadal, S.: Ppcoin: Peer-to-peer crypto-currency with proof-of-stake. self-published paper, (2012)
Ismail, L., Materwala, H.: A review of blockchain architecture and consensus protocols: use cases, challenges, and solutions. Symmetry (2019). https://doi.org/10.3390/sym11101198
Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. USENIX Association, USENIX ATC’14, pp. 305-320, USA (2014)
Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’11, New York (2011a). https://doi.org/10.1145/2063384.2063443
Chang, T.-H., Hong, M., Liao, W.-C., Wang, X.: Asynchronous distributed admm for large-scale optimization-part i: algorithm and convergence analysis. IEEE Trans. Signal Process. 64(12), 3118–3130 (2016). https://doi.org/10.1109/TSP.2016.2537271
Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., Abraham, I.: Hotstuff: Bft consensus with linearity and responsiveness. In: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pp. 347–356. ACM (2019)
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27(3), 244–254 (2013). https://doi.org/10.1177/1094342013488238
Katti, A., Di Fatta, G., Naughton, T., Engelmann, C.: Scalable and fault tolerant failure detection and consensus. In: Proceedings of the 22nd European MPI Users’ Group Meeting. Association for Computing Machinery, EuroMPI ’15, New York, (2015) https://doi.org/10.1145/2802658.2802660
Popov, S.: The tangle. White Paper 1(3) (2018)
Altarawneh, A., Skjellum, A.: The security ingredients for correct and byzantine fault-tolerant blockchain consensus algorithms. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), pp. 1–9, (2020). https://doi.org/10.1109/ISNCC49221.2020.9297326
Al-Mamun, A., Li, T., Sadoghi, M., Jiang, L., Shen, H.-T., Zhao, D.: Hpchain: an mpi-based blockchain framework for data fidelity in high-performance computing systems (2019)
Cachin, C., Vukolić, M.: Blockchain consensus protocols in the wild. arXiv preprint arXiv:1707.01873, (2017)
De Angelis, S.: Assessing security and performances of consensus algorithms for permissioned blockchains. arXiv preprint arXiv:1805.03490, (2018)
Dwork, C., Naor, M.: Pricing via processing or combatting junk mail. In: Brickell, E.F. (ed.) Advances in Cryptology – CRYPTO’ 92, pp. 139–147. Springer, Berlin Heidelberg (1993)
Lamport, L.: The weak byzantine generals problem. J. ACM 30(3), 668–676 (1983). https://doi.org/10.1145/2402.322398
Bosilca, G., Bouteiller, A., Herault, T., Le Fèvre, V., Robert, Y., Dongarra, J.: Revisiting credit distribution algorithms for distributed termination detection. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 611–620 (2021). https://doi.org/10.1109/IPDPSW52791.2021.00095
Moise, I.: Efficient agreement protocols in asynchronous distributed systems. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 2022–2025. IEEE, (2011)
Ropars, T., Lefray, A., Kim, D., Schiper, A.: Efficient process replication for MPI applications: Sharing work between replicas. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 645–654, (2015). https://doi.org/10.1109/IPDPS.2015.29
Fischer, M.J.: The consensus problem in unreliable distributed systems (a brief survey). In: Karpinski, M. (ed.) Foundations of Computation Theory, pp. 127–140. Springer, Berlin Heidelberg (1983)
Sankaran, S., Squyres, J.M., Barrett, B., Sahay, V., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The lam/mpi checkpoint/restart framework: system-initiated checkpointing. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005). https://doi.org/10.1177/1094342005056139
Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. (2011b). https://doi.org/10.1145/2063384.2063443
Woo, S., Lang, S., Latham, R., Ross, R., Thakur, R.: Reliable MPI-IO through layout-aware replication (2011)
Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998). https://doi.org/10.1145/279227.279229
Borowsky, E., Gafni, E.: Generalized flp impossibility result for<i>t</i>-resilient asynchronous computations. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing. STOC ’93, pp. 91–100. Association for Computing Machinery, New York. ISBN 0897915917. (1993). https://doi.org/10.1145/167088.167119
Brokaw, T., Koziuk, G.: The intelligent platform management interface (IPMI) and enclosure management. Electron. Eng. (Lond.) 72, 19 (2000)
Costa, C. H. A., Park, Y., Rosenburg, B. S., Cher, C.-Y., Ryu, K. D.: A system software approach to proactive memory-error avoidance. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 707–718. IEEE Press, (2014). https://doi.org/10.1109/SC.2014.63
Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: Logaider: A tool for mining potential correlations of hpc log events. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451 (2017). https://doi.org/10.1109/CCGRID.2017.18
Leners, J.B., Wu, H., Hung, W.-L., Aguilera, M.K, Walfish, M.: Detecting failures in distributed systems with the falcon spy network, In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 279–294. New York, NY, Association for Computing Machinery (2011). https://doi.org/10.1145/2043556.2043583
Moses, Y., Raynal, M.: Revisiting simultaneous consensus with crash failures. J. Parallel Distrib. Comput. 69(4), 400–409 (2009). https://doi.org/10.1016/j.jpdc.2009.01.001
Bano, S., Sonnino, A., Al-Bassam, M., Azouvi, S., McCorry, P., Meiklejohn, S., Danezis, G.: Sok: Consensus in the age of blockchains. In: Proceedings of the 1st ACM Conference on Advances in Financial Technologies, pp. 183–198 (2019)
Aguilera, M. K., Toueg, S.: Randomization and failure detection: a hybrid approach to solve consensus. Technical report (1996)
Al-Mamun, A., Zhao, D.: BAASH: enabling blockchain-as-a-service on high-performance computing systems. CoRR Preprint at arxiv: 2001.07022 (2020)
Lamport, L., Shostak, R.E., Pease, M.C.: The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982)
Duan, S.: Building reliable and practical byzantine fault tolerance. PhD dissertation, University of California Davis (2016)
Fan, X., Chai, Q.: Roll-dpos: A randomized delegated proof of stake scheme for scalable blockchain-based internet of things systems. In: Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. MobiQuitous ’18, pp. 482–484. New York (2018). https://doi.org/10.1145/3286978.3287023
...Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., Debardeleben, N.A., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)
Darius, B.: Scalable distributed consensus to support mpi fault tolerance. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249. IEEE, (2012)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Depend. Secur. Comput. 7(4), 337–350 (2009)
Leesatapornwongsa, T., Lukman, J.F., Lu, S., Gunawi, H.S.: TaxDC: a taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. SIGPLAN Not. 51(4), 517–530 (2016). https://doi.org/10.1145/2954679.2872374
Omwenga, M., Otim, J., Lumala, A.: Robust mobile cloud services through offline support, pp. 90–93 (2012). https://doi.org/10.1109/ACSEAC.2012.27
Stone, J., Partridge, C.: When the CRC and TCP checksum disagree. SIGCOMM Comput. Commun. Rev. 30(4), 309–319 (2000). https://doi.org/10.1145/347057.347561
Huang, S.-T.: Detecting termination of distributed computations by external agents. In: [1989] Proceedings. The 9th International Conference on Distributed Computing Systems, pp. 79–84, (1989). https://doi.org/10.1109/ICDCS.1989.37933
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011). https://doi.org/10.1561/2200000016
Losada, N., González, P., Martín, M.J., Bosilca, G., Bouteiller, A., Teranishi, K.: Fault tolerance of MPI applications in exascale systems: the ULFM solution. Future Gener. Comput. Syst. 106, 467–481 (2020). https://doi.org/10.1016/j.future.2020.01.026
Hassani, A., Skjellum, A., Bangalore, P. V., Brightwell, R.: Practical resilient cases for fa-mpi, a transactional fault-tolerant mpi. In: Proceedings of the 3rd Workshop on Exascale MPI. Association for Computing Machinery, ExaMPI ’15, New York (2015). https://doi.org/10.1145/2831129.2831130
Hursey, J., Naughton, T., Vallee, G., Graham, R.L.: A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) Recent Advances in the Message Passing Interface, pp. 255–263. Springer, Berlin Heidelberg (2011)
Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. J. ACM 35(2), 288–323 (1988). https://doi.org/10.1145/42282.42283
García-Pérez, Á., Gotsman, A., Meshman, Y., Sergey, I.: Paxos consensus, deconstructed and abstracted. In: Ahmed, A. (ed.) Programming Languages and Systems, pp. 912–939. Springer International Publishing, Cham (2018)
Miguel Castro, B.L.: Practical byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst. 20(4), 398–461 (2002). https://doi.org/10.1145/571637.571640
Acknowledgements
This work was performed with partial support from the National Science Foundation under Grants Nos. CCF-1562659, CCF-1562306, CCF-1617690, CCF-1822191, and CCF-1821431. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Author information
Authors and Affiliations
Contributions
The first draft of the manuscript was written by all authors and we all commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors confirm that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nansamba, G., Altarawneh, A. & Skjellum, A. A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC. Int J Parallel Prog 51, 128–149 (2023). https://doi.org/10.1007/s10766-022-00749-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-022-00749-y