Abstract.
Consider a distributed system with \(n\) nodes. A protocol running on this system is \(k\) resilient if it could tolerate up to \(k\) failures and operate correctly. The reliability of such a protocol is defined as the probability that no more than \(k\) nodes have failed. In the first part of the paper, we study the scalability of systems running such protocols. We show the existence of a threshold time of operation for these protocols which we call the scalable mission time (SMT). This scalable mission time is the maximum time until which an asymptotic increase in the system size leads to an asymptotic increase in the reliability of the protocol. We show that beyond this scalable mission time, an asymptotic increase in system size leads to an asymptotic decrease in reliability. We also show techniques to compute the scalable mission time. In the second part of the paper, we show that the scalable mission time for a \(k\) resilient protocol can be used as a good approximation to the mean-time to failure (MTTF) of the protocol, even when the failure distributions are non-exponential and the nodes fail at different rates (a heterogeneous system). We also show that the MTTF asymptotically approaches the SMT with an increase in system size \(n\). Computation of the MTTF is quite difficult when the system is heterogeneous even if the failure distribution of the nodes is exponential. Using experimental results, we show that the SMT approximation to the MTTF gives values very close to the real MTTF. Further, we consider the maintenance interval of systems running \(k\) resilient protocols and show that if the maintenance interval is larger than the scalable mission time, then there is a maximum scalability value beyond which it is undesirable to scale up the size of the system.
Similar content being viewed by others
Author information
Authors and Affiliations
Additional information
Received: 4 October 1994 / 30 May 1996
Rights and permissions
About this article
Cite this article
Rangarajan, S., Huang, Y. & Tripathi, S. On the scalability and mean-time to failure of \(k\) resilient protocols. Acta Informatica 34, 543–556 (1997). https://doi.org/10.1007/s002360050096
Issue Date:
DOI: https://doi.org/10.1007/s002360050096