In this thesis we investigate simulation algorithms and numerical approximations for estimating performance measures for some large classes of highly reliable systems. We start with the study of analytical approximations for a fundamental problem in reliability theory that is known to be computationally intractable (NP-hard): the standard network reliability problem with highly reliable independent components. We extrapolate these ideas to Markovian systems with highly reliable interacting components. The special structure of the generator matrix is used to derive limit theorems and efficient approximations for some special classes of such systems. For more general systems with complex interdependencies among components, we have to resort to simulation techniques. In particular, we investigate simulation techniques for a class of systems that are considered by the "SAVE" package. The SAVE ("Systems Availability Estimator") package is a state of the art software package being developed at the IBM T. J. Watson Research Center for the performance analysis of highly reliable systems.
In the SAVE package the component failure times and component repair times are assumed to be exponentially distributed so that the systems may be modelled as continuous time Markov chains. Despite the Markovian structure, naive simulation is very inefficient due to the rarity of the failure events. An importance sampling technique called failure biasing has been known empirically to produce orders of magnitude of variance reduction in the simulation of some such systems. We modify this technique to make it both more robust and applicable to a broader class of systems. We develop a mathematical framework within which we can prove that our modified failure biasing technique yields a rate of convergence that is insensitive to the component failure rates.
Another way of making a system highly reliable is to have large degrees of component redundancy. The components are permitted to have generally distributed repair times. We derive exact analytical formulas for the performance measures of some systems with components in parallel. We also investigate importance sampling techniques, based on large deviation theory, that can be used to efficiently simulate such systems.
Cited By
- Carrasco J Adapted Importance Sampling Schemes for the Simulation of Dependability Models of Fault-Tolerant Systems with Deferred Repair Proceedings of the 39th annual Symposium on Simulation, (107-117)
- Glynn P, Heidelberger P, Nicola V and Shahabuddin P Efficient estimation of the mean time between failures in non-regenerative dependability models Proceedings of the 25th conference on Winter simulation, (311-316)
- Shahabuddin P and Nakayama M Estimation of reliability and its derivatives for large time horizons in Markovian systems Proceedings of the 25th conference on Winter simulation, (422-429)
- Heidelberger P, Nicola V and Shahabuddin P Simultaneous and efficient simulation of highly dependable systems with different underlying distributions Proceedings of the 24th conference on Winter simulation, (458-465)
Recommendations
Fast Simulation of Highly Dependable Systems with General Failure and Repair Processes
An approach for simulating models of highly dependable systems with general failure and repair time distribution is described. The approach combines importance sampling with event rescheduling in order to obtain variance reductions in such rare event ...
Highly reliable message-passing mechanism for cluster file system
With the increase in personal computer clusters in popularity and quantity, message passing between nodes has been an important issue for high failure rate in the network. File access in a cluster file system often contains several sub-operations; each ...
Quick Simulation Methods For Estimating The Unreliability Of Regenerative Models Of Large, Highly Reliable Systems
We investigate fast simulation techniques for estimating the unreliability in large Markovian models of highly reliable systems for which analytical/numerical techniques are difficult to apply. We first show mathematically that for “small” time horizons,...