Estimating Availability Through Simulation
Availability is a metric that combines the concepts of reliability and maintainability. Availability gives the probability of a unit being available - not broken and not undergoing repair - when called upon for use. Industries that rely on certain key pieces of equipment have a powerful interest in being able to model and track the availability of these machines. So do the manufacturers of repairable systems, whose customers will have a keen interest in the availability of the products they are buying. Availability estimation is most frequently done through simulation. In this article, we look at the way ReliaSoft's BlockSim software conducts availability simulations.
BlockSim employs the simulation method to estimate a system's availability and associated measures. These include the number of expected failures, number of expected maintenance actions, etc. The estimation process involves synthesizing system performance over a given number of simulation runs or loops. Each loop emulates how the system might perform in real life based on the specified failure and downtime properties of the system. These properties consist of the interrelationships among the components, as defined in the reliability block diagram (RBD), and the corresponding quantitative failure, repair, and maintenance properties for each component. The reliability block diagram determines how component failures can interact to cause system failures. The failure, repair, and maintenance properties determine how often components are likely to fail, how quickly they will be restored to service, the associated logistic time for each repair, when to perform preventive maintenance, etc. By performing many simulation loops and recording a success or failure for each loop, a statistical picture of the system performance can be obtained.
For example, suppose we wish to determine the availability of a complex system over a period of one year. A simulation model of the system could be developed that emulates the random failures and repair times of the components in the system, thus creating an overall picture of the up and down states for the system, similar to the following figure.
For a given operation time, random times-to-failure and, if necessary, times-to-repair are generated. If the component or components that fail in that time period are vital to the operation of the system, the system is said to have failed. This process is repeated for a specified number of iterations and the results are averaged to develop an overall model of system availability.
The process will be illustrated with a simple example of a one-component repairable system. The failure distribution of the component is known to be a two-parameter Weibull distribution with shape parameter = 1.5 and scale parameter = 150 hours. The repair distribution is a normal distribution with = 5 hours and = 2 hours. We will not deal with "advanced" concepts such as preventive maintenance or added logistic time.
In BlockSim, the user must input the number of loops for the simulation on the Setup page of the Reliability/Maintainability Simulation window. (Note that the number of inner and outer loops is irrelevant for the maintainability simulation; the maintainability simulation is only concerned with the total number of loops.) For this example, we will choose just ten loops for the sake of simplicity. In practical applications, it is always best to have a high number of loops in order to obtain an accurate simulation. The user is also required to enter a mission end time on the Maintainability page. For this example, we will choose 100 hours as our mission end time.
With this information entered, the simulation may begin. At each loop, the simulator generates a random failure time for each component using Monte Carlo simulation, based on the failure distribution for the component. There will only be one failure time for this system as it consists of only one component. This failure time is compared to the mission end time. If the failure time is greater than the mission end time, the loop is considered to be over and no downtime is logged for that loop. If the random failure time is less than the mission end time, a failure is logged against the component and, if the failure of the component would result in a system failure, against the system as well. In our example, a component failure is equivalent to a system failure. At this point, a repair time is generated based on the component's repair distribution. This is logged as component and/or system downtime. The failed component has now accumulated life equivalent to the sum of the failure time and the repair time. If this sum, or elapsed time, is less than the mission end time, another random failure time is generated. If this new failure time is less than the remaining time (mission end time less elapsed time), another repair time is logged, and so on. This process repeats until enough failure and repair times have elapsed to meet or exceed the system mission end time, and the total downtime and number of failures for the loop are logged.
This process is repeated for each loop, and the uptime for each loop (mission end time minus downtime) is calculated. At the end of all of the simulation loops, the downtime is averaged and divided by the mission end time to determine the average availability. The point availability is determined by dividing the total number of times the system was operational at the end of each loop by the number of loops.
While this description may seem a little complex, the process can be clarified by going through this example loop by loop and calculating the results.
The following table summarizes the activity for each of the loops:
Thus, the average uptime at the end of the simulation is 97.1 hours. The average availability is calculated by dividing the average uptime by the mission end time, or 97.1/100 = 97.1%. The point availability at 100 hours is 90%, since the system was considered to be operating at 9 out of the 10 loops in the simulation. This is unusual in that the point availability is usually higher than the average reliability. This illustrates the hazards of performing a simulation with a small number of loops. If we had used a larger number of loops, the point availability would most likely have been higher than the average availability.
This gives a simple example of how the availability simulation in BlockSim works. Obviously, the process is more complex when dealing with complicated systems. No distinction was made between the component and system failure in this example, but other complex system simulations must take into account whether the component failure resulted in a system failure, whether some components continue to accumulate time when the system is down, whether there is scheduled preventive maintenance, etc. The process can be viewed as a matter of bookkeeping and keeping track of uptime and downtime for the individual components as well as the entire system.
Copyright © 2001 ReliaSoft Corporation, ALL RIGHTS RESERVED