Reliability HotWire

Issue 8, October 2001

Hot Topics

# Estimating Availability Through Simulation

Availability is a metric that combines the concepts of reliability and maintainability. Availability gives the probability of a unit being available - not broken and not undergoing repair - when called upon for use. Industries that rely on certain key pieces of equipment have a powerful interest in being able to model and track the availability of these machines. So do the manufacturers of repairable systems, whose customers will have a keen interest in the availability of the products they are buying. Availability estimation is most frequently done through simulation. In this article, we look at the way ReliaSoft's BlockSim software conducts availability simulations.

BlockSim employs the simulation method to estimate a system's availability and associated measures. These include the number of expected failures, number of expected maintenance actions, etc. The estimation process involves synthesizing system performance over a given number of simulation runs or loops. Each loop emulates how the system might perform in real life based on the specified failure and downtime properties of the system. These properties consist of the interrelationships among the components, as defined in the reliability block diagram (RBD), and the corresponding quantitative failure, repair, and maintenance properties for each component. The reliability block diagram determines how component failures can interact to cause system failures. The failure, repair, and maintenance properties determine how often components are likely to fail, how quickly they will be restored to service, the associated logistic time for each repair, when to perform preventive maintenance, etc. By performing many simulation loops and recording a success or failure for each loop, a statistical picture of the system performance can be obtained.

For example, suppose we wish to determine the availability of a complex system over a period of one year. A simulation model of the system could be developed that emulates the random failures and repair times of the components in the system, thus creating an overall picture of the up and down states for the system, similar to the following figure.

For a given operation time, random times-to-failure and, if necessary, times-to-repair are generated. If the component or components that fail in that time period are vital to the operation of the system, the system is said to have failed. This process is repeated for a specified number of iterations and the results are averaged to develop an overall model of system availability.

The process will be illustrated with a simple example of a one-component repairable system. The failure distribution of the component is known to be a two-parameter Weibull distribution with shape parameter = 1.5 and scale parameter = 150 hours. The repair distribution is a normal distribution with = 5 hours and = 2 hours. We will not deal with "advanced" concepts such as preventive maintenance or added logistic time.

In BlockSim, the user must input the number of loops for the simulation on the Setup page of the Reliability/Maintainability Simulation window. (Note that the number of inner and outer loops is irrelevant for the maintainability simulation; the maintainability simulation is only concerned with the total number of loops.) For this example, we will choose just ten loops for the sake of simplicity. In practical applications, it is always best to have a high number of loops in order to obtain an accurate simulation. The user is also required to enter a mission end time on the Maintainability page. For this example, we will choose 100 hours as our mission end time.

With this information entered, the simulation may begin. At each loop, the simulator generates a random failure time for each component using Monte Carlo simulation, based on the failure distribution for the component. There will only be one failure time for this system as it consists of only one component. This failure time is compared to the mission end time. If the failure time is greater than the mission end time, the loop is considered to be over and no downtime is logged for that loop. If the random failure time is less than the mission end time, a failure is logged against the component and, if the failure of the component would result in a system failure, against the system as well. In our example, a component failure is equivalent to a system failure. At this point, a repair time is generated based on the component's repair distribution. This is logged as component and/or system downtime. The failed component has now accumulated life equivalent to the sum of the failure time and the repair time. If this sum, or elapsed time, is less than the mission end time, another random failure time is generated. If this new failure time is less than the remaining time (mission end time less elapsed time), another repair time is logged, and so on. This process repeats until enough failure and repair times have elapsed to meet or exceed the system mission end time, and the total downtime and number of failures for the loop are logged.

This process is repeated for each loop, and the uptime for each loop (mission end time minus downtime) is calculated. At the end of all of the simulation loops, the downtime is averaged and divided by the mission end time to determine the average availability. The point availability is determined by dividing the total number of times the system was operational at the end of each loop by the number of loops.

While this description may seem a little complex, the process can be clarified by going through this example loop by loop and calculating the results.

• LOOP 1 - The first generated failure time is 16.1 hours, which is well less than the mission end time of 100 hours, so a failure is logged. Then a repair time of 8.6 hours is generated. This means that the system is up and running again, with an elapsed time of 24.7 hours, with 75.3 hours left until the mission end time. At this point, a second failure time of 62.3 hours is generated. Since this is less than the remaining time of 75.3 hours, another failure is logged. Another repair time of 6.4 hours is generated. At this point, the elapsed time on the loop is 93.4 hours (16.1 + 8.6 + 62.3 + 6.4 = 93.4). This is still less than the mission end time of 100 hours, resulting in a remaining time of 6.6 hours, so another failure time is generated. This failure time is 30.5 hours, which exceeds the remaining time of 6.6 hours, so the loop is ended with a total of two failures logged with a total downtime of 15 hours (8.6 + 6.4 = 15). The system was operating at the end of the loop.

• LOOP 2 - The first generated failure time is 141.2 hours. Since this exceeds the mission end time, no failures or downtime are logged. The system was operating at the end of the loop.

• LOOP 3 - The first generated failure time is 167.4 hours. No failures or downtime. The system was operating at the end of the loop.

• LOOP 4 - The first generated failure time is 199.4 hours. No failures or downtime. The system was operating at the end of the loop.

• LOOP 5 - The first generated failure time is 74.8 hours. This is less than the mission end time, so a failure is logged and a repair time of 6.1 hours is generated. At this point the elapsed time is 80.9 hours, with 19.1 hours remaining until the mission end time. The second generated failure time is 202.8 hours. This exceeds the remaining time, so the loop ends with the system operating. There was one failure and 6.1 hours of downtime logged.

• LOOP 6 - The first generated failure time is 370.8 hours. No failures or downtime. The system was operating at the end of the loop.

• LOOP 7 - The first generated failure time is 101.4 hours. No failures or downtime. The system was operating at the end of the loop.

• LOOP 8 - The first generated failure time is 179.4 hours. No failures or downtime. The system was operating at the end of the loop.

• LOOP 9 - The first generated failure time is 205.7 hours. No failures or downtime. The system was operating at the end of the loop.

• LOOP 10 - The first generated failure time is 92.4 hours, thus a failure is logged. A repair time of 7.8 hours is generated. This results in an elapsed time of 100.2 hours. Since this exceeds the mission end time by 0.2 hours, the system was logged as not operating at the end of the loop. In other words, the system was still considered to be undergoing repair at the end of the loop. Also, the total downtime is 7.6 hours rather than the 7.8 hours generated as a repair time. This is because it is only calculated from the end of the mission (100 - 92.4 = 7.6).

The following table summarizes the activity for each of the loops:

 Loop # # of Failures Downtime Uptime Operating at End of Loop 1 2 15.0 85.0 1 2 0 0.0 100.0 1 3 0 0.0 100.0 1 4 0 0.0 100.0 1 5 1 6.1 93.9 1 6 0 0.0 100.0 1 7 0 0.0 100.0 1 8 0 0.0 100.0 1 9 0 0.0 100.0 1 10 1 7.6 92.4 0 AVERAGE: 97.1 0.9

Thus, the average uptime at the end of the simulation is 97.1 hours. The average availability is calculated by dividing the average uptime by the mission end time, or 97.1/100 = 97.1%. The point availability at 100 hours is 90%, since the system was considered to be operating at 9 out of the 10 loops in the simulation. This is unusual in that the point availability is usually higher than the average reliability. This illustrates the hazards of performing a simulation with a small number of loops. If we had used a larger number of loops, the point availability would most likely have been higher than the average availability.

This gives a simple example of how the availability simulation in BlockSim works. Obviously, the process is more complex when dealing with complicated systems. No distinction was made between the component and system failure in this example, but other complex system simulations must take into account whether the component failure resulted in a system failure, whether some components continue to accumulate time when the system is down, whether there is scheduled preventive maintenance, etc. The process can be viewed as a matter of bookkeeping and keeping track of uptime and downtime for the individual components as well as the entire system.