Quote of the Day
A theory has to be simpler than the data it explains.
— Leibnitz
Introduction
I was in an interminable meeting the other day where we were discussing the MTBF and availability of a system. My issue with this discussion is that each person in the room preferred to think about these terms in different ways. In this post, I will show that the four people in the meeting were actually in violent agreement and simply did not understand that their arguments were mathematically equivalent.
I wish I could say that this was the first time in my career that this had happened, but that would not be true. It happens all the time.
Background
The Argument
I will try to summarize the argument as simply as I can:
Person 1 | The system must conform to GR-909 – a telecommunications specification that specifies system availability. |
Person 2 | The system must have an availability of at least 99.999%. |
Person 3 | The system must have a downtime (i.e. unavailability) of less than 5 minutes per year. |
Person 4 | The system must have a Mean-Time-Between Failure (MTBF) of 68.4 years. |
Definitions
- Availability
- The ratio of (a) the total time a functional unit is capable of being used during a given interval to (b) the length of the interval. For example, a unit that is capable of being used 100 hours per week (168 hours) would have an availability of 100/168. In high availability applications, a metric known as "nines", corresponding to the number of nines following the decimal point, is used. With this convention, "five nines" equals 0.99999 (or 99.999%) availability (Source).
- Mean Time Between Failures (MTBF)
- MTBF describes the expected time between two failures for a repairable system (Source).
- Mean Time To Repair (MTTR)
- MTTR represents the average time required to repair a failed component or device (Source).
- Mean Time to Failure (MTTF)
- MTTF denotes the expected time to failure for a system that requires a repair with an MTTR of a given value. For our purposes here, .
- Failure Rate (FR)
- Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time (e.g. 1E9 hours).
Analysis
Figure 2 summarizes my demonstration of the equality of each person's argument.
Conclusion
It took about 30 minutes to get everyone in the meeting to understand that they all were stating the same requirement. The problem originates in that different departments work in terms of different units. System engineers and industry specifications speak in terms of availability. Hardware engineers speak in terms of MTBF. Customer Service people speak in terms of downtime per year.
The "elephant in the room" was that fact that most systems fail because of software bugs and these reliability calculations ignore software bugs.
Thanks for explaining this. I believe you have a minor typo in eq1, the denominator should be "1-A"
Hi Ronan,
This is an odd feature of many computer algebra system. If you look closely – I often miss it myself – you will see a minus sign on the front of eq1. Nearly every computer algebra system I have used likes to put a minus sign on the front of the expressions it simplifies. Thus, the equation shown actually is equal to 1-A, as you state it should be.
mathscinotes
I missed that little negative sign.
Have a Merry Christmas or wishing you the best this holiday season and a Happy New Year
You have a nice Christmas too. Hopefully I will have a bit more time to write-up some math. I have had quite a bit going on, but no time to write.
mathscinotes