# MTBF and Annualized Failure Rates

Quote of the Day

If I had to give credit to the instruments and machines that won us the war in the Pacific, I would rate them in this order: submarines first, radar second, planes third, bulldozers fourth.

— Admiral Bull Halsey. I have always found the performance of US submarines during WW2 amazing considering the challenges that they faced with faulty torpedoes.

## Introduction

Figure 1: Bathtub Curve Model of System Reliability. (Source)

One of the more distasteful tasks I need to do is make estimates of annual product failure rates using  MTBF predictions based on part count methods. I find this task distasteful because I have never seen any indication that MTBF predictions are correlated in any way with field failure rates. This is not solely my observation – the US Army has cancelled its use of part count method MTBF predictions (i.e.  based on MIL-HDBK-217). However, the telecommunications industry has continued to use these predictions through their use of Telcordia SR-332, which is similar to MIL-HDBK-217. If you want a simple example of an SR-332-based reliability prediction, see this very clear example from Avago. The parts count method assumes that components fail at a constant rate (green line in Figure 1).

While this calculation is simple, it is useful to discuss why the results generated are so useless – in fact, I would argue that they drive incorrect business decisions for things like required spare parts inventories.

## Background

### Theory

The basic math here is shown in Equation 1.

 Eq. 1 $\displaystyle {{\lambda }_{{Annualized}}}=\underbrace{\lambda }_{{=\frac{1}{{MTBF}}}}\cdot {{T}_{{Year}}}=\frac{{{{T}_{{Year}}}}}{{MTBF}}$

where

• λAnnualized is the failure rate per year.
• λ is the failure rate (usually expressed per billion hours).
• TYear is the number of hours in a year (8760)
• MTBF is the Mean Time Between Failures.

### Shortcomings

The shortcomings of the part count method are many:

• It assumes a constant failure rate, memory-less failure rate
• A new part fails at the same rate as an old one.
• Total operating hours is all that is important.
• This means 1000 parts operating for one hour fail is the same as one part operating for 1000 hours.
• It assumes that a part's reliability is predictable based on some simple mathematical function.
• I see wide variations in part failure rates that depend on the part's application and how the vendor build it.
• I frequently see lot-dependent component failures.
• Most part failures are not random.
• They are caused by manufacturing issues, misapplication, environmental issues (e.g. lightning), etc.
• In some cases, they are caused by wear-out (e.g. I just dealt with a rash of dried-out, ten-year old electrolytic capacitor failures)
• It assumes that all vendors have the same quality level.
• It assumes that system's failure rate is the sum of all the individual component failure rates.
• Many issues are related to interaction problems.
• Ignores the fact that how you hook up the parts matters.
• Installation issues are a major source of equipment problems.
• I frequently see installations where there is contamination or wind-generated motion that causes device failure.
• I have reported on this blog numerous cases of insect infestation.
• These issues drive field failure rates far more than random part failures.
• The "elephant in the reliability room" is that software failures tend to dominate over hardware failures.

## Analysis

Figure 2 shows my calculations for a made-up example.

Figure 2: Made-up Example Showing Annualized Failure Rate Calculation.

## Conclusion

In general, I find all formal procedures distasteful. In this case, people want a calculation done in a specific manner – and I dutifully comply. However, I know the answer does not reflect reality. In general, these computed annualized failure rates are ~10x what I would consider acceptable annual failure rates for actual products.

I recently had a conversation with an Australian service provider who was having trouble predicting the number of spare parts he needed to have in inventory. The problem he was having traced directly back to this calculation.

Save

Save

Save

This entry was posted in Statistics. Bookmark the permalink.

### 6 Responses to MTBF and Annualized Failure Rates

1. mark tucker says:

Good timing for this article. I just completed a Telcordia SR-332 calculation for one of our products. Of course, I don't believe in the number I calculated. The customer - if he's even remotely aware how these things are calculated - won't believe the number either. For that matter, I've never met anyone that does believe these numbers.

However, the process has been dutifully followed and we've all played our parts correctly. I too wish there was a better way to do this. Right now, we're just all living with (and perpetuating) the same lie.

• mathscinotes says:

You would be amazed at the number of folks I talk to who are budgeting people and spares based on these numbers. Not just in the US – all over the world.

Thanks for the note.

mathscinotes

2. ANAND says:

Has anyone tried to use demonstrated MTBF numbers for the parts based upon true reliability testing data as opposed to using the estimated numbers from Telcordia data? Curios to know if that is any closer to the AFRs?

• mathscinotes says:

People definitely do component accelerated life testing to established measured values. I am designing a laser life test right now! If you want to see an example of the results of this testing, see this post. The testing generally involves running ~150 lasers at high temperature for 2000 hours. In general, the predicted AFRs are less than the actual AFRs. I only use Telcordia predictions for comparisons between assemblies – they do an EXTREMELY poor job of estimating real AFR because AFR in telcom applications is often driven by environmental issues (e.g. lightning, insect damage, etc.) and not random failures or wearout.

mark

3. Earles L McCaul says:

What "name" should we assign to software MTBF? How long does it 'seem' to work (as intended) before the inevitable coding 'bug' occurs? (Hint: code size/lines of code, etc.).