Estimating Disk Failure Rates.

Febuary 14, 1996

The Rensselaer Computing System is built on top of the Andrew File System (AFS). AFS at RPI runs on 15 fileserver which together provide access to over 100 Gigabytes of date. AFS is robust in that failure of a single drive, or fileserver does not bring down the entire system. Files can be restored to an active system, and key files can be replicated over several fileservers.

With over 70 individual disk drives in AFS, however, frequent disk failures are to be expected. Even if an individual disk had only a .01 probability of failing in a given year, with 70 disks the probability of at least one drive failing is 1-.96^{70, or .51 That is, even with high reliability, we should expect to see frequent disk failures.

This document is an attempt to evaluate disk reliability based on manufacturer estimates of Mean Time Between Failures. It is instructive to see if our observed failure rates are what one would expect given manufacturer reliability estimates. After defining Mean Time Between Failure, and calculating the observed failure rate in RCS, some recomendations for disk replacement are made.

Mean Time Between Failure.

Disk reliability is usually reported by manufacturers as the MTBF, or Mean Time Between Failures. Understanding what this metric means, and how it is measured is important for estimating disk failure rates. IBM defines MTBF as:

That is, run a bunch of drives for a given amount of time, and divide by the number of failures:

{ MTBF = {{ hours\ of\ operation\over{ number\ of\ errors

There are three important considerations left out of this equation: what is a ``defined group'' of drives; what is a failure; and, what is the distribution.

IBM defines the ``defined group'' as drives that:

  1. have not reached end-of-life (typically five to seven years),
  2. are operated within a specified reliability temperature range, under specified normal usage conditions, and
  3. have not been damaged or abused.

Most important to note is that if a drive was manufactured for a lifetime of five years, it is no longer included in the MTBF calculation once it is five years old.2

A failure is ``[a]ny event that prevents a drive from performing its specified operation, given the drive meets the group definition [described above].'' This does include drives that fail during shipment or in early life. It doesn't include drives that are miss-installed, or miss-handled.

Finally, MTBF does not provide any estimate of variance. That is, does a MTBF of 100,000 hours mean a single drive run for over 11 years, or 10 drives run for a little over a year. Knowing the number of drives would allow one to estimate the variability of the MTBF. However, since MTBF is usually reported in increments of 100,000 hours, there is a bit of rounding in the reported figure.3

You may ask how the MTBF of a drive with a five year lifetime can exceed 43,800 hours. The answer is that, if the drives were replaced every five years with new drives with identical MTBF, you would have a good probability of going to the MTBF before seeing a failure. How good a probability depends on the distribution, which we do not know, but IBM suggest it is greater then .30 for their 1,000,000 hour MTBF product line.

Estimating the Number of Failures From MTBF.

One important thing to remember is that we have more then one drive running at any given time (see the tables at the end of this document). This means that the hours of runtime contributing to MTBF are adding up in parallel. IBM suggest the following equation to estimate the number of failures to expect over the lifetime of a ``drive group.''

r \approx {n\ { drives \times \displaystyle h \left({ hours\over{ drive\right) \over { MTBF\displaystyle\left({ hours\over { failure\right)

For example, if you run 1,000 drives for five years, and each drive has a MTBF of 1,000,000 hours we have:

44 \approx {1{,000\ { drives \times \displaystyle 43{,800 \left({ hours\over{ drive\right) \over 1{,000{,000\displaystyle\left({ hours\over { failure\right)

Lacking any information on the MTBF variance, this seems to be a reasonable formula.

Disk Loss in the rpi.edu Cell.

We currently have 71 disks in AFS space. Assuming each disk has a MTBF of 500,000 hours over a five year period we would expect:

{ errors = 6 \approx {71\ { drives \times \displaystyle 43{,800 \left({ hours\over{ drive\right) \over 500{,000\displaystyle \left({ hours\over { failure\right)

In reality, we have seen six failures since the end of July. Our observed MTBF, therefore, is less then 500,000 hours.

A rough estimate of observed MTBF can be calculated by solving the above formula for MTBF. In the past year we have seen about eight disk failures.4 This gives a MTBF of 388,725 hours assuming a five year lifetime for each drive, and assuming all 71 drives have been running for five years. The second assumption is obviously false, so the result indicates an upper bound on observed MTBF (less drives indicates a higher failure rate). The low MTBF value is most likely due to a number of ``real life'' variables such as the recent power problems, and the use of the drives in fileservers, which increases the seek rate reducing manufacturer MTBF.

We can also use the observed probability of failure to calculate a 95% confidence interval for the number of disks expected to fail next year. A confidence interval for an observed probability is given by:

np \pm nz_{\alpha/2 \sqrt{p(1-p)\over n

Where p = r/n, n is the number of drives, and z_{\alpha/2 is the z-score value at the desired confidence level. Using our observed failure rate of 8 out of 71 drives, we get a 95% confidence interval of 8\pm 5.37 \approx 3-13 disks.

Recommendations.

First, we should have a replacement disk in stock for all of the rootvg (root volume group) disks on the fileservers. The rootvg contains the operating system and applications (AFS) code. Loosing the rootvg on a fileserver will bring down the server, and prevent access to all of the AFS files on that server until it is replaced.

The replacement needs to be compatible, but not identical. For example, if one of the two 670 Meg disks used for aaron's rootvg were to fail, it can be replaced with a 1 gig disk. A complete listing of the fileservers and their rootvg disks is at the end of this document.

Second, there should be about 10% of total AFS space available as unclaimed disks at any one time. This will allow for day-to-day growth, and to provide a pool of partitions to which files can be restored in case of a disk failure. It is best if this reserve were divided into partitions of varying sizes. A single 2 gig disk may be divided into 4 partitions. If this disk fails, we would ideally require 4 partitions of equal or larger size for the restore. A large reserve drive, such as a 4.3 gig SSA disk, could be repartitions into the required sizes, but this would require at least one restart of AFS on the fileserver.

At this time, we have less then 5% of AFS space in reserve, and half of this is in small, miscellaneous partitions that would not allow a full restore of any lost partitions. There are, however, about 10 gig worth of disks comming into service soon.

Third, any estimate of future disk use should also take into account end of lifetime replacements. A simple formula for this is the number of disks divided by the disk lifetimes. Assuming an average lifetime of five years, we should expect to replace \lfloor 71/5\rfloor = 14 disks per year. This is a pro-active replacement strategy that removes old disks before they fail. The disks can be cycled into less critical applications such as individual workstations or hot spares, or they can be sold as used disks. This is in addition to the half-dozen (or more) disks we can expect to die in a given year based on observed MTBF.

Disks Installed in AFS fileservers.

The following tables are a summary of the disk types installed in the AFS fileservers. The actual drives are, of course, subject to change as time progresses. An updated listing in Xess format can be found in sofkam/public/disks.x3 . The current pool of AFS partitions can be found with the vspace program, which can be accessed by setting up afstools .

Root volume group disks by fileserver.

\vbox{\halign{#\hfil\tabskip1em &#\hfil &\hfil#\tabskip2pt&#\hfil\tabskip1em &#\hfil &#\hfil &#\hfil &\hfil #\tabskip0pt\cr \bf Server &\bf\hfil CPU &\bf\span Size &\bf\hfil Disk &\bf\hfil Type/Model &\bf\hfil Serial no.&\bf\hfil Part no.\cr \noalign{\vskip2pt aaron &950 &670&MB &hdisk11&8760S &12311702\cr aaron &950 &670&MB &hdisk12&8760S &12314057\cr abraham &520 &670&MB &hdisk4 &8760S &12914718\cr adam &220 &400&MB &hdisk0 &0661467 &05193408 &73F8955\cr asher &370 &540&MB &hdisk0 &MXT-540SL &003B1NGE &74G8675\cr azariah &250 &1.0&GB &hdisk0 &0663L12 &00130560 &45G9512\cr david &230 &1.0&GB &hdisk0 &0663L12 &00002083 &55F9838\cr hannah &530 &670&MB &hdisk0 &8760S &12912472\cr jonah &530 &670&MB &hdisk0 &8760S &12832601\cr levi &230 &1.0&GB &hdisk0 &0663L12 &00016356 &45G9464\cr mishael &550 &400&MB &hdisk0 &0661467 &05051197 &73F8955\cr mishael &550 &400&MB &hdisk1 &0661467 &05061971 &73F8955\cr moses &520 &670&MB &hdisk0 &8760-S &12885641\cr nebuchadnezzar&530 &670&MB &hdisk0 &8760S &12957668\cr noah &320H &400&MB &hdisk0 &0661-467 &05198791\cr samson &530 &670&MB &hdisk2 &8760S &12980090\cr seth &320H &400&MB &hdisk0 &0661-467 &05053116\cr

AFS disks by type.

\vbox{\halign{\hfil#\tabskip1em& \hfil#\tabskip2pt&#\hfil\tabskip1em &#\hfil\tabskip0pt\cr \bf No.&\bf\span Size&\bf Description\cr \noalign{\vskip2pt 2& 1.0&GB &SCSI Disk Drive\cr 4& 1.3&GB &Hitachi SCSI Disk Drive\cr 4& 1.3&GB &IBM SCSI Disk Drive\cr 1& 1.6&GB &Microp SCSI Disk Drive\cr 4& 2.0&GB &HP SCSI Disk Drive\cr 1& 2.0&GB &IBM OEM SCSI Disk Drive\cr 11& 2.0&GB &SCSI Disk Drive\cr 7& 2.8&GB &Hitachi SCSI Disk Drive\cr 6& 2.8&GB &Seagate SCSI Disk Drive\cr 4& 355&MB &SCSI Disk Drive\cr 6& 4.3&GB &SSA Logical Disk Drive\cr 3& 400&MB &SCSI Disk Drive\cr 3& 628&MB &HP SCSI Disk Drive\cr 4& 640&MB &Microp SCSI Disk Drive\cr 11& 670&MB &SCSI Disk Drive\cr \noalign{ \hrule \vskip4pt 71&123&GB& Total\cr


1 This and subsequent quotes are from: MTBF---A measure of OEM disk drive reliability, http://eagle.almaden.ibm.com/storage/oem/tech/mtbf.htm, August 21, 1995.

2 In other words, a catastrophic failure of the bearings at five years, 1 month is not a failure according to MTBF.

3 The reality is usually that no drives are tested before shipping, in which case the MTBF is an estimate based on the performance of similar drives in the field.

4 The actual number may be higher. Eight is based on memory, but a more accurate count based on invoices is in the works.