It’s a fairly infrequent occurrence for a big IT infrastructure user to be upfront about their hardware failure rates, so when a company releases detailed information, it’s worth paying some attention.
Hard drive attrition rates impact numerous factors in the web hosting industry. Of course, it has an impact on hardware costs. Hard drives die, so companies have to replace them. But, of equal importance is planning for redundancy to maintain performance and stability. How much redundancy is needed depends on the likely rate of hardware failure. There’s no point having hundreds of hard drive sitting idle in a warehouse if their failure rate is only at a level where a few are going to be required at any one time. On the other hand, having a couple of spares may be nowhere near enough for a data center that runs thousands of servers.
Accurate failure rates help us to steer safe path between the two extremes.
Backblaze is a cloud-based backup service. At any point in time they store 75 Petabytes of data on 25,000 standard consumer-grade hard drives.
The usual rate of hard drive failure follows a well-known path. A certain percentage will fail soon after being put into production, usually towards the beginning of the period, reflecting defects in manufacturing or other damage, after which failure rates will drop considerably, before gradually starting to rise again as parts wear out. Reliability engineers call this the Bathtub Curve.
Backblaze’s hard drive failure rates follow exactly the expected curve, with a high rate of initial failure declining to a low but constant rate of failure as drives randomly die, before heading back up the curve.
The figures released by Backblaze show that many hard drives will fail during their first year of use, with a low and stable failure rate for the following two years, followed by a very sharp rise in failures after the three year point. If a hard drive lasts through the first year, it’s fairly unlikely that it will fail in next two years. Despite the increasing rates of failure as time goes by, 80% of drives are in working order after 4 years.
The actual failure rates are the most interesting part of the study. During the first year of production use, Backblaze’s drives experience a failure rate of 5.1%, which falls to 1.4% for years 2 and 3, and then shoots up to 11.8% in the fourth year.
Of course, failure rates are impacted by many factors, including how the hard drives are used, where they were sourced, and how they are looked after during their life, so other sources should be considered, but this study does provide a statistically significant guide to hard drive attrition rates that will be useful for planning redundancy levels and likely replacement costs over time.
Photo Credits: Numinosity