Good news & Bad news & more Bad news: The good news is that there has been a couple of exhaustive studies on hard drive failure rates. The bad news is the studies are old. More bad news: With the growing acceptance of solid state drives, we may not see another study.
Google’s Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso wrote a great paper, “Failure Trends in a Large Disk Drive Population“, submitted to USENIX FAST conference of ’07. Reading this paper in today’s context, the takeaways are:
- The study was on consumer-grade (aka SATA) drives.
- Operating Temperature is not a good predictor of failure (i.e. it is not conclusive that drives operating in higher temp environments fail more frequently).
- Disk utilization (wear) is not a good predictor of failure.
- Failure rates over time graph more check mark or fish hook curve than bathtub.
Carnegie Mellon’s Bianca Schroeder & Garth A. Gibson also published a paper, “Disk failures in the real world:
What does an MTTF of 1,000,000 hours mean to you?” , at USENIX’S FAST 2007 conference. The takeaways from this paper are that their study was on FC, SCSI, & SATA disks and that in the first 5 years, it is reasonable to expect a 3% failure rate again following a check mark pattern. After 5 years, the failure rates are, of course, more significant.
Here’s the problem: In 2007, they are analyzing disks, where some are over 5 years old. So, is there any expectation that these failure rates are good proxies for our disks in the last 6 years? (The CMU paper has a list of disks that were sampled, if the reader cares to see). It would be natural to assume that disk drive vendors have improved the durability of their products. But, maybe the insertion of SAS (Serial Attached SCSI) drives into the technology mix might introduce similar failure rates, being that the technology is new(er).
Another study of note was a University of Illinois study by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky submitted during the ’08 FAST conference, “Are Disks the Dominant Contributor for Storage Failures?” which today’s takeaway is that when there is a single disk failure, there is another potential failure sooner than odds would dictate (i.e. that the events can be correlated).
This seems to be in place today as well as the research presented here was a result of cascading failures initially stemming from a disk failure. So, the question is: When can we expect another disk failure?
The research being authoritative and vendor-neutral would be helpful if the data were current. Until then, this data needs to be used as a proxy for predictions – or one can use anecdotal information. sigh.
<img src=”http://www.linkedin.com/img/webpromo/btn_myprofile_160x33.png” width=”160″ height=”33″ border=”0″ alt=”View Jim Surlow’s profile on LinkedIn”>