Author Topic: Researching practical HDD reliability/solutions...  (Read 2917 times)

0 Members and 1 Guest are viewing this topic.

Offline MyHeadHzTopic starter

  • Regular Contributor
  • *
  • Posts: 165
  • Country: us
Researching practical HDD reliability/solutions...
« on: February 24, 2019, 08:10:47 am »
I am trying to consolidate relevant data on hard disk drives (HDD's) with the purpose of making practical recommendations for anyone (end-user) who has an interest in hard disk storage and longevity solutions.  I am focusing only on the home-user or small NAS user, not enterprise solutions- though most of the sources are just the opposite.  Being an idiot, I would greatly appreciate some input.

Background (amazon links are for visual/type reference only): I've had several hard drives fail recently, including several drives known to be quite reliable- so I must be doing something wrong.  Current and previous strategies include:

1.  arbitrary vibration-dampened single-bay off-the-shelf sealed USB drives (no fans/accessibility)
2.  dual-bay enclosures with fans (you add your own drives), non-dampened
3.  non-dampened  external single-bay USB enclosures with fans
4.  non-dampened external single-bay enclusures, without fans
5.  standard SATA mounted drives screwed directly to computer tower frame (no dampening).
(It is worth noting that I've never tried RAID, a dedicated NAS box, or internally-mounted drives that were dampened.)

I've had more failures (dead drives and SMART failures) with solutions 1 and 2, which seemed counter-intuitive to me- especially considering I used high-quality enterprise drives with those.  My initial intent was to find a scalable method to decouple the HDD's from vibration to solve the problem.  Several people strongly asserted that vibration/sound dampening decreases hard drive longevity and write/speed reliability.  This seemed counter-intuitive, but it did happen to agree with my anecdotal drive failure experiences.  There were a lot of differing opinions on various forums, so I began researching journal sources.  It turned out to be quite the rabbit hole. 

Chan (2012) differentiates between vibrations (predictable, consistent frequencies and amplitudes) and externally-produced shock.  These are different problems with different solutions.  For the home-user, it is practical to decouple fans or other external vibration sources from the HDD.  I will do that where I can.  Another thing Chan notes is that vibrations that can affect ideal HDD performance can be as low as 2Hz.

Park, (2012) focuses on dampening solutions for 2.5" (laptop HDD's) with regard to shock and vibration tolerance.  Many papers seemed to focus on one or the other, but Park relates the two.   Park includes a chart of transmissibility using various rubber-based decoupling solutions.  Park also includes a chart relating frequency to the position error signal, which relates read/write errors to amplitude and frequency.  As with the other papers, there was no discussion about which frequencies/factors were most problematic to overall long-term hard drive life.  However, as this was a scientific paper, it was intended to be used more as a reference for people designing relevant devices, so the relevant parameters would vary greatly.  As much as I would like to assume that the PES and long-term failure rate are related, I cannot necessarily support that assumption with this data.  Also, all the data in Park was for 2.5" drives, so I suspect the frequency charts will probably be significantly different for 3.5" drives.

Again, for my purposes, the external forces can be understood and mitigated through isolation.  That leaves issues originating from the drive itself.  The primary sources for these seem to be sinusoidal vibration (platter balance), vibrations from the mechanical head movement, and any resonance issues.

I had a hard time finding information about overall long-term drive reliability.  I found several sources that reference "ideal mounting" of hard drives as being large stationary objects, such as granite slabs (Kelly, (2016)) or ~20kg metal blocks (Suwa (1999). However, I was unable to find why that is used as a standard.  I saw reports referencing it going back into the 90's that I couldn't find live sources of.   Would the conclusions learned from such old technology even still be relevant?

My main concern is that although it reduces measurable vibration, does that necessarily mean the drives will be more reliable?  Could it be that the energy is dissipated into the disk/actuator/head/etc itself, causing stress or strain that eventually reduces lifespan, instead of being safely emitted elsewhere?  I suppose this is where my lack of knowledge of the subject comes in.  Any input would be greatly appreciated!

In the mean time, I'll be trying to figure out some way to reliably mount HDD's in landscaping bricks.

There is still other useful knowledge to apply from what I've learned.  1.  Don't use dual enclosures.  2.  Don't use rubber/soft mounts- they are good for drops, but reduce lifespan otherwise.  3.  Physically isolate fans and other sources of vibration.  4.  I forgot to mention it above, but a major factor in reliability was start/stop cycles.  So I will probably set up a FreeNAS box to address that.  All that means that I will need to redo my entire storage system.  Ohh, well.

I read into the BackBlaze reliability statistics as well.  A lot of their methodology may not apply to normal users.  Those racks are big and heavy, and probably do well to dampen most resonances, or particular resonances.  So their numbers may or may not translate well into desktop system use- which they explicitly state.  There aren't any "control" drives outside of their normal 45 drive enclosures to compare against.




Chan, 2012 - http://seelab.ucsd.edu/papers/cschan_gm13.pdf
Park, 2012 - https://sci-hub.tw/10.1007/s00542-012-1592-z
Kelly, 2016 - https://45drives.blogspot.com/2016/09/everything-you-need-to-know-about-hard.html
Suwa, 1999 - https://sci-hub.tw/10.1109/20.753800

edit:typos
« Last Edit: February 24, 2019, 08:21:18 am by MyHeadHz »
 

Offline magic

  • Super Contributor
  • ***
  • Posts: 7453
  • Country: pl
Re: Researching practical HDD reliability/solutions...
« Reply #1 on: February 24, 2019, 10:08:39 am »
I've had more failures (dead drives and SMART failures) with solutions 1 and 2, which seemed counter-intuitive to me- especially considering I used high-quality enterprise drives with those.  My initial intent was to find a scalable method to decouple the HDD's from vibration to solve the problem.  Several people strongly asserted that vibration/sound dampening decreases hard drive longevity and write/speed reliability.  This seemed counter-intuitive, but it did happen to agree with my anecdotal drive failure experiences.  There were a lot of differing opinions on various forums, so I began researching journal sources.  It turned out to be quite the rabbit hole.
Interesting.
I think a possible problem with soft suspension is that vibration generated by any mass imbalance in the motor or platters shakes the whole disk and hammers on the bearings of the head arm. It likely increases the effort of keeping the heads on track too so if the disk is already starting to fall apart it could increase the rate of read retries and write errors. Pure speculation.
I'm not sure how forces acting on spindle bearings are affected. Probably it makes no difference whether the disk chassis is stationary and the center of mass of the motor/platter system orbits the rotation axis or vice versa.
Anyway, I have one disk suspended on rubbers and it seems to be doing fine so far. But I did it for noise suppression, not out of concerns about external vibrations affecting the disk. Maybe I should consider mounting it normally.
 

Offline jopapeca

  • Newbie
  • Posts: 3
Re: Researching practical HDD reliability/solutions...
« Reply #2 on: February 24, 2019, 12:41:24 pm »
Hi,

I have some FreeNas/Nas4Free boxes running 24/7 using standard drives usually Western Digital in Raid configuration. The boxes are mounted in a rack with other servers. The only issues is drive failure after some large hours, but since the drives are consumer grade I consider it just normal. So we just implemented a strategic swap after some time (1 year) before failure, since the drive cost is much much less then using enterprise grade drives and performance is almost the same for our application.
I had one mechanical failure after a very short time, due to someone unplugging the wrong plug at the rack power while the Nas4Free was performing a large copy.

Enviado do meu SM-G935F através do Tapatalk
« Last Edit: February 24, 2019, 12:44:07 pm by jopapeca »
 

Offline texaspyro

  • Super Contributor
  • ***
  • Posts: 1407
Re: Researching practical HDD reliability/solutions...
« Reply #3 on: February 25, 2019, 04:37:06 am »
The first thing to do is check out Backblaze's hard drive reliability reports.  They use zillions of consumer grade HDDs in their data centers and regularly post their drive failure stats.
 

Offline Jeroen3

  • Super Contributor
  • ***
  • Posts: 4209
  • Country: nl
  • Embedded Engineer
    • jeroen3.nl
Re: Researching practical HDD reliability/solutions...
« Reply #4 on: February 25, 2019, 08:03:37 am »
Don't be fooled by Backblazes numbers.
They use a low sample count for certain drives. They might find a manufacturing issue on them if half of the units fail, but other than that their numbers don't tell much.
For example, you might say "seagate bad" since those are the only ones that failed a lot. But those are also the ones they bought in >20k lots.
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15795
  • Country: fr
Re: Researching practical HDD reliability/solutions...
« Reply #5 on: February 25, 2019, 03:56:07 pm »
I have some FreeNas/Nas4Free boxes running 24/7 using standard drives usually Western Digital in Raid configuration. The boxes are mounted in a rack with other servers. The only issues is drive failure after some large hours, but since the drives are consumer grade I consider it just normal.

I have some custom-made NAS as well (with CentOS) using RAID-5. The first "version" of this NAS had 3 WD drives in the "Green" series. Two of the three were showing signs of imminent failure after about 3 years of 24/7 operation, but pretty moderate data volume overall (fortunately, data was still 100% readable so I could swap in new drives.) The Green drives were actually known to be prone to "early" failures (I still consider 3 years of very moderate operation as "early", but many people were not as lucky and had this happen within 1 to 2 years) due to a firmware design decision (head parking way too often, don't remember the exact period but it was insanely short). I think WD has abandoned the Green series and has increased the head parking timeouts significantly in all of their drives. So for anyone having had the great opportunity of owning "Green" WD drives, you have probably run into this.

Anyway, switched to the "Red" series and haven't had any issues ever since. The NAS used to be active 24/7, but I have also more recently enabled WOL and put the NAS to sleep when planned to be unused for at least a few hours. This saves electricity and increases MTBF a lot (as long as it's not done too often, hence the few hours period).
 

Online wraper

  • Supporter
  • ****
  • Posts: 17952
  • Country: lv
Re: Researching practical HDD reliability/solutions...
« Reply #6 on: February 25, 2019, 03:59:03 pm »
Don't be fooled by Backblazes numbers.
They use a low sample count for certain drives. They might find a manufacturing issue on them if half of the units fail, but other than that their numbers don't tell much.
For example, you might say "seagate bad" since those are the only ones that failed a lot. But those are also the ones they bought in >20k lots.
FWIW Seagate made a shitload of crappy drives. They were failing right and left. Whole series being complete trash. There is even "CC Fly" meme among Russians. google
« Last Edit: February 25, 2019, 11:17:45 pm by wraper »
 

Offline David Hess

  • Super Contributor
  • ***
  • Posts: 17427
  • Country: us
  • DavidH
Re: Researching practical HDD reliability/solutions...
« Reply #7 on: February 25, 2019, 10:32:34 pm »
I have often wondered the same things.  Use shock mounts or not?  Use dedicated forced air cooling or not?

I have not noticed any difference in reliability between:

1. A bunch of loose but cushioned drives sitting on a pad in the bottom of the case.
2. 4 drives solidly screwed into a single set of 3.5" brackets with gaps for air flow and a separately mounted large cooling fan.
3. 2 or 3 drives mounted into a 5.25" to 3.5" drive cage with integrated cooling fan with or without individual shock mounts.
4. Very heavy 2, 3, or 4 drive 5.25" to 3.5" steel RAID cages with removable steel drive caddies and transverse forced air cooling.

On the other hand, my experiences with Seagate drives of about 1TB and larger have been universally bad.  Western Digital became jerks at about the same time by breaking SATA standards and removing features in their desktop drives so I have avoided them since then as well.

The first thing to do is check out Backblaze's hard drive reliability reports.

That is where I start also.  Their data may not be as good as it could be but it is the best available.  In practice, for me that has meant HGST drives since my last batches of Seagate and Western Digital.

« Last Edit: February 25, 2019, 10:50:10 pm by David Hess »
 

Offline Silver_Pharaoh

  • Contributor
  • Posts: 32
  • Country: ca
Re: Researching practical HDD reliability/solutions...
« Reply #8 on: February 26, 2019, 12:02:27 am »
I've only every used WD and more recently HGST since WD now owns them.
Red series are nice, I had a few "Green" drives and one started to fail after 2 years. I bought a 1Tb "Blue" and it's been fine for about 2 years now.

I don't have a NAS, but just proves the "Green" series isn't all that good.
On the preformance computing forum I'm a part of a lot of members (some work at data centers) like HGST. Most don't really like Seagate anymore.

I've never had any rubber mounts or anything like that for my drives so I'm interested in what you guys find :)
 

Offline texaspyro

  • Super Contributor
  • ***
  • Posts: 1407
Re: Researching practical HDD reliability/solutions...
« Reply #9 on: February 26, 2019, 12:17:05 am »
The issue of spinning down drivers to increase longevity is full of controversy.   Spinning up/down a drive tends to be a major point of stress and failure for drives.   I try and leave my spinning rust spinning all the time.

I am familiar with a video editing / archive system where one could configure the idle drives to spin down or to keep the drives spinning at all times.   One video farm had around half the (1000+) drives configured each way.  The only drive failures were in the ones configured to spin down.
 
 

Offline rdl

  • Super Contributor
  • ***
  • Posts: 3667
  • Country: us
Re: Researching practical HDD reliability/solutions...
« Reply #10 on: February 26, 2019, 12:39:06 am »
The issue of spinning down drivers to increase longevity is full of controversy.   Spinning up/down a drive tends to be a major point of stress and failure for drives.   I try and leave my spinning rust spinning all the time.

I am familiar with a video editing / archive system where one could configure the idle drives to spin down or to keep the drives spinning at all times.   One video farm had around half the (1000+) drives configured each way.  The only drive failures were in the ones configured to spin down.

It depends. If the spin down time is too short and the drives are used frequently, then it probably could lead to early failure. If the drives are not used that much and set to spin down in 30 minutes or more then they could last for years. Many of my drives are used primarily as archives and it's entirely possible that some may spin up only once a week, others only a few times a day. The ones that see the most use are set to spin down in 1 hour and that may happen only overnight. I can't say that I've noticed any difference in longevity, but I have only a few drives.
 

Offline MyHeadHzTopic starter

  • Regular Contributor
  • *
  • Posts: 165
  • Country: us
Re: Researching practical HDD reliability/solutions...
« Reply #11 on: March 10, 2019, 10:32:26 am »
(TL;DR: It worked for recovering data off a failed drive, and (IMO) will likely greatly increase HDD longevity in general.)


No L-brackets were practical for this application (for a few reasons).  I bought a few basic drive expansion trays to try, but none of them were practical for this application.  Ultimately, I took the modular HDD bracket out of an old computer case.  This particular bracket allowed me to drill holes and mount the drives to the brick, AND (crucially) let me mount/unmount drives in the bay while the mount is still attached to the brick.  The brick is a standard landscaping brick, and weighs just over 20lb (~9kg) by itself.  The research papers suggest heavier bricks, but those are a lot less practical.  I made a guess that 20lb would be enough into the diminishing returns area of the spectrum that it would be fine, especially considering that the heaviest standalone commercial enclosure I saw (a 5-bay NAS) was only about 7 lbs total with drives installed.





FYI, I added the old IDE drive in the top slot just for dampening mass- this should reduce the amplitude of vibrations from the top bracket.

Though my goal is to increase drive longevity, data recovery seems like a great way to test the setup.  One of my recently-failed external drives, with a rubber-mounted HDD, had irreplaceable data on it.  That drive would sometimes mount for a few seconds, but would quickly fail and disappear.  This happened independent of factors such as OS.   The research papers claim that vibration eventually leads to off-track read/write errors due to wear and tear of the drive (they lose the ability to properly locate the r/w head).  With that in mind, I hypothesized that solidly adding that mass to that failing drive would allow it to function.  A failure would not necessarily mean anything, though.

I mounted that drive as shown above and it has performed flawlessly after about 100GB transferred.  All the irreplaceable data was recoverable (yay!).  I will continue to transfer the rest of the drive contents to see how it holds up, which will take about 2-3 days).  If that works well, I will try to find some "torture tests" to run on the drive to see how well it handles it.  If anyone has any suggests, please let me know.

Adding the mass seems to help a lot, and would likely add a great deal (logarithmic increase?) to the lifespan to any HDD.  It could be a cheap and accessible tool for data recovery of failed drives (of the most common failure mode)- instead of riskier and more intensive head/platter swaps. 

It is also cheap- only a few dollars.  I used a standard drill (not a hammer drill) with a 5/32x4-1/2 carbide masonry bit and 3/16 x 1/4  masonry screws (both Tapcon "red").

I have a few more failed and failing drives- including SMART failures, and drives that just won't mount.  With one SMART failed drive (which was in a dual-bay enclosure), I plan to mount it to the brick, then do a full scan of the drive surface.  I suspect that the drive will not add any more remapped sectors, and that many of the previous remapped sectors will be returned to normal status.  That would strongly support the idea that the vibration from the dual-bay enclosure was the cause of the SMART errors, and the drive itself is actually fine.  If anyone is interested, I can post those results once I test them.

PS: I am aware of the tape mod, but decided not to do it to reduce handling the drive.
« Last Edit: March 10, 2019, 10:41:48 am by MyHeadHz »
 
The following users thanked this post: magic

Offline magic

  • Super Contributor
  • ***
  • Posts: 7453
  • Country: pl
Re: Researching practical HDD reliability/solutions...
« Reply #12 on: March 10, 2019, 02:09:16 pm »
Harsh and nasty concrete block on a wooden shelf... :scared:

Anyway, that's a great experiment and a damn impressive result.
After you are done with recovery, it would be nice to try it again in the same setup with and without the brick a few times to see if it really is the brick causing it. Not some random fluke, or the particular USB adapter, PSU, whatever.

And by the way, you could reduce the time to half a day by connecting it to a native SATA port. On my machine, I have a connected SATA cable and SATA power cable hidden behind one of the 5.25" bay panels, I just remove the panel and hotplug a disk when I need it.
 
The following users thanked this post: MyHeadHz

Offline coromonadalix

  • Super Contributor
  • ***
  • Posts: 6998
  • Country: ca
Re: Researching practical HDD reliability/solutions...
« Reply #13 on: March 10, 2019, 03:18:39 pm »
I have 12 drives in my home server : i use the smart function with a software to read the disks states each week,  when a drive is near 75% of of its life expectancy, i simply swap them for new ones.
I dont even use Raid functions, my only real problems where some drive failures related to the firmware,  western digital time out ... in the past, was corrected with a dos software, and the infamous seagate ones too. Luckily they were backed just in time, since i saw read writes operation errors spikings,  updated the firmware and made some tests, with the fw update you could not read back the drive contents,  they became reliable again outside the server once reformatted / repartitioned.

For now i use Western digital green drives, i use capacity,  not the speed access,  Room temp always at 22-23 degree

I dont use any drive dampening stuff, i just have semi hard rubber footing for my case   ( Antec twelve hundred, and its heavyyyyy around 75-90 pounds)  i use hotswap cases for all the drives, they are all came with a fan for 4 drives per cage ( true dual ball bearing)   the fans are cleaned each month, when they fail (rpm will decrease or stall) i have rpm monotoring i will hear a loud beep warning.

For the case footing dampening guess what  loll

Hockey pucks

Oh   I did find out with an older server,  power it up and down very often was more damageable than leave it always on .......
« Last Edit: March 10, 2019, 03:24:42 pm by coromonadalix »
 

Online wraper

  • Supporter
  • ****
  • Posts: 17952
  • Country: lv
Re: Researching practical HDD reliability/solutions...
« Reply #14 on: March 10, 2019, 05:06:45 pm »
when a drive is near 75% of of its life expectancy, i simply swap them for new ones.
What's that? There is no such SMART parameter for HDDs. Either it works fine or reallocated sectors start to occur which means drive is no longer reliable. I guess it's some sort of woodoo figure some stupid app shows for noobs.
Quote
they became reliable again outside the server once reformatted / repartitioned.
:palm:
« Last Edit: March 10, 2019, 05:09:23 pm by wraper »
 

Offline radar_macgyver

  • Frequent Contributor
  • **
  • Posts: 748
  • Country: us
Re: Researching practical HDD reliability/solutions...
« Reply #15 on: March 10, 2019, 05:26:46 pm »
While a lot of attention is given to vibration and isolation thereof, temperature can have an adverse effect on drives too. Home enclosures often sacrifice airflow (and hence allow drives to get warmer) to reduce fan noise. I have seen drives that were exposed to higher temperatures fail faster (sample size was ~8 drives). I expect the reason is the bearings used will dry out faster at higher temps. The failure mode for all the drives was they simply stop responding on the SATA bus. These were run 24/7, and are 'enterprise grade' nearline SATA models (ST2000NM0033).
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf