Author Topic: Another SSD bites the dust  (Read 2905 times)

0 Members and 1 Guest are viewing this topic.

Online magicTopic starter

  • Super Contributor
  • ***
  • Posts: 7045
  • Country: pl
Another SSD bites the dust
« on: October 05, 2020, 07:08:42 am »
I woke up today and saw a bunch of disk write failures in the kernel log on one machine. Linux ended up resetting the drive and it came back, but everything reads as zeros.

Not even some junk brand, but a pretty pricey Intel from 8 years ago. SMART no longer works too (of course, why not :palm:) but the last run that I have saved showed some 4.5TB total writes at 600GB capacity and everything at 100/100.

I think I'm starting to appreciate the failure modes of spinning rust. I have had quite a bunch of problems with those, but none has ever gone out completely. Will have to see if secure erase brings this crap back to operation.

Maybe the proper way is to buy two SSDs and run them in RAID1 ::)
« Last Edit: October 05, 2020, 10:50:07 am by magic »
 

Online magicTopic starter

  • Super Contributor
  • ***
  • Posts: 7045
  • Country: pl
Re: Another SSD bites the dust
« Reply #1 on: October 05, 2020, 10:47:54 am »
Secure erase seems to have reset it to a working state. Half of SMART data are at zero, though.
Maybe I should put it up on an auction site as "almost unused" :-DD

I'm really losing respect for those things a lot today.
 

Offline Wuerstchenhund

  • Super Contributor
  • ***
  • Posts: 3088
  • Country: gb
  • Able to drop by occasionally only
Re: Another SSD bites the dust
« Reply #2 on: October 05, 2020, 11:11:03 am »
That mirrors my experience with consumer grade SSDs (and that includes the widely hyped Samsung EVO drives).

For my own use I moved to enterprise class SSDs since then, which have a longer endurance and in general seem to be a lot more reliable. Haven't regretted it since.

Having said that, have you checked if your drive has the latest firmware?
 

Offline bd139

  • Super Contributor
  • ***
  • Posts: 23059
  • Country: gb
Re: Another SSD bites the dust
« Reply #3 on: October 05, 2020, 11:18:19 am »
Nothing wrong with consumer SSDs. Just make sure you have backups!

Incidentally out of a sample size of about 450, we haven't had a single Samsung Evo or Pro failure.
 

Offline BradC

  • Super Contributor
  • ***
  • Posts: 2109
  • Country: au
Re: Another SSD bites the dust
« Reply #4 on: October 05, 2020, 12:07:08 pm »
Incidentally out of a sample size of about 450, we haven't had a single Samsung Evo or Pro failure.

You must have skipped the 840 Evo "amnesiac edition". Don't get me wrong, I love Samsung SSDs, but that was an absolute clanger.
 
The following users thanked this post: Ed.Kloonk

Offline bd139

  • Super Contributor
  • ***
  • Posts: 23059
  • Country: gb
Re: Another SSD bites the dust
« Reply #5 on: October 05, 2020, 12:08:38 pm »
Ah yeah. These are all 850's or later. The 830s were problematic.

My main one is a 970 Evo Plus. My laptop has a random SKhynix one in it (Lenovo's default now apparently  :( )
« Last Edit: October 05, 2020, 12:11:05 pm by bd139 »
 

Offline BradC

  • Super Contributor
  • ***
  • Posts: 2109
  • Country: au
Re: Another SSD bites the dust
« Reply #6 on: October 05, 2020, 12:27:06 pm »
Ah yeah. These are all 850's or later. The 830s were problematic.

I have 3 830s in service. They've lost a yard or two of pace over the years, but they're still cranking along.

I bought the 3 Samsung 830 and 3 Intel 330 together. The Intels are still in production. The 830's have been moved to non-critical machines.
Code: [Select]
/dev/sdd - INTEL SSDSC2CT240A3 - 6 years 217 days 3 hours
/dev/sdq - INTEL SSDSC2CT240A3 - 6 years 218 days 12 hours
/dev/sdc - INTEL SSDSC2CT240A3 - 6 years 244 days 1 hours

This is my favourite spinner though :
Code: [Select]
/dev/sda - Hitachi HTS542580K9SA00 - 11 years 197 days 18 hours

Model Family:     Hitachi Travelstar 5K250
  9 Power_On_Hours          0x0012   001   001   000    Old_age   Always       -       101106


Just keeps trucking.
 

Online magicTopic starter

  • Super Contributor
  • ***
  • Posts: 7045
  • Country: pl
Re: Another SSD bites the dust
« Reply #7 on: October 05, 2020, 12:31:16 pm »
Yeah, latest (i.e. the last :)) firmware revision. Earlier firmwares were known to lose data on unclean shutdowns so it was the first thing I checked after receiving the drive.

It showed the same "BAD_CTX 00000115" serial number as the units after unclean shutdown and I assume it just means corruption of internal data structures (flash translation layer, whatever). What disappointed me is that it apparently happened due to uncorrectable errors during normal operation. And that I was either unlucky enough to lose the root of the FTL tree, or that the drive simply detected some problem somewhere and stopped working altogether instead of allowing me to recover whatever was still recoverable. I suspect the latter and I'm not happy about it in the slightest.

I have copies of most stuff, but the OS will probably need to be installed from scratch |O
 

Offline george.b

  • Frequent Contributor
  • **
  • Posts: 383
  • Country: br
Re: Another SSD bites the dust
« Reply #8 on: October 05, 2020, 12:45:01 pm »
A single point of data doesn't mean much, but my OCZ Agility 4 has been happily chugging along for over 5 and a half years of power-on time and 49.24TB lifetime writes. SMART says it is at 86% remaining drive life, whatever that means - reallocated sectors count is zero.
 

Offline Ranayna

  • Frequent Contributor
  • **
  • Posts: 896
  • Country: de
Re: Another SSD bites the dust
« Reply #9 on: October 05, 2020, 04:31:02 pm »
Careful with "enterprise class" SSDs ;)
At least HP and DELL both had cases where some internal counter of the ssd would run over at something like 30.000 power on hours. Poof, that SSD is just a brick now, good luck recovering your data. I do not recall the OEM of the drives, at least one was made by LiteOn.

As with any storage medium: backup is king.
 
The following users thanked this post: bd139

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 5859
  • Country: au
Re: Another SSD bites the dust
« Reply #10 on: October 06, 2020, 03:04:39 am »
I do remember a firmware update for Intel SSDs a while ago. There was an issue with some of the consumer drives bricking.

Careful with "enterprise class" SSDs ;)
At least HP and DELL both had cases where some internal counter of the ssd would run over at something like 30.000 power on hours. Poof, that SSD is just a brick now, good luck recovering your data. I do not recall the OEM of the drives, at least one was made by LiteOn.

Probably Intel. They released a firmware update for these drives as well: https://downloadcenter.intel.com/download/28673/SSD-S4510-S4610-2-5-non-searchable-firmware-links/
 

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11701
  • Country: my
  • reassessing directives...
Re: Another SSD bites the dust
« Reply #11 on: October 06, 2020, 03:44:48 am »
Never put important data in ssd. Ssd only for operating system and aplications. When corrupted or damaged, easily reformatted. Data in 2nd drive TB HDD. Keep backup every 1 or 2 years... 2cnts.
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Offline Monkeh

  • Super Contributor
  • ***
  • Posts: 8042
  • Country: gb
Re: Another SSD bites the dust
« Reply #12 on: October 06, 2020, 03:56:42 am »
It's so cute that you think an HDD is a better place to keep your data.
 
The following users thanked this post: kripton2035, Wuerstchenhund

Offline BravoV

  • Super Contributor
  • ***
  • Posts: 7549
  • Country: 00
  • +++ ATH1
Re: Another SSD bites the dust
« Reply #13 on: October 06, 2020, 04:02:36 am »
At my main desktop, the primary (soon to be secondary  ::)) boot drive NVME SSD on W10, and 2nd boot drive ordinary SATA SSD loaded Linux Mint, and local spinning rust 2 x 8TB on RAID 1.

Important data is never stored in SSD, only at the local RAID 1 volume, and this is also get duplicated incrementally to my NAS (RAID 6). While for backup, another set of HDs (duplicate set) as offlined/air gapped backup to cover the NAS and also for archival.

The only exception of data that is considered not super important, like VMs is stored in 3rd SATA SSD for speed, as spinning disc is too slow, and this is also gets incrementally backup to local RAID 1 volume.

Primary NVME OS gets imaged routinely to the RAID 1 volume, and should the NVME toasted, all I need is just buy another new one, and then restore the saved drive image that took just 10 minutes, and restored gracefully, and this is annually tested.

Offline james_s

  • Super Contributor
  • ***
  • Posts: 21611
  • Country: us
Re: Another SSD bites the dust
« Reply #14 on: October 06, 2020, 04:39:16 am »
I had someone bring me a PC that just stopped booting one day. I discovered I could access the SSD and read it however it became read-only. Format, delete partition, anything I tried, it would appear to succeed but then the contents remained unchanged. I guess as far as failures go that's not a bad way to go.
 

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11701
  • Country: my
  • reassessing directives...
Re: Another SSD bites the dust
« Reply #15 on: October 06, 2020, 05:02:10 am »
It's so cute that you think an HDD is a better place to keep your data.
the key is proven and established statistics for many decades... but still as said, need backup/change drive every few years (statistics says 5-10 years if you dont knock it for no reason) my intel SSD past 7 years i think, i havent check but kids is not complaining so i guess its still working. so SSD is gaining trust, if i can have SSD or NVME 1TB at $100+ i will buy anytime and store my data dont worry, i'll buy another next year for backup. another more proven way for thousands of years storage is animal skin, suit yourself. another unproven way but claimed can store beyond human extinction is that archival place in north pole storing bad jokes programming codes. ymmv and have a nice days.
« Last Edit: October 06, 2020, 05:07:20 am by Mechatrommer »
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Offline Monkeh

  • Super Contributor
  • ***
  • Posts: 8042
  • Country: gb
Re: Another SSD bites the dust
« Reply #16 on: October 06, 2020, 05:06:19 am »
It's so cute that you think an HDD is a better place to keep your data.
the key is proven and established statistics for many decades... but still as said, need backup/change drive every few years (statistics says 5-10 years if you dont knock it for no reason) another more proven way for thousands of years storage is animal skin, suit yourself. another unproven way but claimed can store beyond human extinction is that archival place in north pole storing bad jokes programming codes. ymmv and have a nice days.

I've clearly run a few more drives than you - one drive can never be trusted, no matter what type it is. HDD failures occur far more than you think, you just don't notice them.
 

Offline BravoV

  • Super Contributor
  • ***
  • Posts: 7549
  • Country: 00
  • +++ ATH1
Re: Another SSD bites the dust
« Reply #17 on: October 06, 2020, 05:13:53 am »
It's so cute that you think an HDD is a better place to keep your data.
the key is proven and established statistics for many decades... but still as said, need backup/change drive every few years (statistics says 5-10 years if you dont knock it for no reason) another more proven way for thousands of years storage is animal skin, suit yourself. another unproven way but claimed can store beyond human extinction is that archival place in north pole storing bad jokes programming codes. ymmv and have a nice days.

I've clearly run a few more drives than you - one drive can never be trusted, no matter what type it is. HDD failures occur far more than you think, you just don't notice them.

+1 , I've experienced this my self, wife's HD drive experienced bit rot without any error message nor warning. Several of her "highly sentimental JPG files"  ::) got hits last time, hence she is now have raid 10 at her desktop as she has tons of photos & videos from her cellphone.  :-[

At my archival (not backup) drives, which are mirrored too, I do another layer by PAR-ring the data at 50%.

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 5859
  • Country: au
Re: Another SSD bites the dust
« Reply #18 on: October 06, 2020, 05:17:38 am »
+1 , I've experienced this my self, wife's HD drive experienced bit rot without any error message nor warning. Several of her "highly sentimental JPG files"  ::) got hits last time, hence she is now have raid 10 at her desktop as she has tons of photos & videos from her cellphone.  :-[

At my archival (not backup) drives, which are mirrored too, I do another layer by PAR-ring the data at 50%.

Most RAID set ups won't protect you from bit rot. If a file becomes corrupt, that corrupt file will just be replicated across the array.

To prevent that, you'll want to use a more robust file system like ZFS that detects and corrects that kind of corruption.
 

Offline BravoV

  • Super Contributor
  • ***
  • Posts: 7549
  • Country: 00
  • +++ ATH1
Re: Another SSD bites the dust
« Reply #19 on: October 06, 2020, 05:20:03 am »
+1 , I've experienced this my self, wife's HD drive experienced bit rot without any error message nor warning. Several of her "highly sentimental JPG files"  ::) got hits last time, hence she is now have raid 10 at her desktop as she has tons of photos & videos from her cellphone.  :-[

At my archival (not backup) drives, which are mirrored too, I do another layer by PAR-ring the data at 50%.

Most RAID set ups won't protect you from bit rot. If a file becomes corrupt, that corrupt file will just be replicated across the array.

To prevent that, you'll want to use a more robust file system like ZFS that detects and corrects that kind of corruption.

Yeah, I'm aware of that, was backing Monkeh's post, that single HD as backup (even offlined), is just too risky.

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11701
  • Country: my
  • reassessing directives...
Re: Another SSD bites the dust
« Reply #20 on: October 06, 2020, 05:22:11 am »
I've clearly run a few more drives than you..
good for you.. practically speaking, it works for me with my only few (2 or 3 maybe) HDD generations. if it doesnt for you, i'm sorry you can use RAID or mirror drive whatever fancy they are there are always options for better.. or maybe just a simple SeaGate drive (my trusted brand), hunglow brand is highly unrecommended some of them got paper weight inside, big no no.

btw i can make program for data match or maybe even magnetic strength detector HW to see field difference or collapse in few years, or automated checksum maker but... i dont feel like to because its just practically working for me. i got corrupted pictures or videos from time to time, sometime even the one i downloaded few weeks ago (bad H264 encoding i guess) but not that i cant see them entirely, and i can always redownload the same or better entertainment. but so far, i havent seen my text code comes out in jumbled ascii/hex garbage, or corruped hobby project files, those are #1 priority data...  so thats good enough for me. ymmv and be safe... your data ;)

+1 , I've experienced this my self, wife's HD drive experienced bit rot without any error message nor warning. Several of her "highly sentimental JPG files"  ::)
typical, top on the list storage problem, the SWMBO. when asked what did you do? i didnt do anything.. did you knock it? yeah but its just a little bit it should be no problem! :palm: now she got the biggest affordable SP storage in town, not much complain so far except the super slow Win10 i formatted few weeks ago.
« Last Edit: October 06, 2020, 05:29:27 am by Mechatrommer »
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 5859
  • Country: au
Re: Another SSD bites the dust
« Reply #21 on: October 06, 2020, 05:23:21 am »
+1 , I've experienced this my self, wife's HD drive experienced bit rot without any error message nor warning. Several of her "highly sentimental JPG files"  ::) got hits last time, hence she is now have raid 10 at her desktop as she has tons of photos & videos from her cellphone.  :-[

At my archival (not backup) drives, which are mirrored too, I do another layer by PAR-ring the data at 50%.

Most RAID set ups won't protect you from bit rot. If a file becomes corrupt, that corrupt file will just be replicated across the array.

To prevent that, you'll want to use a more robust file system like ZFS that detects and corrects that kind of corruption.

Yeah, I'm aware of that, was backing Monkeh's post, that single HD as backup (even offlined), is just too risky.

Oh yes, absolutely.

A friend of mine once asked me if I could have a look at a drive for him and attempt to recover data. He returned home one day and a metal-on-metal scraping noise was coming out of his (still spinning) disk. I popped the lid off and the head had made contact with the platter and cut some nice grooves into it. The entire inside of the drive was covered in what used to be the magnetic surface of the disk platters. Needless to say, I threw it in the bin and gave him the "backup talk".
 

Offline bd139

  • Super Contributor
  • ***
  • Posts: 23059
  • Country: gb
Re: Another SSD bites the dust
« Reply #22 on: October 06, 2020, 07:09:00 am »
+1 , I've experienced this my self, wife's HD drive experienced bit rot without any error message nor warning. Several of her "highly sentimental JPG files"  ::) got hits last time, hence she is now have raid 10 at her desktop as she has tons of photos & videos from her cellphone.  :-[

At my archival (not backup) drives, which are mirrored too, I do another layer by PAR-ring the data at 50%.

Most RAID set ups won't protect you from bit rot. If a file becomes corrupt, that corrupt file will just be replicated across the array.

To prevent that, you'll want to use a more robust file system like ZFS that detects and corrects that kind of corruption.

Just a heads up on that. ZFS won’t protect you from corruption. It’ll try to and tell you when it can’t. It’s far from perfect. 3PAR got the job in the end after that evaluation cycle.

But as mentioned planning for corruption is what you really need to do.

The best strategy I’ve found for personal data is offline validation. I’m using a semi manual tool for that “Beyond Compare” and I will periodically before performing a backup allow it to checksum both source and destination directories. I keep two completely segregated offline snapshots of everything I have, one encrypted and on my person and one unencrypted and stored securely at home. The rest floats around between my desktop and laptop on onedrive which has file history. That covers point in time recovery, complete disaster and carnage too.
 

Offline Wuerstchenhund

  • Super Contributor
  • ***
  • Posts: 3088
  • Country: gb
  • Able to drop by occasionally only
Re: Another SSD bites the dust
« Reply #23 on: October 06, 2020, 07:11:12 am »
Careful with "enterprise class" SSDs ;)
At least HP and DELL both had cases where some internal counter of the ssd would run over at something like 30.000 power on hours. Poof, that SSD is just a brick now, good luck recovering your data.

Yes, this one:

https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us

I agree that this was a bummer but it made the headlines because it's also exceptionally rare (intel had a similar issue once, but as far as I remember that was in their lower tier drives). In addition, these are server drives, and usually operate in some kind of redundancy setup (RAID) so unless they all reach 32768 hrs at exactly the same time this is unlikely to bring down your storage.

What it does highlight, though, how important firmware updates are for SSDs (much more so than for spinning rust), and that it's important to keep your SSD firmware at the latest level.

Which, of course, requires that there, actually, are firmware updates.

Quote
I do not recall the OEM of the drives, at least one was made by LiteOn.

LiteOn doesn't make enterprise drives (thank god!). The drives were all made by SanDisk, in fact these were all SanDisk Ligntning Ascend SAS drives. Which even without the bug weren't great, but when you buy OEM drives for your server it doesn't really matter who makes them as the drives are covered by the server's support contract anyways.

For individual purchase (i.e. to use at home) OEM drives, especially those from Dell, are a pretty bad choice as the firmware can only be updated with Dell tools on a suitable Dell system (server). It's better to stick with the original brands (Seagate, WDC, Micron, Samsung) which provide firmware updates and tools which work on every system.

Quote
As with any storage medium: backup is king.

Indeed.
 

Offline bd139

  • Super Contributor
  • ***
  • Posts: 23059
  • Country: gb
Re: Another SSD bites the dust
« Reply #24 on: October 06, 2020, 07:23:43 am »
I find it’s best to just buy everything from HPE. Then you only have one vendor to blame who generally can’t wriggle out of it by blaming another one. Supermicro were good at that when we had storage problems.  They blamed it on the drives. So we bought new ones and still had the issue. That was an old 1U SATA RAID box so it put me off their higher end offerings.
 

Offline MarkF

  • Super Contributor
  • ***
  • Posts: 2627
  • Country: us
Re: Another SSD bites the dust
« Reply #25 on: October 06, 2020, 07:38:37 am »
  Back in the early '80s, we were told that our memory failures were caused
by the software execution loops being too short.

  Maybe you are accessing you files too often?   :-//     :-DD

  Or your files are just too small?
 

Offline Wuerstchenhund

  • Super Contributor
  • ***
  • Posts: 3088
  • Country: gb
  • Able to drop by occasionally only
Re: Another SSD bites the dust
« Reply #26 on: October 06, 2020, 08:34:08 am »
Most RAID set ups won't protect you from bit rot. If a file becomes corrupt, that corrupt file will just be replicated across the array.

To prevent that, you'll want to use a more robust file system like ZFS that detects and corrects that kind of corruption.

It's not so simple.

First of all, "bit rot" (soft errors, a flipped data bit) in hard drives has been *vastly* overhyped (mostly out of ignorance).

And ZFS, too, has been overhyped, again usually because of ignorance. Yes, it's a robust file system, and it has built-in check summing so it can detect (and correct, if there's a 2nd copy of the data) flipped data bits.

The thing that's usually ignored is that in a modern PC almost everything is protected by some form of ECC, and this includes data that goes across the SATA cable and also the data that is on your cheap SATA hard drive. If there's a flipped bit, it will be detected, corrected and reported by the hard drive's ECC correction (shown in SMART as 'recoverable error'). That's the first layer.

In a common server, there's also a hardware RAID controller, which, if configured in a redundancy setup, performs regular patrol scans ("scrubbing") which will find and correct "flipped" data bits if there are any. That's the second layer.

In reality, both layers do such a great job in making sure the integrity of logical data presented to the file system is maintained that file system checksuming isn't needed as real "bit rot" would be captured at a lower level already.

However, there's a caveat:

In a regular PC, the RAM usually isn't ECC protected. And while soft errors because of cosmic radiation are usually rare, there also are other factors (like overclocking, voluntary or involuntary) which can cause sporadic memory errors. Considering that all modern OSes use unused RAM as data cache, it's much more likely that reports of "bit rot" have actually been caused by memory soft errors. The fact that there seem to be no (reliable!) reports of "bit rot" occurrence in systems with ECC RAM supports that.

Now as to ZFS, it's worth remembering why it exists, and to understand that it's necessary to go back in history. Back in the old days when there was still a Sun Microsystems building SPARC based servers running Solaris, Sun of course also offered disk storage which was based on large arrays full of SCSI disks, which normally contained some form of hardware RAID controller. Multiple arrays were then combined with an expensive 3rd party product (Veritas Volume Manager, which was also used by other UNIX vendors like HP) to create RAID setups across storage racks. However, eventually the shared SCSI bus became a bottleneck, and Sun was looking for a standard with more space to grow so they moved to Fiber Channel disks. However, unlike for SCSI for which there were several standard RAID solutions there were none for FC. In addition, Sun wanted to get rid of Veritas VM, so they developed their own product - a file system which can also manage large arrays of FC disks. This became ZFS. Since ZFS had to incorporate functionality which was normally provided by a RAID controller, it also had to have some provision to check data integrity (which a RAID controller does). Which is why ZFS has checksumming.

In short, ZFS was developed for one purpose, which is to build large scale RAID setups across spinning rust without the need for a hardware RAID controller. However, ZFS can only ensure data integrity if the underlying hardware itself is ECC protected - and that includes RAM (which ZFS uses a lot of!).

That means ZFS is great a storage system where running multiple hard drives in a RAID setup without a RAID controller, as long as the system has plenty of ECC RAM. If not then the checksum function becomes worthless as data integrity can no longer be guaranteed.

For everything else, ZFS sucks. It's comparatively slow and inflexible (RAIDZ expansion is still in alpha state). If there's a hardware RAID controller or the system uses SSDs instead of spinning rust then ZFS is mostly inferior to pretty much any other modern file system out there.
« Last Edit: October 06, 2020, 12:40:59 pm by Wuerstchenhund »
 
The following users thanked this post: nugglix, bd139


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf