Author Topic: SSD behaviour - fail then unfail then normal? (Read 8049 times)

daqq · « **on:** October 14, 2014, 07:54:53 pm »

Hi guys,

After a year of not particularly heavy service (standard home PC) a 240GB Intel SSD had failed me - Windows 8 froze up, restarted and tried repairing disk errors, which turned didn't work (since it never booted up back again and kept trying to repair itself - without success). After that I booted up linux, extracted the data, ran some tests - SMART self tests reported "Completed with read failure". After that I shreded the remaining data (overwrite by random), readying it for warranty replacement. Out of curiosity, I did one final test, by badblocks. It reported no error. So I ran the SMART self tests again, they reported no error. So... is the disk magically repaired? Or should I return it anyway?

This is the second Intel SSD in two years I've had fail on me... what am I doing wrong? Does anyone else have this kind of experience?

Thanks,

David

hans · « **Reply #1 on:** October 14, 2014, 08:33:58 pm »

It may be that one sector has gone bad (intermittently) but by completely wiping the drive you have also reset those indications for the OS.

I've had a Samsung 840 drive "fail" on me after 1 month of service with a very similar problem you've described. I was downloading a file on the OS drive with Firefox, and suddenly it would lock up the whole system. No blue screen.
After the fourth time and 2 other internet browsers later I was fed up and looked at the SMART data. It showed that the drive had some uncorrectable error count.
I understand that as the drive had written or read some data the ECC couldn't fix anymore(SMART) and for some reason the Samsung firmware would just give up and stop doing anything.HDtune also showed 2 bad blocks, and so I RMA'ed the drive.
Personally I don't understand why the drive can't mark those sectors as bad and shrink the drive. Also, this drive is 250GB, so I'm pretty sure the other 6GB is over provisioning. It's more likely these errors only developed when reading back the data, after the written data is out of cache memory (which could be possible, because Xilinx ISE I was downloading is a few GB's big).

Before sending it out I also wiped the drive. Out of curiosity I reran HDTune. The bad sectors were gone, which I can only understand as the OS does not know those sectors have (had) corrupted data on it. I was afraid that might spoil my RMA, but obviously the SMART data is not reset and shows the drive developed an uncorrectable error very quickly.

The drive I got back (a new one) is still going strong for 1,5 years now, luckily.

suicidaleggroll · « **Reply #2 on:** October 14, 2014, 08:41:41 pm »

Over the last 5 years, I've built approx 25-30 machines, all of them use an SSD for the OS (usually there is one or more spinning disks for bulk storage, but sometimes the SSD is the only drive in the system, ie: laptops). Of those 25-30, I'd say somewhere around 15-20 of them use Intel SSDs, about 5 Crucial, plus a Mushkin, an OCZ, and a Sandisk. The majority of them run Linux, but there are 3-4 Windows systems in there too. The vast majority run 24/7. There is the occasional reboot for updates, but other than that they're always running. Some are big powerful rackmount servers, most are workstations, and a handful are laptops or embedded systems (fit-PC).

In all of that time with all of those machines, I have experienced precisely zero SSD failures (zero failures of any kind for that matter, apart from one LSI RAID card that started complaining, so I RMA'd it "just in case"). The one time I thought a problem might be due to a bad SSD, it turned out to be an incompatibility with the BIOS used on an old Dell laptop. The worst I've seen is that sometimes the BIOS posts before the SSD is available and you get the "no disk found" error, but a quick Ctrl+Alt+Del brings it right up.

So I'm afraid I can't really help, but I can say that two Intel SSD failures in as many years is far from normal in my experience.

SirNick · « **Reply #3 on:** October 14, 2014, 09:01:31 pm »

I bought one of the Intel G-series (if I remember this correctly) SSD drives back when ~32GB became affordable for casual use. I had it in a rackmount mobile audio computer. Used it a few times, then got on-site to a gig, and it wouldn't boot. Brought it back home and reinstalled the OS, and it worked again for a while, then randomly showed up as an 8MB (that's MEGAbyte) drive to the BIOS. I replaced it with a spinning disk (which I really hated to do in a mobile rig), and later toyed with it again in a non-critical app. Worked fine for a bit, then suddenly started showing up as an 8MB volume again. At that point, I gave up. I was just outside the warranty period, having sat on it for a long while. My own fault.

A cheap Patriot 30GB drive worked pretty well for a while, but would randomly pause for maybe 10 seconds on write ops. Got sick of it and replaced it with a Samsung that is working wonderfully now. (I've also had really good luck with their 1-2TB platter disks -- sadly they're no longer in the biz.)

So, IMO, SSDs are (or, perhaps, were) rather hit and miss. I treat them with kid gloves, and don't trust them with any really critical data. Basically, only as an OS disk. There's a lot of incentive to shove in as much tech as possible, at the lowest possible price point, due to massive demand and steep competition. That will lend itself to the expected casualties until it becomes trivial to manufacture these things.

gxti · « **Reply #4 on:** October 14, 2014, 09:07:46 pm »

There is a fluid mapping between logical blocks and the underlying physical flash. If a block became unreadable, then overwriting it completely with new data could allow the controller to let go of the bad physical block and remap it to a different one, thus "repairing" the logical block. Magnetic drives have similar behavior but it's more of an exception to remap blocks than the norm, everything starts out 1:1 and it's only when something goes wrong that it uses spare area to remap.

SSDs have a considerable amount of extra area hidden away that it uses while shuffling blocks around. As blocks go bad, more of the spare area gets consumed until eventually it runs out and the drive is dead. In theory it would just be permanently read-only, not a total loss, but that's just the theory...

Also you should use the 'secure erase' feature of the drive when preparing to return it, not just writing random data. Because you can't see the spare blocks, they might still have some of your old data on it. Doing a secure erase might also help with the bad block problem because, like the total overwrite, it tells the flash controller that you don't care about the contents of the bad blocks anymore.

SSD problems tend to be controller bugs or flakiness not the "decay" people are used to with spinning drives. My oldest SSD sometimes causes random bus resets, and a different one I thought was completely busted until I did a secure erase and now it's been fine for years. Definitely still not a fully mature technology, you have to be an industry expert to know which brands are trustworthy.

hamster_nz · « **Reply #5 on:** October 14, 2014, 09:39:53 pm »

Quote from: gxti on October 14, 2014, 09:07:46 pm

... you have to be an industry expert to know which brands are trustworthy.

Reliability is not so much an issue of brand - the same brand can have good models and flakey models on the market at the same time. Every now and then even a top brand releases a disk with bad parts or firmware on them.

RobertHolcombe · « **Reply #6 on:** October 14, 2014, 10:28:46 pm »

My experience is only a very small sample size. I own three SSD, 2x OCZ Vertex(?) 60GB that I have had in RAID0 for 5 years and an OCZ Vector 240(?)GB for ~3 years, and have not had any issues whatsoever.

SirNick · « **Reply #7 on:** October 14, 2014, 11:29:59 pm »

Quote from: hamster_nz on October 14, 2014, 09:39:53 pm

Reliability is not so much an issue of brand - the same brand can have good models and flakey models on the market at the same time. Every now and then even a top brand releases a disk with bad parts or firmware on them.

I wonder if he means brand of controller? There are (were) a few well-known controllers that several vendors used. This became public knowledge partly because enthusiasts got interested in the feature sets of certain ones, so it became a point of pride that your design used one of those. Double-edged sword though, as it also became a point of shame when a given controller was found to have flaws.

n45048 · « **Reply #8 on:** October 15, 2014, 02:48:19 am »

As per the previous comments, SSDs will remap failing or bad blocks to a spare area of the drive hidden from the host OS. The hard drive controller takes care of this silently.

I use Intel SSDs regularly and can say I've never had one fail. I have had two Corsair ones die on me though.

Firstly I would check that you're running the latest firmware (install the Intel SSD Toolbox utility). There may be some incompatibility or something funny going on between the SSD controller and your SATA controller. Also make sure your system BIOS is up-to-date.

To be honest though, I only ever use Supermicro motherboards, so comparing the reliability and stability of my overall system isn't a fair comparison to consumer-grade gear. I've seen some motherboards just refuse to play nice with some SSDs (Gigabyte from memory caused me some issues, I can't recall the brand of SSD now).

SeanB · « **Reply #9 on:** October 15, 2014, 04:56:32 am »

Check for drive firmware updates, and apply them after backing up ( and check the backup is working, as the update invariably wipes the drive), which will likely fix the problems you have. The Intel 8M drive was a known bug, caused by an internal counter overflow in the drive controller. IIRC it was triggered by leaving it on for a period combined with certain numbers of writes.

Erasing the drive with SMART will clear all blocks and then the drive will remap itself so yes it clears the failing blocks and uses spare blocks internally. Likely you had a block go bad and the drive had unrecoverable errors on it, the reset put this block in the unusable area. These drives just look like they are perfect on the surface, underneath they are always fixing failures, it depends on the manufacturer as to how bad the underneath tech is and what they consider is good enough. MLC is worse than SLC, simply because of poorer noise margin on reading and writing.

n45048 · « **Reply #10 on:** October 15, 2014, 08:12:36 am »

Quote from: SeanB on October 15, 2014, 04:56:32 am

...as the update invariably wipes the drive

I've never known a firmware update on an SSD to erase the drive. Most manufacturers including Intel advise to backup all data just in case the update goes horribly wrong and bricks the drive, resulting in it being inaccessible. I've always performed updates on "live" drives without any problems.

There is no reason why a controller firmware update needs to reinitialise a disk.

n45048 · « **Reply #11 on:** October 15, 2014, 08:23:33 am »

Quote from: mojo-chan on October 15, 2014, 08:17:17 am

In either case it needs replacing. However, since the bad blocks are remapped you can continue to use the drive. It's near death though, and can't be trusted.

Some very good advice. A drive with bad sectors or failing blocks should be replaced as soon as practicable. It will almost always be a downward spiral from there.

In my experience I've found that drives which have manufacturing faults will usually fail at the beginning of their life. After that, it's usually old age or physical damage which causes failure. I make it a point to keep mechanical disk drives running continuously rather than an on/off cycle. The theory behind this is that the expansion and contraction of the platters and other components caused by temperature fluctuation is minimal.

I also make a point to use high-end disk drives especially in hardware RAID arrays (currently I'm using Hitachi Ultrastar's in my server).

bktemp · « **Reply #12 on:** October 15, 2014, 09:00:34 am »

There are some interesting articles about the bit error rates of hard drives and flash memories. They are about equal. Modern storage devices do not really store data and read them back, they rather guess what data may have been stored. Together with the ecc this works quite well, but a small error rate still remains.

Usually the drive manages bad blocks and automatically remaps them internally, so the operating system has never to deal with them. If there are any bad blocks visible to the operating system, it means the drive firmware was not able to handle them. From my experiance with sd cards, any further write can destroy all the data: If the firmware does not detect bad blocks, it will store data to these blocks and if the data is crucial, it will effect the whole drive.
Marking bad sectors in the file system is useless with ssds because of wear leveling: It remaps the blocks and the sector marked as bad now points to a different block.

Strada916 · « **Reply #13 on:** October 15, 2014, 09:52:40 am »

Turn off search indexing. It can prematurely reduce the life of your ssd. Read it somewhere.

dexters_lab · « **Reply #14 on:** October 15, 2014, 10:49:05 am »

if i found errors like that on one of my ssds it would go back for warranty without question, the way i see it is if it's failing bad enough to throw errors at the user then it's time to bin it

there is usually a smart value that shows the estimate lifetime remaining or similar as a % value it would be interesting to know that as this is a indicator as to how many spare blocks are remaining

also TRIM has a place to play in this, which is a way for the OS to notify the disk controller that an area of the disk is now free to be reset and allows the wear levelling to make best use of the free blocks

amyk · « **Reply #15 on:** October 15, 2014, 12:52:12 pm »

Quote from: SeanB on October 15, 2014, 04:56:32 am

MLC is worse than SLC, simply because of poorer noise margin on reading and writing.

2-bit MLC is considered "high end" in the consumer SSD space now - 3-bit MLC ("TLC") is becoming the norm for SSDs, and large SD cards/USB drives are using 4-bit MLC. Marketing makes it seem like size is everything, so instead of more durable SSDs we're only getting bigger and faster. Data retention has also dropped significantly - I remember 10 years/100k P/E cycles for SLC when it first came out, now it's 1 year/<1k P/E cycles for the latest 4-bit MLCs.

Do not rely on modern high-density NAND flash for any form of long-term storage...

daqq · « **Reply #16 on:** October 15, 2014, 01:54:11 pm »

Thanks guys for the info!

I was asking mainly because the last time the warranty exchange took about a month, during which I was left with a temporary system, which was effectively a internet browsing terminal. This was NOT an experience I wanted to enjoy again.

Damn, I guess I'll try to have the thing replaced on grounds of nonfunctionality, which might get funny, seeing as it reports itself as OK and shows no errors whatsoever.

SeanB · « **Reply #17 on:** October 15, 2014, 05:37:40 pm »

Indexing involves a series of small writes to the database, which equates to a lot of erase cycles on a large block as the bit in a small section changes over and over. That uses up cycles for that part, so it will be reallocated and then the next load of indexing will use that new block again. Run for a while it will cycle through the entire drive spare pool in no time. Big writes conversely only have minimal wear on any single block.

SirNick · « **Reply #18 on:** October 15, 2014, 06:31:01 pm »

Similarly, on Linux systems, set up a RAM drive for /var/log ...

tom66 · « **Reply #19 on:** October 15, 2014, 08:06:29 pm »

Quote from: amyk on October 15, 2014, 12:52:12 pm

Quote from: SeanB on October 15, 2014, 04:56:32 am
MLC is worse than SLC, simply because of poorer noise margin on reading and writing.
2-bit MLC is considered "high end" in the consumer SSD space now - 3-bit MLC ("TLC") is becoming the norm for SSDs, and large SD cards/USB drives are using 4-bit MLC. Marketing makes it seem like size is everything, so instead of more durable SSDs we're only getting bigger and faster. Data retention has also dropped significantly - I remember 10 years/100k P/E cycles for SLC when it first came out, now it's 1 year/<1k P/E cycles for the latest 4-bit MLCs.

Do not rely on modern high-density NAND flash for any form of long-term storage...

Especially temperature dependent. We had a box at work which would lose the software (fail to boot or suffer video corruption/random crashes) after about 6 months accelerated life testing at 65C.

Turned out, the flash had a junction temperature of 105C as it was right next to a hot processor... So its data retention figure was well less than one year.

When we did a checksum test on 12 of the boxes, 8 of them failed after 6 months.

SeanB · « **Reply #20 on:** October 15, 2014, 08:12:54 pm »

Quote from: SirNick on October 15, 2014, 06:31:01 pm

Similarly, on Linux systems, set up a RAM drive for /var/log ...

I killed a flash drive in about 3 days just leaving the logging on. The regular writes of the standard Ubuntu install I had on it sitting idle with only the regular system logging did that. Woke up on day 4 and it had kernel panicked, very unhappy that the only drive it had was no longer responding. Just wanted to see how long it would last.

hamster_nz · « **Reply #21 on:** October 15, 2014, 10:49:42 pm »

Quote from: bktemp on October 15, 2014, 09:00:34 am

There are some interesting articles about the bit error rates of hard drives and flash memories. They are about equal. Modern storage devices do not really store data and read them back, they rather guess what data may have been stored. Together with the ecc this works quite well, but a small error rate still remains.

I'm wildly off-topic, but this is no different to a lot of other areas of digital technology, where everything normally works fine but every now and then random stuff happens.

- Non-repeatable single bit errors in RAM
- bit corrupts packets on a SAS bus
- bit corrupts packets on the network
- bit errors on fibre channel
- Parity errors in CPU caches
- Metastability in flip-flops
- Background radiation induced single bit errors in chips (search for "low alpha packaging" for some interesting reading)

Technology i snow approaching the limits where regular random physical events regularly need to be taken into account and engineered around.

SirNick · « **Reply #22 on:** October 16, 2014, 12:29:56 am »

Quote from: SeanB on October 15, 2014, 08:12:54 pm

I killed a flash drive in about 3 days just leaving the logging on. The regular writes of the standard Ubuntu install I had on it sitting idle with only the regular system logging did that. Woke up on day 4 and it had kernel panicked, very unhappy that the only drive it had was no longer responding. Just wanted to see how long it would last.

Wow... that's pretty bad. I learned my lesson with an Internet-accessible host (a router) running from a Compact Flash card. It took two years to die though, even with a constant barrage of failed SSH login attempts going to the logs.

I looked into the 8MB bug again, just curious about the post-mortem. Looks like the firmware updates never actually fixed anything (which was my experience at the time too). But it gets even better. Not only did it plague the model I had, but even the ones 10 to 20 times its size, which came out later. There are even a few reports as of 2012 and 2013. Some poor guy talked Intel into trying to update his drive for him. They tried, failed, and because it turned out someone sold him an engineering sample, Intel just sent the dead one back to him. Classy.

justanothercanuck · « **Reply #23 on:** October 17, 2014, 09:50:04 am »

Some SATA cables suck, and are prone to quirky errors, like disks that stop working at random. They need to beef up that connector somehow.

wraper · « **Reply #24 on:** October 17, 2014, 10:18:55 am »

Quote from: SirNick on October 16, 2014, 12:29:56 am

Quote from: SeanB on October 15, 2014, 08:12:54 pm
I killed a flash drive in about 3 days just leaving the logging on. The regular writes of the standard Ubuntu install I had on it sitting idle with only the regular system logging did that. Woke up on day 4 and it had kernel panicked, very unhappy that the only drive it had was no longer responding. Just wanted to see how long it would last.

Wow... that's pretty bad. I learned my lesson with an Internet-accessible host (a router) running from a Compact Flash card. It took two years to die though, even with a constant barrage of failed SSH login attempts going to the logs.

Compact flash does not have wear leveling, so it is relatively easy to kill when writing in the same place frequently. That's why I used old 64MB SLC CF when replacing old HDDs in some mid 90's devices. Personally have a 240GB OCZ vertex 3 SSD. It developed a one bad block about a 2 years ago, no further degradation occurred. OCZ toolbox still shows 100% life left. As of the OP's SSD, I would do a firmware update and a secure erase. SSD usually fail because of the firmware bugs, not flash IC failing. SSD are completely different from HHD and usually shouldn't start to develop a huge number of the bad blocks at some point unless you completely rewrite them a few thousands times. BTW what does reallocated sector count parameter show in the SMART?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: SSD behaviour - fail then unfail then normal? (Read 8049 times)

n45048

n45048

n45048

Share me