Author Topic: how can I voluntarily damage an SSD to test a testing program?  (Read 3928 times)

0 Members and 1 Guest are viewing this topic.

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
how can I voluntarily damage an SSD to test a testing program?
« on: February 12, 2021, 07:58:34 pm »
I am writing a "storage" device testing program.

Yesterday I voluntarily damaged an old electro-mechanical Hard Disk Drive of 1.6 Gbyte in order to see how my testing program reacts to physical true damages and if it's able to find and report them.

I opened the cover, and made some scratches on the disk, and this is the result:

Code: [Select]
opening /dev/hdc ... done
disk_test
Checking for bad blocks in read-write mode
From 0 to 1.610.735.615, by 8.192 bytes at time
   [  0   ] BBBBBBBBB..BBBBB.BBBBBBBBBBBB........BBBBBBBBBBBBBBBBBBBBBBBBBBB
   [ 16.6 ] BBBBBBBBBBBBBBBBBBBBBBB.BBBBBBBBBBBBBBBBBB.BBBBBBBBBBBB....BBBBB
   [ 33.3 ] BBBBBBBBBBB...............BBBBBBBBBBB.BBBBBBBBBBBBB.BBBBBBBBBBBB
   [ 49.9 ] BB...BBBBBBBBBBBBBBB.BBBBBBBBBBBBBBB......BBBBBBBBBBBBBBBBBBBBBB
   [ 66.6 ] BBBBBBBBBBBBBBBBB....BBBBBBBBBB.BBBBBBBBBBBBBBB....BBBBBBBBB.BBB
   [ 83.3 ] BBBBBBBBBBBBBBBBBB.BBB.BBBBBB.BBBBB..BB..BB.BBBBB.BBB...........
   [ 99.9 ] .

"." means no problem
"B" means error during I/O

Excellent!!! Damages are there, perfectly discovered and reported! :D

Now I need a damaged old SSD in order to run similar tests. I can buy a second hand cheap SSD on Bonanza, but how can I voluntarily damage it?


(it sounds crazy, I know, but ... that's done for scientific reasons, so ... )
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Johnny10

  • Frequent Contributor
  • **
  • Posts: 900
  • Country: us
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #1 on: February 12, 2021, 08:26:56 pm »
Why not ask EEV Forum members if they will send you one?
Tektronix TDS7104, DMM4050, HP 3561A, HP 35665, Tek 2465A, HP8903B, DSA602A, Tek 7854, 7834, HP3457A, Tek 575, 576, 577 Curve Tracers, Datron 4000, Datron 4000A, DOS4EVER uTracer, HP5335A, EIP534B 20GHz Frequency Counter, TrueTime Rubidium, Sencore LC102, Tek TG506, TG501, SG503, HP 8568B
 

Offline Jwalling

  • Supporter
  • ****
  • Posts: 1517
  • Country: us
  • This is work?
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #2 on: February 12, 2021, 08:45:25 pm »
Fill your SSD with data, but leave 50KB or so free. then make a .cmd file that will keep writing a (just shy of) 50KB file to it.
Use two files, one with all 00s and one with all FFs. Alternate between the two.
Hopefully, it will wear out the flash in that area. Since the disk is almost full, that will subvert the wearing algorithm for the drive...
Jay

System error. Strike any user to continue.
 
The following users thanked this post: DiTBho

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #3 on: February 12, 2021, 08:49:56 pm »
Fill your SSD with data, but leave 50KB or so free. then make a .cmd file that will keep writing a (just shy of) 50KB file to it.
Use two files, one with all 00s and one with all FFs. Alternate between the two.
Hopefully, it will wear out the flash in that area. Since the disk is almost full, that will subvert the wearing algorithm for the drive...

Thanks! Great idea!  :D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online Ian.M

  • Super Contributor
  • ***
  • Posts: 13026
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #4 on: February 13, 2021, 07:48:43 am »
AFAIK, you are in the UK.  You could try asking your nearest branch of CeX (second hand media & electronics) if they'd be willing to sell you SSDs that fail testing, but are still recognized by the controller.  You'd probably need to  check back with them regularly, as I'd bet at the moment their warranty returns get scrapped. 
« Last Edit: February 13, 2021, 07:50:25 am by Ian.M »
 
The following users thanked this post: DiTBho

Online MK14

  • Super Contributor
  • ***
  • Posts: 4843
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #5 on: February 13, 2021, 07:55:19 am »
Fill your SSD with data, but leave 50KB or so free. then make a .cmd file that will keep writing a (just shy of) 50KB file to it.
Use two files, one with all 00s and one with all FFs. Alternate between the two.
Hopefully, it will wear out the flash in that area. Since the disk is almost full, that will subvert the wearing algorithm for the drive...

Disclaimer: I'm absolutely NOT an expert on SSDs. So if anyone claims different, listen to them.

I thought SSDs, use over-provisioning, i.e. they have significantly more storage area, than claimed/reported. Specifically to counter, the very situation you are describing, and other reasons.
I.e. If it is advertised as being exactly 100GB, and reports being exactly 100GB to the OS, it really is something like 105GB+, in size. So it would merely use the remaining 5GB (which you CAN'T fill up with your data), to act as a wear leveling area, and unusable 'bad blocks' space.

Also note, that SSDs try and use their ram (specifically the types with supercapacitors, to hold the supply rails up, after power-off. I'm not sure how non-supercapacitor types, use ram), to store rapidly/regularly changing sectors. Which is really where the 50kB test file, will probably reside (until a power off cycle, forces it to write the ram buffered sectors onto the actual flash).

Additionally, SSDs, very extensively remap the sectors. So writing a 'bad block' testing/detection program, is probably extremely challenging. Because it won't readily tell you where sectors really are and/or let you access the 'removed from use' (bad-block) ones etc.

It sounds to me like your 'test program' will just see what the OS sees, which is the remapped and fault corrected sectors. Until it gets too many 'bad blocks', runs out of over-provisioning (and maybe other) areas, and then does something like, insist in only working in read only mode (for data recovery), and will refuse any further writes or changes to data on the SSD.
I.e. The SSD would be what some call 'Bricked'.

« Last Edit: February 13, 2021, 08:06:47 am by MK14 »
 
The following users thanked this post: hans, Fraser, janoc, newbrain

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #6 on: February 13, 2021, 08:20:20 am »
It sounds to me like your 'test program' will just see what the OS sees, which is the remapped and fault corrected sectors. Until it gets too many 'bad blocks', runs out of over-provisioning (and maybe other) areas, and then does something like, insist in only working in read only mode (for data recovery), and will refuse any further writes or changes to data on the SSD.
I.e. The SSD would be what some call 'Bricked'.

Right. Brick is most likely outcome in case of modern drive. Better get old "mechanical" 3.5" HDD known to have bad blocks. Hardware hoarders could have plenty, not to mention eBay.
 
The following users thanked this post: MK14

Online Ian.M

  • Super Contributor
  • ***
  • Posts: 13026
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #7 on: February 13, 2021, 08:35:38 am »
Some SSDs even deliberately brick themselves (i.e not recognized) on the next power cycle once they've detected critical failure.

« Last Edit: February 13, 2021, 08:37:32 am by Ian.M »
 
The following users thanked this post: MK14

Online Halcyon

  • Global Moderator
  • *****
  • Posts: 5862
  • Country: au
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #8 on: February 13, 2021, 08:37:21 am »
Exactly what MK14 said. Plus a good test program will look at SMART data as well. What better source of data is there than the drive itself? If it's reporting problems, your program should be looking at that and reporting it too, not just blindly accepting what the OS tells you, particularly when it comes to SSDs.

 
The following users thanked this post: MK14

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #9 on: February 13, 2021, 09:59:48 am »
I have added the SMART check as "cross-check", thanks for the tips  :D

I need "guinea pigs", SSDs to used as a subject for experiments in order to detect false positives, false negatives, etc.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #10 on: February 13, 2021, 10:02:27 am »
AFAIK, you are in the UK.  You could try asking your nearest branch of CeX (second hand media & electronics) if they'd be willing to sell you SSDs that fail testing, but are still recognized by the controller.  You'd probably need to  check back with them regularly, as I'd bet at the moment their warranty returns get scrapped.

Great idea! Thanks! That's why I opened the topic in the Chat area, just to catch tips like this!
It seems the best idea ever  :D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline hans

  • Super Contributor
  • ***
  • Posts: 1671
  • Country: nl
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #11 on: February 13, 2021, 10:02:47 am »
Some SSDs even deliberately brick themselves (i.e not recognized) on the next power cycle once they've detected critical failure.

I believe  some SSD firmware/controllers can  also go to a  read-only state on  extensive failures. That way easy data recovery is still possible.

HDD's also have reserve sectors by the way, so that damaged sectors may be discarded after an erase AFAIK. If you run badblocks  on a drive; never let it covince yourself of a failure with a single pass. Sometimes the drive can correct the damaged sector, update the SMART  data (which has pre-fail indicators like reallocated sector count), and go from there. If  it still fails after the drive has been completely write cycled; then for sure it's broken (e.g. by r/w head damage)
 
I've used a pre-fail disk for some time as a NAS scratchdisk for media downloads that were non-critical and reproducible. At the  time I wasn't in the position to buy  a bunch of HDD's for a RAID array. That drive, a notorious Seagate 2TB model, completely failed after about 1 year or so.. so yes it's a risk to keep running storage devices like that. But  I had no regrets; I only had a bunch of movies and IPTV recordings on that drive.

I have seen techsite hardware.info trying to wear out a Samsung 840 250GB TLC SSD in 2013. They went for the straightforward approach: fill the drive with 160GB of static data, then  keep refilling  the remaining area 24/7 till it  fails.  The drive was rated for 1000 P/E cycles. They saw their  first reallocated sector at 2945 P/Ecycles on 1 drive (707TBW). At 3187  P/E  cycles (764TBW) they saw an uncorrectable error  (512kB lost). The drive gave up at 3706 P/E cycles, with 888TBW. This  test took over 3 months to complete, running 24/7.

They  noted that the drive would reshuffle data in  the background  to even off wearing. I  also think that 250GB is an odd-ball value:  the drive probably has 256GB worth of TLC FLASH cells, but uses a portion of the cells as SLC cache (which has an order of magnitude higher wear tolerance) or reserve sectors. For the latter you probably only need a few hundred MB or several GB, depending  on how  precise you can specify  the P/E expectancy and variance. I.e. if you can predict only a few dozen cells have worn out at 1k P/E, you don't need to take up much reserve  space. 1k P/E cycles was quite conservative for this particular drive - they got almost 3x the lifetime out of it.

Source(In Dutch unfortunately): https://nl.hardware.info/artikel/4177/10/hardwareinfo-test-levensduur-samsung-ssd-840-250gb-tlc-ssd-eind-update-20-6-2013-update-9-eindconclusie-20-6-2013
« Last Edit: February 13, 2021, 03:31:19 pm by hans »
 
The following users thanked this post: MK14, DiTBho

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #12 on: February 13, 2021, 11:05:59 am »
Additionally, SSDs, very extensively remap the sectors. So writing a 'bad block' testing/detection program, is probably extremely challenging. Because it won't readily tell you where sectors really are and/or let you access the 'removed from use' (bad-block) ones etc.

Yup, this is a kind of "legacy" requirements needed to also test electro-mechanical HDDs.

The testing program needs to support both the technology, and offers the user the possibility to set up the testing mode.

[..] which is the remapped and fault corrected sectors. Until it gets too many 'bad blocks', runs out of over-provisioning (and maybe other) areas

This is what I am trying to study. I have zero experience with SSDs and flash-based storage devices.


ram-disks (volatile, not permanent, used by industrial machines with high IOp)
I also have to support and test the ram-disk, and in this case I have to test the amount of IOP that goes through an IO system request. If there is a delay, on a ram-disk made with ECC RAM, it means that something with the DRAM controller encountered a problem, the CRC failed and caught a parity error, and the controller tried to repeat the operation, and if this happens too many times ... well, you don't see it happening from the user-space, and even the kernel does not know anything about it, only the controller inside the ram-disk know what's happening, and won't tell you anything unless there is a solid failure, which is too late.

The failure can be detected before it becomes serious, if you disable caching, issue a large amount of IOp, measure the time, and you notice a selective delay in certain area, that means there is probably something ready to go wrong with the decoupling of the capacitors or something on the physical PCB of a specific area where DRAM chips are mounted! Or, worse still, part of the ECC ram is going to became faulty and unreliably and needs to be replaced.

My ram-disk has several ram-sticks installed. Something like 32 banks, 4Gbyte each. I did an experiment with a voluntary damaged ECC ram stick in one of the 32 banks, I removed a capacitor from an old ram stick, and my testing program caught it: bank#4 looks suspicious, slow IOp, please check it!

Bingo! :D

SMART does somehow support this, unfortunately my ram-disk devices do not have this diagnostic features implemented. If I issue a check, it always returns "not supported", or worse still, "all OK".

That's why I have added SMART as additional layer to run cross-tests: this way I can write only one C program, organized by C sub-modules, and use it to test three kinds of storage technology: HDDs, SSDs, and RAM-disks!
« Last Edit: February 13, 2021, 11:12:35 am by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Offline Twoflower

  • Frequent Contributor
  • **
  • Posts: 741
  • Country: de
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #13 on: February 13, 2021, 11:12:04 am »
Fill your SSD with data, but leave 50KB or so free. then make a .cmd file that will keep writing a (just shy of) 50KB file to it.
Use two files, one with all 00s and one with all FFs. Alternate between the two.
Hopefully, it will wear out the flash in that area. Since the disk is almost full, that will subvert the wearing algorithm for the drive...
That probably does not work. The SSD will detect that some parts of the SSD is not written and others very often. My assumption is that the SSD controller will detect such thing and swap some of the static content to the often written cells. The controller most likely doesn't care about "empty" that much but on the write-cycle counter of a section and takes acting on that information.

@DiTBho good luck to wear down a SSD. A German newspaper (others would have done that too) did this in 2017 and the result was surprisingly well for the write-cycles. After five months nine out of twelve SSDs were dead. The 'worst' one was able to handle 188Tera Byte Written (manufacturer datasheet: 72TBW) the best in the test died at 4623TBW (manufacturer datasheet 150TBW). Of course modern drives will behave different. And probably the data will get corrupted within a short time as the FLASH cell insulation is heavily damaged after such treatment. They also noticed that the drives did not allow any access any more (neither read nor write). Some were not even recognized by the BIOS/UEFI.  The German article is behind a paywall but in case of interest: So lange halten SSDs.
 
The following users thanked this post: MK14

Online tom66

  • Super Contributor
  • ***
  • Posts: 6890
  • Country: gb
  • Electronics Hobbyist & FPGA/Embedded Systems EE
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #14 on: February 13, 2021, 12:13:12 pm »
At a prior employer I was responsible for testing eMMC devices used in set top boxes for the 'trick play' function (records the last 30min of programming so you can pause and resume.)

The bottom line is the easiest way to kill flash memory is to get it hot.  At 45C ambient the lifespan of memory was a third as much at room temperature.  Only 1,000 cycles vs 3,000 cycles at room temperature.  And from documentation I could find, this is not unusual. Write endurance drops by about 30-40% every 10C rise.

And, when the devices failed, the device would basically not acknowledge reads to any parts that were damaged, and it would stall the OS.

So: write a lot of data to the memory, while it is hot.  Preferably, heat the drive up while doing so, provided it remains within thermal limits (it does not stop writing above a certain temperature - some drives do.)  Data patterns should cycle between 0 and 1, ideally, toggle every other bit or write random data to each page so that the average bit gets used for an erase at least every two operations.

If you start writing a lot of data to a drive in one file, it will move around that file,  even if the drive has used sectors ... it's going to be more 'productive' in terms of wear rate, if you write data to every sector, so large files.
 
The following users thanked this post: MK14, DiTBho

Online MK14

  • Super Contributor
  • ***
  • Posts: 4843
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #15 on: February 13, 2021, 12:32:55 pm »
On ebay, you can get second hand (possibly sometimes new), SSDs, lower capacity, for around (from memory), the £3 including delivery mark. Various types available, some sellers have quite a few to sell.
You could also make them a bulk, low ball offer.

E.g. £3.75 Delivered, here:

https://www.ebay.co.uk/itm/Ramaxel-8GB-Mini-SSD-S800-S-SATA-DOM-Disk-on-Module-drive/114641308772

If you shop around and are patient, you can probably get them for less, and/or bigger capacity ones. E.g. 16GB, etc.

Quote
Ramaxel 8GB Mini SSD S800-S SATA DOM Disk on Module drive
Condition:New
Configuration:
- Select -
Quantity:
1
7 available
24 sold / See Feedback
Price:
£3.75
 

Online MK14

  • Super Contributor
  • ***
  • Posts: 4843
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #16 on: February 13, 2021, 01:05:19 pm »
Additionally, SSDs, very extensively remap the sectors. So writing a 'bad block' testing/detection program, is probably extremely challenging. Because it won't readily tell you where sectors really are and/or let you access the 'removed from use' (bad-block) ones etc.

Yup, this is a kind of "legacy" requirements needed to also test electro-mechanical HDDs.

The testing program needs to support both the technology, and offers the user the possibility to set up the testing mode.

[..] which is the remapped and fault corrected sectors. Until it gets too many 'bad blocks', runs out of over-provisioning (and maybe other) areas

This is what I am trying to study. I have zero experience with SSDs and flash-based storage devices.


ram-disks (volatile, not permanent, used by industrial machines with high IOp)
I also have to support and test the ram-disk, and in this case I have to test the amount of IOP that goes through an IO system request. If there is a delay, on a ram-disk made with ECC RAM, it means that something with the DRAM controller encountered a problem, the CRC failed and caught a parity error, and the controller tried to repeat the operation, and if this happens too many times ... well, you don't see it happening from the user-space, and even the kernel does not know anything about it, only the controller inside the ram-disk know what's happening, and won't tell you anything unless there is a solid failure, which is too late.

The failure can be detected before it becomes serious, if you disable caching, issue a large amount of IOp, measure the time, and you notice a selective delay in certain area, that means there is probably something ready to go wrong with the decoupling of the capacitors or something on the physical PCB of a specific area where DRAM chips are mounted! Or, worse still, part of the ECC ram is going to became faulty and unreliably and needs to be replaced.

My ram-disk has several ram-sticks installed. Something like 32 banks, 4Gbyte each. I did an experiment with a voluntary damaged ECC ram stick in one of the 32 banks, I removed a capacitor from an old ram stick, and my testing program caught it: bank#4 looks suspicious, slow IOp, please check it!

Bingo! :D

SMART does somehow support this, unfortunately my ram-disk devices do not have this diagnostic features implemented. If I issue a check, it always returns "not supported", or worse still, "all OK".

That's why I have added SMART as additional layer to run cross-tests: this way I can write only one C program, organized by C sub-modules, and use it to test three kinds of storage technology: HDDs, SSDs, and RAM-disks!

If your testing program has to handle, so many (three), VERY significantly different storage mediums. It could be spreading your time and other resources, too thinly, to do a decent job.

My understanding, is that it is usually recommended, to NOT use Ram disks, for the real (actual/written) main/sole data store. But if the data is properly stored elsewhere, so it is just to speed things up or something, that is fine. It is a complicated subject area, so there are many other exceptions.
This is because it is too susceptible to mechanisms, which might corrupt or even fully wipe, all the data. Suddenly, and without much/any warning.

E.g. General DRAM memory bit flips, not all of which are handled by ECC systems. E.g. multiple-bit flips, in the same location in memory (64 bits), or even the address being used by the DRAM itself being corrupted (bit-flipped), so that the wrong memory address is read from or written too. Some systems CRC check the address information, as well.

Power failures of various kinds, not all of which will be prevented/stopped by UPS and other protective systems.

Various hardware failures, e.g. single power supply, or possibly a shorted device, if dual power supplies are fitted. I.e. If the data had been on a HDD, the hardware failure, shouldn't have damaged the data, but the volatile ram would lose all the data.

Various system crashes, MIGHT damage the ram disk's data.

The above are just some of the examples, of how it can go wrong, and why it is not recommended.

But ram disks, is opening up a can of worms, as regards possible differences of opinion. So I will end, by saying that some systems requirements, especially where very high performance is needed. May be somewhat forced to use very fast ram disks, in order to economically reach the desired performance.
Ideally such a solution, will/should as quickly as practicable, copy the data onto non-volatile and more reliable mediums.
SSDs and big SSD arrays, are already so fast, that they might be able to handle things quickly enough.

But developing software/techniques to detect when ECC ram has gone (or is going) bad, and identifying which slot/bank has/is failing/failed, sounds a good and interesting thing to research.

If you or the people you are helping, want data reliability. What about things like ZFS file systems, decent backup solutions and Raid (and similar) disk arrays ?
« Last Edit: February 13, 2021, 01:08:48 pm by MK14 »
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #17 on: February 13, 2021, 02:52:43 pm »
E.g. General DRAM memory bit flips, not all of which are handled by ECC systems. E.g. multiple-bit flips, in the same location in memory (64 bits), or even the address being used by the DRAM itself being corrupted (bit-flipped), so that the wrong memory address is read from or written too. Some systems CRC check the address information, as well.

The dram-controller should use 36bit, for 32bit of data, it means 4 bit for the check: isn't enough?
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online MK14

  • Super Contributor
  • ***
  • Posts: 4843
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #18 on: February 13, 2021, 03:12:57 pm »
The dram-controller should use 36bit, for 32bit of data, it means 4 bit for the check: isn't enough?

That is enough to detect and correct, only 1 bit error (flip). It can detect more, especially 2 bit errors at the same address, but hasn't got enough information to correct them, so is likely to fail/crash/reboot or whatever happens in that circumstance.
 
The following users thanked this post: DiTBho

Offline Jwalling

  • Supporter
  • ****
  • Posts: 1517
  • Country: us
  • This is work?
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #19 on: February 13, 2021, 07:52:13 pm »
Fill your SSD with data, but leave 50KB or so free. then make a .cmd file that will keep writing a (just shy of) 50KB file to it.
Use two files, one with all 00s and one with all FFs. Alternate between the two.
Hopefully, it will wear out the flash in that area. Since the disk is almost full, that will subvert the wearing algorithm for the drive...
That probably does not work. The SSD will detect that some parts of the SSD is not written and others very often. My assumption is that the SSD controller will detect such thing and swap some of the static content to the often written cells. The controller most likely doesn't care about "empty" that much but on the write-cycle counter of a section and takes acting on that information.


I believe that you're wrong, but I'm guessing as well ;) I don't think the controller will move data from allocated sectors.
Jay

System error. Strike any user to continue.
 

Offline Twoflower

  • Frequent Contributor
  • **
  • Posts: 741
  • Country: de
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #20 on: February 13, 2021, 08:06:14 pm »
I believe that you're wrong, but I'm guessing as well ;) I don't think the controller will move data from allocated sectors.
According to the Static wear leveling - Wikipedia or Wear Leveling - Transcent the re-allocation is done in the static and global wear leveling modes. And I think that is very common to all modern SSDs. They have do a lot to provide a usable reliability on the tiny multi-level cells. And doing this types of wear leveling come more or less for free. The over provisioning and error correction methods are expensive as that needs additional memory area.
 

Offline Jwalling

  • Supporter
  • ****
  • Posts: 1517
  • Country: us
  • This is work?
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #21 on: February 13, 2021, 09:23:41 pm »
I believe that you're wrong, but I'm guessing as well ;) I don't think the controller will move data from allocated sectors.
According to the Static wear leveling - Wikipedia or Wear Leveling - Transcent the re-allocation is done in the static and global wear leveling modes. And I think that is very common to all modern SSDs. They have do a lot to provide a usable reliability on the tiny multi-level cells. And doing this types of wear leveling come more or less for free. The over provisioning and error correction methods are expensive as that needs additional memory area.

Fair enough. But in the scenario I first described, will the controller have the time to do this? Just theorizing (guessing) again, but some of the wear leveling may normally occur when the drive is idle.
Jay

System error. Strike any user to continue.
 

Offline Twoflower

  • Frequent Contributor
  • **
  • Posts: 741
  • Country: de
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #22 on: February 13, 2021, 09:58:03 pm »
That's a good question. If a single write request (modern SSDs can handle many requests in parallel) can saturate the write speed of the drive depends on many tings. Especially with the different caching mechanisms in place (RAM and the SLC area, over provisioning). Also the drive controller can delay the write request do do the housekeeping. For example they do that if the SLC cache area is full and the data has directly written to the MLC region of the memory array.

A guess I can't find a documentation for is: A SSD controller will probably slow down the write requests from the computer to manage such an internal copy process and not risk dataloss by allowing such an attack you describe. A drive internal copy task will be so fast you won't notice that unless you do specific benchmarks. Especially the over provisioning has all time some free area to swap data in and out.

Probably the fastest way to damage the SSD would be to remove one flash chip and burn it to death with a microcontroller by writing to the same region over and over again and solder it back. A few hundred times will probably enough for modern MLC FLASH. But the SSD controller might notice that at the first write attempt to that region. As well as they do a write-verify procedure. So most likely the controller will mark the bad cells and write the data some where else before you'll see a bad content while read back.
 
The following users thanked this post: DiTBho

Offline james_s

  • Super Contributor
  • ***
  • Posts: 21611
  • Country: us
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #23 on: February 13, 2021, 10:27:48 pm »
I've encountered several failed SSDs and so far the failure mode I've seen was the drives turned read-only and for some reason this stopped Windows from booting. I was able to clone those to new drives and everything was recovered.
 
The following users thanked this post: MK14

Offline David Hess

  • Super Contributor
  • ***
  • Posts: 17051
  • Country: us
  • DavidH
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #24 on: February 14, 2021, 03:20:33 am »
Fill your SSD with data, but leave 50KB or so free. then make a .cmd file that will keep writing a (just shy of) 50KB file to it.
Use two files, one with all 00s and one with all FFs. Alternate between the two.
Hopefully, it will wear out the flash in that area. Since the disk is almost full, that will subvert the wearing algorithm for the drive...

At best that will rotate through the over-provisioned space however my understanding is that good SSD wear leveling algorithms will also rotate written data through used areas by swapping data around, so there is no pattern of writing which will not wear out all areas evenly.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #25 on: February 14, 2021, 01:14:10 pm »
If your testing program has to handle, so many (three), VERY significantly different storage mediums. It could be spreading your time and other resources, too thinly, to do a decent job.

  • electro-mechanical HDDs: already supported, errors are fine grained detected at block-level (512byte)
  • ram-disks: already supported, errors are non fine grained detected at block level, but rather at bank-level (one IOp defective in a 4Gbyte dram stick)
  • SSDs: ... not yet supported, but the SMART-module is moving its first steps ... waiting for laboratory guinea pigs

It can be done, and it will be done :D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #26 on: February 14, 2021, 01:20:03 pm »
The dram-controller should use 36bit, for 32bit of data, it means 4 bit for the check: isn't enough?

That is enough to detect and correct, only 1 bit error (flip). It can detect more, especially 2 bit errors at the same address, but hasn't got enough information to correct them, so is likely to fail/crash/reboot or whatever happens in that circumstance.

It doesn't need to fix it, it needs to detect it. So it's enough for what I need  :D

Yesterday, I verified a couple of voluntary damaged ram-sticks, moving them in random positions: always detected as defective, no false positives and no false negatives! So this method seems to works!

The ram-disk allocates banks linearly and always in the same way. It helps a lot! SSDs are not so generous and not so simple.

I agree SSDs need a different and dedicated new strategy!
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Offline Twoflower

  • Frequent Contributor
  • **
  • Posts: 741
  • Country: de
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #27 on: February 14, 2021, 01:30:46 pm »
About the spinning rust: Are you sure the device tell you the truth? HDDs do a defect replacement since ages. The question is: Do you get direct media access (before or after the drive error correction?) or do you trust the blue sky information from the drive? The SMART values show you the information how many sectors are replaced (https://kb.acronis.com/node/19836 hopefully you'll see the page in your language).

To my understanding your approach to create defects exceed the number of spare sectors. You should use an error free drive and use a very thin needle and create a tiny spot (just a touch, no lateral movements). Now you should see the re-alocated sector count increase and probably no data errors (the drive will correct them as most of the defect sectors are still fine) as soon as you try to access (read or write) the data of that sector.
 
 
The following users thanked this post: MK14, DiTBho

Online MK14

  • Super Contributor
  • ***
  • Posts: 4843
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #28 on: February 14, 2021, 02:14:43 pm »
It doesn't need to fix it, it needs to detect it. So it's enough for what I need  :D

Yesterday, I verified a couple of voluntary damaged ram-sticks, moving them in random positions: always detected as defective, no false positives and no false negatives! So this method seems to works!

The ram-disk allocates banks linearly and always in the same way. It helps a lot! SSDs are not so generous and not so simple.

I agree SSDs need a different and dedicated new strategy!

That's good that you have made progress with your ram checking strategies/software. People seem to sometimes sell faulty ram sticks (especially older generation ones, which are in a pile, gathering dust on their work bench, perhaps), possibly even piles of them, fairly cheaply on ebay. That would be another method of getting 'bad' ram sticks, and it also might get you more genuine and realistic, fault conditions.

You're right, SSDs are tricky!
The thing is (going by rumors I've heard), even when they are working 100% perfectly, they can be problematic. Such as significant/big performance losses, because there was too little room left on the SSD, so it used slow and slower strategies, in order to read and especially write to the disk.

When/if you buy actual raw flash chips, with the bigger types (smaller types, can be guaranteed to be 'perfect'), they actually can come with defective 'sectors' on the flash chip. I.e. they only guarantee that, you will get at least a certain number of 'marked as good' sectors, but the rest can be marked on the flash chip as 'bad', 'please do not use, failed testing'.
I'm a bit green as regards these chips, but understanding the differences between NAND and NOR flash types, seems to be a good starting point.

When reading those datasheets, it makes me wonder how at the factory, they were able to rapidly check each flash chip, and its sectors (or whatever its memory sub-units are called), for reliability. Without needing to take ages (long test times, usually highly undesirable on the production line) and especially, without needing to actually life test 'burn out, wear out' the actual flash cells/memory.

I suppose they could use high temperatures (but that makes the testing more expensive and time consuming, as well) and measure leakage currents or something. Yes, I do wonder how they do such advanced testing, and so quickly, on the production line ?

I do know that they can do extensive batch testing, on a small number of each batch of chips. Which could involve, testing them to destruction (e.g. wearout), but I can't see how that would help them know which sectors on a flash chip are bad.

Wild speculation:
Maybe it is all to do with the quality of the insulation layer(s), between adjacent flash memory cells. Which they could carefully measure, using expensive/accurate/calibrated/sensitive current/voltage measurements, using secret test patterns and pins/connect-points, possibly at raised temperatures.

The reason I'm saying the above, is because an SSD, really boils down to the flash chip(s) that are inside it, and some controller/interface stuff, built into it.
 
The following users thanked this post: DiTBho

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #29 on: February 14, 2021, 04:47:17 pm »
About the spinning rust: Are you sure the device tell you the truth? HDDs do a defect replacement since ages. The question is: Do you get direct media access (before or after the drive error correction?) or do you trust the blue sky information from the drive? The SMART values

There are units that do not have any SMART implemented! E.g. certain Micro-Drives, certain FC-disks, and Disk-Rams

If SMART is available, my testing program considers its answers before running tests.
If SMART is not available, my testing program scans all the disk-blocks. If it finds a defect, it iterates around it, and checks if gets corrected after multiple accesses. If it's not corrected after a selective write and read-back then it means it's a permanent bad-block.

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #30 on: February 15, 2021, 06:16:20 am »
The dram-controller should use 36bit, for 32bit of data, it means 4 bit for the check: isn't enough?

That is enough to detect and correct, only 1 bit error (flip). It can detect more, especially 2 bit errors at the same address, but hasn't got enough information to correct them, so is likely to fail/crash/reboot or whatever happens in that circumstance.

Four bits isn't enough to detect and correct single bit errors on 36 bits - it only gives 16 'check values' for 36 possible bit flips....
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #31 on: February 15, 2021, 11:32:00 am »
4 bit to check (and only to check, it can't be fixed) a 32 bit word, 32+4 = 36bit

A single-bit error is when one bit { "0", "1"} of a 4-byte of data (32 bits) is changed to the opposite value. There is 1 bit to check each byte (8bit). 4x8=32bit. 4x1=4bit -> 36bit ECC module! 

Code: [Select]
byte-0  bc0  byte-1  bc1  byte-2  bc2  byte-3  bc3
76543210  0  76543210  0  76543210  0  76543210  0

check          32bit data word
 bc   byte-0   byte-1   byte-2   byte-3
3210 76543210 76543210 76543210 76543210
36bit ram module

Quote
It is the most likely error to corrupt data, as it is so small that the computer may not automatically recognize it as incorrect data.  Multiple bit errors, more than 1 bit being simultaneously affected, are more likely to occur, but less likely to be accepted by the computer as valid input.  Multiple bit errors can be detected by single-bit ECC, but may not be corrected by it in all instances.  Instead the system ignores it and reloads the data.

Bingo! And here we go! When the ram-disk controller needs to reload the data it takes more time to conclude the IOp, and if this happens too often during a bank scan, then it means that bank has some physical problem it's going to become faulty due to an hardware error.

I assume there are two types of single-bit memory errors:
  • hard errors
    hardware damage, mishandled hardware, or it can simply be caused by stress, or a damaged capacitor, over time.
  • soft errors
    in a ram-disk, where the storage-medium is an ECC array of dram, it's caused by data being written or read differently than originally intended
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online MK14

  • Super Contributor
  • ***
  • Posts: 4843
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #32 on: February 15, 2021, 06:15:42 pm »
Four bits isn't enough to detect and correct single bit errors on 36 bits - it only gives 16 'check values' for 36 possible bit flips....

The thread has got complicated, and what I was talking about, and what the OP are talking about, are very similar, but COMPLETELY different systems.

I was talking about 64 bit memory + ECC, where they specify it is fully one bit ECC correctable, 2 bits ECC error detectable, and more bit errors MAY be detected (or NOT).

I.e. A continuation of this bit:
E.g. General DRAM memory bit flips, not all of which are handled by ECC systems. E.g. multiple-bit flips, in the same location in memory (64 bits), or even the address being used by the DRAM itself being corrupted (bit-flipped), so that the wrong memory address is read from or written too. Some systems CRC check the address information, as well.

BUT the OP, has been talking about 32 bit systems, I was talking about 64 bit systems + enough ECC protection for all 1 bit errors, etc etc.

I can see why my post(s), can be misunderstood/confusing.

I don't know what 32 bit + 4 bit checking information, system the OP is talking about, i.e. what system, OS, etc. So, I will leave your question for them to answer/address.

In summary, I was really referring to something which handles 64 bit data + Full 1 bit error correcting, ECC.

I should have NOT posted like that. I should have been much clearer, and stated that I was talking about the original 64 bit + ECC system, I was originally talking about.


« Last Edit: February 15, 2021, 06:18:42 pm by MK14 »
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #33 on: February 15, 2021, 07:44:29 pm »
Yup, my fault, Parity memory and ECC are "similar" (both detect single bit errors) but not the same, parity is error detection, ECC is detection and correction, and my posts may look confusing.
  • Parity memory
    provides for the detection of, but not the correction of single bit errors. Parity cannot detect multi-bit errors
  • ECC memory
    provides for the detection of, and the correction of single bit errors. ECC memory can detect but not correct multi-bit errors.

What I meant, and what's important for me: ECC memory can detect multi-bit errors  :D
« Last Edit: February 15, 2021, 07:48:29 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #34 on: February 15, 2021, 07:46:07 pm »
32 data bits -> 4 bytes -> 4 parity bits
64 data bits -> 8 bytes -> 8 parity bits

That's perfectly coherent to my previous post: 1 parity bit for each byte!
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Offline Haenk

  • Super Contributor
  • ***
  • Posts: 1213
  • Country: de
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #35 on: February 18, 2021, 12:54:15 pm »
Btw. "Hard Disk Sentinel" is a very similar tool, I have been using it for a decade or so. It seems there is a Linux version now, too.
It works on SSDs as well, though I'm not sure what to make of the results - as the storage is not per se block adressable (automatic remap on bit errors).

The automatic remapping to spare blocks makes the SSD analysis rather pointless, if you access a bad block, it will be remapped and the bad block is appearing as "good". So accessing/watching the SMART data while scanning the blocks might be an idea to "beat the system"
 
The following users thanked this post: DiTBho


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf