Author Topic: how can I voluntarily damage an SSD to test a testing program?  (Read 4048 times)

0 Members and 1 Guest are viewing this topic.

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4230
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #25 on: February 14, 2021, 01:14:10 pm »
If your testing program has to handle, so many (three), VERY significantly different storage mediums. It could be spreading your time and other resources, too thinly, to do a decent job.

  • electro-mechanical HDDs: already supported, errors are fine grained detected at block-level (512byte)
  • ram-disks: already supported, errors are non fine grained detected at block level, but rather at bank-level (one IOp defective in a 4Gbyte dram stick)
  • SSDs: ... not yet supported, but the SMART-module is moving its first steps ... waiting for laboratory guinea pigs

It can be done, and it will be done :D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4230
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #26 on: February 14, 2021, 01:20:03 pm »
The dram-controller should use 36bit, for 32bit of data, it means 4 bit for the check: isn't enough?

That is enough to detect and correct, only 1 bit error (flip). It can detect more, especially 2 bit errors at the same address, but hasn't got enough information to correct them, so is likely to fail/crash/reboot or whatever happens in that circumstance.

It doesn't need to fix it, it needs to detect it. So it's enough for what I need  :D

Yesterday, I verified a couple of voluntary damaged ram-sticks, moving them in random positions: always detected as defective, no false positives and no false negatives! So this method seems to works!

The ram-disk allocates banks linearly and always in the same way. It helps a lot! SSDs are not so generous and not so simple.

I agree SSDs need a different and dedicated new strategy!
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Offline Twoflower

  • Frequent Contributor
  • **
  • Posts: 742
  • Country: de
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #27 on: February 14, 2021, 01:30:46 pm »
About the spinning rust: Are you sure the device tell you the truth? HDDs do a defect replacement since ages. The question is: Do you get direct media access (before or after the drive error correction?) or do you trust the blue sky information from the drive? The SMART values show you the information how many sectors are replaced (https://kb.acronis.com/node/19836 hopefully you'll see the page in your language).

To my understanding your approach to create defects exceed the number of spare sectors. You should use an error free drive and use a very thin needle and create a tiny spot (just a touch, no lateral movements). Now you should see the re-alocated sector count increase and probably no data errors (the drive will correct them as most of the defect sectors are still fine) as soon as you try to access (read or write) the data of that sector.
 
 
The following users thanked this post: MK14, DiTBho

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4923
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #28 on: February 14, 2021, 02:14:43 pm »
It doesn't need to fix it, it needs to detect it. So it's enough for what I need  :D

Yesterday, I verified a couple of voluntary damaged ram-sticks, moving them in random positions: always detected as defective, no false positives and no false negatives! So this method seems to works!

The ram-disk allocates banks linearly and always in the same way. It helps a lot! SSDs are not so generous and not so simple.

I agree SSDs need a different and dedicated new strategy!

That's good that you have made progress with your ram checking strategies/software. People seem to sometimes sell faulty ram sticks (especially older generation ones, which are in a pile, gathering dust on their work bench, perhaps), possibly even piles of them, fairly cheaply on ebay. That would be another method of getting 'bad' ram sticks, and it also might get you more genuine and realistic, fault conditions.

You're right, SSDs are tricky!
The thing is (going by rumors I've heard), even when they are working 100% perfectly, they can be problematic. Such as significant/big performance losses, because there was too little room left on the SSD, so it used slow and slower strategies, in order to read and especially write to the disk.

When/if you buy actual raw flash chips, with the bigger types (smaller types, can be guaranteed to be 'perfect'), they actually can come with defective 'sectors' on the flash chip. I.e. they only guarantee that, you will get at least a certain number of 'marked as good' sectors, but the rest can be marked on the flash chip as 'bad', 'please do not use, failed testing'.
I'm a bit green as regards these chips, but understanding the differences between NAND and NOR flash types, seems to be a good starting point.

When reading those datasheets, it makes me wonder how at the factory, they were able to rapidly check each flash chip, and its sectors (or whatever its memory sub-units are called), for reliability. Without needing to take ages (long test times, usually highly undesirable on the production line) and especially, without needing to actually life test 'burn out, wear out' the actual flash cells/memory.

I suppose they could use high temperatures (but that makes the testing more expensive and time consuming, as well) and measure leakage currents or something. Yes, I do wonder how they do such advanced testing, and so quickly, on the production line ?

I do know that they can do extensive batch testing, on a small number of each batch of chips. Which could involve, testing them to destruction (e.g. wearout), but I can't see how that would help them know which sectors on a flash chip are bad.

Wild speculation:
Maybe it is all to do with the quality of the insulation layer(s), between adjacent flash memory cells. Which they could carefully measure, using expensive/accurate/calibrated/sensitive current/voltage measurements, using secret test patterns and pins/connect-points, possibly at raised temperatures.

The reason I'm saying the above, is because an SSD, really boils down to the flash chip(s) that are inside it, and some controller/interface stuff, built into it.
 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4230
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #29 on: February 14, 2021, 04:47:17 pm »
About the spinning rust: Are you sure the device tell you the truth? HDDs do a defect replacement since ages. The question is: Do you get direct media access (before or after the drive error correction?) or do you trust the blue sky information from the drive? The SMART values

There are units that do not have any SMART implemented! E.g. certain Micro-Drives, certain FC-disks, and Disk-Rams

If SMART is available, my testing program considers its answers before running tests.
If SMART is not available, my testing program scans all the disk-blocks. If it finds a defect, it iterates around it, and checks if gets corrected after multiple accesses. If it's not corrected after a selective write and read-back then it means it's a permanent bad-block.

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #30 on: February 15, 2021, 06:16:20 am »
The dram-controller should use 36bit, for 32bit of data, it means 4 bit for the check: isn't enough?

That is enough to detect and correct, only 1 bit error (flip). It can detect more, especially 2 bit errors at the same address, but hasn't got enough information to correct them, so is likely to fail/crash/reboot or whatever happens in that circumstance.

Four bits isn't enough to detect and correct single bit errors on 36 bits - it only gives 16 'check values' for 36 possible bit flips....
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4230
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #31 on: February 15, 2021, 11:32:00 am »
4 bit to check (and only to check, it can't be fixed) a 32 bit word, 32+4 = 36bit

A single-bit error is when one bit { "0", "1"} of a 4-byte of data (32 bits) is changed to the opposite value. There is 1 bit to check each byte (8bit). 4x8=32bit. 4x1=4bit -> 36bit ECC module! 

Code: [Select]
byte-0  bc0  byte-1  bc1  byte-2  bc2  byte-3  bc3
76543210  0  76543210  0  76543210  0  76543210  0

check          32bit data word
 bc   byte-0   byte-1   byte-2   byte-3
3210 76543210 76543210 76543210 76543210
36bit ram module

Quote
It is the most likely error to corrupt data, as it is so small that the computer may not automatically recognize it as incorrect data.  Multiple bit errors, more than 1 bit being simultaneously affected, are more likely to occur, but less likely to be accepted by the computer as valid input.  Multiple bit errors can be detected by single-bit ECC, but may not be corrected by it in all instances.  Instead the system ignores it and reloads the data.

Bingo! And here we go! When the ram-disk controller needs to reload the data it takes more time to conclude the IOp, and if this happens too often during a bank scan, then it means that bank has some physical problem it's going to become faulty due to an hardware error.

I assume there are two types of single-bit memory errors:
  • hard errors
    hardware damage, mishandled hardware, or it can simply be caused by stress, or a damaged capacitor, over time.
  • soft errors
    in a ram-disk, where the storage-medium is an ECC array of dram, it's caused by data being written or read differently than originally intended
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4923
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #32 on: February 15, 2021, 06:15:42 pm »
Four bits isn't enough to detect and correct single bit errors on 36 bits - it only gives 16 'check values' for 36 possible bit flips....

The thread has got complicated, and what I was talking about, and what the OP are talking about, are very similar, but COMPLETELY different systems.

I was talking about 64 bit memory + ECC, where they specify it is fully one bit ECC correctable, 2 bits ECC error detectable, and more bit errors MAY be detected (or NOT).

I.e. A continuation of this bit:
E.g. General DRAM memory bit flips, not all of which are handled by ECC systems. E.g. multiple-bit flips, in the same location in memory (64 bits), or even the address being used by the DRAM itself being corrupted (bit-flipped), so that the wrong memory address is read from or written too. Some systems CRC check the address information, as well.

BUT the OP, has been talking about 32 bit systems, I was talking about 64 bit systems + enough ECC protection for all 1 bit errors, etc etc.

I can see why my post(s), can be misunderstood/confusing.

I don't know what 32 bit + 4 bit checking information, system the OP is talking about, i.e. what system, OS, etc. So, I will leave your question for them to answer/address.

In summary, I was really referring to something which handles 64 bit data + Full 1 bit error correcting, ECC.

I should have NOT posted like that. I should have been much clearer, and stated that I was talking about the original 64 bit + ECC system, I was originally talking about.


« Last Edit: February 15, 2021, 06:18:42 pm by MK14 »
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4230
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #33 on: February 15, 2021, 07:44:29 pm »
Yup, my fault, Parity memory and ECC are "similar" (both detect single bit errors) but not the same, parity is error detection, ECC is detection and correction, and my posts may look confusing.
  • Parity memory
    provides for the detection of, but not the correction of single bit errors. Parity cannot detect multi-bit errors
  • ECC memory
    provides for the detection of, and the correction of single bit errors. ECC memory can detect but not correct multi-bit errors.

What I meant, and what's important for me: ECC memory can detect multi-bit errors  :D
« Last Edit: February 15, 2021, 07:48:29 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4230
  • Country: gb
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #34 on: February 15, 2021, 07:46:07 pm »
32 data bits -> 4 bytes -> 4 parity bits
64 data bits -> 8 bytes -> 8 parity bits

That's perfectly coherent to my previous post: 1 parity bit for each byte!
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Online Haenk

  • Super Contributor
  • ***
  • Posts: 1231
  • Country: de
Re: how can I voluntarily damage an SSD to test a testing program?
« Reply #35 on: February 18, 2021, 12:54:15 pm »
Btw. "Hard Disk Sentinel" is a very similar tool, I have been using it for a decade or so. It seems there is a Linux version now, too.
It works on SSDs as well, though I'm not sure what to make of the results - as the storage is not per se block adressable (automatic remap on bit errors).

The automatic remapping to spare blocks makes the SSD analysis rather pointless, if you access a bad block, it will be remapped and the bad block is appearing as "good". So accessing/watching the SMART data while scanning the blocks might be an idea to "beat the system"
 
The following users thanked this post: DiTBho


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf