Additionally, SSDs, very extensively remap the sectors. So writing a 'bad block' testing/detection program, is probably extremely challenging. Because it won't readily tell you where sectors really are and/or let you access the 'removed from use' (bad-block) ones etc.
Yup, this is a kind of "legacy" requirements needed to also test electro-mechanical HDDs.
The testing program needs to support both the technology, and offers the user the possibility to set up the testing mode.
[..] which is the remapped and fault corrected sectors. Until it gets too many 'bad blocks', runs out of over-provisioning (and maybe other) areas
This is what I am trying to study. I have zero experience with SSDs and flash-based storage devices.
ram-disks (volatile, not permanent, used by industrial machines with high IOp)
I also have to support and test the ram-disk, and in this case I have to test the amount of IOP that goes through an IO system request. If there is a delay, on a ram-disk made with ECC RAM, it means that something with the DRAM controller encountered a problem, the CRC failed and caught a parity error, and the controller tried to repeat the operation, and if this happens too many times ... well, you don't see it happening from the user-space, and even the kernel does not know anything about it, only the controller inside the ram-disk know what's happening, and won't tell you anything unless there is a solid failure, which is too late.
The failure can be detected before it becomes serious, if you disable caching, issue a large amount of IOp, measure the time, and you notice a selective delay in certain area, that means there is probably something ready to go wrong with the decoupling of the capacitors or something on the physical PCB of a specific area where DRAM chips are mounted! Or, worse still, part of the ECC ram is going to became faulty and unreliably and needs to be replaced.
My ram-disk has several ram-sticks installed. Something like 32 banks, 4Gbyte each. I did an experiment with a voluntary damaged ECC ram stick in one of the 32 banks, I removed a capacitor from an old ram stick, and my testing program caught it: bank#4 looks suspicious, slow IOp, please check it!
Bingo!
SMART does somehow support this, unfortunately my ram-disk devices do not have this diagnostic features implemented. If I issue a check, it always returns "not supported", or worse still, "all OK".
That's why I have added SMART as additional layer to run cross-tests: this way I can write only one C program, organized by C sub-modules, and use it to test three kinds of storage technology: HDDs, SSDs, and RAM-disks!
If your testing program has to handle, so many (three), VERY significantly different storage mediums. It could be spreading your time and other resources, too thinly, to do a decent job.
My understanding, is that it is usually recommended, to NOT use Ram disks, for the real (actual/written) main/sole data store. But if the data is properly stored elsewhere, so it is just to speed things up or something, that is fine. It is a complicated subject area, so there are many other exceptions.
This is because it is too susceptible to mechanisms, which might corrupt or even fully wipe, all the data. Suddenly, and without much/any warning.
E.g. General DRAM memory bit flips, not all of which are handled by ECC systems. E.g. multiple-bit flips, in the same location in memory (64 bits), or even the address being used by the DRAM itself being corrupted (bit-flipped), so that the wrong memory address is read from or written too. Some systems CRC check the address information, as well.
Power failures of various kinds, not all of which will be prevented/stopped by UPS and other protective systems.
Various hardware failures, e.g. single power supply, or possibly a shorted device, if dual power supplies are fitted. I.e. If the data had been on a HDD, the hardware failure, shouldn't have damaged the data, but the volatile ram would lose all the data.
Various system crashes, MIGHT damage the ram disk's data.
The above are just some of the examples, of how it can go wrong, and why it is not recommended.
But ram disks, is opening up a can of worms, as regards possible differences of opinion. So I will end, by saying that some systems requirements, especially where very high performance is needed. May be somewhat forced to use very fast ram disks, in order to economically reach the desired performance.
Ideally such a solution, will/should as quickly as practicable, copy the data onto non-volatile and more reliable mediums.
SSDs and big SSD arrays, are already so fast, that they might be able to handle things quickly enough.
But developing software/techniques to detect when ECC ram has gone (or is going) bad, and identifying which slot/bank has/is failing/failed, sounds a good and interesting thing to research.
If you or the people you are helping, want data reliability. What about things like ZFS file systems, decent backup solutions and Raid (and similar) disk arrays ?