That's wrong. RAID controllers don't see disk defects. Sector re-allocation is completely transparent on any modern (i.e. since the days of ATA) drive.
For anything somewhat modern (i.e. SAS/SATA) the RAID controller only sees a bunch of blocks.
The patrol read only serves to trigger the drive's internal defect management (data is read so it will trigger an ECC check, and if there's an error the data will be rewritten or, if that fails, the block moved). The controller itself has no handling in disk errors, and if the controller starts seeing bad blocks then the drive is flagged as defect.
You are going to have a hard time convincing me of that since I have watched it in real time.
You have seen what, exactly?
Maybe you have only used crap RAID controllers, of which there are many.
I'm not sure I'd call HPE's Smart Array range (or Dell's PERC, or Microsemi's Adaptec) hardware RAID controllers "crap", as they represent the upper end of the market.
But I'm certainly not talking about fake RAID stuff if that's what you have in mind.
Hard drives do *not* perform scrub-on-read operation for uncorrectable errors; instead they attempt to read the data until a timeout is reached, and then return the bad data and indicate an error. "RAID" drives allow a shorter timeout to be set so they do not waste time trying to recover data which can be regenerated anyway.
That is correct when it comes to
*uncorrectable* errors.
But as far as
*correctable* errors are concerned, they (as the name implies) are corrected inside the drive when found during read. The typical "bit rot" scenario is exactly that, a "flipped bit" on an otherwise healthy sector. Which, when detected, results in a re-write of that sector.
Now, as to
*uncorrectable* errors, it is correct that the drive eventually returns corrupted data. However, when a hard drive reports an uncorrectable error then that's because the defect exceeds into the ECC area so there is not enough information to restore the original information. Which commonly means the drive is defective and should be replaced. It's as simple as that.
When the RAID controller receives the bad data and an error, good ones regenerate the data and write it back to the drive, which then performs a scrub-on-write. If this did not occur, then there would be no reason for the RAID controller to perform idle time scrubbing.
Nope. What you wrote would be (mostly) correct if we were talking about MFM/RLL/ESDI or early IDE (or SCSI-1 drives) from some 30 years ago. But we're not.
Modern (i.e. made in 2000 or later) drives don't represent their physical layout to the host. They report some artificial CHS layout which has nothing to do with the actual physical layout to the host for backward compatibility reasons (so that antique systems still CHS for addressing can boot from these drives). Even the sector size is often fake, as most modern hard drives use 4k sectors internally while reporting 512byte sectors on the interface.
But for most part the CHS layout isn't even used. LBA (Logical Block Adressing) has been a thing even before the year 2000 (it was first used with SCSI drives long before then), and it's been the standard way of addressing disks for many many years. With LBA, the host only sees a device which has a certain number of blocks. There's no CHS involved. LBA has been supported at least since Windows 98 and NT 4.0, and became the standard format with Windows 2000. And while LBA was an extension for IDE and ATA, LBA is the defined addressing standard for SATA, SAS and NVMe storage.
You RAID controller, whatever type/model that is, would have to be really old to not use LBA (and if that controller is so old then I guess the disks are, too). And even then it would only see the fake geometry reported by the drive, not the real one.
If a modern RAID controller encounters a bad block (unrecoverable error), it will try to reconstruct the data from the redundancy disks and then may attempt to re-write the block to the affected disk. If this disk has sufficient spare sectors, it may revert the write to a spare block, after which the block will be fine and the integrity of the data is restored. If that was a one-off, the drive may well be fine for years. However, if that happens more often (as it's the case on a dying drive), the disk will eventually run out of spare sectors after which the attempt by the RAID controller to re-write the block will fail, and then the disk is failed and the array goes into contingency mode.
It would be dumb to discard an entire drive because of a single bad read when the data can be recovered and written back to force the drive to reallocation that sector; of course firmware based RAID controllers often are this dumb.
In a professional environment, if a drive shows bad sectors at the interface it's scrapped, period. Any decent RAID controller will immediately flag a drive as soon as unrecoverable errors start to appear. Because for a modern drive unrecoverable errors are usually a clear sign that the drive is defective, and the only thing that would be stupid is to not scrap the drive and risk the integrity of the host data. It's a simple as that. Because at the end of the day the host data (and the hourly rate for the admin who has to deal with it and the potential fall-out should the drive remain in service) is worth a lot more than that stupid hard drive.
Now I accept that for hobbyist use that may well be different, and if you can't afford to replace a drive it's certainly tempting to work around the problem of a defect sector. But that is only a viable option if your data (and your time) isn't worth much (because if it was you'd not try to cheap-skate backup). And it doesn't change the fact that the drive is telling you that you can no longer rely on it.
The internal defect management could operate on a correctable error, but I have never seen happen.
Well, yes, that's because it's supposed to be
*transparent* to the host. Which is the whole point of a "defect-free interface".
I would have noticed if the defect list grew without errors being reported. Many times I have done extended SMART surface scans and watched for this very thing and it never happened. Most recently I have done it multiple times in the past couple of weeks on a pair of 1TB WD Greens, but doing an external surface scan which includes writing had some results.
Because you don't seem to fully understand what you are seeing. Any high level surface scan tool only scans the area that the drive is reporting as good (it has no access to the whole drive area - "defect-free Interface", remember?). Defects are hidden because defect management is completely *transparent*. The only way you can check what's actually going on in the drive is through SMART data.
When you're at the point where your tool can "see" defects then that means the drive has developed unrecoverable errors and is defective and should be discarded, but at the very least should not be used to store anything of importance.
SSDs are a different matter as they should not have any patrol reads done on them, and less so any writes. SSDs also have internal defect management (which is part of it's Garbage Collection) and unlike with hard drives this normally works without having to be triggered (which for SSDs is done with the TRIM command).
The difference is that SSD must perform verification after every write as part of the write process, and they also scrub-on-read of sectors which are still correctable if they have too many soft errors. Many perform idle time scrubbing because retention of modern Flash memory is abysmal, or at least their specifications say they do. I sure know that many do not and will happily suffer bit rot while powered.
There's nothing in SSDs that matches "scrubbing". Which, on flash, would be completely counterproductive because every read of a flash cell reduces the amount of charge that is stored in that cell, so if a SSD did regular "read-scrubs" then the charge level would quickly become so low that the sector would have to be re-written to maintain its information, a process which causes wear on the flash cell.
In reality, SSDs employ ECC (although slightly more sophisticated then on spinning rust) for error correction, Garbage Collection (which erases blocks that are marked for deletion) and Wear Leveling (where the SSD keeps track of flash use and tries to balance the load evenly across all cells).
(2) SSDs which perform idle time scrubbing *must* also support power loss protection because program and erase cycles can occur at any time, which implies that SSDs which do not support power loss protection, do not perform idle time scrubbing.
That's wrong. PLP on SSDs is required to make sure data that is in the drive's cache is written to the flash memory after a power loss occurs so that it doesn't get lost. If it does then this can affect the integrity of the OS level data layer.
No, but there are two levels of power loss protection.
What you are describing is protection of data in transit which includes buffered and data in write cache. Not all drives have this nor is it required if the filesystem and drive handle synchronous writes properly. In the worst case, file contents are not updated but no metadata or other contents should be lost.
That is correct. PLP exists to protect host data.
However power loss protection is also required for *any* write operation, and possibly any erase operation. The reason for this is that interrupted writes can not only corrupt the data which is being written, but also data in other sectors, including the flash translation table, which can result in a non-recoverable situation.
Their are two reasons data can be so easily corrupted with an incomplete write. With multi-level flash, the existing data when a sector is updated is at risk during the update and it is easy to see why. It seem to me like this is avoidable by only storing the data from one sector with multiple levels but apparently Flash chips are not organized that way.
First of all, SSDs internally are physically organized in blocks, not sectors. Remember the LBA I mentioned above? LBA is *exactly* how SSDs are structured internally. On SSDs, Sectors are an artificial construct which has zero relation to which flash cell the information is physically located.
Garbage Collection and Wear Levelling also don't need PLP. If power is interrupted during the deletion of a block then the block will remain marked as for deletion and deletion will be repeated after power comes back up. For Wear Levelling, when data is moved from one block to another then the data from the old block is copied to a new block and then the old block is marked for deletion. If that process is interrupted then the old data is still there and the block shift is simply repeated.
The other reason is more insidious; the state machine which controls the write operation can glitch during power loss,
The part in a SSD which controls write operations (and everything else) is the SSD controller and that is not a State Machine, it's a micro-controller running specific software (the drive's firmware) to perform it's duties.
which is how drives which lack power loss protection got bricked so easily in those published tests several years ago. Some drives lost no data, not even data in transit, some drives lost only data in transit, and most suffered either unrelated data loss or complete failure, but that was before the problem was understood well.
I know that early consumer SSDs killed themselves (particularly those made by OCZ) for a number of reasons, most of which were related to the lack of automatic GC (the drive needed GC triggered by the OS via TRIM, which the back then still common WindowsXP didn't support) and a wide range of firmware errors.
But the point remains that the only "data in transit" in a SSD is host data, and that SSDs do not work the way you think they do.