Do not all manufacturers have automatic relocation and isolation of defective sectors?
Yes, they do.
How it works in real life, is that when the drive detects that it failed to correctly write a sector, it internally marks that sector as bad, and uses a reserved sector for it instead. The drives have a fixed number of internally reserved sectors (that are impossible to access via the SATA/IDE/SCSI interfaces – except possibly for some specialized industrial-use SCSI/SAS drives only those with more money than sense can buy), and they do this mapping internally. You may be able to detect such relocated sectors by timing reads and writes, but the fact that these drives have their own RAM caches makes it quite complicated.
For standard IDE, SATA, and SCSI drives, including both spinny and SSD variants,
SMART is the interface drives provide for monitoring things like temperature and reallocated sector counts, et cetera.
On servers with spinny disks, you often have a SMART service/daemon (smartd on Linux machines) running. The purpose of this one is to read and rewrite sectors on the disk using I/O idle time, without much impacting other use, so that media failures are detected early. Remember, current drives only detect a failure when they try to
write a sector; they do not usually notice errors on
read. Also, these service daemons can also run online and offline media checks (internal service checks the disk does itself, when not burdened by reads/writes; also exposed via SMART).
For example, here is the output of
smartctl -a /dev/nvme0 of this particular HP EliteBook 840 G4 with a Samsung 512GB SSD:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-45-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, [url=http://www.smartmontools.org]www.smartmontools.org[/url]
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLW512HMJP-000H1
Serial Number: [omitted]
Firmware Version: CXY73H1Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Utilization: 163,689,857,024 [163 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Sat Oct 3 19:52:57 2020 EEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Warning Comp. Temp. Threshold: 68 Celsius
Critical Comp. Temp. Threshold: 71 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.60W - - 0 0 0 0 0 0
1 + 6.00W - - 1 1 1 1 0 0
2 + 5.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1500
4 - 0.0050W - - 4 4 4 4 2200 6000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning: 0x00
Temperature: 32 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 1%
Data Units Read: 3,296,452 [1.68 TB]
Data Units Written: 19,111,259 [9.78 TB]
Host Read Commands: 41,091,531
Host Write Commands: 182,926,857
Controller Busy Time: 809
Power Cycles: 1,009
Power On Hours: 2,058
Unsafe Shutdowns: 36
Media and Data Integrity Errors: 0
Error Information Log Entries: 64
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 32 Celsius
Temperature Sensor 2: 34 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 64 0 0x0018 0x4004 0x02c 0 0 -
1 63 0 0x0017 0x4004 0x02c 0 0 -
2 62 0 0x0018 0x4004 0x02c 0 0 -
3 61 0 0x0017 0x4004 0x02c 0 0 -
4 60 0 0x0018 0x4004 0x02c 0 0 -
5 59 0 0x0017 0x4004 0x02c 0 0 -
6 58 0 0x0018 0x4004 0x02c 0 0 -
7 57 0 0x0017 0x4004 0x02c 0 0 -
8 56 0 0x0018 0x4004 0x02c 0 0 -
9 55 0 0x0017 0x4004 0x02c 0 0 -
10 54 0 0x0018 0x4004 0x02c 0 0 -
11 53 0 0x0017 0x4004 0x02c 0 0 -
12 52 0 0x0018 0x4004 0x02c 0 0 -
13 51 0 0x0017 0x4004 0x02c 0 0 -
14 50 0 0x0018 0x4004 0x02c 0 0 -
15 49 0 0x0017 0x4004 0x02c 0 0 -
... (48 entries not shown)
For spinny disks, I replace drives whenever there are any reallocated sectors (RAW reallocated sector count is nonzero). This includes brand-new drives. (And I don't trust Seagate drives, even as paperweights.)
For SSDs, SMART is not really the best interface, but the fact that there are 100% spares available, and zero media and data integrity errors, tells me I can trust this one. (Otherwise I'd switch the SSD.)
If you use only better-quality spinny disks, like Hitachi and most Western Digitals, anecdotal evidence (and now-inaccessible Google statistics) show that spinny-disk HDDs work just fine up to about +40'C without degradation in service life; and that there is no way to detect when a drive is about to die. In about half the cases you see reallocated sectors increasing a few hours (of use) before the drive dies completely, but then again, most drives do perfectly fine with a small number (say, up to a dozen) reallocated sectors – and many have a few reallocated sectors off the factory.
If you maintain a cluster (like I have in the past; a couple of clusters from a few dozen to a few hundred computing nodes), you'll find out that using "known good" off-the-self HDD models is more bang for the buck in the long term than any enterprise models et cetera. (There was a point in time I used 15kRPM SCSI drives, but that was, oh, two decades ago. Bog-standard SATA for the last decade or so for bulk storage, SSD drives for fast working storage that gets replaced when it wears out.)
Which means that no RAID or disk monitoring is as good as backups and keeping your important data on multiple physically separate media.