Author Topic: badblocks seems drunk (Read 5899 times)

legacy · « **on:** April 23, 2019, 10:48:36 pm »

NAME
       badblocks - search a device for bad blocks

SYNOPSIS
       badblocks  [  -svwnf  ]  [  -b  block-size  ]  [  -c  blocks_at_once  ]  [ -e
       max_bad_blocks ] [ -d read_delay_factor ] [ -i input_file ] [ -o  output_file
       ] [ -p num_passes ] [ -t test_pattern ] device [ last-block ] [ first-block ]

DESCRIPTION
       badblocks is used to search for bad blocks on a device (usually a disk parti-
       tion).   device  is  the  special  file  corresponding  to  the  device  (e.g
       /dev/hdc1).   last-block is the last block to be checked; if it is not speci-
       fied, the last block on the device is used as a default.  first-block  is  an
       optional  parameter  specifying the starting block number for the test, which
       allows the testing to start in the middle of the disk.  If it is  not  speci-
       fied the first block on the disk is used as a default.

       Important  note:  If the output of badblocks is going to be fed to the e2fsck
       or mke2fs programs, it is important that the block size  is  properly  speci-
       fied,  since  the block numbers which are generated are very dependent on the
       block size in use by the filesystem.  For this reason, it is strongly  recom-
       mended that users not run badblocks directly, but rather use the -c option of
       the e2fsck and mke2fs programs.

Code: [Select]

badblocks -nv /dev/sdd1 -p1
Checking for bad blocks in non-destructive read-write mode
From block 0 to 62500850
Testing with random pattern: 
350549

Code: [Select]

badblocks -nv /dev/sdd1 -p1
Checking for bad blocks in non-destructive read-write mode
From block 0 to 62500850
Testing with random pattern: 
121049
1168665

wtf?!?

why does it say that block #350549 is a bad-block in the first try, but it doesn't mention it in the second try, reporting blocks #121049 and #1168665 as bad-blocks?

this is confusing: the harddrive I am testing is a common electro-mechanical 2.5 inc sATA unit, removed from a laptop. So it's not flash, and it doesn't explain the weird behavior I am observing with the tool badblocks

besides, even more unusual and surprising in a way that is hard to understand, four hours ago, I obtained this:

Code: [Select]

badblocks -nv /dev/sdd1 -p1
Checking for bad blocks in non-destructive read-write mode
From block 0 to 62500850
Testing with random pattern: 
1015205
2062821
3110437
4158049
5205665
6253281
7300897
8348509
9396125
10443737
11491349
12538965
13586577
14634193
15681813
16729429
17777041
18824653
20919881
21967497
22229437
24062729
25110349
26157965
27205581
28253193
29300809
29562749
31396041
31657981
33491273
34538889
34800829
35586497
35848433
36896049
37943665
38991281
40038893
41086505
42134121
43181733
44229345
45276961
46324573
47372177
48419801
49467417
50515033
51562649
52610261
53657873
54705489
55753101
56800713
57848329
58895945
59157885
60205501
61253101
62300725
Pass completed, 61 bad blocks found.

It said 61 bad blocks

it seems random-something, is it drunk?

legacy · « **Reply #1 on:** April 23, 2019, 10:57:00 pm »

just in case someone is wondering: no, neither the partition under test, nor anything else related to this hard drive is mounted, so badblocks is free and happy to do its tasks without any external influence.

amyk · « **Reply #2 on:** April 24, 2019, 12:21:42 am »

HDDs do silently attempt to remap/recover unusable sectors. What you're seeing is indicative of that behaviour.

What are the SMART stats on that drive? In particular, the number of reallocated sectors should be checked, and if it is increasing visibly after each run of that utility (which I presume simply reads sectors consecutively), the drive is going to be dead very very very soon --- in any case, back up everything on that drive ASAP.

hamster_nz · « **Reply #3 on:** April 24, 2019, 02:40:08 am »

The whole "badblocks" thing is an anachronism from ages past, where HDDs were stupid and you had to key in the list of bad blocks that were printed on the drive label. - see attached image.

Disks are smarter than most people expect. They try to keep working as much as possible. If they get a read error they will retry quite a few times, and if the ECC checks out they will give you data and carry on like nothing happened, and might even remap to a spare block kept for this purpose.

(Well, actually the spare blocks are there so HDD manufacturers don't have to make 100% perfect drives - the drive can ship with quite a few known-bad blocks)

Things go really turn to custard when the drive runs out of spare blocks for remapping.

The best place to find out what is going on is the S.M.A.R.T. information for the drive. It will tell you exactly what errors are happening, when it happened and hopefully why it happened:

https://www.linuxtechi.com/smartctl-monitoring-analysis-tool-hard-drive/

ejeffrey · « **Reply #4 on:** April 24, 2019, 04:31:27 am »

Yeah bad blocks doesn't do much useful any more due to sector remapping. A drive with persistent IO errors is basically dying anyway. That appears to be what is happening here

mfro · « **Reply #5 on:** April 24, 2019, 05:11:26 am »

Quote from: ejeffrey on April 24, 2019, 04:31:27 am

Yeah bad blocks doesn't do much useful any more due to sector remapping. A drive with persistent IO errors is basically dying anyway. That appears to be what is happening here

It's not completely useless.

Harddrives do sector remapping on failed write attempts only, i.e. if you have a faulty sector on read, there will be no replacement. So, with a few runs, you might manage to get a failing drive back to a somewhat usable state again. I would not recommend to use such a drive for something important as I would expect it to fail completely anytime soon, however.

ogden · « **Reply #6 on:** April 24, 2019, 05:24:39 am »

Quote from: hamster_nz on April 24, 2019, 02:40:08 am

The best place to find out what is going on is the S.M.A.R.T. information for the drive. It will tell you exactly what errors are happening, when it happened and hopefully why it happened:

Right. Cause can be interface errors due to inferior SATA cable. When you are getting strange HDD errors - always check against cable problems as well.

Berni · « **Reply #7 on:** April 24, 2019, 06:27:26 am »

Yeah modern hardrives remap bad sectors when they see them.

But the fact that new bad sectors pop up on every run of it is good reason to be worried. The drive might have run out of spare sectors to swap in, or some part of its reading or writing process is going wonky.

And yes first try using a different SATA cable and plug it into a different port on the motherboard. I bought some shitty Akasa brand SATA cables for my NAS build and was pulling my hair out for a month why i was occasionally getting errors while other times it would work fine.

hamster_nz · « **Reply #8 on:** April 24, 2019, 10:14:04 am »

Quote from: mfro on April 24, 2019, 05:11:26 am

Quote from: ejeffrey on April 24, 2019, 04:31:27 am
Yeah bad blocks doesn't do much useful any more due to sector remapping. A drive with persistent IO errors is basically dying anyway. That appears to be what is happening here

It's not completely useless.

Harddrives do sector remapping on failed write attempts only, i.e. if you have a faulty sector on read, there will be no replacement. So, with a few runs, you might manage to get a failing drive back to a somewhat usable state again. I would not recommend to use such a drive for something important as I would expect it to fail completely anytime soon, however.

How does the hard disk drive (not badblocks) know it has a failed write attempt, (aside from not finding the sector's header when it is looking for where it starts)? Does each block get read back after it is written and verify that the data can be read?

hamster_nz · « **Reply #9 on:** April 24, 2019, 10:56:47 am »

Long tale cut short, pathologically bad disks sometimes happen.

When I was a Customer Engineer for HP, we had a large K-class server (think two and a half racks of server+storage) running Oracle that occasionally reported memory corruption and would core-dump the production databases when heavily loaded causing a nationwide outage.

The server's memory ECC logs were clean - these boxes scrub memory when idle, and no memory corruption was seen. Vendor support cases were raised, outages arranged, diagnostics were run, and cases were escalated to highest levels in both HP and Oracle and no cause was found.

After a few months of ongoing problems it was replaced with a V-class (a server weighing about 220 kg, with a multi-million $ price tag, and a very deep discount), and the old K-class was re-purposed as test/dev box.

After the K-class was reinstalled, the customer started getting file system errors - and not memory errors . I wrote data to each of the disks and checksummed it, and found out that a single Enterprise class SCSI disk was writing OK, then would silently corrupt the data on read.

Looking back through records I identified that disk had been configured as a swap device when it was in production. So when heavily loaded it would swap some of the database's data to swap, and a while later a corrupted copy would be swapped back in, crashing the database...

We sent it away for failure analysis, and the result came back that part of data path in the SCSI interface didn't have parity, CRC or ECC, and was flaky.

hamster_nz · « **Reply #10 on:** April 24, 2019, 11:37:05 am »

Quote from: legacy on April 24, 2019, 11:01:19 am

Quote from: hamster_nz on April 24, 2019, 10:14:04 am
Does each block get read back after it is written and verify that the data can be read?

In the IBM AS/400 Enterprise stuff there are hard drives with a custom firmware, and they do - write - read back - and compare, but common hard drives do not. The minimal data block is also uncommon since it's not 512 byte but rather 700-800 byte because they write a sort of check sum, so they have two protections:
write and read back, to check if a write is defective
read and compare the check-sum, to see if the read is defective

How have I learn it? Well ... by wasting my money. When you see super cheap SCSI hard drive for sale on Amazon/eBay ... e.g. brand new 146GB UW SCSI 12Krpm for just 20 Euro (rather than 130-200 Euro), and there is the warning "AS/400" (what? pufff never mind ... *BIG MISTAKE*) ... if you buy them, they won't work as common hard drives since they have a customized firmware.

Opps.

As well as working with HP-UX systems, I've had the pleasure of using IBM P-series & I-series systems.

Using the IBM diagnostics you can reformat the same disk between the 528 bytes per sector to make pdisks and 512 bytes per sector for hdisks. The extra bytes in the longer sectors are used to make additional data tagging and CRC info visible to the RAID controller.

Lord knows how P-series systems get on with SSDs... Oh, it seems they have SSDs with 528 byte sectors, that you can't use diags to reformat to 512 byte sectors, so you can't boot an OS directly from them without first using diags to make a RAID array. How nice - typical IBM.

drescherjm · « **Reply #11 on:** April 24, 2019, 01:06:02 pm »

Quote

64000871424 bytes (64 GB) copied, 1090.25 s, 58.7 MB/s

Seems pretty slow for a modern drive. I assume this is a 5400 RPM or slower.

Quote

I will install all the S.M.A.R.T.tool stuff in the afternoon.

That should tell us a lot.

Reallocated_Sector_Ct, Current_Pending_Sector, Offline_Uncorrectable, UDMA_CRC_Error_Count, UDMA_CRC_Error_Count, and Power_On_Hours are usually very helpful at diagnosing problems:

I wrote a script years ago that I use on my disks (at home and work I have several hundred drives)
https://raw.githubusercontent.com/drescherjm/jmdgentoooverlay/master/Other/shell-scripts/examine_smart.sh

drescherjm · « **Reply #12 on:** April 24, 2019, 03:22:55 pm »

Quote

5400 RPM laptop hard drive. Anyway, the speed is irrelevant.

Why I mentioned that is the performance is so bad that this could be an additional indicator to drive failure.

Berni · « **Reply #13 on:** April 24, 2019, 03:36:48 pm »

Yeah indeed drops in speed or inconsistent performance are a common sign of slowly failing drives. But its normal for low RPM laptop drives to be pretty slow in general, even modern ones. But yeah SMART will tell you the most about a drives health.

Since you are using Linux by the looks of it, i guess you can also format it into ZFS to help protect it from corruption if you really want to keep using it.

magic · « **Reply #14 on:** April 24, 2019, 04:20:19 pm »

Consumer disks don't do write-read-verify. I believe it only happens if the disks writes to a sector that was previously known as corrupted and only then, if verification fails, the disk may consider permanently remapping the sector.
It appears that data corruption can also happen due to transient faults and writing new data to the same sector clears the problem. I have seen several disks which reported read errors and re-writing them made "pending sector count" go to zero without increasing "reallocated sector count".
Of course a disk which writes corrupt sectors every now and then is likely to keep doing so in the future and keep getting worse.

legacy · « **Reply #15 on:** April 24, 2019, 04:41:50 pm »

Code: [Select]

sdd User Capacity: 160,041,885,696 bytes [160 GB]       Serial Number:      5VCHZ1AJ
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       14005
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       8590065668
195 Hardware_ECC_Recovered  0x001a   059   053   000    Old_age   Always       -       81578831
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

this is what smartctl says

drescherjm · « **Reply #16 on:** April 24, 2019, 07:06:06 pm »

Your SMART status looks fine to me except for possibly Hardware_ECC_Recovered but that is not always consistent between manufacturers.

legacy · « **Reply #17 on:** April 24, 2019, 07:51:29 pm »

yes, the report looks "fine", so it looks really weird to me: I expected something bad reported, and there is no clue.

May be problems are the on the electronic, on the back of the hard drive, or on the cable but the cable is brand new, tested several cables, and all made the same results.

May be the problem is a firmware incompatibility with the SATA controller, but I tested a couple of controllers by different bands, and all made the same results.

Now I am stressing the hard drive with "badblocks" in destructive mode and running in loop. it's a long-time running task and I will check tomorrow morning.

drescherjm · « **Reply #18 on:** April 24, 2019, 07:58:35 pm »

Cable problems usually show up in UDMA_CRC_Error_Count.

Quote

. it's a long-time running task and I will check tomorrow morning

I know. I have done this 100s of times. Even with 8+TB drives (that take 4+ days for the 4 pass destructive mode).

2N3055 · « **Reply #19 on:** April 24, 2019, 08:13:52 pm »

One of the failure modes for classic HDD is that they develop problems with head positioning. It can be tolerance in head bearings, or something else.. It starts to develop unstable seeking. Servo algorithms do position it eventually over the cylinder well enough for ECC to do it's job. You will notice it as drive that slows down a bit, and as it progresses, it will start to be audible when head is struggling. But at the beginning you won't hear much.
It will throw occasional ECC error or bad sector.
Bottom line that drive is dead. Backup data ASAP and smash it with a hammer.

Today drives have built in relocation clusters space reserved. It happens completely in background, and if your OS reports bad sectors that means drive is DEAD. It means that drive used up all reallocation space, so like 5-10% of your drive is already dead.
And manufacturers are NOT reporting that stuff in SMART most of the time..

magic · « **Reply #20 on:** April 25, 2019, 10:01:19 pm »

Quote from: legacy on April 24, 2019, 04:41:50 pm

Code: [Select]
sdd User Capacity: 160,041,885,696 bytes [160 GB] Serial Number: 5VCHZ1AJ 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

This looks like a good disk. Is the error log empty (printed by smartctl -a)?
Frankly, I see no mention of the OS reporting bad sectors here, just some problems with badblock. Any errors in dmesg?
Alternatively, badblock failure could possibly be caused by data corruption. Bad RAM, crappy USB SATA bridge, etc.

amyk · « **Reply #21 on:** April 26, 2019, 12:08:19 am »

Quote from: 2N3055 on April 24, 2019, 08:13:52 pm

One of the failure modes for classic HDD is that they develop problems with head positioning. It can be tolerance in head bearings, or something else.. It starts to develop unstable seeking. Servo algorithms do position it eventually over the cylinder well enough for ECC to do it's job. You will notice it as drive that slows down a bit, and as it progresses, it will start to be audible when head is struggling. But at the beginning you won't hear much.

This is why I recommend doing a scan with a program like MHDD, which is sensitive enough to identify "weak sectors". (It's sensitive enough that ambient vibration and even softly touching the drive while it's being scanned will be visible, so don't do that...)

legacy · « **Reply #22 on:** April 26, 2019, 09:54:03 am »

Quote from: magic on April 25, 2019, 10:01:19 pm

Any errors in dmesg?

nothing

Quote from: magic on April 25, 2019, 10:01:19 pm

Alternatively, badblock failure could possibly be caused by data corruption. Bad RAM, crappy USB SATA bridge, etc.

RAM check: all ok
SATA controller diagnostic: all ok
and it happens ONLY with that goddamn harddrive

I have two Hitachi harddrives, I tried both on the same machine with the same configuration, with the same badblocks' flags: all ok

so WTF, really

ogden · « **Reply #23 on:** April 26, 2019, 10:56:35 am »

Reason of errors could be supply ripple that is caused by faulty capacitor(s). Supply noise can leak into HDD read amplifier and errors depend not on actual HDD data recording but power consumption pattern of HDD. List of non-obvious reasons of intermittent errors can be quite long. You can't trust that HDD anymore anyway. Let it go

GeorgeOfTheJungle · « **Reply #24 on:** April 26, 2019, 11:21:07 am »

If it's a fujitsu just throw it in the trash.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: badblocks seems drunk (Read 5899 times)

Share me