Author Topic: tools to understand when a hard drive is close to death (Read 4147 times)

DiTBho · « **Reply #25 on:** May 28, 2022, 06:10:13 pm »

Quote from: bd139 on May 28, 2022, 05:19:46 pm

(SSD) [...] better failure modes

Faster, larger, cheaper ... all true, but "better failure modes"?

nahhhh, That's false with SSD!

When an SSD fails it simply unplugs the plug, and it's gone with all data vanished; I told something similar in my topic about how poor reliable Flash is.

It happened several times, the last one with an SSD disk on a MIPS laptop. In the end I gave up with its resurrection. Luckily, I had a backup on hands, I bought a new SSD disk from Amazon, and restored the image.

When a SCSI drive fails ... well my Fujitsu-2GB has half of its list full of bad-blocks, but you can still load a kernel from it (yes, I am doing it ... well, it's the only 50pin SCSI disk I have here, at the moment), and its electronic circuits are still alive, it's just a problem with its platters. There are some bad blocks, but until the disk gets completely dead read/write heads, some bad blocks on platters don't affect the entire disk.

With SSD I have always had bad luck, and on a couple of occasions all the data on the disk just vanished without any warning, and since then the disk has ZERO bytes of capacity, simply because some flash cells must be so damaged that the firmware has decided to pull the plug.

Also, SSD sucks for RAID! For a real modern RAID, you have to buy NAS-SATA disks, specifically made for NAS. I opened a discussion here on the forum about that.

bd139 · « **Reply #26 on:** May 28, 2022, 06:15:02 pm »

I disagree. I have data from over 2000 M2 SSDs used in production datacentres. They are at least two orders of magnitude more reliable than any mechanical disk. To the point we didn't actually impact storage reliability by not bothering with RAID. All redundancy was moved to logical machine level.

As for the data disappearing I have never seen this. Not once. The only failure mode I have seen is it dropping into read only mode and that was after an insane TBW in a database server. But that's not an issue as there was a hot replica node.

It of course depends what vendor you go for. Samsung 850 series and later Evo and Pro are fine as are Hitachi enterprise at least. Others, not so much.

When you get to the TBW limit, replace it. I still have some of the exceeded TBW limit drives sitting around. They are fine for low reliability desktops (admin etc) and will probably last 3-4 years fine anyway.

wraper · « **Reply #27 on:** May 28, 2022, 06:20:03 pm »

I had NVMe SSD failure in my computer. It dropped into read only mode which allowed to slowly read all of the data with no corruption as far as I'm aware.

Monkeh · « **Reply #28 on:** May 28, 2022, 06:24:19 pm »

I've seen one or two SSDs just up and vanish. It's not common these days, but it happens. I've had several fail with data corruption, most are after 2-3x their minimum TBW spec on very low cost drives, but I've seen a few lemons which are just plain faulty.. Ah yes, this Micron 1100 is one of those. I'd plug it in for the relevant stats but my USB adapter mysteriously karked it since I last needed it.

I've also got some junk 'industrial' SSDs which just corrupt data in the background, and one which at a random point around a month of uptime, will just cease operating and require a power cycle - that was fun to pin.

They seem mostly reliable, but the lower cost TLC and QLC units unsurprisingly have the higher failure rates IME, and I feel like a lot of them are firmware bugs than actual failure of the flash.

wraper · « **Reply #29 on:** May 28, 2022, 06:28:19 pm »

Quote from: Monkeh on May 28, 2022, 06:24:19 pm

They seem mostly reliable, but the lower cost TLC and QLC units unsurprisingly have the higher failure rates IME, and I feel like a lot of them are firmware bugs than actual failure of the flash.

Yes AFAIK most of the faults are SSD locking out due to firmware freaking out, no actual hardware failure. Was especially prominent with early SSDs. If you have the right tools they can be restored to working condition.

Monkeh · « **Reply #30 on:** May 28, 2022, 06:32:28 pm »

Quote from: wraper on May 28, 2022, 06:28:19 pm

Quote from: Monkeh on May 28, 2022, 06:24:19 pm
They seem mostly reliable, but the lower cost TLC and QLC units unsurprisingly have the higher failure rates IME, and I feel like a lot of them are firmware bugs than actual failure of the flash.
Yes AFAIK most of the faults are SSD locking out due to firmware freaking out, no actual hardware failure. Was especially prominent with early SSDs. If you have the right tools they can be restored to working condition.

I find the right tool is the warranty and backups, myself. If it's already out of warranty it wasn't something I planned on lasting anyway.

DiTBho · « **Reply #31 on:** May 28, 2022, 07:08:54 pm »

Synology NAS does not suggest using SSD for storage, it uses up to eight SATA disks.
It can optionally also use two SSDs but only for parity.

DiTBho · « **Reply #32 on:** May 28, 2022, 10:24:15 pm »

So, thanks to the smarctl -t short trick, I automated it with badblocks and completed tests, and these are results:

Defective disks purchased from Apress24
Model FUJITSU MAW3147NC, SCSI SCA, 147 GB

Code: [Select]

Disk , Serial Number , Version , LifeTime    , note
#1   , DAA0P76047MH  , "0104"  , 53439 hours
#2   , DAA0P7704G9V  , "0104"  , 58667 hours , Badblocks blocks after 40 min
#3   , DAF4P7400M8S  , "3701"  , 52228 hours , 2K blocks are Defective
#4   , DN00P820195T  , "0104"  , 39292 hours
#5   , DAA0P64013V3  , "0104"  , 37093 hours
#6   , na            , na      , na          , the disk is dead, doesn't respond
#7   , DAA0P6300NEG  , "0104"  , 57251 hours
#8   , DAA0P6801RCS  , "0104"  , 31692 hours

50.000 hours means 6-7 years ....

DiTBho · « **Reply #33 on:** May 28, 2022, 10:36:14 pm »

Quote

Service life
The service life is depending on the environment temperature. Therefore, the user must design the
system cabinet so that the average DE surface temperature is as low as possible.
+DE surface temperature: 40°C or less 5 years
+DE surface temperature: 41°C to 45°C 4.5 years

50.000 hours -> close to service life end -> close to death.
Indeed three of eight units already manifests failures.

james_s · « **Reply #34 on:** May 29, 2022, 07:59:24 pm »

Quote from: wraper on May 28, 2022, 06:20:03 pm

I had NVMe SSD failure in my computer. It dropped into read only mode which allowed to slowly read all of the data with no corruption as far as I'm aware.

I have seen this several times with workstations used at a local business. For whatever reason that prevented Windows 10 from booting but I was able to copy it to a new drive and it was back up and running. It confused me the first time I encountered it, I could boot off a rescue image and see the data, I could run error checks and everything looked fine, but whenever I tried to change something the change wouldn't stick. After that it was easy to identify the symptom.

DiTBho · « **Reply #35 on:** May 30, 2022, 10:28:33 am »

No one in that company(1) has yet apologized for the issue they caused, neither they have yet replied to emails. Paypal is handling everything and their today's email looks like good news.

Quote

You’ve received a full refund. To receive the refund, you need to send the item back to the seller.

I am preparing a return package for UPS; however, due to this public feedback on eBay

Quote

This item was described "as new in original packaging" on delivery, it was found to be a "pull" from an old computer. The item was returned to the seller, but he did not co-operate with the shipping company. It's been in "limbo" now for over a month. And as of today, I have not received a refund. My worst transaction on eBay since I have traded for over 18 years on eBay. Avoid this seller.
IBM DORS-32160 2.1GB 5.4k SCSI 46H6135 3.5''

I have also already informed my lawyer, just in case these dudes will not co-operate with the shipping company.

We will see. I will not publish more on this, until the conclusion of this sad story.

(1) aPress Pawel Sinkiewicz, Damian Odulinski
Located in Poland, Gospodarcza 9 Lubuskie Żary 68-200 PL (public address of the company)
They also sell on eBay with the nickname apress24

DavidAlfa · « **Reply #36 on:** June 02, 2022, 08:38:27 am »

Whenever you suspect something is wrong with the hdd, specially if you hear any strange noise pattern coming from it, don't leave all to Smart, doesn't always report bad status.
Download hdd scan and run a full verification pass (Won't destroy data).
If you see zones where it stalls for a bit (>500ms), think about migrating your data to a new disk.
If it marks some as bad, hurry! Hopefully the damage will affect only OS files, not critical.
Used this method for years, it was one of the first tests when a customer reported random hangs, stalling, freezing... It did the job great.
Avoid tools like HDD regenerator, I've tried it several times, seems to work, so you store 500GB of data in it thinking it's fixed, but when trying reading it 2 months later, Samuel L. Jackson appears, surprise MF!! Cyclic error, your data is screwed up!

Jeroen3 · « **Reply #37 on:** June 02, 2022, 09:16:57 am »

I have used HD Sentinel in the past, it keeps track of changes in smart data and warns you when numbers start to look bad.
On todays high density disks one or two bad sectors is not something to panic about. You panic when it increases.

DiTBho · « **Reply #38 on:** June 02, 2022, 06:05:02 pm »

Quote from: DavidAlfa on June 02, 2022, 08:38:27 am

Download hdd scan and run a full verification pass

Thank for the link

I need to compile it for non x86 servers. Yes, I can move the disks, but I'd rather avoid.
If that's not possible, I'll use hdd-scan, as you suggest.

DiTBho · « **Reply #39 on:** June 02, 2022, 06:09:45 pm »

Quote from: Jeroen3 on June 02, 2022, 09:16:57 am

I have used HD Sentinel in the past, it keeps track of changes in smart data and warns you when numbers start to look bad.

hdsentinel

looks great! Is there anything similar but OpenSource? So I can compile it for non-x86 computers.

Jeroen3 · « **Reply #40 on:** June 02, 2022, 07:37:14 pm »

It is available for a few arm architectures on linux.

DiTBho · « **Reply #41 on:** June 03, 2022, 09:36:24 am »

Quote from: Jeroen3 on June 02, 2022, 07:37:14 pm

It is available for a few arm architectures on linux.

I need it for POWER10

BradC · « **Reply #42 on:** June 03, 2022, 10:07:46 am »

smartmontools

DiTBho · « **Reply #43 on:** June 03, 2022, 10:12:20 am »

Looking for other sellers, and I am willing to guide (teach?) them how to use a Linux computer with a SCSI HBA to test disks.

My typical question

Quote

are your hard-drives really brand new, opened only for testing?
How many PowerOn_hours do they have?
(just to avoid miss-understanding)

Topical answer

Quote

sorry we can not test the HDDs

(so how can they be "used only for a few hours, only for testing ?!? HOW?!?)

Quote

sorry we don't have any equipment to test the HDDs

(so how can they be "used only for a few hours, only for testing ?!? HOW?!?)

I am massively using Amazon now
- order
- test(1)
- keep|return

(1) I am working on an C program that integrates
- SCSI query, to directly get access to the disk_serial_number
- smart queries (smartools is written in C++, my version is pure C code)
- "badblocks" (Linux tool) functionalities

All in one program, written in portable C/89, able to be compiled on Linux k2.6.19 ... k5.19

Code: [Select]

boolean_t is_ok_level1
(
    p_disk_t p_disk
)
{
    boolean_t ans;
    boolean_t is_ok;

    smart_short_test(p_disk); /* it does smartctl -t short, and wait for test competition  */
    smart_lifetime_get(p_disk);

    is_ok = True;
    is_ok = ((is_ok) AND (p_disk->lifetime < 100)); /* less than 100 hours */
    //other checks?!? for the health conditions, hence of acceptability, of the hard disk? 
    // is_ok = ...
    ans = is_ok;

    return ans;
}

boolean_t is_ok_level2
(
    p_disk_t p_disk
)
{
    boolean_t ans;

    ans = disk_blocks_is_ok(p_disk); /* it does the same as badblocks -w ... -p1 */
    return ans;
}

...
    is_disk_ok = False;
    if (is_ok_level1(p_disk))
    {
        if (is_ok_level2(p_disk))
        {
            is_disk_ok = True;
        }    
    }

    if (is_disk_ok)
    {
        hinv_keep_it(p_disk)
    }
    else
    {
        hinv_return_it(p_disk)
    }
...

Disks are marked with a progressive number { #001, #002, ... }, which is automatically associated with the disk serial_number and its log, hinv_return_it() does nothing but preparing a text file with all the "#disk_number" list that I have to drop into the Amazon hub.

hinv_keep_it() prepares a similar file (disk_inventory.txt), disks that are good enough for hobby projects, not only for me, but also for my three friends

Ed.Kloonk · « **Reply #44 on:** June 03, 2022, 10:51:18 am »

Which SCSI interface/adapter are you running?

Back in the mid '90s, I had SCSI II gear. Big HDD and Tape backup. Started out with a ISA -> SCSI adapter (adaptec?). Whilst it worked fine on 16-bit winders, the win 95 drivers or implementation was cancer. It was hogging the DMA and causing all sorts of mischief until I got the fancy PCI -> SCSI (1540?) adapter. Happy days.

DiTBho · « **Reply #45 on:** June 03, 2022, 12:11:27 pm »

Quote from: Ed.Kloonk on June 03, 2022, 10:51:18 am

Which SCSI interface/adapter are you running?

- LSI PCI-X SCSI U320 HBA (brand new)
- Amphenol LVD 2m cable (brand new)
- Amphenol U320 LVD terminator (brand new)
- Rax 3xSCA disks with temperature control (brand new)

Why do you ask this? Assuming the kernel driver is working fine(2), if { HBA, cable, terminator } is the problem, then you should see increasing the "non medium error" value.

It didn't happen with the new setup, hence the LVD-setup is fine; indeed, there are Fujitsu 10K rpm and Seagate 15K rpm disks that have perfectly passed all the "badblocks" tests (8 hours burn-in)

Good point, however

I will add a function to check the "non-medium error" value, in order to stop the test program if the value is seen to increase due to a bad physical SCSI configuration rather than due to a physical problem(1) with the disk under testing.

(1) physical problems that I cannot test directly appear on the SCSI interface (hence to my testing program) like a communication/service disruption

worn out bearings -> read/write delayed correction or I/O abort
worn out read/write heads -> read/write delayed corrections or I/O abort
worn out brush less motor -> read/write delayed corrections or I/O abort
worn out SCA connector -> wrong "tag phase reported" by the Linux kernel, disk not recognized, channel too noisy with too many retries (you have exactly this symptom if you use a bad cable, or a bad SCSI terminator)
worn out electronic board with semi fried chips -> read/write delayed correction or I/O abort

edit:
Yup, I also have to write a function to monitor what the kernel complains for. Not yet done.

(2) there are certain SCSI devices { CDROM, DVDRAM, MO, CD-Jbox, Tape{DDS, LDO, ...}, Scanner } that have SCSI-quirks, but I have never observed anything similar with SCSI disks.

When a quirk arises, you see the kernel complain about phases and tags, with a lot of verbosity.
If you don't see it, it's mostly 99.97% ok.

DiTBho · « **Reply #46 on:** June 03, 2022, 12:23:24 pm »

(
What I really find annoying ... sATA/SAS and SCSI disks have a different way to respond to queries.

When you simply want to read the serial number, you have to write two different piece of code, one for SCSI, one for sATA/SAS, the same applies for all the other variables you may want to see the value.

SMARTtools, somehow, hides this under the same user interface
)

DiTBho · « **Reply #47 on:** June 06, 2022, 09:58:32 am »

Code: [Select]

    is_ok = True;
    is_ok = ((is_ok) AND (p_disk->lifetime < 100)); /* less than 100 hours */
    //other checks?!? for the health conditions, hence of acceptability, of the hard disk?
    // is_ok = ...

If you think this is a pretty weak acceptance test, please let me know other S.M.A.R.T. things to look at

bd139 · « **Reply #48 on:** June 06, 2022, 10:02:42 am »

Code: [Select]

is_ok &= bought_it_new_from_respectable_distributor && didnt_turn_up_loose_in_jiffy_bag;

Karel · « **Reply #49 on:** June 06, 2022, 10:14:50 am »

Quote from: james_s on May 28, 2022, 06:08:39 pm

Quote from: bd139 on May 28, 2022, 05:52:39 pm
I'd have noped that idea away super quick as I am recently allergic to anything which involves friction or risk

Is there anything in life that doesn't involve friction or risk?

Marriage?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: tools to understand when a hard drive is close to death (Read 4147 times)

Share me