Author Topic: Intel Atom C2000 Failures  (Read 34392 times)

0 Members and 1 Guest are viewing this topic.

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #25 on: April 09, 2022, 10:44:39 pm »
Do you mind me asking which Gigabyte board you are / were troubleshooting? I have a bunch of J1900-D2H boards in various gear I've built and just had a failure (in a pfSense box).
I did know of the issue but hadn't realised it affected the J1900 SOCs.
That'll teach me for buying boards with soldered CPUs because 'Intel CPUs almost never fail' (which seemed to be a fact before now).
Mine has a 'debug header', one of the pins is a clock but is this the LPC clock?
Failure mode on mine was an NMI error followed by panic, machine rebooted but locked up halfway through. Then no POST ever again.
These are getting hard to find now and can't find a modern replacement either. Will switch to socketed CPUs for these systems as needed.

Thanks for any help you could give.
 

Offline wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #26 on: April 10, 2022, 07:51:55 am »
so is this a silicon failure?
Yes, a silicon failure that has workarounds.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #27 on: April 16, 2022, 11:07:48 pm »
2nd one now. Another pfSense router. The exact same errors:
Apr 16 20:15:37 pfSense kernel: NMI ISA 60, EISA 0
Apr 16 20:15:37 pfSense kernel: I/O channel check, likely hardware failure.
Apr 16 20:15:37 pfSense kernel: panic: NMI indicates hardware failure
Apr 16 20:15:37 pfSense kernel: cpuid = 3
Apr 16 20:15:37 pfSense kernel: KDB: enter: panic

This time on a no-name board little box that's been running perfect for a few years now.

I think there is no proper fix for this but to replace the CPU / board. Seems all cases / parts are designed for gaming these days, so I will buy >10 year old stuff to do it.
The industry is going weird and not in a good way (kind of like the world in general).

Any ideas / thoughts?
 

Offline darkspr1te

  • Frequent Contributor
  • **
  • Posts: 290
  • Country: zm
Re: Intel Atom C2000 Failures
« Reply #28 on: April 17, 2022, 06:22:49 am »
2nd one now. Another pfSense router. The exact same errors:
Apr 16 20:15:37 pfSense kernel: NMI ISA 60, EISA 0
Apr 16 20:15:37 pfSense kernel: I/O channel check, likely hardware failure.
Apr 16 20:15:37 pfSense kernel: panic: NMI indicates hardware failure
Apr 16 20:15:37 pfSense kernel: cpuid = 3
Apr 16 20:15:37 pfSense kernel: KDB: enter: panic

This time on a no-name board little box that's been running perfect for a few years now.

I think there is no proper fix for this but to replace the CPU / board. Seems all cases / parts are designed for gaming these days, so I will buy >10 year old stuff to do it.
The industry is going weird and not in a good way (kind of like the world in general).

Any ideas / thoughts?
I Also have had my units fail, my Sg-2440 , then it's bigger ported sibling
One unit which has been in stores for years as it was upgraded got pulled out and put into service only to last a few days and same kernel error and it's dead.
All had the updated bios ment to mitigate the issue but alas it seems it did not.

darkspr1te
 

Offline wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #29 on: April 17, 2022, 11:24:41 am »
2nd one now. Another pfSense router. The exact same errors:
Apr 16 20:15:37 pfSense kernel: NMI ISA 60, EISA 0
Apr 16 20:15:37 pfSense kernel: I/O channel check, likely hardware failure.
Apr 16 20:15:37 pfSense kernel: panic: NMI indicates hardware failure
Apr 16 20:15:37 pfSense kernel: cpuid = 3
Apr 16 20:15:37 pfSense kernel: KDB: enter: panic
AFAIK with particular CPU failure it won't get this far.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #30 on: April 17, 2022, 05:25:39 pm »
AFAIK with particular CPU failure it won't get this far.

I'm not sure what else the failure could be though. The first board (a Gigabyte J1900-D2H) which failed did those errors a couple of times then wouldn't POST about a day later.
Checked all voltages and anything else I could think of. Still no POST. Replaced board (hard to find) and system works.
Kicking myself for trusting Intel so much. I have these boards / CPUs all over the place. Routers, storage servers and Asterisk PBXs. And it's so hard to find a drop-in replacement although I have managed to get a couple of boards.
It's annoying as these were perfect for the applications I'm using them for and (were) super stable.
One of those tech gotchas I suppose. At least the issue doesn't corrupt the SATA drives or anything.
 

Online amyk

  • Super Contributor
  • ***
  • Posts: 8275
Re: Intel Atom C2000 Failures
« Reply #31 on: April 17, 2022, 07:49:54 pm »
That might be something else, I'm not sure what else they put on the LPC bus on these mobos. It's usually TPM, and possibly BIOS (hence the boot problems).
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #32 on: April 17, 2022, 10:34:31 pm »
That might be something else, I'm not sure what else they put on the LPC bus on these mobos. It's usually TPM, and possibly BIOS (hence the boot problems).

Could be, don't know. It's hard to find documentation or schematics of these boards / chips.
There's not much on these boards which isn't on the CPU - there is Power, BIOS, RAM, PCIE slots, all of this goes to the CPU / SOC directly. I'm sure the BIOS is on the LPC on these.
Will have a butchers about with the broken boards, but I don't trust them anymore and they will all be getting replaced. Asus P8B-M with Xeon E-1220L V2 seems to be the best bet (ATX though not ITX).
Incredible that there isn't a modern equivalent.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #33 on: April 24, 2022, 12:59:18 am »
To add to this:- 2nd system as above failed shortly after I enabled PowerD (enable speedstep). Because I thought it would help. Once it cooled down a bit, it failed.
Both failures I have had have been on pfSense systems which by default run the CPU clock at the maximum, I don't know if that's related.
Disabled PowerD again and it's been stable for a week. Tomorrow I'm replacing it with a 'new' Atom E3845 box. I know that is still vulnerable to this kind of failure (D0 stepping)
Just thought I'd throw out there that it seems to be temp related somehow.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #34 on: April 25, 2022, 12:04:35 am »
It is temperature-related, I installed my new pfSense box with the config copied over from the old one. Didn't work for whatever reason (interface names etc)
Tried to go back to the old one which had cooled down for an hour or so. Didn't boot. Tried heating it up. Didn't work.
This crap is so annoying.
 

Offline jesse1329

  • Newbie
  • Posts: 4
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #35 on: May 02, 2022, 02:08:06 am »
I had a GA-J1900N-D3V. They probably have the same basic design. It's probably the BIOS connected to the LPC bus killed the IOs on the CPU due the silicon issue. And it's sad I didn't realize it earlier. I remember reading the news when it first broke years ago on NAS systems, saying just don't use USB! I didn't use USB on my boards, so I thought with that "work-around" it won't happen. That might be true on some boards. But frankly, at this point, we don't know all the IOs that are bugged on this design in the errata notice. Can't trust Intel fully explained the problem. I think people should just junk these SoC boards and not chance it if you are running these as critical servers.

And the very telling thing is our boards are failing at the same time. I've had other people contact me saying the same thing happened to them.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #36 on: May 08, 2022, 11:23:01 pm »
The D3V is the one with the PCI slot, right?
D2H is essentially the same but with a 1x PCIe slot as far as I know.
Interestingly, it has been my pfSense machines which have died. One possibility is that by default pfSense disables C-states / speedstep.
But yeah, I no-longer trust these little integrated-CPU boards. Will go back to doing it the proper way.
Did any socketed Intel CPUs have these problems? It just seems a bit weird.
 

Offline bsonTopic starter

  • Supporter
  • ****
  • Posts: 2270
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #37 on: May 09, 2022, 01:08:35 am »
My Synology DS1515+ died, and I replaced it last week with a DS1821+.  (Just moved the drives over, powered on, and followed the in-browser migration instructions.  Mostly consisted of clicking 'next'.)  The 1515 replaced an earlier one with the CPU failure (this one had a PSU failure).  Opened it up, and sure enough the 1515 I scrapped had the resistor hack.
 

Online Monkeh

  • Super Contributor
  • ***
  • Posts: 7992
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #38 on: May 09, 2022, 02:36:30 am »
Did any socketed Intel CPUs have these problems? It just seems a bit weird.

Intel CPUs and chipsets have suffered several similar failure modes over the last decade. Whether it's socketed or not has no bearing on the failure or how 'proper' it is.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #39 on: May 11, 2022, 12:36:53 pm »
Did any socketed Intel CPUs have these problems? It just seems a bit weird.

Intel CPUs and chipsets have suffered several similar failure modes over the last decade. Whether it's socketed or not has no bearing on the failure or how 'proper' it is.

Which ones? When I say the proper way I mean with a socketed CPU which can at least be easily replaced.
 

Online Monkeh

  • Super Contributor
  • ***
  • Posts: 7992
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #40 on: May 11, 2022, 01:31:13 pm »
Did any socketed Intel CPUs have these problems? It just seems a bit weird.

Intel CPUs and chipsets have suffered several similar failure modes over the last decade. Whether it's socketed or not has no bearing on the failure or how 'proper' it is.

Which ones? When I say the proper way I mean with a socketed CPU which can at least be easily replaced.

Cougar Point chipsets, for example.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #41 on: May 14, 2022, 10:26:55 pm »
Thanks. Didn't know about that one either.
Just built a server based on Cougar bastard Point!
Losing the 3Gb SATAs will not be the end of the world in this application fortunately.
Grrr.
 

Offline ericloewe

  • Supporter
  • ****
  • Posts: 85
  • Country: pt
Re: Intel Atom C2000 Failures
« Reply #42 on: September 18, 2022, 11:55:29 pm »
Couple of notes on this:

Supermicro is said to still be doing free (customer ships to Supermicro at customer's expense) RMAs even out of warranty. Worth a shot for anyone with affected Supermicro motherboards: https://www.truenas.com/community/threads/fyi-intel-c2000-family-of-processors-system-fault-may-lead-to-dead-system.50314/post-707843

The workaround that was/in applied by some vendors (those that don't replace the SoC wholesale) involves pulling up not just the LPC clock, but also the LPC data lines. I believe that this is because the degradation is not limited to the Clock pins - there have been cases of people with boards that still boot, but where the host cannot communicate with the BMC over LPC (Note 1), and also cases (me included) where a dead board is brought back to life with the external pull up, but the BMC is unreachable over LPC.

I'm still trying to wrap my head around why exactly systems fail to boot (see also link above). It turns out you can't boot from LPC, only from SPI - but this behavior is configurable by setting a pin of the SoC (FLEX_CLK_SE0 / AH59). At boot, the SoC pulls it up with 20k and reads the pin - high is SPI boot, low is LPC boot.
I've come to suspect that this clock, and not the LPC clock proper, is being used by most/all vendors to run the LPC bus, and thus being run out to the TPM header. This would probably be needed to enable a 33 MHz LPC bus, because the LPC clock pins are fixed at 25 MHz.

Note 1: The BMC has a bunch of tentacles into the host system. PCIe is used for graphics and USB is used to provide remote I/O devices. LPC is used to provide SuperIO functionality and in-band management of the BMC.
 

Offline nasjinx

  • Newbie
  • Posts: 1
  • Country: ca
Re: Intel Atom C2000 Failures
« Reply #43 on: December 28, 2022, 07:38:55 pm »
Does anyone know a permanent fix for this issue other than replacing CPU? As per Intel's errata, a platform level change was suggested as a workaround.

I guess Intel provided a new CPU where the issue was fixed at the silicon level.

Thanks
 

Offline wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #44 on: December 28, 2022, 07:44:16 pm »
Does anyone know a permanent fix for this issue other than replacing CPU? As per Intel's errata, a platform level change was suggested as a workaround.

I guess Intel provided a new CPU where the issue was fixed at the silicon level.

Thanks
For faulty system there is no other fix than already mentioned resistor or CPU replacement. "Platform level change" means circuit redesign to completely avoid the issue.
 
The following users thanked this post: nasjinx

Offline pat0te

  • Newbie
  • Posts: 7
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #45 on: February 10, 2023, 12:34:55 pm »
Hi everyone! greetings from Spain.
I have a DELL S4048-ON switch crashed after shutdown and reboot. I think I'm affected by this Clockgate from Intel, I cannot get any data via serial console and switch try to boot in a loop. I search all the web and I saw same problem on this model (2015 factory date).
I was wondering if anyone has S4048-ON's schematics so I could find LPC_CLKOUT on the mainboard. I see a connection pinout near microprocessor (pic attached), probably one of those pins is LPC_CLKOUT.

Any info is welcome!

thank you all for reading :-)

 

Offline pat0te

  • Newbie
  • Posts: 7
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #46 on: February 20, 2023, 12:07:13 pm »
Anyone?  :)
 

Offline bsonTopic starter

  • Supporter
  • ****
  • Posts: 2270
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #47 on: February 23, 2023, 11:17:47 pm »
Stick a scope probe to it?
 

Offline wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #48 on: February 24, 2023, 08:02:53 am »
Stick a scope probe to it?
It's pointless as you will see no signal once LPC clock is dead.
 

Offline ericloewe

  • Supporter
  • ****
  • Posts: 85
  • Country: pt
Re: Intel Atom C2000 Failures
« Reply #49 on: February 25, 2023, 01:27:58 am »
Actually, you're likely to see a tiny, mangled clock almost down in the noise. That's what I got on my C2758 before the RMA - it was easily missed, but it was there if you looked closely. Stupidly, I did not save a screenshot of the waveform.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf