Author Topic: Intel Atom C2000 Failures  (Read 34374 times)

0 Members and 1 Guest are viewing this topic.

Offline bsonTopic starter

  • Supporter
  • ****
  • Posts: 2270
  • Country: us
Intel Atom C2000 Failures
« on: September 27, 2017, 09:12:48 pm »
Does anyone know more about the Atom C2000 family failures?  In the errata sheet, Intel states:
Quote
AVR54. System May Experience Inability to Boot or May Cease Operation
Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock
outputs) may stop functioning.
Implication: If the LPC clock(s) stop functioning the system will no longer be able to boot.
Workaround: A platform level change has been identified and may be implemented as a workaround
for this erratum.
Status: For the steppings affected, see Table 1, “Errata Summary Table” on page 9.

The LPC is a PCI-to-ISA bridge controller and is one of the two supported BIOS boot locations; the C2000 can either boot from SPI (default) or LPC/ISA (set via external sense pins at powerup).  This is fixed in a stepping, and curiously the "fix" consists of eliminating the ability of muxing the LPC bus pins with GPIO - they no longer become software selectable.  This is pretty much ALL I've been able to find on the subject.  There's a workaround which consists of adding an external 100 ohm resistor, but it's not clear what pins this is added to.  It's added across two pads on a connector on some Synology NAS units, so it's not an output current limiter but almost certainly a stiff pullup or pulldown.  This leads me to suspect it really goes on a configuration sense pin.  Intel hasn't made their "platform level change" public.  Tracing it out on a board is kind of hard since the SoC is a large BGA package that would need to be desoldered.

Does anyone know more about this?  Like, for example, where the resistor is added - in particular is it added to the LPC clock outputs, or to the sense inputs? 

It's also not clear if the clock output actually fails, or this is merely a convenient symptom any engineer with a scope can identify.  (The two LPC clocks are only 25MHz.)  Some possible root causes I can think of are:

1. The sense input pullup is underdimensioned and fails, resulting in the CPU trying to fetch boot firmware from SPI.
2. The sense configures it for LPC boot while the pins are reset to GPIO, resulting in duplicate pin drivers that short out internally.
3. 1+2 -  multiple sense inputs with slightly different thresholds result in inconsistent pin configuration with both pin drivers enabled.
4. The clock pin output driver actually dies.

#4 sounds simple and straightforward, but somewhat implausible to me.  This isn't Intel's first rodeo, and besides how would an external resistor help with this?

Here's the C2000 family datasheet:
https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/atom-c2000-microserver-datasheet.pdf

Errata:
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-c2000-family-spec-update.pdf
« Last Edit: September 27, 2017, 09:15:02 pm by bson »
 
The following users thanked this post: rthorntn

Offline amyk

  • Super Contributor
  • ***
  • Posts: 8275
Re: Intel Atom C2000 Failures
« Reply #1 on: September 28, 2017, 02:23:14 am »
 
The following users thanked this post: rthorntn

Online Monkeh

  • Super Contributor
  • ***
  • Posts: 7992
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #2 on: September 28, 2017, 04:31:04 am »
#4 sounds simple and straightforward, but somewhat implausible to me.  This isn't Intel's first rodeo, and besides how would an external resistor help with this?

From what I've been able to gather, this is exactly what's happening - the resistor is connecting a different clock pin to the LPC bus. This is not the first time Intel has had consistent aging failures like this.
 
The following users thanked this post: rthorntn

Offline bernroth

  • Regular Contributor
  • *
  • Posts: 126
  • Country: de
Re: Intel Atom C2000 Failures
« Reply #3 on: November 28, 2017, 03:40:02 pm »
I made a post some time ago:

https://supportforums.cisco.com/t5/firewalling/clock-signal-repair-pictures-isr4300-asa-isr4400/m-p/3088505/highlight/false?attachment-id=107384

The fix applied by Cisco is by putting a 110 ohm pull-up resistor from either LPC_CLKOUT0 or LPC_CLKOUT1 to +3.3V

I don't want to talk about their repair quality  |O

Currently I am trying to fix another Cisco router. I found the signals LPC_CLKOUT0 and LPC_CLKOUT1.

Does anyone know which one of these pins requires the pull-up?
Maybe I'll just put two pull-up resistors :)

 
The following users thanked this post: rthorntn

Offline hfiennes

  • Newbie
  • Posts: 2
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #4 on: October 07, 2019, 09:01:13 pm »
Ok, a very late reply but my Synology DS415+ box died, and the 100 ohm resistor (across pins 1 & 6 of a 12 pin, 2mm header) made it work again.

It's all back together now so I can't really verify this theory, but do we know *how* the clock output dies? The register article quotes intel as saying "a degradation of a circuit element under high use conditions at a rate higher than Intel’s quality goals after multiple years of service.".

https://www.theregister.co.uk/2017/02/06/cisco_intel_decline_to_link_product_warning_to_faulty_chip/

Could the issue be the PFET in the clock driver dying? If that was the case, a strong pull-up on the clock line - meaning only the NFET needs to be functional to get a clock out of the pin - would indeed solve the issue. This seems to square with the Cisco fix too - it's just a strong pull-up.

(it also means that there's 33mA being sunk by the driver 50% of the time, which makes me fear for the longevity of the fix - and whether something like the smart pull-up on an I2C FM+ bus would stress the D2000 less)
« Last Edit: October 07, 2019, 09:06:04 pm by hfiennes »
 
The following users thanked this post: rthorntn

Offline rthorntn

  • Frequent Contributor
  • **
  • Posts: 400
  • Country: au
Re: Intel Atom C2000 Failures
« Reply #5 on: November 18, 2019, 05:42:45 am »
I have five c2xxx Supermicro motherboards, one is a "2013" A1SAi-2750F and the rest are "2015" A1SAi-2550F's.

Basically I'm wondering out loud if I should preemptively mod these, or just wait for them to die and fix them, I don't run them 24/7 atm and I wouldn't put anything business critical on them now, how would one go about figuring out where to stick the resistor?

Thanks.
« Last Edit: November 18, 2019, 06:12:29 am by rthorntn »
 

Offline EEVblog

  • Administrator
  • *****
  • Posts: 37740
  • Country: au
    • EEVblog
Re: Intel Atom C2000 Failures
« Reply #6 on: February 26, 2020, 06:24:23 am »
Ok, a very late reply but my Synology DS415+ box died, and the 100 ohm resistor (across pins 1 & 6 of a 12 pin, 2mm header) made it work again.

It's all back together now so I can't really verify this theory, but do we know *how* the clock output dies? The register article quotes intel as saying "a degradation of a circuit element under high use conditions at a rate higher than Intel’s quality goals after multiple years of service.".

https://www.theregister.co.uk/2017/02/06/cisco_intel_decline_to_link_product_warning_to_faulty_chip/

Could the issue be the PFET in the clock driver dying? If that was the case, a strong pull-up on the clock line - meaning only the NFET needs to be functional to get a clock out of the pin - would indeed solve the issue. This seems to square with the Cisco fix too - it's just a strong pull-up.

(it also means that there's 33mA being sunk by the driver 50% of the time, which makes me fear for the longevity of the fix - and whether something like the smart pull-up on an I2C FM+ bus would stress the D2000 less)

I just shot a video on this after finding a DS415+ in the dumpster!
Yes,m the resistor fix works, and I assumed it was bypassing a clock somehow but couldn't trace exact details.
 
The following users thanked this post: rthorntn, hsn93, Marco1971

Offline Mazian

  • Newbie
  • Posts: 1
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #7 on: May 14, 2020, 06:46:35 pm »
Quote from: rthorntn on November 18, 2019, 05:42:45 am
I have five c2xxx Supermicro motherboards, one is a "2013" A1SAi-2750F and the rest are "2015" A1SAi-2550F's.

Basically I'm wondering out loud if I should preemptively mod these, or just wait for them to die and fix them, I don't run them 24/7 atm and I wouldn't put anything business critical on them now, how would one go about figuring out where to stick the resistor?

Bit of a late followup, but... I'm also running an A1SAi-2750F, and a friend pointed me at the very helpful DS415+ video.  A user on another forum found the pins for a similar board, and the manual for the A1SAi boards has the same header shown on page 43, the JTPM1 header.  I made a 100 ohm jumper and popped it onto the board across pins 1 (LPC clock) and 9 (+3.3V):



Can't be 100% sure it's doing anything, since my board hadn't died yet, but at least it didn't make it worse!
 
The following users thanked this post: rthorntn, awallin, Per Hansson, Grunchy, Foxxz, charly

Offline awallin

  • Frequent Contributor
  • **
  • Posts: 694
Re: Intel Atom C2000 Failures
« Reply #8 on: May 15, 2020, 10:44:35 am »
Thanks for posting this!  :-+

We tried this on our SuperMicro C2000s, and it does work! One of these was run 24/7 since 2015 and died last week - now back from the dead  8)
IIRC these are SuperMicro 5018A-MLTN4 with A1SAM-2550F mb.
991776-0
 
The following users thanked this post: rthorntn, Grunchy

Offline Grunchy

  • Newbie
  • Posts: 5
  • Country: ca
Re: Intel Atom C2000 Failures
« Reply #9 on: July 26, 2020, 12:14:30 am »
Supposedly the Western Digital DL2100, DL4100, and DX4200 are also susceptible to these catastrophic disasters.
However Western Digital has apparently taken the attitude that they are not going to do anything other than let the chips fall where they may and people can use their warranty protection (or not).
https://translate.google.com/translate?sl=nl&tl=en&u=https%3A%2F%2Fnl.hardware.info%2Fnieuws%2F51213%2Fwestern-digital-geeft-statement-over-c2000-defecten-onderneemt-geen-actie

However, I do have a dead DL4100 and I'm game to try wiring in a 110 ohm resistor between +3.3V and LPC-CLKOUT0/1. Why not, the box is dead anyway.
However some smart wag said that he heard there is no pin connection from anywhere on the motherboard to either of the LPC-CLKOUT pins. Would us DX4200/DL2100/DL4100 owners therefore consequently all be SOL?
Rats.
 
The following users thanked this post: rthorntn

Offline nigelwright7557

  • Frequent Contributor
  • **
  • Posts: 690
  • Country: gb
    • Electronic controls
Re: Intel Atom C2000 Failures
« Reply #10 on: July 27, 2020, 09:29:21 pm »
i do like a good bodge to get things working.
i recently messed up a pcb by accidentally swapping power pins to an ic.
luckily it was cut and strap to fix it.
with the onset of cheap Chinese pcb's it isnt always an expensive problem.
 
The following users thanked this post: rthorntn

Online wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #11 on: July 27, 2020, 10:01:13 pm »
Ok, a very late reply but my Synology DS415+ box died, and the 100 ohm resistor (across pins 1 & 6 of a 12 pin, 2mm header) made it work again.

It's all back together now so I can't really verify this theory, but do we know *how* the clock output dies? The register article quotes intel as saying "a degradation of a circuit element under high use conditions at a rate higher than Intel’s quality goals after multiple years of service.".

https://www.theregister.co.uk/2017/02/06/cisco_intel_decline_to_link_product_warning_to_faulty_chip/

Could the issue be the PFET in the clock driver dying? If that was the case, a strong pull-up on the clock line - meaning only the NFET needs to be functional to get a clock out of the pin - would indeed solve the issue. This seems to square with the Cisco fix too - it's just a strong pull-up.

(it also means that there's 33mA being sunk by the driver 50% of the time, which makes me fear for the longevity of the fix - and whether something like the smart pull-up on an I2C FM+ bus would stress the D2000 less)

I just shot a video on this after finding a DS415+ in the dumpster!
Yes,m the resistor fix works, and I assumed it was bypassing a clock somehow but couldn't trace exact details.
As I understand it, it's simply a strong pull-up which mitigates/replaces a dead high side switch in CPU.
 
The following users thanked this post: rthorntn

Offline pf

  • Newbie
  • Posts: 1
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #12 on: January 10, 2021, 04:49:44 am »
On my DS415+ I tried various values of the resistor between pins 1 and 6. Starting with 100 ohms and increasing the value in roughly 100-ohm steps, I found that 750 ohms was sufficient to restore the LPC clock signal. 1000 ohms did not restore the LPC clock signal.
 
The following users thanked this post: rthorntn

Offline nevusZ

  • Newbie
  • Posts: 2
  • Country: at
Re: Intel Atom C2000 Failures
« Reply #13 on: January 16, 2021, 09:52:04 am »
https://www.eevblog.com/forum/repair/barracuda-networks-f80-repair-(atom-c2000-bug)/

hello,
how can i find out if i can fix my board the same way?
 
The following users thanked this post: rthorntn

Offline hindenbugbite

  • Newbie
  • Posts: 8
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #14 on: May 13, 2021, 03:11:34 am »
I was recently hit by this bug on a QNAP TS-251+ with an Intel Celeron J1900. The resistor mod did work. In my case it was a 180 ohm resistor to ground.

The scope traces I took before and after shows the strong pull down resistor loading down both the high and low level of the clock. In effect pulling it back into CMOS input threshold range:
1218815-0
1218817-1

While I'm glad it works again, I don't believe this to be a reliable long-term fix. Intel's errata noted that this issue doesn't occur if LPC is 1.8V which suggests the lower power reduces degradation. The strong pull down here only continues to stress the push-pull FETs, of which the pull part has already degraded.

If more time permits, I might try to disconnect the LPC clock from the CPU and inserting an inline buffer to regenerated the clock. A buffer should help reduce the load current that the output has to provide which hopefully will provide a longer term fix. The QNAP has an impedance matching resistor between the clock output and the rest of the board so it is possible to buffer the clock as long as latency is kept in check. (maybe < 10ns) The issue is that the clock now has a 1V offset which won't register with most CMOS logic. One option is to go analog and use an op-amp to offset the clock back down near 0V and even add a gain to make it full swing again.
 
The following users thanked this post: rthorntn, edavid

Offline Whales

  • Super Contributor
  • ***
  • Posts: 1899
  • Country: au
    • Halestrom
Re: Intel Atom C2000 Failures
« Reply #15 on: May 13, 2021, 03:38:58 am »
Woah, I have a bad intel board with soldered processor.  Last I checked the LPC wouldn't get to a state where it output the right signals for things to continue.  Going to get it out from storage and scope the clocks now!  Thankyou for this topic, even if it doesn't end up being the cause I'm sure to have some fun.

Offline jesse1329

  • Newbie
  • Posts: 4
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #16 on: January 05, 2022, 12:16:55 am »
Hi!

I was troubleshooting a gigabyte motherboard with the celeron J1900. I couldn't figure out what the problem was keeping it from booting. I pretty much decided the SoC was not enabling the bus to talk to the BIOS chip. I had the strange idea to probe the lpc clock pin and found the scope image I've attached. That's one sick puppy. So I searched for lpc clock on J1900, and realize now this is an old issue I've heard about.

I tried a 150 ohm pull-down thinking that might get some devices switching but you can see, it doesn't even swing rail to rail. Pull-up and pull-down? Or is the only way is to inject a new 25 Mhz clock?
 

Offline Foxxz

  • Regular Contributor
  • *
  • Posts: 123
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #17 on: January 05, 2022, 02:11:30 am »
Thanks for posting this!  :-+

We tried this on our SuperMicro C2000s, and it does work! One of these was run 24/7 since 2015 and died last week - now back from the dead  8)
IIRC these are SuperMicro 5018A-MLTN4 with A1SAM-2550F mb.
(Attachment Link)

Wow! I have been running one 24/7 since 2015 as well and just before Christmas is became completely unresponsive and dead as a doornail. Nothing on the screen. Reset/powercycle did nothing.

I replaced the system with something more modern but I wouldn't mind reviving this system.
 

Offline hindenbugbite

  • Newbie
  • Posts: 8
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #18 on: January 05, 2022, 04:10:01 am »
I tried a 150 ohm pull-down thinking that might get some devices switching but you can see, it doesn't even swing rail to rail. Pull-up and pull-down? Or is the only way is to inject a new 25 Mhz clock?

I am skeptical that injecting a standalone 25MHz clock would lead to a solution here because this is a bus clock and needs to be in phase with the data line. Your scope trace might have both a voltage and a timing problem if it really is a saw tooth. 3V CMOS needs below 0.8V for low and above 2.1V for high, so I would have guessed a pull down would have done something too. But if the timing has shifted from the edge quality then it would be hard to recover.

Just curious, could the saw tooth trace be due to your scope/probe bandwidth?
 
The following users thanked this post: jesse1329

Online wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #19 on: January 05, 2022, 11:29:26 am »
I tried a 150 ohm pull-down thinking that might get some devices switching but you can see, it doesn't even swing rail to rail. Pull-up and pull-down? Or is the only way is to inject a new 25 Mhz clock?
It should be a pull-up, not pull-down.
 

Offline jesse1329

  • Newbie
  • Posts: 4
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #20 on: January 05, 2022, 05:38:51 pm »
Just curious, could the saw tooth trace be due to your scope/probe bandwidth?

Yes. I checked and I needed a 10X probe to view this one correctly with my scope. So this is what a healthy clock looks like, I suppose. I'm going to check the BIOS chip more closely. LPC bus comes out of reset on power on, but no data observed, but the BIOS chip is on SPI up to 100MHz. Still wary about the SoC issue as there is no other explanation, so I'll keep checking.
 

Online wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #21 on: January 05, 2022, 05:41:47 pm »
Yes. I checked and I needed a 10X probe to view this one correctly with my scope.
Probe in 1X mode is way worse than a simple coaxial cable. It's good up to a few MHz (sine) at most.
 

Offline hindenbugbite

  • Newbie
  • Posts: 8
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #22 on: January 05, 2022, 08:26:38 pm »
Yes. I checked and I needed a 10X probe to view this one correctly with my scope. So this is what a healthy clock looks like, I suppose. I'm going to check the BIOS chip more closely. LPC bus comes out of reset on power on, but no data observed, but the BIOS chip is on SPI up to 100MHz. Still wary about the SoC issue as there is no other explanation, so I'll keep checking.

That definitely a better looking clock for 3.3V CMOS. I don't recall what the initial traffic on the LPC bus looks like but it would seem logical that the BIOS is waiting for the CPU to query it. So there should be some data packet to get things started.

While this bug is mainly attributed to the LPC CLK line, I believe it can affect any of the LPC lines. The CLK is the first to fail because it is continuous and sees the most transitions.
 

Offline Foxxz

  • Regular Contributor
  • *
  • Posts: 123
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #23 on: January 20, 2022, 11:16:48 pm »
Quote from: rthorntn on November 18, 2019, 05:42:45 am
I have five c2xxx Supermicro motherboards, one is a "2013" A1SAi-2750F and the rest are "2015" A1SAi-2550F's.

Basically I'm wondering out loud if I should preemptively mod these, or just wait for them to die and fix them, I don't run them 24/7 atm and I wouldn't put anything business critical on them now, how would one go about figuring out where to stick the resistor?

Bit of a late followup, but... I'm also running an A1SAi-2750F, and a friend pointed me at the very helpful DS415+ video.  A user on another forum found the pins for a similar board, and the manual for the A1SAi boards has the same header shown on page 43, the JTPM1 header.  I made a 100 ohm jumper and popped it onto the board across pins 1 (LPC clock) and 9 (+3.3V):



Can't be 100% sure it's doing anything, since my board hadn't died yet, but at least it didn't make it worse!


Can confirm this fixed the dead system I had
 

Offline jesse1329

  • Newbie
  • Posts: 4
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #24 on: January 26, 2022, 04:26:05 am »
That definitely a better looking clock for 3.3V CMOS. I don't recall what the initial traffic on the LPC bus looks like but it would seem logical that the BIOS is waiting for the CPU to query it. So there should be some data packet to get things started.

While this bug is mainly attributed to the LPC CLK line, I believe it can affect any of the LPC lines. The CLK is the first to fail because it is continuous and sees the most transitions.

I am able to see some activity on the SPI bus for the BIOS chip just before it reboots. The BIOS seems ok on my scope from what I saw, I haven't bothered putting my logic analyzer on there, since I don't think I'll matter. So there isn't really much left to check. I'm curious what other I/Os are affected on the SoC. I thought I saw the RTC might be affected. Who knows what else could be? I don't think Intel really explained what fully is affected. I am thinking these chips are ticking time bombs, depending on the board implementation and use case. I had someone contact me saying their identical board died too in the last few weeks, some time after I put out my questions on what has gone wrong.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #25 on: April 09, 2022, 10:44:39 pm »
Do you mind me asking which Gigabyte board you are / were troubleshooting? I have a bunch of J1900-D2H boards in various gear I've built and just had a failure (in a pfSense box).
I did know of the issue but hadn't realised it affected the J1900 SOCs.
That'll teach me for buying boards with soldered CPUs because 'Intel CPUs almost never fail' (which seemed to be a fact before now).
Mine has a 'debug header', one of the pins is a clock but is this the LPC clock?
Failure mode on mine was an NMI error followed by panic, machine rebooted but locked up halfway through. Then no POST ever again.
These are getting hard to find now and can't find a modern replacement either. Will switch to socketed CPUs for these systems as needed.

Thanks for any help you could give.
 

Online wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #26 on: April 10, 2022, 07:51:55 am »
so is this a silicon failure?
Yes, a silicon failure that has workarounds.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #27 on: April 16, 2022, 11:07:48 pm »
2nd one now. Another pfSense router. The exact same errors:
Apr 16 20:15:37 pfSense kernel: NMI ISA 60, EISA 0
Apr 16 20:15:37 pfSense kernel: I/O channel check, likely hardware failure.
Apr 16 20:15:37 pfSense kernel: panic: NMI indicates hardware failure
Apr 16 20:15:37 pfSense kernel: cpuid = 3
Apr 16 20:15:37 pfSense kernel: KDB: enter: panic

This time on a no-name board little box that's been running perfect for a few years now.

I think there is no proper fix for this but to replace the CPU / board. Seems all cases / parts are designed for gaming these days, so I will buy >10 year old stuff to do it.
The industry is going weird and not in a good way (kind of like the world in general).

Any ideas / thoughts?
 

Offline darkspr1te

  • Frequent Contributor
  • **
  • Posts: 290
  • Country: zm
Re: Intel Atom C2000 Failures
« Reply #28 on: April 17, 2022, 06:22:49 am »
2nd one now. Another pfSense router. The exact same errors:
Apr 16 20:15:37 pfSense kernel: NMI ISA 60, EISA 0
Apr 16 20:15:37 pfSense kernel: I/O channel check, likely hardware failure.
Apr 16 20:15:37 pfSense kernel: panic: NMI indicates hardware failure
Apr 16 20:15:37 pfSense kernel: cpuid = 3
Apr 16 20:15:37 pfSense kernel: KDB: enter: panic

This time on a no-name board little box that's been running perfect for a few years now.

I think there is no proper fix for this but to replace the CPU / board. Seems all cases / parts are designed for gaming these days, so I will buy >10 year old stuff to do it.
The industry is going weird and not in a good way (kind of like the world in general).

Any ideas / thoughts?
I Also have had my units fail, my Sg-2440 , then it's bigger ported sibling
One unit which has been in stores for years as it was upgraded got pulled out and put into service only to last a few days and same kernel error and it's dead.
All had the updated bios ment to mitigate the issue but alas it seems it did not.

darkspr1te
 

Online wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #29 on: April 17, 2022, 11:24:41 am »
2nd one now. Another pfSense router. The exact same errors:
Apr 16 20:15:37 pfSense kernel: NMI ISA 60, EISA 0
Apr 16 20:15:37 pfSense kernel: I/O channel check, likely hardware failure.
Apr 16 20:15:37 pfSense kernel: panic: NMI indicates hardware failure
Apr 16 20:15:37 pfSense kernel: cpuid = 3
Apr 16 20:15:37 pfSense kernel: KDB: enter: panic
AFAIK with particular CPU failure it won't get this far.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #30 on: April 17, 2022, 05:25:39 pm »
AFAIK with particular CPU failure it won't get this far.

I'm not sure what else the failure could be though. The first board (a Gigabyte J1900-D2H) which failed did those errors a couple of times then wouldn't POST about a day later.
Checked all voltages and anything else I could think of. Still no POST. Replaced board (hard to find) and system works.
Kicking myself for trusting Intel so much. I have these boards / CPUs all over the place. Routers, storage servers and Asterisk PBXs. And it's so hard to find a drop-in replacement although I have managed to get a couple of boards.
It's annoying as these were perfect for the applications I'm using them for and (were) super stable.
One of those tech gotchas I suppose. At least the issue doesn't corrupt the SATA drives or anything.
 

Offline amyk

  • Super Contributor
  • ***
  • Posts: 8275
Re: Intel Atom C2000 Failures
« Reply #31 on: April 17, 2022, 07:49:54 pm »
That might be something else, I'm not sure what else they put on the LPC bus on these mobos. It's usually TPM, and possibly BIOS (hence the boot problems).
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #32 on: April 17, 2022, 10:34:31 pm »
That might be something else, I'm not sure what else they put on the LPC bus on these mobos. It's usually TPM, and possibly BIOS (hence the boot problems).

Could be, don't know. It's hard to find documentation or schematics of these boards / chips.
There's not much on these boards which isn't on the CPU - there is Power, BIOS, RAM, PCIE slots, all of this goes to the CPU / SOC directly. I'm sure the BIOS is on the LPC on these.
Will have a butchers about with the broken boards, but I don't trust them anymore and they will all be getting replaced. Asus P8B-M with Xeon E-1220L V2 seems to be the best bet (ATX though not ITX).
Incredible that there isn't a modern equivalent.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #33 on: April 24, 2022, 12:59:18 am »
To add to this:- 2nd system as above failed shortly after I enabled PowerD (enable speedstep). Because I thought it would help. Once it cooled down a bit, it failed.
Both failures I have had have been on pfSense systems which by default run the CPU clock at the maximum, I don't know if that's related.
Disabled PowerD again and it's been stable for a week. Tomorrow I'm replacing it with a 'new' Atom E3845 box. I know that is still vulnerable to this kind of failure (D0 stepping)
Just thought I'd throw out there that it seems to be temp related somehow.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #34 on: April 25, 2022, 12:04:35 am »
It is temperature-related, I installed my new pfSense box with the config copied over from the old one. Didn't work for whatever reason (interface names etc)
Tried to go back to the old one which had cooled down for an hour or so. Didn't boot. Tried heating it up. Didn't work.
This crap is so annoying.
 

Offline jesse1329

  • Newbie
  • Posts: 4
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #35 on: May 02, 2022, 02:08:06 am »
I had a GA-J1900N-D3V. They probably have the same basic design. It's probably the BIOS connected to the LPC bus killed the IOs on the CPU due the silicon issue. And it's sad I didn't realize it earlier. I remember reading the news when it first broke years ago on NAS systems, saying just don't use USB! I didn't use USB on my boards, so I thought with that "work-around" it won't happen. That might be true on some boards. But frankly, at this point, we don't know all the IOs that are bugged on this design in the errata notice. Can't trust Intel fully explained the problem. I think people should just junk these SoC boards and not chance it if you are running these as critical servers.

And the very telling thing is our boards are failing at the same time. I've had other people contact me saying the same thing happened to them.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #36 on: May 08, 2022, 11:23:01 pm »
The D3V is the one with the PCI slot, right?
D2H is essentially the same but with a 1x PCIe slot as far as I know.
Interestingly, it has been my pfSense machines which have died. One possibility is that by default pfSense disables C-states / speedstep.
But yeah, I no-longer trust these little integrated-CPU boards. Will go back to doing it the proper way.
Did any socketed Intel CPUs have these problems? It just seems a bit weird.
 

Offline bsonTopic starter

  • Supporter
  • ****
  • Posts: 2270
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #37 on: May 09, 2022, 01:08:35 am »
My Synology DS1515+ died, and I replaced it last week with a DS1821+.  (Just moved the drives over, powered on, and followed the in-browser migration instructions.  Mostly consisted of clicking 'next'.)  The 1515 replaced an earlier one with the CPU failure (this one had a PSU failure).  Opened it up, and sure enough the 1515 I scrapped had the resistor hack.
 

Online Monkeh

  • Super Contributor
  • ***
  • Posts: 7992
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #38 on: May 09, 2022, 02:36:30 am »
Did any socketed Intel CPUs have these problems? It just seems a bit weird.

Intel CPUs and chipsets have suffered several similar failure modes over the last decade. Whether it's socketed or not has no bearing on the failure or how 'proper' it is.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #39 on: May 11, 2022, 12:36:53 pm »
Did any socketed Intel CPUs have these problems? It just seems a bit weird.

Intel CPUs and chipsets have suffered several similar failure modes over the last decade. Whether it's socketed or not has no bearing on the failure or how 'proper' it is.

Which ones? When I say the proper way I mean with a socketed CPU which can at least be easily replaced.
 

Online Monkeh

  • Super Contributor
  • ***
  • Posts: 7992
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #40 on: May 11, 2022, 01:31:13 pm »
Did any socketed Intel CPUs have these problems? It just seems a bit weird.

Intel CPUs and chipsets have suffered several similar failure modes over the last decade. Whether it's socketed or not has no bearing on the failure or how 'proper' it is.

Which ones? When I say the proper way I mean with a socketed CPU which can at least be easily replaced.

Cougar Point chipsets, for example.
 

Offline dtsystems

  • Contributor
  • Posts: 10
  • Country: gb
Re: Intel Atom C2000 Failures
« Reply #41 on: May 14, 2022, 10:26:55 pm »
Thanks. Didn't know about that one either.
Just built a server based on Cougar bastard Point!
Losing the 3Gb SATAs will not be the end of the world in this application fortunately.
Grrr.
 

Offline ericloewe

  • Supporter
  • ****
  • Posts: 85
  • Country: pt
Re: Intel Atom C2000 Failures
« Reply #42 on: September 18, 2022, 11:55:29 pm »
Couple of notes on this:

Supermicro is said to still be doing free (customer ships to Supermicro at customer's expense) RMAs even out of warranty. Worth a shot for anyone with affected Supermicro motherboards: https://www.truenas.com/community/threads/fyi-intel-c2000-family-of-processors-system-fault-may-lead-to-dead-system.50314/post-707843

The workaround that was/in applied by some vendors (those that don't replace the SoC wholesale) involves pulling up not just the LPC clock, but also the LPC data lines. I believe that this is because the degradation is not limited to the Clock pins - there have been cases of people with boards that still boot, but where the host cannot communicate with the BMC over LPC (Note 1), and also cases (me included) where a dead board is brought back to life with the external pull up, but the BMC is unreachable over LPC.

I'm still trying to wrap my head around why exactly systems fail to boot (see also link above). It turns out you can't boot from LPC, only from SPI - but this behavior is configurable by setting a pin of the SoC (FLEX_CLK_SE0 / AH59). At boot, the SoC pulls it up with 20k and reads the pin - high is SPI boot, low is LPC boot.
I've come to suspect that this clock, and not the LPC clock proper, is being used by most/all vendors to run the LPC bus, and thus being run out to the TPM header. This would probably be needed to enable a 33 MHz LPC bus, because the LPC clock pins are fixed at 25 MHz.

Note 1: The BMC has a bunch of tentacles into the host system. PCIe is used for graphics and USB is used to provide remote I/O devices. LPC is used to provide SuperIO functionality and in-band management of the BMC.
 

Offline nasjinx

  • Newbie
  • Posts: 1
  • Country: ca
Re: Intel Atom C2000 Failures
« Reply #43 on: December 28, 2022, 07:38:55 pm »
Does anyone know a permanent fix for this issue other than replacing CPU? As per Intel's errata, a platform level change was suggested as a workaround.

I guess Intel provided a new CPU where the issue was fixed at the silicon level.

Thanks
 

Online wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #44 on: December 28, 2022, 07:44:16 pm »
Does anyone know a permanent fix for this issue other than replacing CPU? As per Intel's errata, a platform level change was suggested as a workaround.

I guess Intel provided a new CPU where the issue was fixed at the silicon level.

Thanks
For faulty system there is no other fix than already mentioned resistor or CPU replacement. "Platform level change" means circuit redesign to completely avoid the issue.
 
The following users thanked this post: nasjinx

Offline pat0te

  • Newbie
  • Posts: 7
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #45 on: February 10, 2023, 12:34:55 pm »
Hi everyone! greetings from Spain.
I have a DELL S4048-ON switch crashed after shutdown and reboot. I think I'm affected by this Clockgate from Intel, I cannot get any data via serial console and switch try to boot in a loop. I search all the web and I saw same problem on this model (2015 factory date).
I was wondering if anyone has S4048-ON's schematics so I could find LPC_CLKOUT on the mainboard. I see a connection pinout near microprocessor (pic attached), probably one of those pins is LPC_CLKOUT.

Any info is welcome!

thank you all for reading :-)

 

Offline pat0te

  • Newbie
  • Posts: 7
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #46 on: February 20, 2023, 12:07:13 pm »
Anyone?  :)
 

Offline bsonTopic starter

  • Supporter
  • ****
  • Posts: 2270
  • Country: us
Re: Intel Atom C2000 Failures
« Reply #47 on: February 23, 2023, 11:17:47 pm »
Stick a scope probe to it?
 

Online wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #48 on: February 24, 2023, 08:02:53 am »
Stick a scope probe to it?
It's pointless as you will see no signal once LPC clock is dead.
 

Offline ericloewe

  • Supporter
  • ****
  • Posts: 85
  • Country: pt
Re: Intel Atom C2000 Failures
« Reply #49 on: February 25, 2023, 01:27:58 am »
Actually, you're likely to see a tiny, mangled clock almost down in the noise. That's what I got on my C2758 before the RMA - it was easily missed, but it was there if you looked closely. Stupidly, I did not save a screenshot of the waveform.
 

Offline ericloewe

  • Supporter
  • ****
  • Posts: 85
  • Country: pt
Re: Intel Atom C2000 Failures
« Reply #50 on: February 25, 2023, 01:39:57 am »
Hi everyone! greetings from Spain.
I have a DELL S4048-ON switch crashed after shutdown and reboot. I think I'm affected by this Clockgate from Intel, I cannot get any data via serial console and switch try to boot in a loop. I search all the web and I saw same problem on this model (2015 factory date).
I was wondering if anyone has S4048-ON's schematics so I could find LPC_CLKOUT on the mainboard. I see a connection pinout near microprocessor (pic attached), probably one of those pins is LPC_CLKOUT.

These switches have an ASPeed BMC, don't they? Look for a daughterboard with an American Megatrands [sic] MegaRAC sticker. The BMC connects via LPC (and USB and PCIe) to the host, so that would at least narrow down your search a bit.
 

Offline pat0te

  • Newbie
  • Posts: 7
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #51 on: February 28, 2023, 12:05:18 pm »
Stick a scope probe to it?

Unfortunaly, I don't have any scope.
 

Offline pat0te

  • Newbie
  • Posts: 7
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #52 on: February 28, 2023, 12:06:42 pm »
Hi everyone! greetings from Spain.
I have a DELL S4048-ON switch crashed after shutdown and reboot. I think I'm affected by this Clockgate from Intel, I cannot get any data via serial console and switch try to boot in a loop. I search all the web and I saw same problem on this model (2015 factory date).
I was wondering if anyone has S4048-ON's schematics so I could find LPC_CLKOUT on the mainboard. I see a connection pinout near microprocessor (pic attached), probably one of those pins is LPC_CLKOUT.

These switches have an ASPeed BMC, don't they? Look for a daughterboard with an American Megatrands [sic] MegaRAC sticker. The BMC connects via LPC (and USB and PCIe) to the host, so that would at least narrow down your search a bit.

I think this switch has not BMC, but I'm going to take a look. Thanks for your suggestion.
 

Offline pat0te

  • Newbie
  • Posts: 7
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #53 on: March 02, 2023, 05:25:44 pm »
Hi everyone! greetings from Spain.
I have a DELL S4048-ON switch crashed after shutdown and reboot. I think I'm affected by this Clockgate from Intel, I cannot get any data via serial console and switch try to boot in a loop. I search all the web and I saw same problem on this model (2015 factory date).
I was wondering if anyone has S4048-ON's schematics so I could find LPC_CLKOUT on the mainboard. I see a connection pinout near microprocessor (pic attached), probably one of those pins is LPC_CLKOUT.

These switches have an ASPeed BMC, don't they? Look for a daughterboard with an American Megatrands [sic] MegaRAC sticker. The BMC connects via LPC (and USB and PCIe) to the host, so that would at least narrow down your search a bit.

Zero chips or stickers found with an American Megatrends / MegaRAC ::sadface::
 

Offline ericloewe

  • Supporter
  • ****
  • Posts: 85
  • Country: pt
Re: Intel Atom C2000 Failures
« Reply #54 on: March 04, 2023, 06:58:27 pm »
Ok, further digging and a series of fortuitously timed posts by a few people led to further useful info about Supermicro A1SAi/A1SRi boards, and likely other vendors' board too:

Well, the gist of it is that there are two clock lines, one feeds the TPM header and is involved in the boot ROM selection; the other runs to the BMC. This is why fixing the boot failures often still left the BMC unresponsive to the LPC bus, it just wasn't getting any clocks.

Supermicro's fix for this is a pair of 150 Ohm bodge resistors, one to pull up each of these two clock lines to 3.3 V. The fix is actually pretty neat on these boards and could pass casual inspection, depending on how well the soldering turns out.
« Last Edit: March 04, 2023, 07:07:04 pm by ericloewe »
 

Online Sacodepatatas

  • Regular Contributor
  • *
  • Posts: 80
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #55 on: March 05, 2023, 02:10:19 pm »
Quote
Unfortunaly, I don't have any scope.

You can buy a DIY 50MHz Frequency counter for less than 5€ from AliExpress/eBay. Or you can build a temporary one by yourself if you've got a PIC16F628A and some led displays in your spare box.

Edit: Or one of those EZ-USB FX2LP boards that can be programmed as a logic tracer and costs about the same price.
« Last Edit: March 05, 2023, 02:14:41 pm by Sacodepatatas »
 

Online wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #56 on: March 05, 2023, 02:28:42 pm »
Quote
Unfortunaly, I don't have any scope.

You can buy a DIY 50MHz Frequency counter for less than 5€ from AliExpress/eBay. Or you can build a temporary one by yourself if you've got a PIC16F628A and some led displays in your spare box.

Edit: Or one of those EZ-USB FX2LP boards that can be programmed as a logic tracer and costs about the same price.
And they will be totally useless for the job.
 

Online Sacodepatatas

  • Regular Contributor
  • *
  • Posts: 80
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #57 on: March 06, 2023, 01:49:41 am »
Why? I have used such frequency meter for monitoring the stability of 32MHz clock signals, I don't see why a 25MHz Signal can't be counted with these. Maybe related to the logic voltaje levels, but nothing that couldn't be solved quickly. The most simple manner that I think for watching if there is a clock signal at any port, without any equipment at all but a multimeter, is just a high pass filter followed by a peak detector (a schottky, a capacitor  and a high value resistor) and then measure the voltage level at the output.
 

Online wraper

  • Supporter
  • ****
  • Posts: 16865
  • Country: lv
Re: Intel Atom C2000 Failures
« Reply #58 on: March 06, 2023, 10:53:34 am »
Why? I have used such frequency meter for monitoring the stability of 32MHz clock signals, I don't see why a 25MHz Signal can't be counted with these. Maybe related to the logic voltaje levels, but nothing that couldn't be solved quickly. The most simple manner that I think for watching if there is a clock signal at any port, without any equipment at all but a multimeter, is just a high pass filter followed by a peak detector (a schottky, a capacitor  and a high value resistor) and then measure the voltage level at the output.
Because there is no signal at all or it's extremely weak once it fails.
 

Offline pat0te

  • Newbie
  • Posts: 7
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #59 on: March 06, 2023, 07:46:11 pm »
Ok, further digging and a series of fortuitously timed posts by a few people led to further useful info about Supermicro A1SAi/A1SRi boards, and likely other vendors' board too:

Well, the gist of it is that there are two clock lines, one feeds the TPM header and is involved in the boot ROM selection; the other runs to the BMC. This is why fixing the boot failures often still left the BMC unresponsive to the LPC bus, it just wasn't getting any clocks.

Supermicro's fix for this is a pair of 150 Ohm bodge resistors, one to pull up each of these two clock lines to 3.3 V. The fix is actually pretty neat on these boards and could pass casual inspection, depending on how well the soldering turns out.
I saw this repair on Internet, but Dell S4048 motherboard is very different. There is TPM header near microprocessor but only with 1 line of pins.

Video for Supermicros:
 

Offline ericloewe

  • Supporter
  • ****
  • Posts: 85
  • Country: pt
Re: Intel Atom C2000 Failures
« Reply #60 on: March 08, 2023, 07:25:52 pm »
Ok, further digging and a series of fortuitously timed posts by a few people led to further useful info about Supermicro A1SAi/A1SRi boards, and likely other vendors' board too:

Well, the gist of it is that there are two clock lines, one feeds the TPM header and is involved in the boot ROM selection; the other runs to the BMC. This is why fixing the boot failures often still left the BMC unresponsive to the LPC bus, it just wasn't getting any clocks.

Supermicro's fix for this is a pair of 150 Ohm bodge resistors, one to pull up each of these two clock lines to 3.3 V. The fix is actually pretty neat on these boards and could pass casual inspection, depending on how well the soldering turns out.
I saw this repair on Internet, but Dell S4048 motherboard is very different. There is TPM header near microprocessor but only with 1 line of pins.
Are you sure it's for a TPM? Do you have any documents saying that?

If so, my suggestion is that you figure out which of the pins are not power pins. From there, try pulling up each of them to 3.3 V (one at a time!) until the thing boots. Once it boots, you'll have identified your clock pin and can do a more permanent fix.

Pictures are also welcome for the next person to come along with such a dead switch.
« Last Edit: March 08, 2023, 07:28:02 pm by ericloewe »
 
The following users thanked this post: wraper

Offline pat0te

  • Newbie
  • Posts: 7
  • Country: es
Re: Intel Atom C2000 Failures
« Reply #61 on: March 13, 2023, 09:28:11 am »
Ok, further digging and a series of fortuitously timed posts by a few people led to further useful info about Supermicro A1SAi/A1SRi boards, and likely other vendors' board too:

Well, the gist of it is that there are two clock lines, one feeds the TPM header and is involved in the boot ROM selection; the other runs to the BMC. This is why fixing the boot failures often still left the BMC unresponsive to the LPC bus, it just wasn't getting any clocks.

Supermicro's fix for this is a pair of 150 Ohm bodge resistors, one to pull up each of these two clock lines to 3.3 V. The fix is actually pretty neat on these boards and could pass casual inspection, depending on how well the soldering turns out.
I saw this repair on Internet, but Dell S4048 motherboard is very different. There is TPM header near microprocessor but only with 1 line of pins.
Are you sure it's for a TPM? Do you have any documents saying that?

If so, my suggestion is that you figure out which of the pins are not power pins. From there, try pulling up each of them to 3.3 V (one at a time!) until the thing boots. Once it boots, you'll have identified your clock pin and can do a more permanent fix.

Pictures are also welcome for the next person to come along with such a dead switch.

Well, no, I'm not sure about TPM. I don´t have any documentation, that's the problem. I posted this pic before regarding TPM style connector:
https://www.eevblog.com/forum/index.php?action=dlattach;topic=95943.0;attach=1713377;image
 

Offline charly

  • Newbie
  • Posts: 1
  • Country: ar
Re: Intel Atom C2000 Failures
« Reply #62 on: October 03, 2023, 05:33:49 pm »
Just registered (a programmer here, my electronics skills are below zero :D ) to say thanks for this thread and keep it going for anybody still around (this hardware is way more than what many NAS boxes offer today).
I've had a 2550F since 2014 and it died on me like a year ago (using it with FreeNAS at home). Tried everything, it just turned some lights on the motherboard, but no video, no nothing. I even distrusted the power supply. Now this bridge that I made with audio friends help made it back to life! I didn't want to throw away this beauty even if it just works for a few extra months.
Thanks a lot!!
 

Offline Gil

  • Newbie
  • Posts: 1
  • Country: au
Re: Intel Atom C2000 Failures
« Reply #63 on: April 13, 2024, 12:40:55 am »
Where is the pin located on the 251+ ??  many thanks
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf