Author Topic: Debugging low temperature crashes of a 400 MHz ARM Microcontroller (Read 11134 times)

gperoni · « **on:** December 25, 2016, 05:38:57 pm »

Hello, I hope this to be a good place to ask for suggestions on this topic.

I recently manufactured a batch of boards with a 400 MHz ARM microcontroller + external DDR in it. Around 30% of the boards exhibit a problem when started from a temperature below room temperature (20C), hanging while booting (u-boot is used at boot and it hangs mostly during memory intensive operations such as copying the kernel from the external device to DDR or computing the checksum, however this is simply most of the work done by the bootloader and other crashes were also seen, so I wouldn't point the finger to the DDR exclusively). Of the remaining 70% of boards, half also hang when left outside and booted at ~5C. Some of the remaining can be left in a fridge for minutes at -25C and will boot just fine.

This is not the first time I manufacture this board, all of the boards from the last batch can be left in a fridge for tens of minutes at the time and will boot just fine. The schematic differences between those two batches are very minimal and I checked them something like 20 times so far, there is nothing that could explain this behaviour. The PCB manufacturer was the same, however I changed the SMD assembler. This last assembler provided some of the components, like decoupling caps. However, all of the supplies and noise tested on them is the same as the old boards, with the biggest ripple being a 100-150mVpp.

So, list of facts known so far:

- It's a temperature-dependent problem. Around 30% of the boards cannot boot when cold. Others can be left in the freezer and will work just fine. Once that the board is booted, the heat generated during normal operation is enough to keep it alive. If a board refuses to boot, blowing hot air on it is a guaranteed fix.
- I excluded a bad batch of chips by replacing all of the BGAs on a board with new ones. The board was a particularly bad one and didn't change its behaviour after having its chips replaced.
- When running a long (20s) memory test before booting, chances of a bad board booting increase. If the test is run a couple of times, it's basically certain the board heated up enough to pass the boot process. However, this memory test never crashes, while the subsequent boot steps do sometime.
- The ICs of bad boards works all the time when started at lower frequencies (200 MHz), even after the boards are left in the freezer for a while.
- Even if the SMD assembler provided many of the passives for the power supplies, the voltages are looking perfectly alike the old revision, with a Vpp of around 100-150mV. It's very difficult to find noise differences between the two revisions.

The boards working all the time at lower frequencies should rule out a solder problem. What about PCB impedances changing at lower temperatures (even if the PCB manufacturer was the same between the two batches)? This and bad passives are my two most likely causes, so, even if the voltages are fine, I started the tedious process of replacing the passives (mostly for decoupling) provided by the SMD assembler with quality ones (even if the SMD assembler should have used the temperature rating indicated in the BOM and there is not a lot of ripple, so I'm mostly working in vain right now).

Any idea on what to look at next?

Thank you a lot and happy holidays!

Howardlong · « **Reply #1 on:** December 25, 2016, 07:07:22 pm »

When the boot fails, does ot always fail at the same place, or is ot at random places during the boot process?

Are all your oscillators starting?

dgtl · « **Reply #2 on:** December 25, 2016, 07:10:38 pm »

DDR calibration issues? For example, for iMX6 the calibration is run once and the values hardcoded to FW image. Even when copypasting DDR layout from evalkit and using same components, the calibration had to be re-run for custom boards and the constants updated. Memory timings change with temperature and when the settings are not valid, it will fail. Perhaps just the memory timings are invalid and you are running the chips a little too fast or some other parameter is not set correctly?
Perhaps you can narrow down to one chip with cold spray or heatgun aimed at one specific chip.
On one design, we had one power rail get unstable in low temperatures and causing stability issues. Wasn't fun to debug at all.

gperoni · « **Reply #3 on:** December 25, 2016, 08:53:32 pm »

I couldn't see any problem with my crystals, I even tried replacing them before with no luck. The bootloader fails in a very large number of ways. They are mostly in order, depending on temperature the first failure mode will show up, then after a reboot the second will, etc.

- Sometimes the board starts, but without showing any output. Maybe here the bootloader is failing to load in ram.
- "reading uEnv.txt" (reading a small file in the SD card configuring the bootloader)
- Verifying the checksum (this is the step that is most likely to fail)
- Moving the kernel image from one memory location to the other
- A kernel panic early in the kernel init, before output is initialised

Sometimes (not in any order), the board will decide to reboot itself after going trough a few of the steps above. Just like that, out of the blue, either the board reboots or the program counter is reset to the beginning. I tried checking the reset line of the processor, but I couldn't see anything there.

Unfortunately there is no calibration process for the DDR, even if the datasheet says that calibration is part of the normal boot process and I assume this is done automatically by the bootloader. I changed DDR chips between the two batches, so what I did to rule the DDR out was replacing the old board's DDR with one of the new ones, but I couldn't get it to crash. I'm gonna check the datasheets for the part numbers, just in case the values are a bit off...

eb4fbz · « **Reply #4 on:** December 25, 2016, 10:01:42 pm »

From my experience, these issues are due to signal integrity problems. When cold, the drivers current increase and so do reflections due to impedance mismatches and crosstalk.
I've seen this behaviours in high speed SPI, LVDS and DDR buses: Layout rework, snubbers, series resistors and terminations could help.

Are you sure that the PCB manufacturer has not changed the stackup or materials? prepeg type, resin content, FR4 cores, thickness, etc?

gperoni · « **Reply #5 on:** December 25, 2016, 10:42:47 pm »

It looks like it might be fixed! I started comparing DDR datasheet and bootloader initialisation code and saw some differences, so I tried taking the worst board I had and changed the DDR on it. Result, it's now an above-the-average board. I believe I have a very permissive batch of DDRs that I'm using outside of their characterisation, I just need to change the timings in the bootloader and it should be fine. Just writing this even if untested to make sure people don't use their time to answer this. I will let you know if this doesn't end up being the problem. Thanks a lot everyone!

dgtl · « **Reply #6 on:** December 25, 2016, 10:50:48 pm »

Just as a reference, could you also tell us, what uc it was. Perhaps others are having same issues with the same thing...
Getting SDRAM timings right is a tricky thing and it is very easy to make a small mistake, that almost works. Some timings are given in time units, some in clocks, some in frequency; then you have to convert to the units the controller is using (usually clocks) and then round the correct way.
Some manufacturers have done things easier with helpful tools that calculate the settings, some have made things even harder with nonexistant documentation and a lot of magic constants in source code. At least the JEDEC standards standardize the things a little, but always expect surprises. And things get harder with every generation, DDR3 is more difficult than DDR2, DDR or SDR.

gperoni · « **Reply #7 on:** December 25, 2016, 11:18:12 pm »

Sure, it's a TI DaVinci, DM36x. The DDR is a MT47H64M16HR-25:H.

lujji · « **Reply #8 on:** December 25, 2016, 11:52:33 pm »

Very interesting problem.

Quote

the voltages are looking perfectly alike the old revision, with a Vpp of around 100-150mV. It's very difficult to find noise differences between the two revisions.

Just to make sure, were you monitoring DDR voltages while the boards were actually in freezer?

mac.6 · « **Reply #9 on:** December 26, 2016, 09:03:45 am »

100-150mV ripple? on 3.3v or other core voltage? that can be too much ripple if you are close to vdd limits. Don't forget to probe near mcu power pin while reproducing the reset/lockup issues.
My first suspect would be power supply, especially decoupling caps as you said they are from different source (occam razor principle).

diyaudio · « **Reply #10 on:** December 26, 2016, 09:55:15 am »

Quote from: gperoni on December 25, 2016, 05:38:57 pm

Hello, I hope this to be a good place to ask for suggestions on this topic.

I recently manufactured a batch of boards with a 400 MHz ARM microcontroller + external DDR in it. Around 30% of the boards exhibit a problem when started from a temperature below room temperature (20C), hanging while booting (u-boot is used at boot and it hangs mostly during memory intensive operations such as copying the kernel from the external device to DDR or computing the checksum, however this is simply most of the work done by the bootloader and other crashes were also seen, so I wouldn't point the finger to the DDR exclusively). Of the remaining 70% of boards, half also hang when left outside and booted at ~5C. Some of the remaining can be left in a fridge for minutes at -25C and will boot just fine.

This is not the first time I manufacture this board, all of the boards from the last batch can be left in a fridge for tens of minutes at the time and will boot just fine. The schematic differences between those two batches are very minimal and I checked them something like 20 times so far, there is nothing that could explain this behaviour. The PCB manufacturer was the same, however I changed the SMD assembler. This last assembler provided some of the components, like decoupling caps. However, all of the supplies and noise tested on them is the same as the old boards, with the biggest ripple being a 100-150mVpp.

So, list of facts known so far:

- It's a temperature-dependent problem. Around 30% of the boards cannot boot when cold. Others can be left in the freezer and will work just fine. Once that the board is booted, the heat generated during normal operation is enough to keep it alive. If a board refuses to boot, blowing hot air on it is a guaranteed fix.
- I excluded a bad batch of chips by replacing all of the BGAs on a board with new ones. The board was a particularly bad one and didn't change its behaviour after having its chips replaced.
- When running a long (20s) memory test before booting, chances of a bad board booting increase. If the test is run a couple of times, it's basically certain the board heated up enough to pass the boot process. However, this memory test never crashes, while the subsequent boot steps do sometime.
- The ICs of bad boards works all the time when started at lower frequencies (200 MHz), even after the boards are left in the freezer for a while.
- Even if the SMD assembler provided many of the passives for the power supplies, the voltages are looking perfectly alike the old revision, with a Vpp of around 100-150mV. It's very difficult to find noise differences between the two revisions.

The boards working all the time at lower frequencies should rule out a solder problem. What about PCB impedances changing at lower temperatures (even if the PCB manufacturer was the same between the two batches)? This and bad passives are my two most likely causes, so, even if the voltages are fine, I started the tedious process of replacing the passives (mostly for decoupling) provided by the SMD assembler with quality ones (even if the SMD assembler should have used the temperature rating indicated in the BOM and there is not a lot of ripple, so I'm mostly working in vain right now).

Any idea on what to look at next?

Thank you a lot and happy holidays!

Have you checked the loader capacitors on the crystal, I had a problem with a SHARC DSP board where I forgot to add loader capacitors (at it was clearly mentioned in the data sheet) I almost wrote off a batch of boards as a result of a silly mistake.

mikeselectricstuff · « **Reply #11 on:** December 26, 2016, 03:41:23 pm »

Those symptoms are screaming "marginal timings" to me.
You ideally want to replace the OS+bootloader with something that just does memory tests and outputs the result on a LED, to make it easier to narrow things down.

RoGeorge · « **Reply #12 on:** December 26, 2016, 07:49:20 pm »

Quote from: gperoni on December 25, 2016, 05:38:57 pm

however I changed the SMD assembler. This last assembler provided some of the components, like decoupling caps. However, all of the supplies and noise tested on them is the same as the old boards, with the biggest ripple being a 100-150mVpp.

Did you tried to swap all the decoupling capacitors between an working board from the old batch and a non working one from the new batch?

gperoni · « **Reply #13 on:** December 26, 2016, 08:43:56 pm »

I really appreciate all the nice answers! Let's see if I can do a recap of what happened in the last day:

1) I solved a boot voltage sequencing problem where keeping the UART RX connected to a UART device (to read the boot log) was causing spikes in the 1.8V rail, the power up sequence was wrong when the spikes concurred with the power on. This solved 90% of the cases where the board is powered but there is no output whatsoever.
2) I noticed that the RESET pin of the microprocessor was at 3.9V instead of 3.3V, this is due to how the PMIC is configured (open drain 100k ohm pull up at 5V), so I reworked a few of the boards to have it providing the correct 3.3V. I couldn't notice any difference.
3) The board whose DDR I replaced yesterday is still behaving more or less ok. However I don't feel like I have a statistically reliable basis to say changing the DDR decreased the occurrence of problems. I wish I had better ways to test those boards.
4) Somehow, the success rate at boot seems to be higher during the evening (third day in a row now). Tried room temperature and turning lights on/off. Still unexplained. Maybe this last point is completely made up.
5) I have two power supply boards I connect this board I'm debugging to. One of them has a better layout and is less noisy, so the noise for the 1.35V rail goes down from ~100-150mV to ~70-80mV. However this doesn't seems to be helping with boot.
6) I tried removing some of the bigger capacitors (4.7uF) from a board to see if it increased problems, but I couldn't see any difference. I added back 6uF 50V NPO, and again, I couldn't see any difference.
7) I ordered a couple of slightly faster DDR chips, those have a better CAS Latency (5 vs 6). They should be here in a week. The problem I was talking about yesterday is that I'm using DDR chips with a CAS latency of 6, configured from the bootloader to work at 5, because the microprocessor doesn't support the higher CAS latency option. The only reason chips with CAS of 6 work at 5 is that they are downclocked (340 vs 400 MHz). However yes, it's asking for problems. This makes me think that maybe the last batch was working fine just out of pure luck. There is a thread on SE about this. .
8) I also noticed that increasing CPU/DDR frequency just a tiny bit (DDR goes up 2% to 340 MHz) increases the occurrence of the problem a lot.

So, left to do:

1) Wait for the new DDRs.
2) Replace the remaining capacitors, the 40 or so 0.1uF.
3) (optional) Buy a bench power supply to power the IC from it.

Uhm, it looks like writing everything down at the end of the day helps making order in the mess that is my mind right now!

Quote from: lujji on December 25, 2016, 11:52:33 pm

Just to make sure, were you monitoring DDR voltages while the boards were actually in freezer?

No, I take it out from the freezer and spend about 30 seconds powering it up. However the freezer is at -10C while it's 5C outside, so I prefer to just use outside to cool the boards down. Nothing in the board is rated for -10C and leaving the boards in the freezer for too long breaks even the most resistant boards, while others still fail to boot from 5C.

Quote from: mac.6 on December 26, 2016, 09:03:45 am

100-150mV ripple? on 3.3v or other core voltage? that can be too much ripple if you are close to vdd limits. Don't forget to probe near mcu power pin while reproducing the reset/lockup issues. My first suspect would be power supply, especially decoupling caps as you said they are from different source (occam razor principle).

This is basically the only reason why I'm considering buying a bench power supply, to see if a perfectly clean power supply fixes the problem all the time. Yes, 100-150mV is too high, but it is also very close to what I'm measuring in the old board... It's tricky. I should check voltages again tomorrow.

lujji · « **Reply #14 on:** December 27, 2016, 03:41:54 am »

Check DDR voltages when the board is at 5C. Perhaps the feedback of your supply circuitry changes with temperature which makes the voltage drop enough for RAM to misbehave. Also see if increasing DDR voltage makes any difference on 'faulty' boards.

MatteoX · « **Reply #15 on:** December 27, 2016, 05:06:57 am »

Quote from: mikeselectricstuff on December 26, 2016, 03:41:23 pm

Those symptoms are screaming "marginal timings" to me.
You ideally want to replace the OS+bootloader with something that just does memory tests and outputs the result on a LED, to make it easier to narrow things down.

I totally agree. From your description everything points to the marginal timing problem. At cold the the propagation delays on the CMOS ICs are shorter.

You can easily try to replicate your problem using a freeze spray to selectively cool down particular component(s). I have used successfully used this technique
(freeze spray and heat gun) to vary delays and detect marginal timings on FPGA prototypes.

Brutte · « **Reply #16 on:** December 27, 2016, 01:14:01 pm »

That is ARM926 core.
On error in data transfer from/to memory this CPU cannot even detect it as that would require ECC at least. It can only detect an unaligned access and some permission faults.

On error in op-code transfer there is a chance the core senses the problem, for example as invalid op-code. If the core faces such event, it can act accordingly (log "I do not understand how to execute 0xDEADBEEF loaded from 0x01234567" or reboot, halt, whatever). Not sure what option was selected in your OS but if the core offers such hardware mechanism then I suggest to use it for debugging.

The register is called FSR, IFSR for instruction and DFSR for data.

Quote from: ARM926 TRM

Register c5 accesses the Fault Status Registers (FSRs). The FSRs contain the source of the last instruction or data fault. The instruction-side FSR is intended for debug
purposes only.

There is also accompanying FAR (fault address register that indicates at what address the fault was detected). And there is a bunch of bits that configure the cache behaviour.

I do not know the details of your setup and the root of your problem but even when you improve the layout and won't face the bsod's that often then that does not mean the memory interface works flawlessly as not every memory error ends up in "total crash".

free_electron · « **Reply #17 on:** December 27, 2016, 04:11:46 pm »

condensation ?

water is conductive. ice is not ... you say they work after being frozen ...

how good is your ddr layout ? did you respect the length-matching ?

diyaudio · « **Reply #18 on:** December 27, 2016, 06:55:43 pm »

Interesting how people talk of possible layout problems and don't post pictures of the ACTUAL layout.

gperoni · « **Reply #19 on:** January 05, 2017, 08:32:03 pm »

A small update. I got the faster DDRs, the ones with the lower CAS latency (MT47H64M16HR-25E:H vs MT47H64M16HR-25:H), but they are not helping. I also discovered what spring probes are and used them to test voltages but I couldn't find anything with more than 50mVpp noise. The PCB manufacturer is also saying they used the same stackup and materials.

A few thoughts:

1) I bought the new DDRs from China, maybe I got relabelled, slower DDRs. Unlikely. However the EMIF controller is configured with the CAS latency of the faster chips, so the older ones shouldn't be working. Buying from China was faster.
2) I got the freeze spray, nice tool. It looks like I can freeze the side of the PCB opposite to the ICs, but it looks like that when the ground plane on the side of the ICs gets below 17-15 degrees C the software running freezes. Kernel panic sometimes, segfault in others. By freezing IC and DDR I can also get them to crash, but that is sort of a useless information, unfortunately I can't selectively crash them as they are very close and the freeze spray sends out a -50 liquid.
3) I would love to take a look at the registers during a crash, but I don't have the JTAG pinout.

Even if tedious I think replacing every capacitor, inductance and filter is worth a try, even if I'm not sure of how those can help fix what looks like a DDR problem when there is a clean DDR power supply. Other than that, maybe I can try changing some DDR timing setting. However I'm out of ideas after those.

Thank you everyone for the help!

Brutte · « **Reply #20 on:** January 05, 2017, 09:25:00 pm »

Quote from: gperoni on January 05, 2017, 08:32:03 pm

3) I would love to take a look at the registers during a crash, but I don't have the JTAG pinout.

Oopsie.
No JTAG means that it is going to be much harder.

When the core faces some unexpected event (like opcode with flipped bit), it automagically suspends current execution and jumps to (raises) a fault handler (I think there are three fault handlers in ARM926 for three categories of nastiness). So one could provide a custom fault handler that is kept in designated on-board SRAM section (you do not want to load that fault handler to/from faulty DDR) that would dump all the core registers (including FSR, FAR, part of the stack, etc) in a human-readable format. From that data you could infer where the actual problem lies. I suspect your OS should support such debugging options and various test setups that stretch hardware beyond standard requirements, etc.

Of course with JTAG all that interrogation can be made with a click of a button.

thm_w · « **Reply #21 on:** January 05, 2017, 11:55:26 pm »

Quote from: free_electron on December 27, 2016, 04:11:46 pm

condensation ?
water is conductive. ice is not ... you say they work after being frozen ...
how good is your ddr layout ? did you respect the length-matching ?

We had a vendor product that would die at cold temps due to condensation. Was hard because they never saw the problem initially (their environmental chamber would not allow condensation to form presumably), then wouldn't admit it was a problem when we demonstrated. Problem was a boot pin was floating, so the condensation would pull it to the wrong state

.
Likely not OP's issue though.

mikeselectricstuff · « **Reply #22 on:** January 07, 2017, 08:44:51 am »

You are really making life very hard for yourself using "the OS crashes" as teh only diagnostic to track down a memory fault, and how will you know whether you've actually fixed it? When the OS runs for an hour, a day, a month?
A few hours spent writing ( or finding) some bare-metal code that just sits in internal RAM ( so RAM faults don't crash the system) and continuously does external RAM tests, outputting status to a LED or UART will give you far more insight into the effects of any fixes, and understanding of what the actual issue is so you can be sure that any fix actually solves the issue.

gperoni · « **Reply #23 on:** January 07, 2017, 10:59:39 pm »

Absolutely right, I was very close to writing those routines but decided to replace the DDR yet another time instead, with a 2-years-old Samsung (K4T1G164QF-BCF7) I found while searching for other components. It turns out I can now freeze the board to -50 celsius and see it working.

So, yeah, DDR timings, as everyone here was saying! The small detail is that the DDR timings between the two DDRs are absolutely the same.

Anyway by looking at the difference between registers and uboot initialisation code I found some big and small errors (like x instead of x-1) and it looks like the Micron DDR is now 10-100 times more stable, even when cold. The Samsung however is still behaving much better at low temperatures (-10), where it doesn't crashes, unlike the Micron.

Not sure what to do next, replace the Micron DDRs with Samsung (it's out of production but still available and cheap) and have an overall better product for reasons I don't understand (this troubles me) or just stick with the Micron DDR that seems to be stable enough.

I know, I shouldn't be basing decisions on feelings. What I'm doing for testing is just running the "production software" with a fan forcing -10 degrees air torwards the product. It doesn''t crashes with the Samsung, it does with the Micron, regardless of the temperature rating of the chips being the same. Maybe I just got lucky Samsung chips, as it looks I did with my old Micron batch. Maybe it's a hardware problem (some of the PCBs don't have perfect solderable pads, the Samsung DDR has smaller nonconductive overmolds and falls into place much more, thus bigger solder contacts (flatter balls) make a better contact)?

So, yeah, sort of solved, not sure why.

mikeselectricstuff · « **Reply #24 on:** January 07, 2017, 11:38:16 pm »

Quote from: gperoni on January 07, 2017, 10:59:39 pm

So, yeah, sort of solved, not sure why.

If you don't know why, how do you know it won't crop up again? or under some circumstances you've not tested yet?
My guess is something is out of spec and the Micron parts have more margin.
At least this batch do, maybe the next will also, or maybe not...


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Debugging low temperature crashes of a 400 MHz ARM Microcontroller (Read 11134 times)

Share me