Author Topic: Debugging low temperature crashes of a 400 MHz ARM Microcontroller (Read 11178 times)

gperoni · « **on:** December 25, 2016, 05:38:57 pm »

Hello, I hope this to be a good place to ask for suggestions on this topic.

I recently manufactured a batch of boards with a 400 MHz ARM microcontroller + external DDR in it. Around 30% of the boards exhibit a problem when started from a temperature below room temperature (20C), hanging while booting (u-boot is used at boot and it hangs mostly during memory intensive operations such as copying the kernel from the external device to DDR or computing the checksum, however this is simply most of the work done by the bootloader and other crashes were also seen, so I wouldn't point the finger to the DDR exclusively). Of the remaining 70% of boards, half also hang when left outside and booted at ~5C. Some of the remaining can be left in a fridge for minutes at -25C and will boot just fine.

This is not the first time I manufacture this board, all of the boards from the last batch can be left in a fridge for tens of minutes at the time and will boot just fine. The schematic differences between those two batches are very minimal and I checked them something like 20 times so far, there is nothing that could explain this behaviour. The PCB manufacturer was the same, however I changed the SMD assembler. This last assembler provided some of the components, like decoupling caps. However, all of the supplies and noise tested on them is the same as the old boards, with the biggest ripple being a 100-150mVpp.

So, list of facts known so far:

- It's a temperature-dependent problem. Around 30% of the boards cannot boot when cold. Others can be left in the freezer and will work just fine. Once that the board is booted, the heat generated during normal operation is enough to keep it alive. If a board refuses to boot, blowing hot air on it is a guaranteed fix.
- I excluded a bad batch of chips by replacing all of the BGAs on a board with new ones. The board was a particularly bad one and didn't change its behaviour after having its chips replaced.
- When running a long (20s) memory test before booting, chances of a bad board booting increase. If the test is run a couple of times, it's basically certain the board heated up enough to pass the boot process. However, this memory test never crashes, while the subsequent boot steps do sometime.
- The ICs of bad boards works all the time when started at lower frequencies (200 MHz), even after the boards are left in the freezer for a while.
- Even if the SMD assembler provided many of the passives for the power supplies, the voltages are looking perfectly alike the old revision, with a Vpp of around 100-150mV. It's very difficult to find noise differences between the two revisions.

The boards working all the time at lower frequencies should rule out a solder problem. What about PCB impedances changing at lower temperatures (even if the PCB manufacturer was the same between the two batches)? This and bad passives are my two most likely causes, so, even if the voltages are fine, I started the tedious process of replacing the passives (mostly for decoupling) provided by the SMD assembler with quality ones (even if the SMD assembler should have used the temperature rating indicated in the BOM and there is not a lot of ripple, so I'm mostly working in vain right now).

Any idea on what to look at next?

Thank you a lot and happy holidays!

Howardlong · « **Reply #1 on:** December 25, 2016, 07:07:22 pm »

When the boot fails, does ot always fail at the same place, or is ot at random places during the boot process?

Are all your oscillators starting?

dgtl · « **Reply #2 on:** December 25, 2016, 07:10:38 pm »

DDR calibration issues? For example, for iMX6 the calibration is run once and the values hardcoded to FW image. Even when copypasting DDR layout from evalkit and using same components, the calibration had to be re-run for custom boards and the constants updated. Memory timings change with temperature and when the settings are not valid, it will fail. Perhaps just the memory timings are invalid and you are running the chips a little too fast or some other parameter is not set correctly?
Perhaps you can narrow down to one chip with cold spray or heatgun aimed at one specific chip.
On one design, we had one power rail get unstable in low temperatures and causing stability issues. Wasn't fun to debug at all.

gperoni · « **Reply #3 on:** December 25, 2016, 08:53:32 pm »

I couldn't see any problem with my crystals, I even tried replacing them before with no luck. The bootloader fails in a very large number of ways. They are mostly in order, depending on temperature the first failure mode will show up, then after a reboot the second will, etc.

- Sometimes the board starts, but without showing any output. Maybe here the bootloader is failing to load in ram.
- "reading uEnv.txt" (reading a small file in the SD card configuring the bootloader)
- Verifying the checksum (this is the step that is most likely to fail)
- Moving the kernel image from one memory location to the other
- A kernel panic early in the kernel init, before output is initialised

Sometimes (not in any order), the board will decide to reboot itself after going trough a few of the steps above. Just like that, out of the blue, either the board reboots or the program counter is reset to the beginning. I tried checking the reset line of the processor, but I couldn't see anything there.

Unfortunately there is no calibration process for the DDR, even if the datasheet says that calibration is part of the normal boot process and I assume this is done automatically by the bootloader. I changed DDR chips between the two batches, so what I did to rule the DDR out was replacing the old board's DDR with one of the new ones, but I couldn't get it to crash. I'm gonna check the datasheets for the part numbers, just in case the values are a bit off...

eb4fbz · « **Reply #4 on:** December 25, 2016, 10:01:42 pm »

From my experience, these issues are due to signal integrity problems. When cold, the drivers current increase and so do reflections due to impedance mismatches and crosstalk.
I've seen this behaviours in high speed SPI, LVDS and DDR buses: Layout rework, snubbers, series resistors and terminations could help.

Are you sure that the PCB manufacturer has not changed the stackup or materials? prepeg type, resin content, FR4 cores, thickness, etc?

gperoni · « **Reply #5 on:** December 25, 2016, 10:42:47 pm »

It looks like it might be fixed! I started comparing DDR datasheet and bootloader initialisation code and saw some differences, so I tried taking the worst board I had and changed the DDR on it. Result, it's now an above-the-average board. I believe I have a very permissive batch of DDRs that I'm using outside of their characterisation, I just need to change the timings in the bootloader and it should be fine. Just writing this even if untested to make sure people don't use their time to answer this. I will let you know if this doesn't end up being the problem. Thanks a lot everyone!

dgtl · « **Reply #6 on:** December 25, 2016, 10:50:48 pm »

Just as a reference, could you also tell us, what uc it was. Perhaps others are having same issues with the same thing...
Getting SDRAM timings right is a tricky thing and it is very easy to make a small mistake, that almost works. Some timings are given in time units, some in clocks, some in frequency; then you have to convert to the units the controller is using (usually clocks) and then round the correct way.
Some manufacturers have done things easier with helpful tools that calculate the settings, some have made things even harder with nonexistant documentation and a lot of magic constants in source code. At least the JEDEC standards standardize the things a little, but always expect surprises. And things get harder with every generation, DDR3 is more difficult than DDR2, DDR or SDR.

gperoni · « **Reply #7 on:** December 25, 2016, 11:18:12 pm »

Sure, it's a TI DaVinci, DM36x. The DDR is a MT47H64M16HR-25:H.

lujji · « **Reply #8 on:** December 25, 2016, 11:52:33 pm »

Very interesting problem.

Quote

the voltages are looking perfectly alike the old revision, with a Vpp of around 100-150mV. It's very difficult to find noise differences between the two revisions.

Just to make sure, were you monitoring DDR voltages while the boards were actually in freezer?

mac.6 · « **Reply #9 on:** December 26, 2016, 09:03:45 am »

100-150mV ripple? on 3.3v or other core voltage? that can be too much ripple if you are close to vdd limits. Don't forget to probe near mcu power pin while reproducing the reset/lockup issues.
My first suspect would be power supply, especially decoupling caps as you said they are from different source (occam razor principle).

diyaudio · « **Reply #10 on:** December 26, 2016, 09:55:15 am »

Quote from: gperoni on December 25, 2016, 05:38:57 pm

Hello, I hope this to be a good place to ask for suggestions on this topic.

I recently manufactured a batch of boards with a 400 MHz ARM microcontroller + external DDR in it. Around 30% of the boards exhibit a problem when started from a temperature below room temperature (20C), hanging while booting (u-boot is used at boot and it hangs mostly during memory intensive operations such as copying the kernel from the external device to DDR or computing the checksum, however this is simply most of the work done by the bootloader and other crashes were also seen, so I wouldn't point the finger to the DDR exclusively). Of the remaining 70% of boards, half also hang when left outside and booted at ~5C. Some of the remaining can be left in a fridge for minutes at -25C and will boot just fine.

This is not the first time I manufacture this board, all of the boards from the last batch can be left in a fridge for tens of minutes at the time and will boot just fine. The schematic differences between those two batches are very minimal and I checked them something like 20 times so far, there is nothing that could explain this behaviour. The PCB manufacturer was the same, however I changed the SMD assembler. This last assembler provided some of the components, like decoupling caps. However, all of the supplies and noise tested on them is the same as the old boards, with the biggest ripple being a 100-150mVpp.

So, list of facts known so far:

- It's a temperature-dependent problem. Around 30% of the boards cannot boot when cold. Others can be left in the freezer and will work just fine. Once that the board is booted, the heat generated during normal operation is enough to keep it alive. If a board refuses to boot, blowing hot air on it is a guaranteed fix.
- I excluded a bad batch of chips by replacing all of the BGAs on a board with new ones. The board was a particularly bad one and didn't change its behaviour after having its chips replaced.
- When running a long (20s) memory test before booting, chances of a bad board booting increase. If the test is run a couple of times, it's basically certain the board heated up enough to pass the boot process. However, this memory test never crashes, while the subsequent boot steps do sometime.
- The ICs of bad boards works all the time when started at lower frequencies (200 MHz), even after the boards are left in the freezer for a while.
- Even if the SMD assembler provided many of the passives for the power supplies, the voltages are looking perfectly alike the old revision, with a Vpp of around 100-150mV. It's very difficult to find noise differences between the two revisions.

The boards working all the time at lower frequencies should rule out a solder problem. What about PCB impedances changing at lower temperatures (even if the PCB manufacturer was the same between the two batches)? This and bad passives are my two most likely causes, so, even if the voltages are fine, I started the tedious process of replacing the passives (mostly for decoupling) provided by the SMD assembler with quality ones (even if the SMD assembler should have used the temperature rating indicated in the BOM and there is not a lot of ripple, so I'm mostly working in vain right now).

Any idea on what to look at next?

Thank you a lot and happy holidays!

Have you checked the loader capacitors on the crystal, I had a problem with a SHARC DSP board where I forgot to add loader capacitors (at it was clearly mentioned in the data sheet) I almost wrote off a batch of boards as a result of a silly mistake.

mikeselectricstuff · « **Reply #11 on:** December 26, 2016, 03:41:23 pm »

Those symptoms are screaming "marginal timings" to me.
You ideally want to replace the OS+bootloader with something that just does memory tests and outputs the result on a LED, to make it easier to narrow things down.

RoGeorge · « **Reply #12 on:** December 26, 2016, 07:49:20 pm »

Quote from: gperoni on December 25, 2016, 05:38:57 pm

however I changed the SMD assembler. This last assembler provided some of the components, like decoupling caps. However, all of the supplies and noise tested on them is the same as the old boards, with the biggest ripple being a 100-150mVpp.

Did you tried to swap all the decoupling capacitors between an working board from the old batch and a non working one from the new batch?

gperoni · « **Reply #13 on:** December 26, 2016, 08:43:56 pm »

I really appreciate all the nice answers! Let's see if I can do a recap of what happened in the last day:

1) I solved a boot voltage sequencing problem where keeping the UART RX connected to a UART device (to read the boot log) was causing spikes in the 1.8V rail, the power up sequence was wrong when the spikes concurred with the power on. This solved 90% of the cases where the board is powered but there is no output whatsoever.
2) I noticed that the RESET pin of the microprocessor was at 3.9V instead of 3.3V, this is due to how the PMIC is configured (open drain 100k ohm pull up at 5V), so I reworked a few of the boards to have it providing the correct 3.3V. I couldn't notice any difference.
3) The board whose DDR I replaced yesterday is still behaving more or less ok. However I don't feel like I have a statistically reliable basis to say changing the DDR decreased the occurrence of problems. I wish I had better ways to test those boards.
4) Somehow, the success rate at boot seems to be higher during the evening (third day in a row now). Tried room temperature and turning lights on/off. Still unexplained. Maybe this last point is completely made up.
5) I have two power supply boards I connect this board I'm debugging to. One of them has a better layout and is less noisy, so the noise for the 1.35V rail goes down from ~100-150mV to ~70-80mV. However this doesn't seems to be helping with boot.
6) I tried removing some of the bigger capacitors (4.7uF) from a board to see if it increased problems, but I couldn't see any difference. I added back 6uF 50V NPO, and again, I couldn't see any difference.
7) I ordered a couple of slightly faster DDR chips, those have a better CAS Latency (5 vs 6). They should be here in a week. The problem I was talking about yesterday is that I'm using DDR chips with a CAS latency of 6, configured from the bootloader to work at 5, because the microprocessor doesn't support the higher CAS latency option. The only reason chips with CAS of 6 work at 5 is that they are downclocked (340 vs 400 MHz). However yes, it's asking for problems. This makes me think that maybe the last batch was working fine just out of pure luck. There is a thread on SE about this. .
8) I also noticed that increasing CPU/DDR frequency just a tiny bit (DDR goes up 2% to 340 MHz) increases the occurrence of the problem a lot.

So, left to do:

1) Wait for the new DDRs.
2) Replace the remaining capacitors, the 40 or so 0.1uF.
3) (optional) Buy a bench power supply to power the IC from it.

Uhm, it looks like writing everything down at the end of the day helps making order in the mess that is my mind right now!

Quote from: lujji on December 25, 2016, 11:52:33 pm

Just to make sure, were you monitoring DDR voltages while the boards were actually in freezer?

No, I take it out from the freezer and spend about 30 seconds powering it up. However the freezer is at -10C while it's 5C outside, so I prefer to just use outside to cool the boards down. Nothing in the board is rated for -10C and leaving the boards in the freezer for too long breaks even the most resistant boards, while others still fail to boot from 5C.

Quote from: mac.6 on December 26, 2016, 09:03:45 am

100-150mV ripple? on 3.3v or other core voltage? that can be too much ripple if you are close to vdd limits. Don't forget to probe near mcu power pin while reproducing the reset/lockup issues. My first suspect would be power supply, especially decoupling caps as you said they are from different source (occam razor principle).

This is basically the only reason why I'm considering buying a bench power supply, to see if a perfectly clean power supply fixes the problem all the time. Yes, 100-150mV is too high, but it is also very close to what I'm measuring in the old board... It's tricky. I should check voltages again tomorrow.

lujji · « **Reply #14 on:** December 27, 2016, 03:41:54 am »

Check DDR voltages when the board is at 5C. Perhaps the feedback of your supply circuitry changes with temperature which makes the voltage drop enough for RAM to misbehave. Also see if increasing DDR voltage makes any difference on 'faulty' boards.

MatteoX · « **Reply #15 on:** December 27, 2016, 05:06:57 am »

Quote from: mikeselectricstuff on December 26, 2016, 03:41:23 pm

Those symptoms are screaming "marginal timings" to me.
You ideally want to replace the OS+bootloader with something that just does memory tests and outputs the result on a LED, to make it easier to narrow things down.

I totally agree. From your description everything points to the marginal timing problem. At cold the the propagation delays on the CMOS ICs are shorter.

You can easily try to replicate your problem using a freeze spray to selectively cool down particular component(s). I have used successfully used this technique
(freeze spray and heat gun) to vary delays and detect marginal timings on FPGA prototypes.

Brutte · « **Reply #16 on:** December 27, 2016, 01:14:01 pm »

That is ARM926 core.
On error in data transfer from/to memory this CPU cannot even detect it as that would require ECC at least. It can only detect an unaligned access and some permission faults.

On error in op-code transfer there is a chance the core senses the problem, for example as invalid op-code. If the core faces such event, it can act accordingly (log "I do not understand how to execute 0xDEADBEEF loaded from 0x01234567" or reboot, halt, whatever). Not sure what option was selected in your OS but if the core offers such hardware mechanism then I suggest to use it for debugging.

The register is called FSR, IFSR for instruction and DFSR for data.

Quote from: ARM926 TRM

Register c5 accesses the Fault Status Registers (FSRs). The FSRs contain the source of the last instruction or data fault. The instruction-side FSR is intended for debug
purposes only.

There is also accompanying FAR (fault address register that indicates at what address the fault was detected). And there is a bunch of bits that configure the cache behaviour.

I do not know the details of your setup and the root of your problem but even when you improve the layout and won't face the bsod's that often then that does not mean the memory interface works flawlessly as not every memory error ends up in "total crash".

free_electron · « **Reply #17 on:** December 27, 2016, 04:11:46 pm »

condensation ?

water is conductive. ice is not ... you say they work after being frozen ...

how good is your ddr layout ? did you respect the length-matching ?

diyaudio · « **Reply #18 on:** December 27, 2016, 06:55:43 pm »

Interesting how people talk of possible layout problems and don't post pictures of the ACTUAL layout.

gperoni · « **Reply #19 on:** January 05, 2017, 08:32:03 pm »

A small update. I got the faster DDRs, the ones with the lower CAS latency (MT47H64M16HR-25E:H vs MT47H64M16HR-25:H), but they are not helping. I also discovered what spring probes are and used them to test voltages but I couldn't find anything with more than 50mVpp noise. The PCB manufacturer is also saying they used the same stackup and materials.

A few thoughts:

1) I bought the new DDRs from China, maybe I got relabelled, slower DDRs. Unlikely. However the EMIF controller is configured with the CAS latency of the faster chips, so the older ones shouldn't be working. Buying from China was faster.
2) I got the freeze spray, nice tool. It looks like I can freeze the side of the PCB opposite to the ICs, but it looks like that when the ground plane on the side of the ICs gets below 17-15 degrees C the software running freezes. Kernel panic sometimes, segfault in others. By freezing IC and DDR I can also get them to crash, but that is sort of a useless information, unfortunately I can't selectively crash them as they are very close and the freeze spray sends out a -50 liquid.
3) I would love to take a look at the registers during a crash, but I don't have the JTAG pinout.

Even if tedious I think replacing every capacitor, inductance and filter is worth a try, even if I'm not sure of how those can help fix what looks like a DDR problem when there is a clean DDR power supply. Other than that, maybe I can try changing some DDR timing setting. However I'm out of ideas after those.

Thank you everyone for the help!

Brutte · « **Reply #20 on:** January 05, 2017, 09:25:00 pm »

Quote from: gperoni on January 05, 2017, 08:32:03 pm

3) I would love to take a look at the registers during a crash, but I don't have the JTAG pinout.

Oopsie.
No JTAG means that it is going to be much harder.

When the core faces some unexpected event (like opcode with flipped bit), it automagically suspends current execution and jumps to (raises) a fault handler (I think there are three fault handlers in ARM926 for three categories of nastiness). So one could provide a custom fault handler that is kept in designated on-board SRAM section (you do not want to load that fault handler to/from faulty DDR) that would dump all the core registers (including FSR, FAR, part of the stack, etc) in a human-readable format. From that data you could infer where the actual problem lies. I suspect your OS should support such debugging options and various test setups that stretch hardware beyond standard requirements, etc.

Of course with JTAG all that interrogation can be made with a click of a button.

thm_w · « **Reply #21 on:** January 05, 2017, 11:55:26 pm »

Quote from: free_electron on December 27, 2016, 04:11:46 pm

condensation ?
water is conductive. ice is not ... you say they work after being frozen ...
how good is your ddr layout ? did you respect the length-matching ?

We had a vendor product that would die at cold temps due to condensation. Was hard because they never saw the problem initially (their environmental chamber would not allow condensation to form presumably), then wouldn't admit it was a problem when we demonstrated. Problem was a boot pin was floating, so the condensation would pull it to the wrong state

.
Likely not OP's issue though.

mikeselectricstuff · « **Reply #22 on:** January 07, 2017, 08:44:51 am »

You are really making life very hard for yourself using "the OS crashes" as teh only diagnostic to track down a memory fault, and how will you know whether you've actually fixed it? When the OS runs for an hour, a day, a month?
A few hours spent writing ( or finding) some bare-metal code that just sits in internal RAM ( so RAM faults don't crash the system) and continuously does external RAM tests, outputting status to a LED or UART will give you far more insight into the effects of any fixes, and understanding of what the actual issue is so you can be sure that any fix actually solves the issue.

gperoni · « **Reply #23 on:** January 07, 2017, 10:59:39 pm »

Absolutely right, I was very close to writing those routines but decided to replace the DDR yet another time instead, with a 2-years-old Samsung (K4T1G164QF-BCF7) I found while searching for other components. It turns out I can now freeze the board to -50 celsius and see it working.

So, yeah, DDR timings, as everyone here was saying! The small detail is that the DDR timings between the two DDRs are absolutely the same.

Anyway by looking at the difference between registers and uboot initialisation code I found some big and small errors (like x instead of x-1) and it looks like the Micron DDR is now 10-100 times more stable, even when cold. The Samsung however is still behaving much better at low temperatures (-10), where it doesn't crashes, unlike the Micron.

Not sure what to do next, replace the Micron DDRs with Samsung (it's out of production but still available and cheap) and have an overall better product for reasons I don't understand (this troubles me) or just stick with the Micron DDR that seems to be stable enough.

I know, I shouldn't be basing decisions on feelings. What I'm doing for testing is just running the "production software" with a fan forcing -10 degrees air torwards the product. It doesn''t crashes with the Samsung, it does with the Micron, regardless of the temperature rating of the chips being the same. Maybe I just got lucky Samsung chips, as it looks I did with my old Micron batch. Maybe it's a hardware problem (some of the PCBs don't have perfect solderable pads, the Samsung DDR has smaller nonconductive overmolds and falls into place much more, thus bigger solder contacts (flatter balls) make a better contact)?

So, yeah, sort of solved, not sure why.

mikeselectricstuff · « **Reply #24 on:** January 07, 2017, 11:38:16 pm »

Quote from: gperoni on January 07, 2017, 10:59:39 pm

So, yeah, sort of solved, not sure why.

If you don't know why, how do you know it won't crop up again? or under some circumstances you've not tested yet?
My guess is something is out of spec and the Micron parts have more margin.
At least this batch do, maybe the next will also, or maybe not...

nctnico · « **Reply #25 on:** January 07, 2017, 11:49:53 pm »

I wouldn't rule out other parts though. I had a similar problem with a design and it turned out a new batch of decoupling capacitors was much worse then before (tested with DC bias + LCR meter). Fortunately it was fixable by software but on the next board revision I specced different capacitors and added some extra.

gperoni · « **Reply #26 on:** January 08, 2017, 01:02:36 am »

Quote from: mikeselectricstuff on January 07, 2017, 11:38:16 pm

If you don't know why, how do you know it won't crop up again?

And this is why after having replaced the DDR yesterday I spent the whole of today looking at timing parameters and improved the situation with the Micron chips. I wish my understanding of those problems was deeper.

Saying that bad capacitors might be causing the Micron DDR to behave badly while Samsung is less susceptible is a very good point, I should probably just try new chips in the old board or new capacitors in the new one.

gperoni · « **Reply #27 on:** July 21, 2017, 10:09:04 am »

So, 6 months in update. The problem discussed in the thread was solved by changing DDR, replacing it with one with the same timing parameters but I guess blander requirements. Since that it's now time to respin the board, I'm wondering if I can get your opinion on whatever or not the DDR layout of the board is decent, or if it should be improved.

This is the DDR layout of a reference design board, a Leopardboard DM368:
http://imgur.com/a/YymoU

This is our DDR layout:
http://imgur.com/a/LVveI

Thanks a lot!

nctnico · « **Reply #28 on:** July 21, 2017, 04:23:28 pm »

Can't you route all the data traces on the top layer? The address lines are way less critical than the data lines. I'd also put the data signals from the same lane on the same layer.

gperoni · « **Reply #29 on:** July 21, 2017, 07:41:05 pm »

I think the constrain in routing all of the data lines in the top layer is space. This is quite a small board and that's all the space there is to route the DDR, we can't move the DDR further from the IC. I asked the designer about keeping data/address in different planes, and he is saying that yes, that would be an improvement, but again he isn't sold on it as he considers it mostly a waste of time since that the board was working fine in the past, for two (prototype) revisions, before that batch.

I sort of agree with him. Please don't take this the wrong way, I extremely appreciate people making suggestions here for free (!!!) but ultimately he is the guy in charge. If there is nothing blatantly wrong for a 400 MHz DDR design, I think we shouldn't move the data/address lines to different planes, as this would take a lot of time.

However what we just saw as a possible improvement is moving the DDR VREF trace further back from the signal lines (not depicted, so here is a screen, fat one on the right:

)

Thank you!

nctnico · « **Reply #30 on:** July 21, 2017, 08:52:18 pm »

The problem with using many layers is that you need to have ground layers or very good decoupling between power and signal layers. In a DDR400 design I did I routed all data lines on the top layer using 0.1mm traces / 0.15mm distance. Yes, this is a lot of work but I thing it will help signal integrity. This probably needs swapping data lines (but keep lanes together) to have avoid needing vias. Perhaps reducing the layers used to top and bottom layer with a solid ground plane in between will already help. It is important each trace has the same impedance/capacitance and I don't see that happen when using several inner layers.
The fact the board didn't work with some chips shows the design is marginal.

gperoni · « **Reply #31 on:** July 22, 2017, 12:29:24 am »

Ok, fine. It's after 2AM here and I can't sleep. It's because I looked at another design I was going to manufacture where I have the same DDR/IC, and I got super scared when I realised the DDR there is routed using another 4 signal layers, but those are all stacked together and there is no ground/power plane in between, just adjacent to the first and fourth signal layer. I saw another post of yours, nctnico, where you are saying impedance starts becoming a factor only when traces are multiple cms long and hundreds of MHz, you were suggesting to focus on trace lengths. Do you think I can get away with that impedance-control-free-4-stacked-signal-layers design since that the maximum trace length is around 1cm? (still DDR400).

The stackup of the board discussed in this topic is better: ground, signal, signal, ground (repeat). There isn't half a mm of separation between the two signal layers as I saw in some application notes from Micron, unfortunately, it's just ~100um. By plugging the numbers in a calculator I get 73 Ohm for a microstrip and 58 Ohm by using the asymmetric stripline model, basically ignoring the second signal layer stacked between the two ground/power planes, that I guess makes sense and should say the impedance in there is not that wrong?

If that's the case, for this board, I believe the only improvements possible are the ones you suggested.

(I feel like a baby with being unable to sleep for this, but the idea of spending all that time again debugging this DDR scares me, and you convinced me there is work to do in here, nctnico).

Btw, Thank you for your time, again.

Rerouter · « **Reply #32 on:** July 24, 2017, 02:39:47 am »

Your layout does look lacking in return paths for signals, and high risk of crosstalk. But yes for these distances you only need to be within an order of magnitude for impedance, so focus on him last.

Your stackup should be ok, likely having the data run between the 2 planes, and remember to via connect the planes near the source and ddr

You will not always be able to, but larger radius meanders have less velocity factor shift, so your matching will be closer to what you expect.

With a ground for your signals it will remove your current crosstalk issues, i would say you can definatly do it on 4 layers, you may just have to get creative.

VALTERLED · « **Reply #33 on:** January 20, 2021, 09:25:11 pm »

Hi ll,
I am working on a similar issue at the moment but its a fault developed on an existing board from a Wandel Goltermann machine, the SPM-19.

The CPU board is running around a Intel 8085.
The problem is that at room temperature of 16-18C the board does not boot- If I warm ANY I mean ANY IC that is Any Eprom or Any RAM or the CPU, the circuit boots correctly. 20C are enough to make it work.
Once it boots, the board works fine forever and the small heat produced after 1 minute or so of operation takes away the problem until it will cool again.
I tried to remove from the sockets every single EPROM one each at the time and the CPU and by warming each one (20C are enough) outside the board, and then re-plugging it makes the board boot perfectly, so I must remove the PCB from being faulty as any of the components if removed and heated makes the board work. What's strange is that is enough that I heat just one of the components that are sharing the ADDR bus and Data bus and the board will boot.

I will focus on the Addr and Data bus and see what happens there. I will check the pull-ups and everything is involved and is common to the CPU and associated components.
Do you have any clue of what to look for?
Thank you for reply,
Valter

srb1954 · « **Reply #34 on:** January 21, 2021, 06:43:40 am »

Did you try cooling the chips with a freezer spray while it is running correctly to see if it stops again?

VALTERLED · « **Reply #35 on:** January 21, 2021, 07:43:05 am »

Hi,
thank you for reply,
not yet sistematically, I could do it by blowing fresh air with an Hair phon as I did till now to cool the board (its around 13-15C in my lab) but I would prefer to purchase a cooling can and do it this next saturday.

Thanks for advise, , I keep you posted on the result.
Ciao Valter

srb1954 · « **Reply #36 on:** January 21, 2021, 08:24:41 am »

Easier to use a can of freezer spray or if you don't have this you can use a compressed air duster can held upside down (with the nozzle to the bottom).

These can quickly cool chips down to much lower temperatures than using a fan. It is also possible to pin-point a faulty chip more precisely because of the small area cooled by the spray compared to a that of a fan.

VALTERLED · « **Reply #37 on:** January 21, 2021, 11:18:00 am »

I'll certainly do.

Whats really strange is that from a "non working state", ANY of the components on the data/address bus, if heated to around 20C is curing the booting. There is not a specific component that is responsible of the problem but any one of them (and I replaced almost all of the as they are on socket) can make the board work.
I am thinking some physics effect that mast have appeared after many years (the machine is 30 years old), like the the reverse saturation current of diodes effect related to temperature, something that involves all the components sitting on the bus....Don't know, I will investigate more during weekend.
in over 40 years of electronics I have never met such issue.

Thank you for support,
Valter


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Debugging low temperature crashes of a 400 MHz ARM Microcontroller (Read 11178 times)

Share me