I really appreciate all the nice answers! Let's see if I can do a recap of what happened in the last day:
1) I solved a boot voltage sequencing problem where keeping the UART RX connected to a UART device (to read the boot log) was causing spikes in the 1.8V rail, the power up sequence was wrong when the spikes concurred with the power on. This solved 90% of the cases where the board is powered but there is no output whatsoever.
2) I noticed that the RESET pin of the microprocessor was at 3.9V instead of 3.3V, this is due to how the PMIC is configured (open drain 100k ohm pull up at 5V), so I reworked a few of the boards to have it providing the correct 3.3V. I couldn't notice any difference.
3) The board whose DDR I replaced yesterday is still behaving more or less ok. However I don't feel like I have a statistically reliable basis to say changing the DDR decreased the occurrence of problems. I wish I had better ways to test those boards.
4) Somehow, the success rate at boot seems to be higher during the evening (third day in a row now). Tried room temperature and turning lights on/off. Still unexplained. Maybe this last point is completely made up.
5) I have two power supply boards I connect this board I'm debugging to. One of them has a better layout and is less noisy, so the noise for the 1.35V rail goes down from ~100-150mV to ~70-80mV. However this doesn't seems to be helping with boot.
6) I tried removing some of the bigger capacitors (4.7uF) from a board to see if it increased problems, but I couldn't see any difference. I added back 6uF 50V NPO, and again, I couldn't see any difference.
7) I ordered a couple of slightly faster DDR chips, those have a better CAS Latency (5 vs 6). They should be here in a week. The problem I was talking about yesterday is that I'm using DDR chips with a CAS latency of 6, configured from the bootloader to work at 5, because the microprocessor doesn't support the higher CAS latency option. The only reason chips with CAS of 6 work at 5 is that they are downclocked (340 vs 400 MHz). However yes, it's asking for problems. This makes me think that maybe the last batch was working fine just out of pure luck.
There is a thread on SE about this. .
8) I also noticed that increasing CPU/DDR frequency just a tiny bit (DDR goes up 2% to 340 MHz) increases the occurrence of the problem a lot.
So, left to do:
1) Wait for the new DDRs.
2) Replace the remaining capacitors, the 40 or so 0.1uF.
3) (optional) Buy a bench power supply to power the IC from it.
Uhm, it looks like writing everything down at the end of the day helps making order in the mess that is my mind right now!
Just to make sure, were you monitoring DDR voltages while the boards were actually in freezer?
No, I take it out from the freezer and spend about 30 seconds powering it up. However the freezer is at -10C while it's 5C outside, so I prefer to just use outside to cool the boards down. Nothing in the board is rated for -10C and leaving the boards in the freezer for too long breaks even the most resistant boards, while others still fail to boot from 5C.
100-150mV ripple? on 3.3v or other core voltage? that can be too much ripple if you are close to vdd limits. Don't forget to probe near mcu power pin while reproducing the reset/lockup issues. My first suspect would be power supply, especially decoupling caps as you said they are from different source (occam razor principle).
This is basically the only reason why I'm considering buying a bench power supply, to see if a perfectly clean power supply fixes the problem all the time. Yes, 100-150mV is too high, but it is also very close to what I'm measuring in the old board... It's tricky. I should check voltages again tomorrow.