Is the MCLR pin so ultra-sensitive to ESD that everywhere from the PCB assembly house to our factory it is going to get its MCLR pin killed because we forgot to add the 10k resistor from MCLR to Vdd?
Highly doubtful, IMO.
First off, you keep saying "we." Did you write the code? Did you design the board? If not, why are you stuck with this problem?
Are you 100% sure about all of these assertions? In particular:
Some of the PCBs work, most don’t, and some work the first time they are powered up but never thereafter. By “non-working”, I mean they light up but don’t respond to DALI dimming commands sent to the PIC18F26K20.
Never is a really long time. How many tries did you make? And did you let power drain, completely? Also the ones that work, how many power cycles have you done to verify they "always" work? And did you let power drain, completely? Also did you try reflashing the ones that didn't/stopped working? Did the PICKIT3 verify they flashed? This would be one way to rule out ESD damage to MCLR; they won't reflash if you fried this pin, lol.
and
We then had an external modification applied to the PCBs which involves wiring a 10k resistor from MCLR pin to Vdd, and a 10n capacitor from MCLR pin to Vss. I undid this modification on one of the few working boards and then found that stopped it working. –But when I re-did this modification the PCB still did not work.
Sample of 1 is pretty thin. Also you did not explicitly state if this modification makes your boards work, even. You imply this but do not state it, as such. If this fixes your boards 100%, you should probably be done with your testing and you should be reworking your boards, lol.
In case you are not 100% sure of these assertions, I would try this:
First and foremost, power cycle and test a board at least 20 or 30 times, making sure to disconnect power long enough for the caps to totally drain. Sometimes it may seems like some boards are good, and some are bad.... but it could be random factor not explicit to the boards, until you positively rule it out. For instance, the bit state of your data memory is random on startup. Also, some of the special registers are also random. But this randomness is only pseudo random.... even if the power is drained completely, there is still a good chance that some devices "prefer" to start up with a 1 in a given spot, vs another chip. But if you do enough cycles, you should elicit this.
1. If you reflash the ones that don't work (or have stopped working), do they sometimes work, again, at least on the first powerup?
yes, no, random
2. Try reflashing the board while the board is powered up. And do not disconnect the power until you have tested the board. Does it work, then?
yes, no, random
3. If 2 is a no: If you can enable MCLR, do so, and try 2, again, this time doing MCLR reset... again, not removing power from the board. Does the board work, now?
yes, no, random
Other things to check:
Power Up Timer Enable? PRWTE bit in config should be on..
Brownout detection settings?
Did you scope the supply rail for noise?
Is the clock running? Is there any physical way to check with an oscilloscope if the chip is running?
There should be a minimum delay after power is good before attempting a write to EEPROM, on some devices.
This brings up another point...
If EEPROM is readable, check to see if there is any difference between the good boards and the bad ones. Perhaps some bug related to EEPROM or spurious/erronious EEPROM write is resulting in an invalid parameter that is bricking the boards.
only two of porta pins are actually used as analog, the rest are digital output and i hear this can result in RMW problem as per pg136 of pic18f26k20 datasheet, but i am not sure.
This wouldn't have any relationship to MCLR, so I'm not sure why this is on your radar. MCLR is on its own port E, and a read-modify-write error to an unused input pin shouldn't cause a problem. Why would you even be reading or writing to this pin, at all?
There are LAT registers for all the other pins... no reason for there to be read-modify-write error unless whoever coded it ported the code over from some other device and didn't tick all the boxes. And even then, it would reset on power cycle, unless it causes some erroneous write to EEPROM which is out of bounds.
If you are positive that your assertions are correct, and there is no change to EEPROM, I would be scoping power rail and mclr pin while you cycle the boards on and off, using a digital oscilloscope. Until you have seen a physical problem with your own eyes, verified some physical damage, you should be suspecting human error, not a unicorn. ESD damage to a floating pin while the rest of the board is powered/grounded would be pretty miraculous, IMO, to be able to replicate this on demand. Does your board even produce any high voltage/EMF (in the 10kV+ range)?
Is the MCLR pin so ultra-sensitive to ESD that everywhere from the PCB assembly house to our factory it is going to get its MCLR pin killed because we forgot to add the 10k resistor from MCLR to Vdd?
Where did you get the idea that MCLR pin is super sensitive to ESD? This makes zero sense to me. If you intentionally broke every ESD precaution you could think of, on purpose, I still don't think you are going to kill most of your boards. Unless your unused MCLR pin has ISCP trace attached to an antenna for some reason, I don't understand where your concern is coming from.
This is CMOS technology. At WORST, ESD sensitivity is as bad as a signal FET. Even if it was as sensitive as a laser diode, you are still not going to unintentionally break MOST of them. And if this were the case you should be having problem of your can't flash your boards. Not that they work once, then stop.