And once the PCBs arrived, none of them would allow me to burn the boot loader onto the ATMEGA via the programming header.
Had to pull all the ATMEGAs off and flash them in a socket.
Put them back on and 90% of the boards I made just don't work as expected.
turned out the spi bus was being driven on the wrong clock edge and the first batch of the peripheral chips just happened to work with zero hold time, the next ones not so much.
And once the PCBs arrived, none of them would allow me to burn the boot loader onto the ATMEGA via the programming header.As a first thing, you just brute-forced a workaround instead of solving the actual problem. Instead you should find the root cause of the problem, and then take measures to fix it. The same goes with DOA delivered boards. You should get some of them back from the customers and find what exactly causes them not working.
Had to pull all the ATMEGAs off and flash them in a socket.
Put them back on and 90% of the boards I made just don't work as expected.
This is after P&P and hand soldering over £300 worth of components >.<
I just don't know what to do any more.
Note: I am adding to this list as i think of things. So you may want to re-read it in case i have added more since you read it last.
Questions
- Have you got any units back that a client says don't work? If so do they work for you or are they dead.
- Is your QC test automated? If so are you sure it is not faulty and letting dead units out the door? Maybe a manual test is needed so you know for sure all units leaving you are 100% working.
- Is there anything unusual about your location or places you ship to. Some places irradiated all their mail with high energy x-rays and this can destroy electronics.
- Does the MCU/software system interface/talk-to something that is different in different parts of the world. Here's examples of what i mean. Maybe bluetooth to a phone or maybe comms to a desktop PC app. Maybe some people have their phone/desktop set to a different timezone/country/language and may your app is incompatible with this.
- Do you have the sourcecode to your ATMega or does the freelancer hold that?
Few possibilities i can think of.
- Are users connecting the battery around the wrong way and damaging the device. (9V around wrong way for a split sec, that sort of thing)
- Do you have any floating inputs pins on the MCU? Maybe the code only works if an input is read as either low or high but it keeps changing with ambient noise. When built it might stay in one state but in noise environments maybe it floats to high and stops the code running. (Floating inputs should have MCU pullups enabled in software but maybe they are not set in your code?)
- Have you tried powering the device from 4.5V and with lets say 50mA current limit. Not all USB ports are created equal. Maybe your product is quite critical on power and not all USB ports can power it.
- Where are you getting your parts from, maybe you are getting lots of fake ICs
- Could be a PCB track routing issue where tracks run too close to a hole or board edge and sometimes get cut by the drill/router. etc Some PCBs work, some don't, some intermittent.
- Are you sure you have the ATmega Fuse Bits set correctly, maybe the startup delay, brownout detector or crystal settings are wrong and this is making it run intermittently.
- Does the product have protection from ESD or PSU spikes, like a TVS? Does the product get used in a location where it might need this. etc automotive/industrial
- How are you programming the MCUs? I one had a crappy USBASP programmer that would brick 2 our of 5 AVRs it flashed. Not sure why, maybe clock was out of spec and kept erasing fuse bits.
- Does your MCU programming system include a verify check?
- There is one AVR MCU, cant remember which, that comes with fuse bit set to put it into a compatibility mode where it pretends to be a different AVR chip. Some of the IO/peripherals don't work until you get it out of that mode. :palm: (my guess is they have a supply agreement to sell a compatibly chip for 25 years for MIL/MED/AERO) EDIT: All ATMega128 pretend to be a ATMega103 until you change the M103C fuse bitAnd once the PCBs arrived, none of them would allow me to burn the boot loader onto the ATMEGA via the programming header.
Had to pull all the ATMEGAs off and flash them in a socket.
Put them back on and 90% of the boards I made just don't work as expected.
This makes me lean towards a PCB/SCH issue.
Are you using MISO MOSI SCK pins for anything else other than programming?
You can use them for other things too, but you need to make sure you don't load the lines so much that programming is effected.
It can become intermittent if you load them or have caps on the line to gnd.
Also, grab one of those boards that doesn't program and use DMM to check the tracks between the programming header pins GND VCC MOSI MISO SCK RESET and the ATmega pads for those pins. Also check none are shorted together.
Help
- Are your PCB files in Altium? if so i'm happy to take a look at your SCH/PCB/CODE and see if i can spot any potential problems.
What does your power input stage look like? Ceramic cap and LDO by any chance?
If so, try plugging in the power with the supply already switched on, and with reasonably long power leads, you might be blowing the regulator due to ringing in the LC circuit formed by the power cable and the input cap (Cure is a jellybean electrolytic about 10 times the value of the input ceramic in parallel with it, the ESR damps the ringing).
Are ALL of your external IO lines fitted with some form of ESD protection?
No floating inputs on the micro?
Is everything run well within datasheet ratings?
I had an issue with a production board once where we suddenly started getting a very high failure rate, turned out the spi bus was being driven on the wrong clock edge and the first batch of the peripheral chips just happened to work with zero hold time, the next ones not so much.
You need to get a few duds back and investigate.
Regards, Dan.
Maybe the flux you're using needs to be cleaned off the boards otherwise is causing some resistance or short circuits?
Maybe you're accidentally joining two pins of your microcontrollers during soldering?
Do you have screw holes near traces? maybe you're shorting traces with screws or breaking them with friction over traces?
If usb powered... are you assuming you're getting clean 5v? Maybe the guys at the other end have too long usb leads causing voltage drop, or maybe they have stupid unregulated phone charger style usb things pumping 5.5-6v in your boards?
Inductance on the long usb cable causing voltage spikes? Not enough capacitance on input and output of regulators that could damage the regulators or cause them to reset/glitch? Bad output capacitors on regulators?
Where do they install these products? are there powerful magnets or some induction things or something that could be picked up by your circuit and affect it
maybe share at the very least a picture of the assembled board ... if it's too much of a secret to show a schematic or something more complex
Hard to help without details. Anyhow, the fact the 5-10% that get shipped end up not working is an indication there is something wrong with the product or processes around and the manufacturer should have stopped shipping such a product.
What does your power input stage look like? Ceramic cap and LDO by any chance?
If so, try plugging in the power with the supply already switched on, and with reasonably long power leads, you might be blowing the regulator due to ringing in the LC circuit formed by the power cable and the input cap (Cure is a jellybean electrolytic about 10 times the value of the input ceramic in parallel with it, the ESR damps the ringing).
Are ALL of your external IO lines fitted with some form of ESD protection?
No floating inputs on the micro?
Is everything run well within datasheet ratings?
I had an issue with a production board once where we suddenly started getting a very high failure rate, turned out the spi bus was being driven on the wrong clock edge and the first batch of the peripheral chips just happened to work with zero hold time, the next ones not so much.
You need to get a few duds back and investigate.
Regards, Dan.
I have the boards that have failed from when clients used them. More than half worked as expected. Some did fail testing.Action items right now: debug the boards that do fail testing, try to find the exact root cause, record all observations. Dig out all failure reports with devices that subsequently DID pass tests, record them somewhere and look for patterns. Ask customers for more info if necessary.
First thing you really have to do before considering anything else is to triage the failures and get to the bottom of it
Action items right now: debug the boards that do fail testing, try to find the exact root cause, record all observations.Exactly. Investigation on dead boards is essential. There are too many causes and considerations to find the problem without proper debugging. Absolutely analyze where those boards failed and report your discoverings, possibly together with a schematic.
I have the boards that have failed from when clients used them. More than half worked as expected. Some did fail testing.
QC is done by hand. I run the boards on a test program for 24 hours. We also test after this that the PWM input, switches and firmware all work.
...
The boards are encased in an aluminium block with >1.5mm wall around with a few cutouts for buttons and display.
This product gets used all over, but most boards that have failed, arrived to the user bad. My hunch was xray but I have nothing to back that up with.Normal airport x-ray is totally fine, that wont do anything. Only the high power X-ray's used to sterilize mail are a concern. Those are usually found at government buildings.
I use an Arduino to load the boot loader onto the ATMEGA and then a FTDI for flashing firmware.FYI - There's a new chip out, the ATMega328PB which is not the same as a ATMega328P.
AVRdude checks and we make sure it confirms all is good.
Not sure about the ATMega328p doing this or not.
I do use those programming pins for the NRF24L01.Right, so the ATMega328 SPI pins is used for flash programming of the MCU and also for talking to the NRF24L01 chip over SPI?
I have the boards that have failed from when clients used them. More than half worked as expected. Some did fail testing.
QC is done by hand. I run the boards on a test program for 24 hours. We also test after this that the PWM input, switches and firmware all work.
...
The boards are encased in an aluminium block with >1.5mm wall around with a few cutouts for buttons and display.
Do you by any chance put screws into this aluminum case? If so are they pre-threaded?
Thread forming screws (or just tight screws) into aluminium creates lots of metal filings!
The metal flakes may not cause a problem initially but after being shaken around in transport maybe they get all over the place and short IC pins.
Test: Put down a clean sheet of copy paper on a desk. Grab a finished unit and carefully open it up on the paper. Tap out the board & case and see if any metal flakes come out. The paper will give a good contrast to make them easy to see.This product gets used all over, but most boards that have failed, arrived to the user bad. My hunch was xray but I have nothing to back that up with.Normal airport x-ray is totally fine, that wont do anything. Only the high power X-ray's used to sterilize mail are a concern. Those are usually found at government buildings.I use an Arduino to load the boot loader onto the ATMEGA and then a FTDI for flashing firmware.FYI - There's a new chip out, the ATMega328PB which is not the same as a ATMega328P.
AVRdude checks and we make sure it confirms all is good.
Not sure about the ATMega328p doing this or not.
It's easy to think 'oh that's just the lead version' but no, it's a different chip with some different pinouts.I do use those programming pins for the NRF24L01.Right, so the ATMega328 SPI pins is used for flash programming of the MCU and also for talking to the NRF24L01 chip over SPI?
How are you handling the reset line on the ATMega328? Is it pulled high externally? Is it connected to an external button or something?
I just wonder if it's possible for the ATmega to go into reset state for some reason while comms to NRF24L01 are active and somehow get garbage send to the ATmega while it's in reset low state (program mode).
I'm not sure this is actually possible, because there should be no SPI clock once MCU goes into reset.
I'm just thinking out loud. Maybe someone else will have a through reading this.
How are you handling merging of the USB Vbus power onto the 5V from the voltage regulator output?
Normally you would diode OR the two sources, but from the pcb layout it looks more like connecting 5V usb to reg output?
Voltage regulators do not like a higher voltage on their output than their input. They tend to die.
That can happen if you connect 5V from USB onto the output of a 5V reg and then remote the battery that's powering the input!
I could see you doing all QC test with a battery always connected but a user connecting USB first because they have a shinny new toy and can't wait to plug it in before they can source a battery.
Have a look at these 2 areas on dead PCBs.
There maybe issues where the track has broken or shorted etc..
The text on the LCSC one is barley visible, like this is better than what I can see with my eyes.Text is not visible because IC is dirty.
LCSC part looks counterfeit.
:palm:I think it's correct, debugging is essential.
No triaging work has been done so far
I have the boards that have failed from when clients used them. More than half worked as expected. Some did fail testing.Sho you had perfectly working boards and now some of them aren't working.
Quote from: JacksterI have the boards that have failed from when clients used them. More than half worked as expected. Some did fail testing.Sho you had perfectly working boards and now some of them aren't working.
So why don't you trace what's failed in that boards? It's the MCU? It's one of the power rails? The crystal? What else? That is essential! Knowing that would be a huge help.
Please, do it and report it here.
I am not sure what was wrong with them other than the firmware stops running as it should be.As I (and the wise Mike) said: check the crystal/oscillator. Put an oscilloscope to see if it's working. Keep in mind correct probe range/capacitance while probing.
Power is fine, sensor input is fine.
The only thing it could be is the MCU or WiFi board.
Don't forget that the scope probe itself can start it oscillating. If possible use a x100 probe. Or better look at an output pin that is being toggled by software after startupI am not sure what was wrong with them other than the firmware stops running as it should be.As I (and the wise Mike) said: check the crystal/oscillator. Put an oscilloscope to see if it's working. Keep in mind correct probe range/capacitance while probing.
Power is fine, sensor input is fine.
The only thing it could be is the MCU or WiFi board.
Also, "power is fine": did you check it with a DMM or with an oscilloscope?
I bet that with an oscilloscope you could find the issue in minutes. Since you're doing circuits for work, consider purchasing one.
I don't have an oscilloscope. Just multi meter.Buy one.
Will a £50 USB one be enough?
I don't have an oscilloscope. Just multi meter.
you need at least double bandwitdh of your clock.
If I may, I will invest time in understanding WHAT is failed instead trying to fix something that you don't know exactly.
HantekDSO5102P is fine, you could choose also a Siglent SDS1052DL or a used classic Rigol DS1052E.
I know that you should learn how to use it properly but it's something mandatory and not so difficoult (if I can use it , anyone else can do). You can't live without it.
Once you have an oscilloscope, you can't (and you wouldn't) get back!
Imagine this: without that instrument you are blind.
So I managed to get some time in the office today and got one of the bad boards and started removing components and testing.
After a few, I removed the FTDI and it burned the boot loader.
Put all the other components back other than the FTDI and again it burned.
I then put the same FTDI chip back and it did not burn. Removed it and it burned.
I then got a known working FTDI chip from RS and it burnt the boot loader.
I need to double check all of this in the morning so this is more of a note for myself.
The LCSC ones are dated year 17 and the RS ones are dates 18.
Will take a working board and remove the working FTDI chip and replace with one of the new FTDI chips and see if it burns the boot loader in the morning.
NON GENUINE DEVICE FOUND!character by character as you type instead of the loopback echoing what you type. (see: https://www.eevblog.com/forum/microcontrollers/ftdi-gate-2-0/ (https://www.eevblog.com/forum/microcontrollers/ftdi-gate-2-0/) )
If I may, I will invest time in understanding WHAT is failed instead trying to fix something that you don't know exactly.
HantekDSO5102P is fine, you could choose also a Siglent SDS1052DL or a used classic Rigol DS1052E.
I know that you should learn how to use it properly but it's something mandatory and not so difficoult (if I can use it , anyone else can do). You can't live without it.
Once you have an oscilloscope, you can't (and you wouldn't) get back!
Imagine this: without that instrument you are blind.
IDK what I am looking for though with it.
And once the PCBs arrived, none of them would allow me to burn the boot loader onto the ATMEGA via the programming header.This is a probably a different problem, consider your design/manufacturing to be flawed and work from first principles on one board. Does your PCB software have a rules check? Check the uP pins for things that are GND that shouldn't be.
Have you gotten boards back from customer? i.e. customer returns.
If you get a customer return board back on your bench does it work?
Possible fake FTDI chips? Hack a bad board to loop the FDTI's RX and TX pins, open the USB serial port with a terminal program and see if it respondsCode: [Select]character by character as you type instead of the loopback echoing what you type. (see: https://www.eevblog.com/forum/microcontrollers/ftdi-gate-2-0/ (https://www.eevblog.com/forum/microcontrollers/ftdi-gate-2-0/) )NON GENUINE DEVICE FOUND!
If you think the FTDI chip might be involved, the next step is to work out possible ways in which it could cause the symptoms you're seeing.
Are there any physical pins on the FTDI chip that are also shared with pins which are needed to burn your boot loader?
Without knowing the details of your design, I'd suggest two possibilities - either:
a) there's a logic signal (or signals) in common. Are your programming (SPI / reset) pins connected to the FTDI, or are they separate? If they're completely separate, then it really shouldn't be able to interfere with boot loading via that route.
b) they share a common power supply, and something bad is happening which is causing the voltage at the MCU to go out of spec during programming. Does anything get warm?
Another option (c) is that the FTDI chip is a complete red herring, and the difference is caused by heating, cooling and flux contamination of your PCB when you remove and replace components. Be sure to thoroughly clean the board after every rework operation, especially in and around the MCU crystal if it has one.
Be careful to not conflate problems... looks like the earlier revision boards have a reliability issue but that needs proper post-mortem analysis.
Rev3 boards...And once the PCBs arrived, none of them would allow me to burn the boot loader onto the ATMEGA via the programming header.This is a probably a different problem, consider your design/manufacturing to be flawed and work from first principles on one board. Does your PCB software have a rules check? Check the uP pins for things that are GND that shouldn't be.
Populate one board with the bare minimum to program the uP onboard and work backwards.
Good Luck!
Have you gotten boards back from customer? i.e. customer returns.
If you get a customer return board back on your bench does it work?
I second that question, and don't think the OP has answered it (unless I overlooked some comment). This is an important step in finding the root cause of the failures in the field.
@Jackster: Do you actually know that the boards somehow "broke" in transit? Or are they still in the same shape they were in when you sent them out, but don't work at the customers while they still work when you (re-)test them at home?
I remember many years ago, a separate State branch of the organisation I worked for were tasked with making boards that would automatically ring various phone numbers when required.I'd argue that's an error on the exchange end, but you'll still have to deal with it.
We duly received our portion of those devices, but unfortunately they didn't work.
When we complained to the other State, they protested:
"But we tested them & they all rang up who they were supposed to!"
Yup! They dutifully programmed the whole number needed to call those sites from their State into the PROMs.
Those additional numbers, of course, weren't needed in the State they were intended for, & would "freak the exchange out".
Why have you got reset connected to DTR via a capacitor?
It will go through the code 3-6 times and then hang.THIS. You must understand why it hangs. If it runs at least 3-6 times it's not the hardware "broken" (intended as burnt or physically broken) probably. There is something changing or at limit on the hardware that is affecting the software readings, maybe.
I'm puzzled.
We don't know what the device in question does exactly. We know only it uses Wi-Fi and USB.
We have no data on what is wrong on the board or what do not work (hardware? software?). We know only that some boards doesn't work anymore.
We don't know in which mode the boards fail (what should they do? They work partially? They don't turn on? The microcontroller won't start up? The Wi-Fi does not connect? There is no output? What?)
There is no diagnostic data except "Power is fine, sensor input is fine".
I don't think we can guess the problem because there are so many possibilities, from fake chips to esd to user fault... the list goes on and on.
I think this is not the right way to proceed. We need data.
Can you post here (or privately via PM) a full schematic and at least a working diagram of your sketch and principle of operation toghether with detailed symptoms?
Here there are many people that wants to help you, but try to provide something more to work on.
The device takes a PWM input from a sensor and displays the result on 7 segment displays.
It can transmit this info over the WiFi interface to another that will take the data from the WiFi and display it on its seven segment display.
I'm puzzled.The lack of an oscilloscope is also a bit puzzling.
The device takes a PWM input from a sensor and displays the result on 7 segment displays.
It can transmit this info over the WiFi interface to another that will take the data from the WiFi and display it on its seven segment display.
Is that the unit conversion device for sonar measurements which you had posted about earlier, by any chance? These are used on boats, I would assume. Are you sure they handle the vibrations and humidity well?
(Is this maybe also the "three boards connected via FFC connectors" design you have also asked questions about in earlier threads? If so, are you sure the connectors are robust enough, and are you sure the signals make it across the connections in good shape?)
May I add a personal comment: It would not hurt if you had the courtesy to let us know what your product is and does, and give us a link to the website you presumably have. You are selling these for profit, it seems, and are asking for free advice here. At least satisfy our curiosity in return; and the information may help with the troubleshooting as well.
QuoteI'm puzzled.The lack of an oscilloscope is also a bit puzzling.
... my guess would be oscillator startup issues - assuming you use a crystal or ceramic resonator. This would be consistent with inability to program and some units working and some not.good thinking!
...
If you have a dead board in front of you, poke the oscillator pins and see if it starts.
I am not saying the FTDI chip is the cause btw. Just that when removed and replaced with a known working one, there is no boot loader issues.
I am pretty much on board with the issue being the board design. Probably just coincidence that the old FTDI is working and the new ones are not?
Only tested a handful.Possible fake FTDI chips? Hack a bad board to loop the FDTI's RX and TX pins, open the USB serial port with a terminal program and see if it respondsCode: [Select]character by character as you type instead of the loopback echoing what you type. (see: https://www.eevblog.com/forum/microcontrollers/ftdi-gate-2-0/ (https://www.eevblog.com/forum/microcontrollers/ftdi-gate-2-0/) )NON GENUINE DEVICE FOUND!
I don't think so. They all have different serial numbers. They were cheap though at $2.78 per chip.
Doing as you said, it just echos back what I type.
This is a fun thread... it has something for everyone and not enough information for any proper conclusions...
.. The schematic showed a ICSP header... I presume that is what is being used to program the uP? The requirements for ICP are minimal... if it isn't working on Rev3 boards you either have some extra shorting to ground (due to errant ground pour) or dodgy chips.
Ignore any talk about Flux, ESD etc... until all the obvious has been eliminated. For now just use a multimeter on continuity to determine if any ground shorts on Rev3 boards.
it has something for everyone and not enough information for any proper conclusions...In my opinion it lacks of something. Not enough data.
I have had hundreds of boards from PCBway and this is the first issue I have had.
Thank you for letting us know about the issueI have had hundreds of boards from PCBway and this is the first issue I have had.
As a suggestion for the future, remember to always order board testing for production boards.
I thought they did do that over x number of boards ordered?Proabably. But it's the "every X boards" that makes the difference.
Upon questioning, the operator doing the testing admitted that he'd removed the test of that particular net because it was failing too often. He had neither checked a board manually by himself, nor told anyone else that there was a problem.This is insane.
Upon questioning, the operator doing the testing admitted that he'd removed the test of that particular net because it was failing too often. He had neither checked a board manually by himself, nor told anyone else that there was a problem.This is insane.
a couple of other things ( apart from the fouled up pcb spacing )
- not enough bulk capacitance in the design
- not enough local capacitance in design
- the crystal you use is a resonator. your processor fuse bits may need to be tuned for that ! check what the load capacitance and bleed resistor is in those things. you can buy those in different variants and the tuning needs to be done.
- aluminum case.... do you connect that electrically to your system ground ?
Arduino is not exactly a case of 'proper' design. They take too many shortcuts.
That aside , what else is on your board. Anything drawing pulsed currents like muxed displays , rf transmitters etc ?
Arduino is not exactly a case of 'proper' design. They take too many shortcuts.
That aside , what else is on your board. Anything drawing pulsed currents like muxed displays , rf transmitters etc ?
The green PCB is correct. The Black one is one from the bad batch from PCBway.
The spacing between the holes is 1.27mm and the distance between the solder pads should be 0.375mm.
Arduino is not exactly a case of 'proper' design. They take too many shortcuts.
That aside , what else is on your board. Anything drawing pulsed currents like muxed displays , rf transmitters etc ?
Yea I have 4-8 seven segments displays and a nrf24l01.
I know there are a few other things that are not correct but we will see if they need fixing.I have to admit that I am slightly furious after reading this phrase. If there is something wrong with your design and you know it but decide to go like "nah, isn't that bad, people buying this won't notice!" you are up for a very bad suprise. Reputation is slowly getting more important again, after years of cheap electronics that fail just after warranty ends because of obvious* design flaws that are not corrected because of cost and "well, it works NOW, why should I care if it works in 3 years?" - mentality.
Thanks for your information, I checked that your order BATCH3 is with the same pad design
and the production file of it is with smaller pads as your design. The difference is that order BATCH3
and BATCH2 produced at different production line, and it need different way to prepare production file.
Arduino is not exactly a case of 'proper' design. They take too many shortcuts.
That aside , what else is on your board. Anything drawing pulsed currents like muxed displays , rf transmitters etc ?
Yea I have 4-8 seven segments displays and a nrf24l01.
That would be one possibility. muxed displays draw peak currents. Any kind of noise on your power rail and the cpu may brown out. Same for RF transmitters. it looks like you have those mounted above the cpu ...
Looking at the clues posted in this thread, I'm almost 99.9% positive there's more to your problems than just the PCB mishap. Although it's possible such "almost shorted" pads could give intermittent operation, it's unlikely it happens in multiple units. You have so many unexplained incidents of it failing, working again, then failing again.
If you want to become a professional design engineer, do yourself a favor and as soon as you have a moment of silence, don't go on to design more features, or a more advanced product, but instead, try to do a proper root cause analysis. Instead of just building products, try to build a process/a "factory" where you can robustly build these products without wasting a lot of time.
You seem to have many issues, some are likely correlated, some are not.
In a stressful situation, we tend to fall back into trying to just get things to work by whatever means. Like can't get the MCU flashed? It's not a total showstopper, swap the board and go on. But in the long run, solving the problem once and for all would pay back in time used, and, it could turn out it's connected to your other issues, so they would be solved as well.
When I was looking this comment of yours:
"They just develop a fault where the software no longer cycles. This can happen on new boards too. It will go through the code 3-6 times and then hang. "
I thought, you are very lucky. You have a lot of specimen that do fail, on your hands. And you have consistent failures. Like you don't need to operate a well-performing product for weeks to see a failure. If I understand correctly, you have at least one (1) unit in your hands which you can demonstrate a failure with, within minutes or hours. That's great.
It doesn't matter what the fault is and what do you think it might be caused by. Given this particular failure you can demonstrate, go for full-blown root cause analysis and see what you find.
You just need to make your steps smaller, and lower level. Whenever you hit a wall of not knowing how to do it, Google it, learn it.
I don't personally use debuggers a lot, but this could be a case where you'd get a starting point. Failing to have one, just make your code turn an LED on/off at different points of code, after a few iterations of moving around where you turn the LED on/off you have found the exact place in code where it hangs.
If your MCU isn't flashing, look at the communication signals with an oscilloscope, decode the contents. It may take several hours, but then you know exactly where it hangs. Chances are, you find some analog signaling issue (stuck logic level, bad rise/fall time)... in two seconds after looking at the scope screen.
Get yourself the basic tools, a 50MHz 2-channel digital storage oscilloscope being a bare minimum to debug such a design. A $400 4-channel Rigol or similar is more than enough, but I'm sure you can get an older generation thing used for maybe $100.