General > General Technical Chat

What's the trickiest bug you've ever tracked down..?

(1/2) > >>

I'm sure many of us have had some real hair-tearing moments debugging hardware/software... here's my worst...

I designed a product for a client, containing an early AVR (AT90S4414), which worked well and was very successful.
Atmel discontinued the 90S4144 (or was it 4414?) and this was replaced with the 8515, which was just an increased memory version - the sales rep told my customer that it was a drop-in replacement and I wasn't involved in the change - the new chips worked fine so my customer was happy.

A few years later they discontinued the 90S8515, so we changed to the ATTiny8515. There were a few software tweaks due to moved IO registers, but after fixing this all seemed OK. Until we started getting calls from users.

We were getting symptoms that looked exactly like randomly occurring eeprom corruption - units would test and calibrate fine at the factory, and still be fine the next day at outgoing inspection, but some time later they would appear to randomly lose their cal data. This was exacerbated by these customers being overseas and receiving what appeared to be DOA units.

For those who want to try working it out themselves, the only other relevant factor, is that there was 1000uf cap across the 5V rail, from which the AVR's 3V rail was derived.

To see the answer, drag-select the text below to reveal it.

After many days' head-scratching (really not helped by focussing on memories of early AVRs losing EEPROM in brownout), I suddenly had a 'Eureka' moment....  

As the original code was written for a device with 256 bytes eeprom, there was no ee address high byte to write.

The 90S8515 has an EE address high byte register , but always cleared it to 0 on reset, so no problem.

The ATTiny8515 does NOT initialise as many of its regs on reset as its predecessor did, this includes the EE address regs.

What was happenning was the the EE address high register was bring intialised to a random value on pwerup, so the cal data was being written to a random half of the EEPROM. The capacitor ensured that this random page selection lasted for several days without power. it was only when a unit was powered down for several days (like when shipped to an overseas custoimer) that it would then sometimes power up in the opposite state, and hence appear to have lost its eeprom data.

OK the answer was in the datasheet but how many people read _every_ bit of _every _ section of the datasheet when upgrading to a new part....?

I'm sure many folks here are fans of the "Air Crash Investigation" series - fans will know how often a crash is a result of a sequence of events, sometimes years apart....

Have an industrial controller, several years in production, and made in reasonable volume production (~1000/yr).  We very rarely had an unusual occurrence.  The CPU is a MOT 68337.  On chip is a CAN module (TOUCAN) that we implement to communicate DeviceNet.  For years we never heard about any systemic problems.  We had a customer describe an issue, that several times an hour the DeviceNet would stop communicating.  We looked at the obvious, like field wiring, power, noise, etc and found nothing.  It was then discovered that inputs, either on the discrete inputs, or certain RS232 comms or even analog inputs, occurring at a rate close to 100 Hz would kill the DeviceNet.

No obvious cause led to a month or so of code analysis.  I was convinced that there was an interrupt sneaking in overwhelming the CPU.  Tests and more tests confirmed that the CAN had among the highest interrupt priority, and no unknown or "sneak" interrupt were present.  No matter what we changed, the sensitive frequency seemed to be around 96-103 Hz.  Thinking that there was some strange code interaction when certain functions ran at the 100 Hz.  Next step was to change the clock parameters to alter the main loop "scan" frequency. which is 500 Hz.  We changed the divider and readjusted the CAN baud rate so that we had a scan of 250 Hz.  Same problem, still at 100 Hz inputs.  Another change in clock rate still resulted in the 100 Hz sensitivity.

Next took us down looking at the +5 volt switcher on the board.  Looked at the frequency response of the power supply.  Swept it, injected test signals, etc and found that 100 Hz injected into the power supply summing junction would replicate the problem.  Looked like a real possibility, but response tests of the power supply resulted in no sensitivity, peaks or notches at 100 Hz.  Seemed like another near dead end.

Finally into the evening one day, some circuit mods resulted in a breakthrough.  It turns out that the PLL on the CPU seems to have a sensitivity to 100 Hz.  The PLL is powered through a pin called VddSYN.  This pin was coupled to Vdd through a simple R-C filter network.  By increasing the C value and upping the R, I could reduce the 100 Hz sensitivity (but not eliminate it).  There was a limit to how high the R value could be, since the VddSYN current draw would reduce the voltage to unacceptable levels.  I performed a simple test using an FFT response analyzer.  The card has inputs that can measure frequency using the CPU's counter/timers.  These timers are clocked off of the CPU clock.  We applied a known constant frequency to a frequency input using a high precision signal generator.  This input was converted to an analog value and available on a D/A channel, using the CPU PWMs.  Statically the analog output was a very constant value, corresponding to the input frequency.  I then applied a frequency sweep from the FFT analyzer to an analog input, and monitored the D/A output with the analyzer.  Sure enough, around 100 Hz input came rolling out of the D/A.

So what was happening was the CPU PLL was sensitive to supply variations at around 100 Hz.  The analog input resulted in some CPU effort to occur at 100 Hz, which caused some +5 volt power supply wiggle at 100 Hz (around 5 mV).  This power supply noise was also present on the VddSYN PLL power, resulting in the CPU clock frequency wobbling a bit at 100 Hz.  Normally not an issue of any great concern, but the CAN baud rate was also varying, since it is derived internally from the CPU clock.  Since the baud rate accuracy is relatively critical with CAN (at 1 mbaud), errors on the CAN messages would quickly kill the DeviceNet comms and result in the problem.

Took a long time to find, with hardware and software all coming under the microscope.  The fix was relatively easy.  We powered the VddSYN pin (PLL power) from a 5 volt reference supply on the card that is used for the A/D reference power.  This completely resolved the issue with no ill effects on the A/D operation or accuracy.  Although Motorola has a circuit suggestion for powering the VddSYN for "use in noisy environments", it was nothing more than a 2 pole R-C and did not fully resolve the problem.

This was my trickiest one to find.


A-sic Enginerd:
I've got a few. My top 5:

- Had a bug in our new 10/100T MAC. We had prototyped it with a laser etched gate array and the bug wasn't showing up until we got back first silicon for the first chip it went into. Don't recall the nature of the bug - data corruption, MAC would stall, something of that sort. Second step to debugging something seen in actual silicon is to try and reproduce it in simulation. (First step is to do a crap load of datapoint collection). Finally nailed it when my peer created a clock signal in sim that was the most obscure and whacky thing you'd ever seen. It was a square wave like my kid would draw with crayon when they were in kindergarten. Anyway, it happened to hit things just right we saw the same symptom. Once we drove it to root cause we could more easily reproduce the error with more realistic sims.

- More recently (about 3 or 4 yrs ago) I had just started work for new startup company. They had a chip they'd already released into production. Some of the recent shipments were seeing a measurable number of different and odd errors. Some saw data corruption, some were seeing incomplete frames, a few other symptoms that none of us could make any connection on similarities or common areas of the chip. After weeks and weeks of debug and hundreds if not thousands of man hours running various tests and trying out different theories, we arrived at one conclusion - cosmic rays. Go ahead and laugh if you want, but in todays smaller geometries, and if there are certain issues with a fab process (which it turns out there were with the fab we were using) the designs can become subject to and open to issues when hit with stray alpha or beta particles. In this case it was causing random bit errors in a very large memory / ram that was in the chip. (512 bits wide, forget how deep).

- One that the bug was found pretty quickly, but to deal with it was about the biggest PITA I've ever had to deal with. Brand spankin' new chip design with craploads of new IP. It had an embedded ARM. We got first silicon in and put it on a brand new board design. The board was comparatively large and complex (about the size of an AT form factor with all kinds of PCI buses, 10/100 spigots, IR, USB, etc etc), and to make it all worse was done by a guy that, although he was an experienced engineer in other aspects, this was his first board design. Well, it turns out someone screwed up and had tied the pin that enables the JTAG port to the ARM to the disabled state. Now of course we were needing to still do as much testing on the chip as possible and time was of the essence (not only product schedules, but time and money involved on a chip spin...all that sort of business). The chip is pretty brain dead without the ARM, so we spent a lot of very long days trying to wake the thing up through an untested AMBA bus, that went out an unproven flash interface to an untested board designed by a rookie to a socketed flash part. Yeah....that was fun.

- Then there was that whole thing when doing senior project in college. Due to perform the first semesters final demo. Project involved (among other things) one main board that had a microcontroller talking to some daughter boards we connected via headers and ribbon cables. Let's just say we burned our Xmas vacation discovering we were victims to an issue a few of us on the team had just learned about a couple weeks prior in a completely different class - ground bounce. We had very inadequate grounding between the daughter boards and things were noisy enough (whole lot of nothing but digital going on here) the data going across the cables was generating its own clock. Symptoms were some data skipped while other data was double clocked.

- Oh yeah, almost forgot. Hit my first real brain twister while in college. Was during advanced logic design class. FPGA's were still fairly new on the scene, but our department chair always had the view of a pretty practical curriculum, as well as trying to keep up with latest trends seen in industry. This translates into some of the instructors learning at the same time as the students. Lab partner and I had this flash of brilliance that to solve a part of the project we dropped in a set of transparent latches. Take note for the unknowing: Xilinx doesn't like latches.  ;D

the trickiest bug ?   Plenty ...

The Fanny est  few ....

Lots of color pencils stacked  in a VCR ..

Lots of almonds in to a Inject Printer ...

And many more "causes of damage", created by the unlimited imagination of the young devils .   ;D

We sent our products (sensors) to a automotive manufacturer, the product was giving false detection, it was a medium percentage of products doing this and the customer F**d Motor Company was giving us big complains.

It was really difficult to open the sensor because it was covered by hard epoxic, 100% of the units you tried to open will get destroyed, X-Ray didn't give a clue.

So I gave the idea to get a good sensor out of the production line without epoxic, and try to duplicate the fail (Reversed engineering)

I started to isolate the fail, the sensor had a transmitter and receiver portion, the bad products were failing on reception.

So we started to play with the reception side and ended up discovering a problem on a component connection due the paste solder used on the smd process.

The big part of this is that I was working on a manufacturing factory, and we didn't have all the R&D tools that our R&D center had. We had a old analog oscilloscope that was difficult to set for our purposes, as soon as the R&D guys knew we found the problem they traveled to the factory so we can give them a demonstration on how we did it, they saw us having hard time to operate the oscilloscope and order to our manager to give us green light to buy the best oscilloscope and other instrumentation we would like to buy.

Also I received a very good salary increase that year



[0] Message Index

[#] Next page

There was an error while thanking
Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod