Author Topic: What's the trickiest bug you've ever tracked down..? (Read 4641 times)

mikeselectricstuff · « **on:** May 20, 2010, 04:53:31 pm »

I'm sure many of us have had some real hair-tearing moments debugging hardware/software... here's my worst...

I designed a product for a client, containing an early AVR (AT90S4414), which worked well and was very successful.
Atmel discontinued the 90S4144 (or was it 4414?) and this was replaced with the 8515, which was just an increased memory version - the sales rep told my customer that it was a drop-in replacement and I wasn't involved in the change - the new chips worked fine so my customer was happy.

A few years later they discontinued the 90S8515, so we changed to the ATTiny8515. There were a few software tweaks due to moved IO registers, but after fixing this all seemed OK. Until we started getting calls from users.

We were getting symptoms that looked exactly like randomly occurring eeprom corruption - units would test and calibrate fine at the factory, and still be fine the next day at outgoing inspection, but some time later they would appear to randomly lose their cal data. This was exacerbated by these customers being overseas and receiving what appeared to be DOA units.

For those who want to try working it out themselves, the only other relevant factor, is that there was 1000uf cap across the 5V rail, from which the AVR's 3V rail was derived.

To see the answer, drag-select the text below to reveal it.

After many days' head-scratching (really not helped by focussing on memories of early AVRs losing EEPROM in brownout), I suddenly had a 'Eureka' moment....

As the original code was written for a device with 256 bytes eeprom, there was no ee address high byte to write.

The 90S8515 has an EE address high byte register , but always cleared it to 0 on reset, so no problem.

The ATTiny8515 does NOT initialise as many of its regs on reset as its predecessor did, this includes the EE address regs.

What was happenning was the the EE address high register was bring intialised to a random value on pwerup, so the cal data was being written to a random half of the EEPROM. The capacitor ensured that this random page selection lasted for several days without power. it was only when a unit was powered down for several days (like when shipped to an overseas custoimer) that it would then sometimes power up in the opposite state, and hence appear to have lost its eeprom data.

OK the answer was in the datasheet but how many people read _every_ bit of _every _ section of the datasheet when upgrading to a new part....?

I'm sure many folks here are fans of the "Air Crash Investigation" series - fans will know how often a crash is a result of a sequence of events, sometimes years apart....

tecman · « **Reply #1 on:** May 20, 2010, 05:55:35 pm »

Have an industrial controller, several years in production, and made in reasonable volume production (~1000/yr). We very rarely had an unusual occurrence. The CPU is a MOT 68337. On chip is a CAN module (TOUCAN) that we implement to communicate DeviceNet. For years we never heard about any systemic problems. We had a customer describe an issue, that several times an hour the DeviceNet would stop communicating. We looked at the obvious, like field wiring, power, noise, etc and found nothing. It was then discovered that inputs, either on the discrete inputs, or certain RS232 comms or even analog inputs, occurring at a rate close to 100 Hz would kill the DeviceNet.

No obvious cause led to a month or so of code analysis. I was convinced that there was an interrupt sneaking in overwhelming the CPU. Tests and more tests confirmed that the CAN had among the highest interrupt priority, and no unknown or "sneak" interrupt were present. No matter what we changed, the sensitive frequency seemed to be around 96-103 Hz. Thinking that there was some strange code interaction when certain functions ran at the 100 Hz. Next step was to change the clock parameters to alter the main loop "scan" frequency. which is 500 Hz. We changed the divider and readjusted the CAN baud rate so that we had a scan of 250 Hz. Same problem, still at 100 Hz inputs. Another change in clock rate still resulted in the 100 Hz sensitivity.

Next took us down looking at the +5 volt switcher on the board. Looked at the frequency response of the power supply. Swept it, injected test signals, etc and found that 100 Hz injected into the power supply summing junction would replicate the problem. Looked like a real possibility, but response tests of the power supply resulted in no sensitivity, peaks or notches at 100 Hz. Seemed like another near dead end.

Finally into the evening one day, some circuit mods resulted in a breakthrough. It turns out that the PLL on the CPU seems to have a sensitivity to 100 Hz. The PLL is powered through a pin called VddSYN. This pin was coupled to Vdd through a simple R-C filter network. By increasing the C value and upping the R, I could reduce the 100 Hz sensitivity (but not eliminate it). There was a limit to how high the R value could be, since the VddSYN current draw would reduce the voltage to unacceptable levels. I performed a simple test using an FFT response analyzer. The card has inputs that can measure frequency using the CPU's counter/timers. These timers are clocked off of the CPU clock. We applied a known constant frequency to a frequency input using a high precision signal generator. This input was converted to an analog value and available on a D/A channel, using the CPU PWMs. Statically the analog output was a very constant value, corresponding to the input frequency. I then applied a frequency sweep from the FFT analyzer to an analog input, and monitored the D/A output with the analyzer. Sure enough, around 100 Hz input came rolling out of the D/A.

So what was happening was the CPU PLL was sensitive to supply variations at around 100 Hz. The analog input resulted in some CPU effort to occur at 100 Hz, which caused some +5 volt power supply wiggle at 100 Hz (around 5 mV). This power supply noise was also present on the VddSYN PLL power, resulting in the CPU clock frequency wobbling a bit at 100 Hz. Normally not an issue of any great concern, but the CAN baud rate was also varying, since it is derived internally from the CPU clock. Since the baud rate accuracy is relatively critical with CAN (at 1 mbaud), errors on the CAN messages would quickly kill the DeviceNet comms and result in the problem.

Took a long time to find, with hardware and software all coming under the microscope. The fix was relatively easy. We powered the VddSYN pin (PLL power) from a 5 volt reference supply on the card that is used for the A/D reference power. This completely resolved the issue with no ill effects on the A/D operation or accuracy. Although Motorola has a circuit suggestion for powering the VddSYN for "use in noisy environments", it was nothing more than a 2 pole R-C and did not fully resolve the problem.

This was my trickiest one to find.

Paul

A-sic Enginerd · « **Reply #2 on:** May 20, 2010, 06:17:15 pm »

I've got a few. My top 5:

- Had a bug in our new 10/100T MAC. We had prototyped it with a laser etched gate array and the bug wasn't showing up until we got back first silicon for the first chip it went into. Don't recall the nature of the bug - data corruption, MAC would stall, something of that sort. Second step to debugging something seen in actual silicon is to try and reproduce it in simulation. (First step is to do a crap load of datapoint collection). Finally nailed it when my peer created a clock signal in sim that was the most obscure and whacky thing you'd ever seen. It was a square wave like my kid would draw with crayon when they were in kindergarten. Anyway, it happened to hit things just right we saw the same symptom. Once we drove it to root cause we could more easily reproduce the error with more realistic sims.

- More recently (about 3 or 4 yrs ago) I had just started work for new startup company. They had a chip they'd already released into production. Some of the recent shipments were seeing a measurable number of different and odd errors. Some saw data corruption, some were seeing incomplete frames, a few other symptoms that none of us could make any connection on similarities or common areas of the chip. After weeks and weeks of debug and hundreds if not thousands of man hours running various tests and trying out different theories, we arrived at one conclusion - cosmic rays. Go ahead and laugh if you want, but in todays smaller geometries, and if there are certain issues with a fab process (which it turns out there were with the fab we were using) the designs can become subject to and open to issues when hit with stray alpha or beta particles. In this case it was causing random bit errors in a very large memory / ram that was in the chip. (512 bits wide, forget how deep).

- One that the bug was found pretty quickly, but to deal with it was about the biggest PITA I've ever had to deal with. Brand spankin' new chip design with craploads of new IP. It had an embedded ARM. We got first silicon in and put it on a brand new board design. The board was comparatively large and complex (about the size of an AT form factor with all kinds of PCI buses, 10/100 spigots, IR, USB, etc etc), and to make it all worse was done by a guy that, although he was an experienced engineer in other aspects, this was his first board design. Well, it turns out someone screwed up and had tied the pin that enables the JTAG port to the ARM to the disabled state. Now of course we were needing to still do as much testing on the chip as possible and time was of the essence (not only product schedules, but time and money involved on a chip spin...all that sort of business). The chip is pretty brain dead without the ARM, so we spent a lot of very long days trying to wake the thing up through an untested AMBA bus, that went out an unproven flash interface to an untested board designed by a rookie to a socketed flash part. Yeah....that was fun.

- Then there was that whole thing when doing senior project in college. Due to perform the first semesters final demo. Project involved (among other things) one main board that had a microcontroller talking to some daughter boards we connected via headers and ribbon cables. Let's just say we burned our Xmas vacation discovering we were victims to an issue a few of us on the team had just learned about a couple weeks prior in a completely different class - ground bounce. We had very inadequate grounding between the daughter boards and things were noisy enough (whole lot of nothing but digital going on here) the data going across the cables was generating its own clock. Symptoms were some data skipped while other data was double clocked.

- Oh yeah, almost forgot. Hit my first real brain twister while in college. Was during advanced logic design class. FPGA's were still fairly new on the scene, but our department chair always had the view of a pretty practical curriculum, as well as trying to keep up with latest trends seen in industry. This translates into some of the instructors learning at the same time as the students. Lab partner and I had this flash of brilliance that to solve a part of the project we dropped in a set of transparent latches. Take note for the unknowing: Xilinx doesn't like latches.

Kiriakos-GR · « **Reply #3 on:** May 20, 2010, 07:00:33 pm »

the trickiest bug ? Plenty ...

The Fanny est few ....

Lots of color pencils stacked in a VCR ..

Lots of almonds in to a Inject Printer ...

And many more "causes of damage", created by the unlimited imagination of the young devils .

DavidDLC · « **Reply #4 on:** May 20, 2010, 08:11:03 pm »

We sent our products (sensors) to a automotive manufacturer, the product was giving false detection, it was a medium percentage of products doing this and the customer F**d Motor Company was giving us big complains.

It was really difficult to open the sensor because it was covered by hard epoxic, 100% of the units you tried to open will get destroyed, X-Ray didn't give a clue.

So I gave the idea to get a good sensor out of the production line without epoxic, and try to duplicate the fail (Reversed engineering)

I started to isolate the fail, the sensor had a transmitter and receiver portion, the bad products were failing on reception.

So we started to play with the reception side and ended up discovering a problem on a component connection due the paste solder used on the smd process.

The big part of this is that I was working on a manufacturing factory, and we didn't have all the R&D tools that our R&D center had. We had a old analog oscilloscope that was difficult to set for our purposes, as soon as the R&D guys knew we found the problem they traveled to the factory so we can give them a demonstration on how we did it, they saw us having hard time to operate the oscilloscope and order to our manager to give us green light to buy the best oscilloscope and other instrumentation we would like to buy.

Also I received a very good salary increase that year

wd5gnr · « **Reply #5 on:** May 21, 2010, 02:07:41 pm »

I'm not sure this is trickiest, but it is probably the funniest. Some years ago I worked as a failure analyst for -- "ahem" -- a large semiconductor maker. They like to hire semiconductor physicists and not EEs so there were only a few of us I would consider real EEs. That led to a lot of comical things, but the one I'm thinking of came out of our technology assessment group.

If you aren't familiar with these terms, by the way, FA is where you take failed parts, dissect them, and figure out why they are bad. You don't fix them per se, but you try to prevent whatever broke them from happening again (customer action, bad mask set, whatever). TA is the opposite. They take good parts, stress them until they fail and draw conclusions. (Ever wonder how they know an EEPROM will last 100 years? That's how.) So the salient point is that these groups see nothing but failed parts. If you aren't very experienced, you could draw the wrong conclusions.

Because there were just a couple of "real EEs" we got a lot of ad hoc consulting work. So one day one of the TA guys comes by and says "I need some help. I need a power switching transistor other than a 2NXXX (whatever the part was)." I asked him why he didn't want to use that part. He replies, "Oh I bought 8 of them at Radio Shack and they were all bad."

I told him that was unlikely. He said they tried all 8 and none of them worked. He drew his schematic on the white board. Very typical CPU switched current sink. The load was a bunch of DUTs (device under test) running at 14V or something. This transistor was in the ground leg so the CPU could pulse the current flow. An emitter resistor set the collector current. It looked fine -- I've done that circuit a million times.

I got a databook out (this is pre common Internet). The transistor was way over spec'd for this service. I told him that was stupid. There was no way he got 8 different transistors that were all bad.

Finally I went to look for myself. Abstractions are powerful but sometime they backfire. Like this time. What looked like a nice little resistor on my whiteboard was a big horking 10 or 20W wirewound resistor (there were a lot of DUTs although it was 5 or 6 times too large even so).

Here's what was happening. Any small current change in the emitter current would cause several ka-zillion volts of back EMF to form in the inductor created by the wire wound resistor. We all know the maximum back voltage of the BE junction on a bipolar transistor can never exceed 0.5 ka-zillon volts, so there you go.

I did not know it at the time, but I started researching (hard with no Internet) and found that they construct special "non inductive" wire wound resistors just for this purpose. They wind a half wind one way and then reverse and go the other way so the inductance tends to cancel out. The next transistor worked.

It was funny to me that because it was nothing for us to see a lot of 100 or 200 devices all bad, this guy assumed that was just normal and when you bought stuff off the shelf it was probably bad.

But that wasn't the trickiest. I'll have to tell the trickiest some other time.

Al W.
http://www.hotsolder.com

kc1980 · « **Reply #6 on:** May 21, 2010, 04:34:17 pm »

Sorry, this is OT. But real quick:

wd5gnr - I like your blog. The "how to control Rigol with Python" one is awesome.

wd5gnr · « **Reply #7 on:** May 21, 2010, 05:47:55 pm »

Quote from: kc1980 on May 21, 2010, 04:34:17 pm

Sorry, this is OT. But real quick:

wd5gnr - I like your blog. The "how to control Rigol with Python" one is awesome.

Thanks. Just to be fair, that one article is just a link to someone else's blog. Or is that you? ;-)

But the Rigol "tutorials" are mine. I should probably do some more of those this weekend.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: What's the trickiest bug you've ever tracked down..? (Read 4641 times)

mikeselectricstuff

What's the trickiest bug you've ever tracked down..?

tecman

Re: What's the trickiest bug you've ever tracked down..?

A-sic Enginerd

Re: What's the trickiest bug you've ever tracked down..?

Kiriakos-GR

Re: What's the trickiest bug you've ever tracked down..?

DavidDLC

Re: What's the trickiest bug you've ever tracked down..?

wd5gnr

Re: What's the trickiest bug you've ever tracked down..?

kc1980

Re: What's the trickiest bug you've ever tracked down..?

wd5gnr

Re: What's the trickiest bug you've ever tracked down..?

Share me