Electronics > Repair
Series defect on agilent 167xx boards?
<< < (31/50) > >>
MateKrisz:
I'm here again. My digital microscope has been arrived and I checked the 16720A PCB first time. I found a physical damage on the board. One trace is broken. I attached some picture about this. I think this is connect the U87 11 with the U72 ?? The clock hide this trace connection. I think need to replace this trace with extra wire on the pcb directly.
ahakman:
I have a couple 16752a cards I'm trying to fix as I have another project that requires a logic analyzer. I fixed one by removing the plastic runners and doing some careful track repair. That one passes all self tests. But the second one is being very difficult.

On the second card, it fails the Memory Data Bus Test. It's the same single bits on 2 banks
Chip 0 Bank 0 Port 1 Bits 0x00000002 U2
Chip 0 Bank 1 Port 1 Bits 0x00000002 U64
Chip 1 Bank 0 Port 3 Bits 0x10000000 U60
Chip 1 Bank 1 Port 3 Bits 0x10000000 U90

U2 and U64 are on opposite sides of the board in the same location. Same with U60 and U90. I verified that their data bits are indeed connected together (in a way that's convenient for the layout, not necessarily D0 on one connects to D0 on the other).

I verified continuity between all data bits on U60 and U90 to the 33 ohm resistor packs, and verified that on the other side of that, I do indeed measure about 35+ ohms. From there, the signals go to the Virtex FPGA, and then the other side of the FPGA looks like it connects to the actual logic chip.

What I'm a little confused about is how can it be bit 28 on U60 and U90 when they're only 16 bits wide each?? Or are they setup in a 32 bit arrangement with their companion chips (U89 and U59), and if the failure is in the upper word, it calls out U60 and U90, but if the failure was in the lower word, it would call out U89 and U59??

How does the Chip / Bank / Port nomenclature work?
Chip 0 / 1 I get - the 2 main logic analyzer asics
Is Bank which FPGA memory controller it's talking to?
Is Port which set of DRAMs the FPGA is talking to?
Does U60 tell me the same information as Chip 1 Bank 0 Port 3?

Because it's a single bit error, and I've traced the signal from the DRAM through the 33 ohm resistor pack to the other side of that, and that goes directly to the FPGA, does that point at the bga ball under the FPGA? I find it hard to believe that one ball is broken on each of the outter-most FPGAs, but it could happen I suppose.
Could this also be an interconnect issue between the FPGA and the main logic analyzer chip?

Or is the U60 / U90 thing a complete red herring, and the problem is somewhere else entirely? If I run one of the later tests, I get the same bits failing (bit 1 and bit 28), but it calls out entirely different chip identifiers??? HUH??? I'd have to go stick the card back in the analyzer and run the tests again to check which test and which chips it was calling out, but the wrong bits are in the same positions, but it was on completely different chips (U37 seems to ring a bell). I also kind of read somewhere that if there are multiple self test failures, to basically ignore all the tests after the first one that failed, as they're all dependent on each other. Is that true?

I've traced out many of the lines, including through vias that were under or close to the runners and double sided sticky pads between the main logic analyzer chip and the FPGA that controls U60 / U90, and I can't find anything that looks or measures broken.

I'm a bit stumped on this one. The one card was relatively "easy" to fix (I guess if you count scraping solder mask with a pin under a microscope and repairing traces with a single strand of wire from a 22gauge wire 'easy'), and this one is the exact opposite.
MarkL:
The two-connector (4 pod) analyzer boards are pretty much two independent acquisition engines.  It's split down the center and there is no crossover for acquisition data flow (although I think there is some muxing between the pods before  acquisition occurs in ASICs U22 and U45).

The data R/W access from the backplane to all the memory chips, however, is common between the two sides.  I think this is what it's complaining about, or at least is where I would look first.

As you've noted, the bit assignments are done based on what's best for the layout, and not how the chip manufacturers label the pins.  It's possible you're looking at a single problem, but I think probably not since the chips are on opposite edges of the board.

The nomenclature is a bit confusing.  To be honest I have not seen errors reported with "Chip 0" and "Chip 1" before, but I think it referring to the Virtex FPGA's as "0" and "1" within each side.  The big acquisition ASIC under the heatsink are numbered like this:

  16750A/51A/52A: U45 (pods 1+2 "Chip 9"), U22 (pods 3+4 "Chip 8")

And then I'm guessing on the pod 1+2 side:

  Chip 0 is U41
  Chip 1 is U52

And then on the pod 3+4 side:

  Chip 0 is U10
  Chip 1 is U25

Repeating... This is only a guess.  Was there any Chip8/9 information printed with these errors?

(EDIT: So this guess was wrong.  The discrepancy is from units running HP-UX vs. units running windows.  Chip 8/9 in HP-UX is Chip 0/1 in windows.  Keep reading...)


I would closely examine the large clump of traces going down the center of the board and heading towards the two FPGAs near the backplane connector, and mostly on the bottom since they are all directly under one of the runners.  I think the Altera MAX (U18) is responsible for backplane access to the acquisition memory, and on the 16750/1/2 boards is done through the Virtex.

Do you have more detailed output from the failing pv test with debug turned on (d r=10 d=9) ?  Are there any other errors being reported on the Memory Data Bus Test, or are any other pv tests failing?  In general, the first test to fail is the one to focus on fixing, since other tests very often fail as a result of the first failure.  But it's not to say to disregard clues that might exist in later failed tests, so it's at least worth running all the tests once to be aware of what else pops up.

If unable to find a break anywhere, one troubleshooting method is to put the failing test into a loop and then start perturbing operation of the various signal lines with a low resistance to ground.  Try a 33R to start, but you might need to go as low as 10R.  The idea is to see if you can generate the same error report on other bits and then try to zero in on which physical data line is having trouble.  The data lines are usually (but not always) in bit order next to each other, so you can usually tell when you're getting close to the culprit.

This method will also reveal a lot about the nomenclature as various errors are reported.  Perturbing data, address, and control signals on various memory chips will help create an understanding of Chip/Bank/Port and bit ordering.

You can also do this on your working card to try to recreate the error message you're seeing on the bad card.

It's frustrating not having any documentation for this level of diagnostics.  The exact point of failure could be sitting right in front of us and we'd never know it.
ahakman:
So I did some fault injection of my own. This is what I discovered.

When testing a 16752A in a 16903A chassis (the 3 slot 16900 series Windows XP chasis), the DRAM chip identifiers given by the self test are COMPLETELY WRONG!

Chip 0 = the main logic analysis chip for pod 1/2
Chip 1 = the main logic analysis chip for pod 2/3

So, it was telling me there was a fault on bit 0x10000000 of U90/U60. At first glance, this makes sense - U90 and U60 are on opposite sides of the board of eachother, and their data lines are connected together - U90's D15 connects to U60's D0 and so on.

Ok, I'll inject another fault on a different pin on U90 - let's inject a fault on D15... run the self test

new failure on U31 / U74 at bit 0x80000000. Ok, so the data bits on the bottom chip align with the numbering they're using here obviously, but the identifiers are completely wrong.

I was also very confused as to whether this was talking about the data bus between the FPGAs and the DRAMs, or between the FPGAs and the aquisition ICs, so I injected a fault there as well and ran some tests.

No change to the "Memory Data Bus Test", but a new fault on the "Analyzer chip memory bus test"

Ok, so "Memory Data Bus" = between the FPGAs and the DRAMS including all of the 33 ohm resistor packs (which were a huge problem on my card - I took most of them off, cleaned the pads, had to repair a couple pads as they were eaten away right where the pad transitions to the trace at the boundary of the opening in the solder mask, and soldered them all back - the corrosion on the solder joints of those on my card was pretty bad)

and "Analyzer chip memory bus" = between the acquisition ASICs and the FPGAs
Chip identifiers completely unreliable

Ok, now we're getting somewhere.

On the 16752A in 16903A Memory Data Bus test, it uses the nomenclature
Chip => bank => port

Chip = 0 / 1 - which acquisition ASIC or that general side of the board - pod 1/2 = chip 0, pod 3/4 = chip 1
Bank = 0 / 1 = Top / bottom side memories - not exactly sure which bank is which side of the board as the chips' data pins are wired together
Port = 0 to 3 => seems like each FPGA has 2 ports - and there's 2 FPGAs per ASIC. Each "port" is 4 chips (2 on each side of the board). At least for Chip 0, with the bottom of the PCB facing up, and the pod connectors towards you, the "ports" go from 0 in the middle of the board to 3 on the left side of the board. Port 0 = U76 U77 on the bottom and U36 and U37 on the top. Port 3 = U89 and U90 on the bottom and U59 and U60 on the top. The port numbering and the byte order in the ports follows no logical order, and is all over the place. Port 1 - the chips right by the central bus of traces that runs to the top section of the board - aka right next to where a runner with the double sided adhesive was. I finally found the right chip!

There's also a "BONUS" port on each Chip which seems to be the one extra DRAM that doesn't have a partner that's only on the top side.

On the Analyzer Chip Memory Bus test, it uses the same nomenclature, but drops "bank" and only talks about chip and port

So seeing as my failure on chip 0 is on port 1 bit 0x00000002, that would be U82 / U36, not U90 / U60 as the incorrect self test says.

Time to go poke around with the continuity tester now that I know where I'm actually looking for a fault!

I wonder how they managed to screw that up!
ahakman:
Here's the issue - hard to tell in the photo, but that's a nodule of corrosion and obviously the track is completely eaten away between the pad and the trace.

[ Specified attachment is not available ]

Don't mind the resistor pack being crooked - I re-flowed them all with hot air - obviously I need to remove them and clean and inspect all those pads properly too, not just reflow with some flux.

What a mess
Navigation
Message Index
Next page
Previous page
There was an error while thanking
Thanking...

Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod