Electronics > Projects, Designs, and Technical Stuff

Rare (0.1%) failure of ULPI initialisation on embedded processor


We've got a board with a Xilinx Zynq SoC and a USB3320 transceiver.  I'm not posting this in the FPGA forum because although the device is an FPGA, I think this is more general-purpose-embedded as bugs go as the USB peripheral is built into the hard ARM cores on the Zynq.

Our board is a custom hardware platform running Linux 5.10 (petalinux) using Zynq 7010 and 256MB DDR3.  Boot is via SD card.

About 1 in 1000 boot cycles, the USB peripheral sometimes fails to initialise.  In the kernel log we see "udc-core: couldn't find an available UDC or it's busy" when our script attempts to initialise the USB peripheral.  Only a hard power cycle fixes the issue.
With extra debug turned on, I noticed that every failure like this is accompanied with this kernel log:

--- Code: ---ULPI transceiver vendor/product ID 0x0424/0x0007
Found SMSC USB3320 ULPI transceiver.
ULPI integrity check: failed!
--- End code ---

A successful boot, instead, has "ULPI integrity check: passed."

The origin of this message is here: https://github.com/Xilinx/linux-xlnx/blob/master/drivers/usb/phy/phy-ulpi.c (in ulpi_check_integrity).

Now one thing that struck me about this is all the routine does is two byte read/writes.  It doesn't retry anything.  If there's a SEU, then the check will fail and the device won't be enumerated.  But perhaps if our interface fails 1-in-1000 writes then we're not in SEU territory but more unreliable bus territory?

I did notice that when I manually write to this using some low level C code the failure rate increases with SDIO activity on our SD card, so that makes kind of perfect sense:  bootup time involves a lot of SDIO read activity.  However, looking at the board layout, the traces are quite far from one another, and the supply voltage to the devices appears stable throughout, so the correlation was not obvious to me.

I was considering getting the kernel engineer to patch this function and just try a few more times to see if ULPI works.  Whilst running normally we don't see any particular issues with USB and can run for hours on end without fault.

Curious what others might think is going on here.  I haven't yet probed the ULPI bus, that will be the next task.

Even if traces on the board are not close, they all come together inside the package. The interference may not even be observable outside.

I personally would add more information to that message - written and expected value. And may be extend the loop to write more values and don't bail on the first failure. It would show if some specific bit fails consistently. This may guide the investigation. It may be something simple like one of the data traces is longer/shorter and close to the setup/hold margins and a minor interference affects it.

Also, do a repeated read in case of failure to see if it is a read or a write error.

Working later one may not be a good indicator that there is no issue. USB and ULPI  are very fault tolerant, they will hide minor errors.

Thanks.  I think it's worth trying a patch.

Looking at our test log, what I find particular interesting are there are no occurrences where the VendorID/ProductID is read wrong.

Instead, it is always the scratch register test that fails.   To me, this suggests a failure on write:  the read direction is okay, but any write (from SoC to ULPI) has a small chance of being corrupted, which leads to the value being read back wrong.

Indeed, on the C program I noticed the same anomaly.  If the register read back wrong, it reads back wrong every time after that, until it is re-written.

Retry on write is easy to test.
If that does not fix, perhaps the POR is bad ? Non monotonic Vcc rise can do that on some parts.
Of course, if retry on write does fix, you have to wonder how fragile normal run is.


[0] Message Index

There was an error while thanking
Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod