I am having 480M USB HS here.
Yes, I understood; I just meant that some USB host implementations seem to behave wonkily, if physically separate HS and FS devices are mixed on the same bus. Similarly, some FS devices (like ATmega32u4-based keyboards and other HID devices often used in DIY controllers) don't work at all with USB 3.x ports, unless you stick an USB 2.0 hub in between.
I have seen various delays and latencies between URBs in the millisecond range; 100ms is an order of magnitude higher than anything I've encountered.
I observed 99% package arrive together but just 1 or two bulk package arrive with undefined delay. i have observed upto 100ms.
How annoying! Have you verified with another OS or host machine, to ensure it is just not a coincidence, an unfortunate a combination of hardware and driver on the host side?
In my blog post when i said 100ms delay i meant delay between individual usb bulk packets not delay between display frames.
Yes, that's how I understood it. I just haven't observed that long delays between bulk packets. I mostly use USB serial endpoints for this, and although I have a couple of Teensy 4.0s with HS interfaces and a Cypress FX2LP, I am most familiar with FS. (The aforementioned ATmega32u4 with a native FS interface can do a megabyte per second over USB serial, for example; pretty close to the theoretical 12 Mbit/s limit if one considers the overheads.)
If I may suggest, see if you can get better bandwidth and lower latencies using USB serial. Because it is ubiquitous, the OS stack is more likely to handle it without gross issues.
It may be completely normal for bulk endpoint then i have to think better way to detect frame sync.
If I were you, I'd see if a USB Serial + HID device endpoint combination would work. I've done this before, albeit FS only, using Teensyduino (Arduino environment for Teensy microcontrollers), and it has worked well for me. (Teensyduino support for such configurations isn't complete yet for Teensy 4.0, so I haven't tested this on HS. Basically, the core Teensyduino includes the USB endpoint support magic.)
The HID device, having a 64-byte packet reserved on the bus every millisecond, should allow 1-3ms practical response time to frame sync -- I mean, as measured between the microcontroller and an userspace application. For best results, in Linux, I'd use a dedicated thread that blocks on the event device, to get the minimum latency. A standard 10-byte custom HID report should work fine for this: USB stacks need to handle these correctly, or game controllers would be glitchy.
If there is only one hardware device in use at any time, then in Linux, a simple udev rule can be used to create persistent device symlinks (a separate one for the USB serial endpoint and the HID event device), and optionally limit the access to the devices per user and/or group. It's pretty easy to set up. Then, if the device (symlinks) do not exist, the application can ask the user to plug it in. (For multi-device support, you need to
glob() for the symlinks, and have them include a serial number or similar to make them unique.) Makes things especially easy, in my opinion.
In my experience, USB serial, even full duplex, tends to be quite efficient and work fine. At least in Linux, you won't even need libusb for the communications, just basic termios support (included in the standard C library). You might need to add tty layer drain via
tcdrain() after a full update, though, as the stack usually waits a small while for data before sending the final URB. It essentially blocks until data written to the character device is actually sent by the kernel.
In particular, if implementing a Python Qt interface for such a device, I like to use a dedicated thread for the communication, and use a Queue to pass info between the main thread doing Qt, and the communications thread. Full duplex via USB serial seems to work very well even using separate send and receive threads, although for any command-response protocol you'll want to use a single thread. Interesting stuff!