Currently I have an 8-slot ring buffer for the incoming requests. The USB interrupt would cause the request be copied over into the ring buffer, then the main loop now release from WFI would go in and execute things, finally another USB interrupt cause the results to be picked up. Maybe I need to get rid of this ring buffer and just respond to things in the USB interrupt? Or at least not rely on the ring buffer and USB interrupt for outgoing data?
CMSIS-DAP is a request-response protocol. There will be nothing in the buffer after the first packet before you respond to it. CMSIS-DAP has capability to buffer multiple requests, but you need to advertise that capability explicitly through DAP_INFO_PACKET_COUNT response.
Most of the time tools just send one request and wait for the response, since this is the most compatible configuration.
It is hard to tell what may be wrong without debugging the code. Sometimes APIs require the input buffer to remove the first byte (command ID), sometimes they need to keep it. So this one thing to check.
Again, run my tool and see if it can at least identify your debugger and print information about it, since OpenOCD does not seem to do that.
I wonder if using USB HS/SS + CMSIS-DAP v2 Bulk mode + FPGA with hard MCU cores + implement dap_swd_transfer and dap_jtag_transfer using HDL = some really REALLY fast JTAG and SWD action?
Not really. In practice SWD ports on many parts are limited to some pretty low clock, like 16 MHz or so. And this you can pretty well bit-bang from a fast MCU without FPGA.
SS is definitely overkill. HS does improve things a lot, mostly due to ability to send 512 or 1024 byte packets.
Also, there are MCUs with hardware SWD/JTAG drivers, although they are not well documented. Nuvoton M480 series devices have USB HS and SWD peripheral. But nuvoton does not publish any documents about it. But I'm sure it is possible to figure it out with some effort. I never really cared to do that, since optimized bit-banging is plenty fast.
Significant optimization should come from the tools. CMSIS-DAP protocol lets you pack multiple transfers into the same packet. But tools rarely take advantage of this, since they all just use generic abstractions that do one packet per transfer.