Is ULPI flexible enough to allow this? (I've just started studying it so I'll still have to figure this out.)
The ULPI is a standard connection channel specification to your FPGA. A ULPI PHY IC is a USB cable driver with a simple ser & deser, clock recovery circuitry in it and simple USB arbitration, so your FPGA interface only runs at 1/2 or 1/4 480mb speed (3 pin VS 6 pin mode). Meaning you do not need a 480megabit ser-des in your FPGA. These ULPI PHY ICs are stupid other than a ser-des plus USB status. Everything else you need to program in the FPGA, but with the knowledge you are receiving and transmitting your packet structures and contents in bytes.
You should have access to 100% of the USB2.x capabilities with a ULPI PHY IC.
I've now read the ULPI spec fully and there should be no problem doing what I have in mind.
ULPI-compliant PHYs actually do a little more than what you wrote above, they also handle NRZI encoding, CRC, and form packets.
They also handle some additional functions that can be handy to have, like cable disconnection, VBUS detection (which could be used for something else), etc.
As I mentioned above, the only constraint you'll have is that the PHY will generate low-level USB packets. That's SYNC+packet ID+payload+CRC+EOP. You only need to provide packet ID and payload, the rest is automatic. The SYNC part is 32 bytes long in HS, so to optimize throughput, you better use large payloads. The USB std defines 1024 bytes maximum for data packets, but I don't know whether the PHYs actually check for length or not. Packet ID is 1 byte, you can write anything you want if you don't have to comply with USB. And CRC is 16 bits. EOP is a few bits. All in all, the SYNC preamble is what eats the most extra bandwidth, and for 1024 data packets, the overhead is about 3%. Since it handles CRC, you have a reasonable check for data integrity for free.
Just a note, ULPI supports various bus widths, but the most common is 8 data bits. So it's 8 bit @60MHz.
As to using MCUs with embedded USB 2 PHYs, doing the same will greatly depend on the MCU. Some have a PHY that isn't directly accessible, some do. But that was just another idea to investigate. That can't be as "portable" as doing this with a separate ULPI PHY.
For higher data rates, I guess doing something similar with an USB 3 PHY in SS mode should be doable as well. Now we're talking about 5Gsps. The interface requires more (and faster) IOs though, it's a PIPE interface, but that should still be doable with modest FPGAs.
Fully implementing links in FPGAs that have fast enough transceivers is of course also on the list, but I thought this idea might be worth investigating.
As to terms, yeah. Lattice calls SERDES their high-speed transceivers, and flags their FPGA without that as not having SERDES. It's a matter of terms. Xilinx actually has a number of variants for their high-speed transceivers, from GTP to GTX to... And yes, their higher-end FPGAs such as the Virtex series have much faster, but also more sophisticated transceivers. Another term used, CDR, let's also state what were are talking about. Even the GTP that the Artix (and I think Spartan 7?) have seem to have some kind of CDR, but as far as I can tell, it's not a full CDR as I think of it: they can't recover clock and data on their own (unless I missed something, I'm not an expert with the Xilinx 7-series), which is why the app note mentioned earlier shows a way to recover data (not the clock) using oversampling. That's not any kind of hardware CDR as I call it. But, as far as I've seen, the Virtex seem to have that. Now whether you'd get better, or at least as good performance (data integrity) relative to jitter and clock drift for a ~500Mbps link than a dedicated USB PHY, I do not know. But having read some papers about clock/data recovery and oversampling, I'm not completely sure that is the case with 4x oversampling and a relatively simple scheme. Maybe it is though. I'd be curious to see eye patterns and compare them - problem being that you can't directly get an eye pattern with the oversampling method, as you don't recover the clock.
Small edit: The SYNC preamble in data packets are 32 *bits* for USB HS, not 32 bytes. So the overhead is much less severe.