There was a short period in history where bit banging (low-speed 1.5Mbit/s) USB was a useful thing, and even bit banging 10Mbps Ethernet has been done. But it puts severe limitations on "other things" you want to do with such a system. These days there are plenty of uC's with build in peripherals for USB or Ethernet. I still use my AVR programmers with the Thomas Fischl library based USBasp software, Analyzing that software may also be a good approach when wanting to learn about low-level USB signaling. (I even got useful data from the USD10 Logic Analyzer and Sigrok / Pulseview for 1.5Mbps USB). But this is mostly a passed by milestone because of the wide availability of uC's with USB or Ethernet.
Recently I put an STM32 on a Logic Analyser and measured the timing between "end of serial communication" and release of the RS485 enable line. I was very surprised that it was some 500ns or so on a 100MHz processor (WeAct Black pill STM32F411CEU6). There were no other ISR's or DMA, and in the assembly generated by GCC, there were at most a few instructions in between. I guess it has something to do with caching, pipeling and synchronization between different clock domains. Such microcontollers also use "flash acelleraotors", which for example reads 128 bits from Flash memory in parallel, and then puts the bytes in the order the processor wants them. There is some weird bit shuffling under the hood, and you don't get the timing you would expect if your experience is based on 20 year old 8-bitters.
Another trend is that multi core processors are getting more common. All the bigger ARM's (used in Linux / Android devices) seem to have smaller processors for house keeping. The Beaglebone (Ti Sittara) has two built in PRU's, which can do things like EtherCAT, which requires very low latency modification of Ethernet packets as they stream by. The Propellor chip is a classic. ESP32 has variants with multiple processor cores, and the RP2040 also has two cores. With a multi core uC, I guess you could one core to bit banging or other low level low latency tasks, while the other does the more computationally based things that disrupt timing (ISR's DMA, task switching, etc).