Decoding is trickier than encoding as you need to.either phaselock to the manchester clockrate or oversample. Capture peripherals can be very useful though, 564k is probably doable on a pic8 at 32MHz
I think if you don't use logic cells and do it in traditional MCU way using Input Capture (called CPP on PIC16/18), you still can do it on PIC16.
If you only sample rising edges, you can reconstruct the data (provided you had some locking sequence before the packed) - the distance between edges will be 1 clock or 1.5 clocks - that's all you need to know. Thus you have roughly 2 us to enter the interrupt, process data, and leave the interrupt. All you need to figure out if the distance between pulses was 1 or 1.5 clocks. Comparing to 1.25 will do the trick. Then you append your data accordingly and transmit it out (say by sending through SPIBUF).
1 clock is roughly 2 us, 16 cycles on PIC16F1*, less 4 cycles for interrupt entry/exit. 12 cycles should be enough to subtract two numbers and compare the difference to 10 (1.25 cycles) and then record the data bits and write SPIBUF if needed.
On PIC16F1454 you would have 20 cycles, and that is plenty to decode everything and store the byte in the USB buffer, leaving few cycles to spare for the USB engine to send data to PC.
On PIC18, you get 28 cycles, but you may need to do some context saving/restoring. It is actually more than you need. So, you can get another CPP running and detect falling edges. This will let you do auto-locking and error detection.