I'm having trouble finding timing data in the RP2040 datasheet, it isn't nearly as thoroughly documented as microcontrollers from major suppliers.
In RP2040 datasheet section 2.16.1, it says that an external oscillator can drive the XIN pin at up to 50MHz, so you won't be able to clock the RP2040 off of the Northbridge 100MHz clock. Even if it could clock in that way, I'd be concerned about the delay in the PIO state machine; just a single 315MHz discrete flip-flop like SN74AUC74RGYR has propagation delay of 0.7ns typical, 2.5ns max at 1.5V.
If you want programmable logic, there's piles of options available, but I haven't used anything like that since college (although I've perused datasheets when contemplating personal projects).
That said, if you just want to listen to the signal, pass through everything with a 1 cycle delay, and fix the 32nd bit after the start bit, I feel like you could implement that with discrete logic more easily than programmable logic (as someone who is very rusty with programmable logic).
Your trace looks like the data transitions on the falling edge, so have one D-type flip flop read in the Northbridge data on the rising edge and a second D-type flip flop send the (possibly modified) data on the falling edge. The flip flop delay determines your clock skew; you could use something like SN74AUC74RGYR with 0.7ns typical, 2.5ns max propagation delay at 1.5V, which should let you meet your timing requirements as long as that leaves enough setup time for the CPU before it reads in the data on the rising edge.
Then you have a counter, two more state flip flops, and some logic gates to implement your state machine. One flip flop starts at 0 and is set to 1 when the start bit is read, which enables the counter. When the counter hits 32 (tested with an inverter and 6-input NOR gate), a 1 is or'ed in with the signal between the two pass-through flip-flops, setting that next bit to 1 instead of what the Northbridge sent, and also setting the second flip flop to 1, which disables the counter and forces the flip flops to always pass through data until the next reset, which resets the state flip-flops.
One nice thing: that flip flop runs on 0.8-2.7V VDD, so you could run it directly on Vcore. That's the most speed-critical part; you could probably find all the other components that could also run on Vcore.
That same basic state machine would be very easy on a FPGA in VHDL or Verilog; you'd just have to add in all the configuration of clocks and pins and delays to get everything wired up properly.
If those discrete components are too big, or the flip-flop 0.7-2.5ns delay isn't fast enough (because the CPU input needs a setup time longer than 4.3ns typical, 2.5ns worst case at 100MHz), then I would next investigate a FPGA that can input a 100-120MHz external clock and has IO that lets you control the phase between a clock and output with sub-ns precision. I know they exist, but I'm not sure of how low-end you can go while having that capability.
For example, a Xilinx Spartan-7 accepts a 19-800MHz input clock, has multiple orders of magnitude more logic capabilities than you need, has GPIO timings that measure from fractions of a ns up to 2.0ns at 1.5V (better at 1.8V), and has clock management modules that can add sub-ns phase shifts to let you tweak to exactly the right timing. However, it needs 0.95V or 1.0V supply voltage (IO voltage can be Vcore), and the cheapest Spartan-7 in stock at DigiKey is $21.91 in a 225-ball 0.8mm-pitch BGA, and if you want the 196-ball 1.0mm-pitch BGA that's their easiest to solder option, the cheapest is $26.74. I haven't looked, I'm not sure if you can keep the clock features you need while getting to a cheaper, easier to solder part that might even run directly off of 1.5-1.8V Vcore.
Good luck!