Ok, so with some effort i converted the design to use the PWM colour mixing approach in Mike's video.
Loading in the data, latching it and holding for 1,2,4,8,16,32,64,128 time periods.
Much better approach, i now have 8 bits for each colour and can convert that to a true 5/6bit gradient.
Which is all i need for my goal of 16bit colour.
32x128 RGB display update rate is 291Hz, which is overkill, but provides some room for additional processing without causing flicker. Or stretching out the updates with dead time to add a global brightness control.
It also uses a butt load less ram to latch and hold, compared to how i had it working before.
So i now have enough room to double buffer the video memory array

Might take another look at using gated timer mode tomorrow, see if that has any advantages.
Thanks everyone, especially Mike