You are doing fine. I just have experience with fixed floating point & resizing graphics.
You can use different scale sizes for the X&Y to stretch fill the image on the LCD.
Now, bi-linear filtering is not that more complicated, but, it does require a bit of ingenuity.
(This is considered for thought and experimentation. I'll offer further in a day or so if you wish, though, you might be able to solve the problem for yourself...)
Your 'oCoord_X/Y[25:16]' contains an integer location of which pixel you wish to draw on the output display.
And if you think it through, your 'oCoord_X/Y[15:0]' contains a fractional location in-between the current integer location and the next integer location of the pixel you wish to draw. (You may consider only using oCoord_X/Y[15:8] for 255 sub-locations between the current integer pixel and the adjacent one.)
This means, when reading a pixel, you want to have simultaneous access to 4 pixels decoded as RGB color after your palette. Pixels oCoord_X/Y, oCoord_X+1/Y, oCoord_X/Y+1, oCoord_X+1/Y+1. The oCoord_X/Y[15:8] will be like a linear fader/selector multiplying how much of X/Y, X+1/Y, X/Y+1, X+1/Y+1 is blended together, with the sum result being your final pixel.
EG, everything (RGB) is 3x, this is not proper code, it is an example guide (the math was for 24 bit color, you can shrink everything from 8 bits to 4 bits for 12 bit accuracy):
assign x_fraction = oCoord_X[15:8];
assign y_fraction = oCoord_Y[15:8];
out_y0(RGB) <= ( (oCoord_X/Y[25:16]_(RGB) * (255-x_fraction)) + (oCoord_X+1/Y[25:16]_(RGB) * (x_fraction)) ) >> 8 ;
out_y1(RGB) <= ( (oCoord_X/Y+1[25:16]_(RGB) * (255-x_fraction)) + (oCoord_X+1/Y+1[25:16]_(RGB) * (x_fraction)) ) >> 8 ;
out_final(RGB) <= ( (out_y0(RGB) * (255-y_fraction)) + (out_y1(RGB) * (y_fraction)) ) >> 8 ;
The trick is how to get the 4 correct pixels at every pixel clock cycle.
There are a few solutions.
For the Y coordinate, one trick is to break up your display ram buffer into 2 dual port ram blocks. 1 block will hold all the even Y lines of the picture while the second will store all the odd Y line. On the read side, with a little smart steering logic based on requested address and which data you read, you may now read an even and odd Y line simultaneously. This solves simultaneous reading of Y and Y+1. The same can be done for the X / or vertical lines. This means 4 dual port rams, each with 1/2 the X size and 1/2 the Y size.
There are smarter ways to do the X without 4 rams since you know you are always reading from left to right. Also, for the Y, you may also run the read side at 2X speed feeding the Y address and then the Y+1 address.