You could waste half the space in your "frame buffer", so the "row" bits were already interleaved. It'd be comparatively easy to spread the bits when drawing (just shift by 2 instead of 1...)
Of your 4 bytes, all 16 row bits are going to be set, but only one column bit, right?
So if you framebuffer has 16 words indexed by column
uint32_t fb[16] = { 0x1x2x3x4x...ExFx, /* the hex digits are the row bits and the x's are empty, then */
:
}
then the refresh code would be a zippy:
shiftword = fb[column] /* (spread bits) */ | (1<<(2*column)); /* column bit */
/* shift it out */
Classic "trade space for speed" and "make the thing you do a lot (refresh), quicker, while letting the less common thing (drawing) be a bit slower.)
(I may have confused row/column in that description...)