Older consoles didn't actually have a direct framebuffer.
Consider the
SNES video modes. In most modes, you have one to four stacked fixed grids of 8x8 tiles, where each grid can be scrolled separately (with the different grids having different number of bits per pixel), with 128 additional sprites (max. 32 on the same scan line). The tile grid is usually larger than the actual display, so that scrolling is achieved by just updating the offset.
Let's say you have a 128×64 tile map, each tile being 8×8 pixels, with 16 colors per tile. Each tile uses one of 16 palettes (each color chosen from 32768 possible colors), and can have 16 drawing modes (like mirrored and rotated). The tile map takes 8192 bytes, the tile definitions another 8192 bytes, the palette and drawing mode selectors 256 bytes, and the palettes 512 bytes, for a total of 17152 bytes.
Or, let's say you have three separate 128×64 tile maps, with 256 possible 8×8-pixel tiles, 16 colors per tile; and separate 24-bit palettes for each layer (6-bit RGBA, for transparency support). The tile maps take 3×8192 bytes, the tile definitions 8192 bytes, the palette and drawing mode selectors 3×256 bytes, and the palettes 3×256×24/8 bytes, for a total of 35840 bytes. (A fast renderer might internally use a buffer with 16 scan lines and 8 extra pixels before and after each scan line, so that each tile could be rendered at once to the buffer, without any need to check for partially shown tiles. That would be 7936 pixels for a 480-pixel wide display; or 31744 bytes for 32 bits per pixel, allowing for fast 6-bit color blending.)
To emulate existing games, the emulator part will do the blitting from somewhat similar structures on the emulated memory, to an actual framebuffer; then the framebuffer e.g. DMA'd to the ILI9341 display.
I have some nice 240×240 display modules with an IPS panel and ST7789 controllers. It does support 18-bit colors (262144), but the 15/16-bit interface is much easier and nicer to use.
Let's say I want to use a Teensy 4 (30mm×18mm) as a blitter for this display for my own handheld games. Four planes of 256×256 tiles of 8×8 pixels would need 262144 bytes. At 256 colors per tile, the tile data takes 65536 bytes. If each plane has their own palette (with 32-bit color entries), the palettes take 4096 bytes. Total, about 331776 bytes.
I also want sprites. Let's say 256 up to 32×32 pixel sprites, with 256 colors per sprite. Each sprite can be mirrored and rotated, and uses one of 32 possible palettes. The sprite data takes up to 262144 bytes, and the 32 palettes 32768 bytes. (Palette swapping is common "trick" to add variety to the graphics; and palette modification to change the "mood".)
At this point we're at around 626688 bytes, not including sprite coordinate tables, communication buffers to the master microcontroller and so on; but because Teensy 4 has
1024k of RAM, this is quite possible. In fact, there is ample room for a (32+240+32)×(32+240+32) = 304×304-pixel
and a 240×240-pixel 16-bit framebuffers, allowing tear-free updates to the TFT, without blocking communications to the microcontroller during a TFT update.
The 304×304-pixel framebuffer means all tiles and sprites are rendered in their entirety, which simplifies the blitting functions significantly; any sprite or tile that does not have all four corners within this framebuffer, is not visible at all.
Since Teensy 4.0 runs at 600 MHz, it has ample power to do the blitting, even allowing for completely free rotation of the tile grids and sprites, although without antialiasing, it tends to look quite jaggy. (Such blitters tend to be different, traversing either the tile map or the screen coordinates along odd directions, in which case a 256×256-pixel framebuffer would work better, as then the screen coordinates are exactly 8 bits.)
Reminescing time:
A quarter of a century ago I wrote a EGA/VGA "rotator" for 16-bit x86 (8086/80186/80286), that drew a scaled and rotated picture on-screen, casting a shadow in a fixed direction (up/down/left/right). The source was a 256×256 map, one byte per pixel. A 65536-byte/32768-entry (15 bits: 7 bits for current height, 8 bits for map data) determined the color of each displayed pixel and the current height for the next pixel; this to avoid conditional expressions. Essentially, the low 8 (4) bits of the result determined the final color, and the high 7 bits the current height for the next pixel. To traverse the map, I used 16-bit fixed-point integers, with 8 integer bits and 8 fractional bits. So, calculating each pixel involved just two additions (for the coordinates), one 8-bit lookup (from source map) and one 16-bit lookup (from the lookup table), and some register moves which were almost free since x86 splits its 16-bit registers into two 8-bit registers that can be modified separately. At the beginning of each scan line there was additional work (including setting various mask registers etc. for 16-color EGA/VGA), but the inner loop was fast enough to generate a smooth-ish rotation.
If you work out the code, everything just slots together nicely to powers of two, making it much simpler than one would otherwise guess.
One peculiarity of it was that even though you could "zoom" in, it would not lengthen the shadows, as it only applied to the XY plane and not the heights. Or, alternatively, the zoom level changed the angle of the light, with the light at angle
arctan(scale). (To fix that, you'd need to recalculate the lookup table for each different zoom level.)
To change the direction of the shadows, one must traverse also the framebuffer in fractional pixel steps. Unsigned 16-bit fixed point, with 8 integer and 8 fractional bits, suffices for 240×240. If the framebuffer is double buffered, this causes no issues; otherwise worse artefacts than tearing will be visible.
In other words, you have two fixed-point coordinates for the framebuffer pixel, their constant deltas, two for the source map "pixel", their constant deltas, and a 16-bit register for the current "height" used with the 8-bit source map value to find out the new height and pixel color; or 9 registers. 16-bit x86 really only had six general purpose ones (ax, bx, cx, and dx, that split into ah/al, bh/bl, ch/cl, dh/dl; and di and si, 16-bit index registers), so I never implemented that (it would have been too slow). Now, even Cortex M0 and better have at least 11 32-bit ones that a GCC extended inline assembly code in e.g. Arduino environment can safely use: R0-R8, R10, R11.
(This approach cannot really be used to implement radial lighting, because rays from the center of the screen to the edge overlap; you'd do a lot of extra calculations. A variant that uses a temporary buffer for a row of heights can do a cone with central angle at most 90°, though; there the trick is that the temporary buffer is either expanded to the size of the next row, or traversed using two fixed-point registers.)