Thank your pointers.
I might have glossed over all the setup that the memory<->core highway.
No, on a good 32-bit MCU peripherals will use 32-bit registers. And byte and half word access would be prohibited or results undefined.
I suppose that if we have:
logic [31:0] reg0x00 = { DATA[7:0], STATUS[7:0], CTRLA[7:0], CTRLB[7:0] };
With DATA[7:0] clearing an interrupt flag upon reading, we would have some trouble if reading the status flag, or while writing to CTRL registers.
We would have to spend a bit of address space (which is not a scarce resource on 32-bit!) to space out the registers and have one 8-bit register at the beginning of the 32-bit register.
Now die size depends on many factors. If you have a small design relative to the CMOS process node, you are quickly pad-limited. Meaning that beyond a small number of pads, you will need a lot of extra die area mostly empty. Which is one reason those very cheap modern 8-bitters only have a few GPIOs. And which is also why the complexity of the design and the number of IOs are often linked, something you'll see in particular with FPGAs.
Mostly empty? I thought it would get filled with extra peripherals instances, but this would get in the way of having the same feature set across multiple pin counts.
This means as as soon as there are more than just a few pins, there is a rather large die size budget to fill. Interesting.
[EDIT] fixed the formatting, sorry!