For input, you can use 8-bit port reads and small look-up tables: if the input pins are spread over P bytes, then 256P for up to 8 parallel bits, 512P for up to 16 parallel bits, and 1024P for up to 32 parallel bits.
For example, if A, B and C correspond to some port data input bytes, with LA[], LB[], and LC[] the three 256-entry lookup tables in rom/flash, the actual value is
v = LA[A] | LB[B] | LC[C]
In each of the lookup tables, a bit is set if the corresponding input bit is set, clear otherwise. If v is 8-bit, then each table takes 256 bytes for a total of 768 bytes of lookup; if 16-bit, total 1536 bytes; if 32-bit, total 3072 bytes of lookup is needed.
For output, it gets slightly more complicated, but you can use a similar approach; you just need to shadow the PDOR (port data output register).
Essentially, you first split the output value v into bytes. Arbitrarily choosing little-endian byte order:
b0 = v & 255
b1 = (v >> 8) & 255
b2 = (v >> 16) & 255
Next, we do a lookup (which is the inverse of the above LA[] etc) for each, noting that the value here needs to be at least 24-bit, because the output is spread over 3 PDOR bytes:
o = L0[b0] | L1[b1] | L2[b2]
Then, we split that output again into bytes, again I'm arbitrarily picking little-endian byte order,
o0 = o & 255
o1 = (o >> 8) & 255
o2 = (o >> 16) & 255
Finally, we OR each output PDOR byte with the shadowed state – i.e., containing bits set for the unrelated output pins that are/need to be currently high:
A = s0 | o0
B = s1 | o1
C = s2 | o2
Above, L0[], L1[], and L2[] are all 32-bit with 256 entries, 1024 bytes each, for a total of 3072 bytes of rom/flash needed for lookups.
I researched this when I was looking into 18-bit parallel output for ILI9341 etc. 240×320 pixel full-color display modules.
For example, if the output pins are spread over 5 PDOR register bytes (and each value is 18-bit, or 3 bytes), the output lookup tables need 5×3×256 = 3840 bytes. (5 bytes can be efficiently split into two sub-tables, one with 32-bit values and the other 8-bit values. Otherwise, you may need to expand to e.g. two 32-bit tables, taking up 6144 bytes, as you'll want to minimize the number of rom/flash lookups per output word, since they tend to be slower than RAM accesses on 32-bit ARMs.)
If anyone is interested, I can show some example C code, including generating the tables given pin mapping etc. However, I suspect this should be obvious for many; it is by no means my own original invention. Just re-discovered on my own.