you want fast? use assembly.
if the compiler isn't too smart or doesn't optimize well writing directly to register won't do it.
I know very little about ATMEL architecture but on pics, from enhanced midrange you can access both ram and linear data memory and even program memory (the lower word, so the 8/16/32 bit constant you want to fetch). You also have instruction with pre and post increment of the address pointer
so for example, on a pic16
I have already loaded FRS0H and FSR0L with the base address.
PORTB is all outputs.
The table is 200 elements for simplicity
Register _counter has already been loaded with 199
loop:
MOVIW FSR0++ ; Load the array data into the accumulator and increment address
MOVWF LATB ;
DECFSZ _counter;
GOTO loop ;
four instructions, five clock cycles per loop (goto takes to clocks).
obviously, if the array is bigger than 256 the situation will be much worse, having to decrement and check a 16 bit number with an 8 bit mcu
on a dspic it is even easier because you can use any one of the accumulators and 16 bit arithmetics
W0 is the pointer, W1 stores the data from memory.
loop:
DO loop_end,#9999; will do the loop 10000 times. actually, any 14 bit number + 1 times so max 16384 times. number can also be an accumulator.
MOV [W0++],W1
loop_end:
MOV W1,LATB
two instructions. with no overhead*. neat, huh? only prerequisite, the address pointer MUST BE an even number or an address error trap will be generated.
An address error trap will also be generated if the instruction tries to fetch data from an unimplemented location.
working with bytes over words is only a matter of using the .B suffix in the instruction, like so
MOV.B [W0++],W1
in that case W0 can also be an odd number. only the lower half of the register will be modified.
of course, if the number of repetitions is greater than 16k or there are chances you can exit the loop at any moment you can and should use the check for condition method.
I am sure you can conjure something simillar with your mcu of choice
*in a dsPIC33E if the first instruction will fetch data from anywhere else than the SFR area, it will take two clock cycles instead of one