Update. Setting the PLL to 16 mhz works.
I also got curious and went looking for asm code that drives the WS281x. Found something that looks promising. Using Atmel Studio 7, I have it sort of working at 8 MHz (needs some more attention though). Along the way I was reminded why I dislike Atmel Studio 7. Intensely dislike. It seems if you mix ASM and C++ the assembler I/O port addresses are wrong. I found this out by debugging with Atmel ICE. I had to wade through the include files to discover why. What a load of crap. Not sure how someone without a debugger could have figured it out. Grrrrr.
What I meant was you should be able to get it working on 8Mhz with assembler. The libraries are very inefficient. I recall from a long evening watching youtube videos on the topic that the "FastLED" library is the better one, but it's still slow compared to toggling the data pin yourself in asm.
Consider that, compiler optimisation (in-lining) aside a function call in C involves:
* Push all local variables and pointers currently in registers onto the stack "N instructions"
* Push the return instruction pointer (IP) onto the stack "1 instruction"
* Push all function parameters onto the stack "N instructions"
* Copy the function address into the IP (JMP) "1 instruction"
The function then:
* Pops the parameters off the stack
* Executes
* Pops the return address off the stack
* Pushes any return value onto the stack
* Copies the return address into the IP (JMP)
The calling code then:
* Pops the return value (if any) off the stack
* Pops all it's local variables back off the stack
* Continues
If you have a dozen or so local variables in registers and a half dozen parameters this single function call could result in 30 instructions, not including the code within the function itself.
In C++ if you are using classes it gets even worse as there are lookup V tables and what not to correctly determine the implementation involved in the call to potentially polymorphic methods.
In terms of efficiency you can see that from a development best practice point of view, dividing your code up into dozens and dozens of small nested functions to make it easier to maintain comes at a massive cost. In normal enterprise business applications it doesn't really matter due to them running on multi-core processors clocked at 2.5Ghz and above with pipelining and what-not that can result in many instructions per single clock cycle. It does matter in low latency applications such as stock exchange applications where microseconds cost millions and the speed of light is inconvenient.
In an average Enterprise Java application the number of Java code lines to assembler instructions could be a one to a million. If written in assembler the code could be made 100,000 times faster. But in Java world things are measured in milliseconds, not microseconds and if you need microsecond timing, you have to jump through all manor of hoops and optimise the JVM, heap, garbage collector etc. etc. etc.
C Compilers (and Java compilers) will optimise a lot of it for you. If they determine that the full function call would be inefficient they will inline the code from the function directly, avoiding the function call overhead altogether. (In Java world it will inline/sequence large nested calls and cache them in the JIT compiler). You can also pass the "inline" directive yourself to force inlining of functions. At the cost of duplicating code in program address space. Obviously the compiler needs to determine if you are tinkering with the stack in anyway or bad stuff could happen. When running GCC for example in optimisation level 3 it is incredibly paranoid about your code and will throw errors and warnings if it's even slightly unsure about what you are doing with memory addresses. (void*) for example or casting to (char*), the compiler will not touch it and will throw frame pointer errors if you attempt to optimise memory frames as it can no longer determine what exactly is in that memory address and what you have done with it.
If you think you can get around this by using inline assembler, think again. The C compiler is not that easily fooled. The optimisations occur after it has been assembled. The compiler will also modify your ASM code because it will be using the registers itself to do stuff and needs to avoid collisions. It might also use the stack to save and restore registers to allow you to use them for your ASM. Sometimes its possible it can't work around your hard coded assembler and might very well cause a register collision which may well crash your program or give you rubbish data. An example is you setting a register that the subsequent C code once assembled simply overwrites. The register keyword can help, but IIRC it is "advisory" and not enforced and you don't choose which register to use.
GCC is a complicated beast best left to GCC aficionados. I'm sure I have some of it wrong. Inline assembler comes with risks.
As an aside, in multi-programming systems such as PCs the OS scheduler can and will suspend your code to handle an interrupt or allow another process to execute. This involves copying the full CPU state and process state into memory, including all the memory look up buffers in the memory management unit. Then copying the new processes full descriptor, CPU state and memory maps back into the CPU / MMU. This can cost a microsecond or more. If you need to get round it you can use processor core affinity to allocate a core specifically to your process.
I'm not sure how interupts are handled in MCUs, but I doubt there is a full context switch as above.