If you want cycle accuracy, C is indeed not the right tool. Even when you go through the steps to get the binary that does the thing right, you need to keep using that binary (and not the C source), because any little change in the compiler or linker can/will throw the timing off.
Me, I like using GCC/IntelCC/Clang
extended asm for such critical parts. It differs from external assembly sources in that the extended asm construction explicitly tells the C compiler about the input and output registers (and if so wanted, even lets the C compiler choose the exact registers used), and what registers or memory gets clobbered. When used in an inlined function, the compiler can adjust the assembly (register use) to best fit the surrounding code (and vice versa, because it knows exactly what registers etc. are used in the extended asm).
Things like timing-critical interrupt handlers are better written as external assembly files. The C compiler really only needs its calling address, which is available at link time, and can be exported to C using a simple
extern declaration. (I like to rely on other ELF object file properties, like
section attribute, to make the build machinery smoother, more capable, but still robust wrt. source code changes.)
For single-instruction multiple-data or SIMD stuff, I like using GCC/IntelCC/Clang vector extensions via the
vector_size attribute (on the variable type). Addition, subtraction, and multiplication on vector variables does the corresponding component-wise operation, and for the rest, the compiler provides built-in
intrinsics (and a standardized set for x86-64 in
<immintrin.h>). The basic vector extensions work regardless of hardware support –– for example, when the hardware only supports say two components, the compiler uses two registers for a vector with four components, transparently and quite efficiently ––, so such code is actually portable across hardware. For the same reasons as for extended asm, the compiler also generates quite efficient code for the intrinsics; typically much better than hand-written assembly, if we include the compiler-generated surrounding code in the consideration.
Arduino is a somewhat toy-like/silly/coddling environment, designed to make quick development easy for non-programmers and non-technical people.
As usual, a lot of Arduino stuff is quite crappy, but there are some very nice nuggets in there among the cores and support libraries. So, it's not all bad.
I happily use it for quick prototyping, although I am quite familiar with the cores (Arduino core code for the specific microcontroller chip) and the libraries I use, as I look through their source codes before I rely on them even a tiny little bit. (That is the difference between
"I think there is a library for that" and
"I use this
library for that" in my case; involving several hours of source code examination.)
In all this,
the ability to read and understand assembly code is quite useful and rather important. If you are like me, you rarely write any assembly from scratch, but you end up reading (the important or strange or critical parts) of compiler-generated assembly at least once every week. I myself often end up examining the generated assembly for wildly different hardware –– x86-64, MIPS, AVR, some ARM Cortex –– to find out if a particular C expression has issues on any hardware when compiled given a small set of compilers and compiler versions.
That is, I am basically never interested in what code ends up being optimal; I am interested in what code performs acceptably well in all situations, and what code patterns have issues with specific compilers and/or specific hardware architectures – the latter being more important than the former. Even when I'm using the nonstandard C compiler and linker features above, I like to know what weaknesses the pattern I am applying, has. That way, not only do I have that tool in my toolbox, but it also has a small note with it that lists the known risks/deficiencies/weaknesses in addition to its strengths, and I can rummage quickly through those to find a tool that suits a particular situation.
(From my point of view, this also explains why I dislike language-lawyerism: from this point of view, the language-lawyers are claiming the text of the standard is more important than those notes based on real life observations. The standard is better than nothing, but can never override the behaviour observed in the real world.)
Whether the same approach works for others, depends on their personal strengths and what they get paid for, I believe.