a questionable analysis of a very particular case in a particular context with a particular compiler.
Yup. Microbenchmarking becomes the more difficult the shorter the target code is and the fewer clock cycles it takes, especially so on architectures with caches or multiple execution units or arithmetic-logic units: the surrounding code, including the elapsed time measurement, affects the target code too much for the measurements to be useful. At some point, it becomes nonsensical, because the measurement uncertainty and biases exceeds the target duration.
Much better – but still microbenchmarking – is to implement a function in more than one way, and run them in loops with precomputed inputs, and measure the time taken to handle all inputs.
This is then repeated a few thousand times, and the durations recorded. The most useful measure is not average (because the error in timing under normal operating systems is always positive: the task may be interrupted by other stuff) but
median (or some other percentile). The minimum is only academically interesting, in the sense that it is the time taken "when all stars happen to align"; an unrealistic minimum that may occur, but cannot relied upon to occur. Median is an easy one to explain: in half the cases, the time taken was at most
median.
Proper benchmarking involves taking a real world task and data set, and processing it using different implementations. Then, of course, you don't benchmark a single operation, but the implementations.
Premature optimization, like trying to make the fastest
atoi() you can before making sure it is a limiting bottleneck in your task at hand, is an extremely common mistake, especially among programmers without sufficiently wide experience: they spend a lot of time on "optimizing" something that has no effect on the end result, essentially wasting valuable time. Most often, true optimization avoids having to do that thing altogether, and achieves an order of magnitude greater savings.
An example of that is reading lots of unsorted data from storage, when you need it in order, with a human waiting for the operation to complete. You can discuss sort algorithms how much you want, but instead of reading all data and then sorting it, you can get the task done faster (using slightly more computing resources) by sorting the data as it becomes available, for example by reading each data entry into a binary heap or a (balanced) tree. This is less important now with extremely fast SSD drives, but with e.g. SD cards (often used for removable storage on microcontrollers and embedded devices, even on phones) and other storage media with limited transfer rates, the insertion (online sorting new entry into the data structure) takes place during time which otherwise would be wasted waiting for new data to arrive. The end result on these slower media is that even though you end up using more CPU cycles, the data is sorted and ready basically as soon as it all of it has been read, whereas the fastest offline sorting algorithm is only just starting its work at that point.