Especially if the processor has a cache
Obviously, a real measurement will always differ due to interrupts, pipeline effects, branch prediction, cache, waitstates, DMA blocking the bus or RAM etc.
Anyway, looking at the code will make it possible to better estimate the number of cycles needed. As stated above, even looking at C code for a "fast" tangens implementation allows to say this will need > 100 cycles. With the actual source code the prediction will be better and with the ASM code, it can be quite accurate - if you don't consider the complex runtime discussed above.
Even 20 years ago, measurements on an i486 with its tiny cache doing nothing else, there was a measured 10:1 difference between mean and maximum times. Modern processors have much faster clocks,but DRAM memory latency hasn't changed. Processors have much bigger caches and are more dependent on them to reduce the
average memory latency. Naturally caches cannot change the maximum latency.
Hence the maximum:mean ratio has increased
significantly and predicted execution times are even less valid than before.
Remember the truism "cache is the new RAM, RAM is the new disk"