Or you can do what I did just a few days ago, turn off the dynamic branch prediction, copy the program in ITCM, and verify the timing on an oscillosscope. Yes, I needed cycle-accurate. Yes, it is cycle-accurate. Yes, it works exactly the same every time.
OK, you slugged the system. That's sufficient for the timing, but wasteful and inelegant.
Good guess, but just wrong.
The system has a small cycle-accurate bit, but before and after that, it does normal processing, which may be written by others. Branch prediction is turned off for the cycle-accurate bit only, which is mostly unrolled anyway, so no significant performance loss. For this bit, only worst-case performance matters. Can't share the exact details, but let's say, it needs to generate a 3.000MHz signal (with some other details). With branch prediction enabled, it initially generates a 3.000MHz signal for one cycle, then something like 3.050MHz. Nothing is "crippled" by forcing it to generate the desired cycle-accurate signal equal to the worst-case performance. (You could achieve the same result keeping the branch predictor enabled; but I feel it's more
elegant to turn off the error source, than to "trick" it, even if "tricking" could be proved to work consistently.)
After the small cycle-accurate bit has been ran, all CPU features are back on. Now a M7 core is a lot more capable, performance-wise, than an XMOS core, thanks to strong pipelining, dual-issue, branch prediction, caches, and so on. Single-thread performance helps software development; latter processing can be easily written in C, without taking extra measures on paralleling the code (parallel data processing isn't always trivial!), which would be necessary for a multicore system where each core is comparatively inefficient.
Because most of the code is where M7 shines - general purpose (not constrained by worst-case timing requirements; average case is important) data processing and user interfaces - and only a tiny part is cycle-accurate, it makes sense to choose M7 and "force" it to do the cycle-accurate part as well, even though it's extra work. M7 is low-cost, as well.
You
could do it otherwise - choose something that's better suited for cycle-accurate work, but less suited for available developers to write efficient UI and data processing code on something they are familiar with. Or, you could build a multi-chip solution. All of these are more expensive, and the end result is not magically any better.
You still had to measure the timing. That is equivalent to inspection, and as engineers used to be taught, "you can't inspect quality into a product".
Nice try, but you are just plain wrong.
What the XMOS IDE is doing, it is simulating the internal operations of the CPU, and reporting the result to the user.
In digital logic, simulating (with proper simulation models) and measuring the number of clock cycles provide equivalent results. With improper simulation models, the actual hardware is
right.
Buying the Cortex-M7 simulation model does exactly the same - the only difference, it's expensive, and likely has a more difficult user interface (like having to write your own SystemC wrapper).
Running the code on Cortex-M7 and measuring it does the same, though. It's, after all, implementing the actual digital logic. The operation runtime can be measured, and it will match the cycle-accurate simulation model.
So, cycle-accurate simulation is only
convenient. It's fundamentally not magically more "accurate" than running the actual digital logic. It's actually the opposite: the simulation model must contain zero bugs. Measuring the actual device provides the correct result, in case the simulation model has any problem in it.
The error you make is, you seem to think that measuring the number of clock cycles a digital system runs in, is somehow equivalent to measuring a voltage on a prototype of a single unit of voltage reference; or measuring a length of a single piece of cut metal. This is not true.
You are
not measuring a system with random error terms; you are measuring a circuit consisting of known digital logic, and if you can enumerate the possible sources of variation and show they do not exist, you have a guaranteed number of cycles.
But I'm sure it's way
easier to do on XMOS, though. I'm sure they have put a lot of thought into providing an easy-to-use, yet accurate simulation models integrated in the IDE. And, most importantly, they have "crippled" the CPU performance by deliberately not including modern-day CPU optimizations that would increase the average processing performance, with the cost of variability in timing.
If you do timing-accurate work in a modern complex processor like an M7 MCU, you need to
know what those timing-invariable elements are, and know how to either disable them (note: it does not necessarily cripple the performance, because they can be
selectively disabled; like caches on certain memory areas, or just turning things on/off periodically), or how to calculate for the worst-case timing. Latter is more challenging, but completely possible.
I'm sure that if you have
a lot of cycle-accurate work to do, Xcore is likely orders of magnitude easier to work with than a Cortex-M7. And I would love to work with one, if someone would pay me for doing that. For my own projects, I'm defaulting to the most commonly available, most easily second-sourcable parts whenever it's possible, and only resort to specialistic products with their special best-in-class features, when I absolutely must.