Estimating computing capacity between ARM cores?

Estimating computing capacity between ARM cores?
Posted by MT on 31 Aug, 2019 00:24
How do one evaluate computing capacity when comparing different ARM cores based on the same instructions and clock speed used in all cases (M0+ setting the base). Tricky business in many way's i sense , so many if's and circumstances , how do one do it to end up with something reasonable?

Been reading ARM white papers who of course contains nice figures and so on yet try to compare M0+ to something else upwards per same Mhz and code flow and instructions used is a grueling task.

The purpose is for a design that currently uses F334 M3's to be replaced with a cheaper G071 M0+'s or one H750 replacing all M3's or M0+'s but then that complexifies the entire design yet computing power can now be distributed rather then be fixed + other bonuses as per multi M3 or M0+ dont give yet those give way simpler board layout and etc.

#1 Reply
Posted by SiliconWizard on 31 Aug, 2019 01:01
First off - the STM32F3xx line are Cortex-M4, not M3. Yes, I know their references are confusing.

To compare them with figures, I guess we have to resort to benchmarks of some sort. They are not ideal, but at least they give real figures.

STM gives CoreMark figures for some of their MCUs. From the figures I was able to get, and those are very rough of course, just to get an idea: a typical STM32F3xx should get you around 3.4 CoreMark/MHz. A typical M0 (STM32F0xx), around 2 CoreMark/MHz. I couldn't find any figure for the M0+ (STM32G0xx), but that should be close to the M0, certainly much closer than to the M4.

So there's certainly a significant performance difference per MHz. Also, the F334 can be clocked at up to 72MHz, only 64MHz for the STM32G071.

That's about all I could get as far as figures go.

#2 Reply
Posted by MT on 31 Aug, 2019 01:45
M4 yes , sorry for that, Anyhow as noted ARM white papers have more detailed "fancy" numbers per Mhz but as mentioned that tells me just that, imagine just the case of interrupt , blocking interrupts/DMA etc combined with code structure not to mention interrupt handling have different clock lengths depending on core makes the whole task of estimating very problematic. I had hoped to find some code cases from ARM, ST etc but nothing. I doubt bench marking code is real app comparable.

My imagined lame idea was is that if one had a routine for the M0+ then the same routine for i.e M4 and M7 and
compare that but that suggest detailed study of disassembled code and absolute control of the compilation or so.

#3 Reply
Posted by bugnate on 31 Aug, 2019 06:06
If it were me, I might try to get by with comparing a M0+ to an M3/M4 just by using clock speed and the aforementioned CoreMark/MHz, but only crudely, and of course assuming a FPU does not come into play. Maybe just some numbers to flash to management and a sanity check. The cores and ISA are different, but not hugely so. The M7 on the other hand is a (dual issue) beast, and (maybe most of all) it has a real cache. As you crank up the MHz, I would think it harder and harder to predict scaling. Other subsystems are likely to be more complicated as well, including performance critical ones. You also might consider wake-up time from stop/standby, which I think trends the other way.

Quote from: MT on 31 Aug, 2019 00:24
The purpose is for a design that currently uses F334 M3's to be replaced with a cheaper G071 M0+'s or one H750 replacing all M3's or M0+'s but then that complexifies the entire design yet computing power can now be distributed rather then be fixed + other bonuses as per multi M3 or M0+ dont give yet those give way simpler board layout and etc.

For what it's worth, I just finished bringing up a low/medium complexity H753 design from scratch, and didn't find the implementation to be more complicated than a comparable (peripheral-wise) M3/M4 setup. The hard part was wading through the relative mountain of documentation.

#4 Reply
Posted by hans on 31 Aug, 2019 08:07
From a previous m0 vs m3/m4 comparison I made recently, I had decided for the m3/m4 core. Reason: the main decoding algorithm that I wrote has a tight loop with if-else statements in them, and ARM m3/m4 has something called conditional execution. This basically inlines an if-else statement with an "if-then-else" instruction, which can perform these control flow tasks without program jumps. I put this small piece of code into the online compiler explorer, and saw that the M0 code looked like a horrid mess with 4 jumps per iteration, while the m4 code was basically only 6 instructions per iteration.

Another advantage is that the m3/m4 supports a bigger ISA. The m0 is limited to Thumb2, and as such it may need to use multiple instructions to execute something that a single 32-bit opcode can handle. Note that the M0 is a different ARM architecture than M3, and M4&M7 yet again. (ARMv6-M, ARMv7-M, ARMv7E-M, respectively).

Nonetheless, these may be very theoratical reasons to explain the much higher CoreMarks score for the M3/M4 cores. I would recommended comparing it for your particular application.

However if your MCU firmware is mostly juggling I/O registers and not so much number crunching or complex protocol stacks, then I think you could get by with a m0 chip. (I would if I would redesign the aforementioned design into a FPGA with m0 softcore and hardware accelerator for the encoder/decoder)

#5 Reply
Posted by westfw on 31 Aug, 2019 08:23
Overall performance may be more related to memory system than the CPU itself. Most microcontrollers have flash memory that is slower than the instruction rate, and a "flash accelerator", or cache memory, is probably outside of the core definition and ARM's estimates of performance. Features such a TCM (tightly coupled memory) or a single-cycle IO port can also be significant.

#6 Reply
Posted by MT on 31 Aug, 2019 15:40
Quote from: hans on 31 Aug, 2019 08:07
From a previous m0 vs m3/m4 comparison I made recently, I had decided for the m3/m4 core. Reason: the main decoding algorithm that I wrote has a tight loop with if-else statements in them, and ARM m3/m4 has something called conditional execution. This basically inlines an if-else statement with an "if-then-else" instruction, which can perform these control flow tasks without program jumps. I put this small piece of code into the online compiler explorer, and saw that the M0 code looked like a horrid mess with 4 jumps per iteration, while the m4 code was basically only 6 instructions per iteration.

Another advantage is that the m3/m4 supports a bigger ISA. The m0 is limited to Thumb2, and as such it may need to use multiple instructions to execute something that a single 32-bit opcode can handle. Note that the M0 is a different ARM architecture than M3, and M4&M7 yet again. (ARMv6-M, ARMv7-M, ARMv7E-M, respectively).

Nonetheless, these may be very theoratical reasons to explain the much higher CoreMarks score for the M3/M4 cores. I would recommended comparing it for your particular application.

However if your MCU firmware is mostly juggling I/O registers and not so much number crunching or complex protocol stacks, then I think you could get by with a m0 chip. (I would if I would redesign the aforementioned design into a FPGA with m0 softcore and hardware accelerator for the encoder/decoder)

Interesting you say that because i have a tight routine, yet not computationally intense, yet 6 instances slightly different that was first implemented on F100 and migrated through the years (F103) and was intended to be multiple "if else" but then at the time i read somewhere that "switch- case" was less jumpy i.e faster so i went with a 5 stage switch-case this finally ended up in the F334, so now i think i have to re examine that routine in either case, despite at the time all my pokeing in the disassembler. Thanks for the link to the code examiner, had no idea about that one.

Most of the computing is simple counting,compares, add's, shifts, and a large stack of 16x16 mul and shift for each, and several 32bit add and accumulate routines, lots of table lookups and a input protocol handler/translator serviced by a DMA buffer operating on 8bits data, all driven by 3-4 differently timed interrupts, and a multi-channel output handler of processed data that if driven by interrupts could cause several conflicts and waits as it need tight timing so thats moved over to a DMA sequencer that just shifts out timed data to its channel from a SRAM buffer to free up the CPU.And most certainly some unforeseen things i yet not discovered/implemented that will cause all kind of hassle!
(im not a software developer only forced by the evil bosses)

G071 increased RAM size v.s cost enables the move of firmware from as run on F334 FLASH to be run G071 SRAM at full speed while the heavy lifting can be done by a distributed H750 as a F446 for each sub block was intended simply dont cut anymore even money wise to one H750.

I concur to the FPGA solution if designed from scratch would be a very nice solution unfortunately now its not a viable.

#7 Reply
Posted by SiliconWizard on 31 Aug, 2019 16:01
Quote from: MT on 31 Aug, 2019 01:45
M4 yes , sorry for that, Anyhow as noted ARM white papers have more detailed "fancy" numbers per Mhz but as mentioned that tells me just that, imagine just the case of interrupt , blocking interrupts/DMA etc combined with code structure not to mention interrupt handling have different clock lengths depending on core makes the whole task of estimating very problematic. I had hoped to find some code cases from ARM, ST etc but nothing. I doubt bench marking code is real app comparable.

My imagined lame idea was is that if one had a routine for the M0+ then the same routine for i.e M4 and M7 and
compare that but that suggest detailed study of disassembled code and absolute control of the compilation or so.

Short of testing that specifically with your OWN code and specific environment, benchmarks are pretty much all you can count on to get realistic figures. Of course they don't say it all, but CoreMark is rather balanced and not all that badly designed for this purpose. You were strictly talking about "computing capacity" (whatever you mean by that), so again a benchmark of this kind seems somewhat relevant. An industry-standard benchmark is certainly better than just theoretical and potentially uninformed hypotheses.

Now there are of course myriads of potential particular cases that would make a general-purpose benchmark less relevant, but then it's completely up to you. Either you use a general-purpose tool, or your very own. What else do you want? Theoretical values are pretty much meaningless as every implementation will define real-life performance, including the use of cache and other factors. The approximate 1.7x factor between STM32's M0 and M4 for "computing-intensive" tasks looks about right to me. Of course this is just a very general figure. If you're heavily using floating point for instance, the difference would be much more drastic. The F334 has an FPU, the G071 doesn't AFAIK... Another architectural point: the M4 has speculative branch, the M0+ doesn't.

As to interrupts, I would expect the M0+ to have slightly less latency actually, because it's only a 2-stage pipeline core. That would have to be confirmed though, as stating this just from the architecture would typically be a case of what I said above: a theoretical and potentially uninformed hypothesis.

#8 Reply
Posted by MT on 31 Aug, 2019 17:00
Quote from: SiliconWizard on 31 Aug, 2019 16:01
Quote from: MT on 31 Aug, 2019 01:45
M4 yes , sorry for that, Anyhow as noted ARM white papers have more detailed "fancy" numbers per Mhz but as mentioned that tells me just that, imagine just the case of interrupt , blocking interrupts/DMA etc combined with code structure not to mention interrupt handling have different clock lengths depending on core makes the whole task of estimating very problematic. I had hoped to find some code cases from ARM, ST etc but nothing. I doubt bench marking code is real app comparable.

My imagined lame idea was is that if one had a routine for the M0+ then the same routine for i.e M4 and M7 and
compare that but that suggest detailed study of disassembled code and absolute control of the compilation or so.

Short of testing that specifically with your OWN code and specific environment, benchmarks are pretty much all you can count on to get realistic figures. Of course they don't say it all, but CoreMark is rather balanced and not all that badly designed for this purpose. You were strictly talking about "computing capacity" (whatever you mean by that), so again a benchmark of this kind seems somewhat relevant. An industry-standard benchmark is certainly better than just theoretical and potentially uninformed hypotheses.

Now there are of course myriads of potential particular cases that would make a general-purpose benchmark less relevant, but then it's completely up to you. Either you use a general-purpose tool, or your very own. What else do you want? Theoretical values are pretty much meaningless as every implementation will define real-life performance, including the use of cache and other factors. The approximate 1.7x factor between STM32's M0 and M4 for "computing-intensive" tasks looks about right to me. Of course this is just a very general figure. If you're heavily using floating point for instance, the difference would be much more drastic. The F334 has an FPU, the G071 doesn't AFAIK... Another architectural point: the M4 has speculative branch, the M0+ doesn't.

As to interrupts, I would expect the M0+ to have slightly less latency actually, because it's only a 2-stage pipeline core. That would have to be confirmed though, as stating this just from the architecture would typically be a case of what I said above: a theoretical and potentially uninformed hypothesis.

Yeah, i noted that to M0+ might have faster int's, however Interupt/DMA was just an example of many other things people here inc your self mentioned as probable causes for the problem to estimate computation capacity in a real app, so yes much is theoretical and uninformed hypotesis or speculation if you like, those 1,7 figures have to be broken down into what they mean for a real app but estimates have to start from somewhere i suppose.

#9 Reply
Posted by SiliconWizard on 31 Aug, 2019 17:30
Well, those raw benchmark figures could tell you whether switching to an M0+ for your particular application would be completely out of the question, but not much more.

Really, what I suggest is: just buy a small M0+ dev board (like a Nucleo for the G071 or something), test significant chunks of your code on it, time it, see if it fits your requirements...

https://www.digikey.com/product-detail/en/stmicroelectronics/NUCLEO-G071RB/497-18337-ND/9739925

$11 and a couple hours of test time.

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

There was an error while thanking

Thanking...

Go to page:

1

Full site Menu

Navigation

Powered by SMFPacks Advanced Attachments Uploader Mod