From a previous m0 vs m3/m4 comparison I made recently, I had decided for the m3/m4 core. Reason: the main decoding algorithm that I wrote has a tight loop with if-else statements in them, and ARM m3/m4 has something called conditional execution. This basically inlines an if-else statement with an "if-then-else" instruction, which can perform these control flow tasks without program jumps. I put this small piece of code into the online compiler explorer, and saw that the M0 code looked like a horrid mess with 4 jumps per iteration, while the m4 code was basically only 6 instructions per iteration.
Another advantage is that the m3/m4 supports a bigger ISA. The m0 is limited to Thumb2, and as such it may need to use multiple instructions to execute something that a single 32-bit opcode can handle. Note that the M0 is a different ARM architecture than M3, and M4&M7 yet again. (ARMv6-M, ARMv7-M, ARMv7E-M, respectively).
Nonetheless, these may be very theoratical reasons to explain the much higher CoreMarks score for the M3/M4 cores. I would recommended comparing it for your particular application.
However if your MCU firmware is mostly juggling I/O registers and not so much number crunching or complex protocol stacks, then I think you could get by with a m0 chip. (I would if I would redesign the aforementioned design into a FPGA with m0 softcore and hardware accelerator for the encoder/decoder)
Interesting you say that because i have a tight routine, yet not computationally intense, yet 6 instances slightly different that was first implemented on F100 and migrated through the years (F103) and was intended to be multiple "if else" but then at the time i read somewhere that "switch- case" was less jumpy i.e faster so i went with a 5 stage switch-case this finally ended up in the F334, so now i think i have to re examine that routine in either case, despite at the time all my pokeing in the disassembler. Thanks for the link to the code examiner, had no idea about that one.
Most of the computing is simple counting,compares, add's, shifts, and a large stack of 16x16 mul and shift for each, and several 32bit add and accumulate routines, lots of table lookups and a input protocol handler/translator serviced by a DMA buffer operating on 8bits data, all driven by 3-4 differently timed interrupts, and a multi-channel output handler of processed data that if driven by interrupts could cause several conflicts and waits as it need tight timing so thats moved over to a DMA sequencer that just shifts out timed data to its channel from a SRAM buffer to free up the CPU.And most certainly some unforeseen things i yet not discovered/implemented that will cause all kind of hassle!
(im not a software developer only forced by the evil bosses)
G071 increased RAM size v.s cost enables the move of firmware from as run on F334 FLASH to be run G071 SRAM at full speed while the heavy lifting can be done by a distributed H750 as a F446 for each sub block was intended simply dont cut anymore even money wise to one H750.
I concur to the FPGA solution if designed from scratch would be a very nice solution unfortunately now its not a viable.