PIC32MX320: 3,378 C32 2.x, optimized (-O3)
PIC32MX440: 3,067 X32 1.21, optimized (-O3) - @ 20Mhz
LM4F120: 2,914, MDK-ARM, optimized (-O3 + time)
STM32F4: 2,888, MDK-ARM, optimized (-O3 + time)
STM32F4: 2,525, gcc-arm, optimized (-O3)
PIC24F: 2,432, C30 2.x, optimized -O3 (speed)
PIC24F: 2,403 XC16 pro, optimized (-O3)
PIC32MX440, 2,288 X32 1.21, optimized (-O3) @ 80Mhz
LPC1343: 2,087, MDK-ARM, optimized (-O3 + time)
STM32F3: 1,964, MDK-ARM, optimized (-O3 + time)
LM4F120: 1,911, IAR-ARM, optimized
STM32F4: 1,903, IAR-ARM, optimized
PIC24F: 1,901, C30 2.x, optimized -O2 (speed)
LPC1227: 1,506, IAR-ARM, optimized
LPC1343: 1,410, gcc-arm, optimized (-O3)
STM32F3: 1,362, IAR-ARM, optimized
LM4F120: 1,297, MDK-ARM
LM4F120: 1,245, IAR-ARM,
PIC24F: 1,237, C30 2.x,
PIC24H: 1,195, XC16, optimized (-O3)
PIC24H: 1,170, XC16 pro, optimized (-O3)
STM32F4: 1,162, IAR-ARM, optimized (-O3)
PIC32MX320: 1,151, C32 2.x
PIC24F: 1,106 [compiler?] optimized (-O3)
STM32F4: 1,053, IAR-ARM
LPC1227: 1,050, IAR-ARM
STM32F4: 1,029, MDK-ARM
PIC32MX440, 1,004, X32 1.21 @ 20Mhz
PIC24F: 993, XC16 free,
STM32F4: 955, MDK-ARM, optimized (-O3)
STM32F1: 921, IAR-ARM, optimized
LPC1343: 906, MDK-ARM
STM32F4: 902, gcc-arm
STM32F3: 858, MDK-ARM
STM32F3: 854, IAR-ARM,
STM32F4: 806, MDK-ARM
STM32F3: 804, gcc-arm, optimized
STM32F3: 766, gcc-arm,
PIC32MX440, 762, X32 1.21, @ 80Mhz
STM32F1: 736, gcc-arm, optimized
MSP430F2418: 734, IAR, optimized (3)
MSP430F2370: 667, IAR, optimized (3)
LPC1343: 664, gcc-arm
STM32F1: 653, IAR-ARM,
MSP430F2418: 630, IAR
STM32F030F: 619, MDK-ARM, O3 optimized
LPC1114: 614, IAR-ARM, optimized
MSP430F2370: 573, IAR
PIC24F: 555, [unknown]
STM32F030F: 552, MDK-ARM, O0
STM32F4: 489, IAR-ARM
PIC24H: 489, XC16 free
STM8S: 482, IAR-STM8, optimized
P87C51MC2: 470, Keil C51 optimized for speed
STM32F1: 453, gcc-arm,
P87C51MC2: 439, Keil C51 optimized for size
STM8S: 434, IAR-STM8,
LPC1114: 410, IAR-ARM
PIC18F26K20: 380, XC8 pro
PIC18F26K20: 323, XC8 free
PIC18F26K20: 322, PICC18 pro
AVR90USB1286: 237, gcc-avr
PIC18F26K20: 168, PICC18 lite
Simulation only:
PIC32MZ: 1173
PIC32MZ: 3,413, O3
PIC24F: 978 XC16
PIC24F: 2,404, XC16, O3
PIC24F: 1,215 C30
PIC24F: 2,433 C30, O3
I was surprised:
...
2) avr sucked wind.
I was surprised:
...
2) avr sucked wind.
Not surprising. GCC without any optimization produces absolutely crap code.
can't generate code for this expression
That wouldn't be logical,
No optimization of any kind, for any chipWhen you say that, do you mean that you didn't do any chip-specific optimizations, or that you completely turned off compiler optimizations (-O0) as well? The latter is practically worthless; some compilers do a lot more in the "optimization" step than others. Pick a generic compiler optimization parameter ("-O3"?) and use the closest equiv for each compiler.
I ran dhrystone 2.1 (a useless benchmark) on a few mcus that I have.Unless you used the same C library the same code didn't run on all chips. Drystone results are tainted by the efficiency of the C library.
I would run 10,000 times the benchmark, and then flip a pin. By measuring the duration between pin flips, we measured the duration of the benchmark. The shorter the duration, the faster the execution.
No optimization of any kind, for any chip - the exact same code ran on all chips.
Unless you used the same C library the same code didn't run on all chips. Drystone results are tainted by the efficiency of the C library.
right, but at least he has gone to the trouble to produce some results, and given a rough indication of how he did them.
Which is a start, and most of us are stuck with the efficiencies of those C libraries anyway.
Drystone results are tainted by the efficiency of the C library.Of course they are. They're "tainted" by how good the C compiler is too. That's THE POINT. It's a benchmark of the chip+compiler+library SYSTEM. (and that's what SHOULD be meant by "applied no optimizations.") Benchmarks that only measure the chip are much MORE tainted.
They compile right away, with minimum changes.Yeah, but we can't come up with numbers that compare directly to the ones in your table, without seeing exactly the wrapper code or your external calculations work. (It currently says "Drystones/sec/MHz"; but that doesn't look right, nor does it match up very well with the data (compared to published estimates.))
Not if you are comparing compilers. IMHO there is not much use to include a slow and fast C library in a benchmark test since you can always replace a slow C library function with a faster one.QuoteDrystone results are tainted by the efficiency of the C library.Of course they are. They're "tainted" by how good the C compiler is too. That's THE POINT. It's a benchmark of the chip+compiler+library SYSTEM. (and that's what SHOULD be meant by "applied no optimizations.") Benchmarks that only measure the chip are much MORE tainted.
Compiler Options | Dhrystones/MHz/Second |
-o0 | 806 |
-o1 | 932 |
-o2 | 942 |
-o3 | 955 |
-o3 -otime | 3621 (likely not valid) |
Compiler Options | Dhrystones/MHz/Second |
-o0 (free) | 489 |
-o1 (free) | 865 |
-o2 (pro) | 1158 |
-o3 (pro) | 1170 |
--o3 -otime 3621 -
I ran dhrystone on a lpc2106 @ 30mhz (12mhz x 5 / 2) and I got 2000 plus. No time to debug it but it wouldn't surprise me that some thing is being optimized away.
running a keil vs iar vs gcc comparison would be interesting.
I combined jaxbird's results for F4 and 24H with mine in the first post. When I get some time, I will try XC16 on the 24F as well.How about running the benchmark on slower frequency and disabling wait state.
On wait state, I think so as well, particularly at high speed so performance is unlikely to increase with speed linearly. But it would be difficult for me to quantify.
How about running the benchmark on slower frequency and disabling wait state.
But, as the PIC24 does 0.5 instruction per MHz, I argue that the actual performance is half. So I actually think you get ~1187 Dhrystone/MHz.
PIC24 does 1 [instruction] every 2 [cycles]Where did you get that idea? The pic24 manuals say "up to 40 MIPS" (and 40MHz clock) and:
All instructions execute in a single cycle, with the exception of instructions that change the program flow,(Of course, the 8-bit PICs also say something like that, and elsewhere define a "cycle" as "four oscillator clocks", but I believe that the PIC24 is really 1 instruction per clock...)
The PIC24 is a minimum of two clocks per instruction, gospel...
(Of course, the 8-bit PICs also say something like that, and elsewhere define a "cycle" as "four oscillator clocks", but I believe that the PIC24 is really 1 instruction per clock...)
Where did you get that idea?
The PIC24 is a minimum of two clocks per instruction, gospel...
Perhaps the better wording is every instruction take the multiple of 2 clock cycles
Too bad that PIC24 is really under-marketed by Microchip and under-appreciated by the mass.
pretty much double the performance
LM4F120: 2,914, MDK-ARM, optimized (-O3 + time)
AVR90USB1286: 237, gcc-avr
Added numbers for LM4F120 (CM4) ....
Just want to make sure we agree that we are calculating results per MHz and not per MIPS.
QuoteJust want to make sure we agree that we are calculating results per MHz and not per MIPS.
That's what I suspected. I used a 8Mhz crystal in my test and used 4Mhz for the calculation, as the cpu is actually running at 4Mhz. I understand if you used 8Mhz - both approaches have rationale. Thus I kept the numbers the way they are, knowing that people may think one is more valid than another.
Added LPC1114 (=CM0 running at 12Mhz). The unoptimized Dhrystone/Mhz number is actually quite comparable to STM8S' (=good old 6502).What you wrote above makes all my alarm bells ring. I'm very much doubting comparing unoptimised results has any real value. Unoptimised code is mostly used for debugging purposes where each line of code is represented by some assembly language. The aim is not even to make production grade code as no-one in their right mind would use that in a product. A real test would be to optimise for size and for speed.
The old clunker isn't that bad, after all. :)
Or to put it another way, the OEMs are reasonable honest when they say that the CM0/1 chips are meant to compete with the 8-bitters.
STM8S' (=good old 6502).You keep on saying that, I didn't know that, is there some info on that?
.-The datasheet/reference does list most of the instructions as executing in a single cycle, but I do find that a bit questionable as it's clearly not oscillator clock cycles they are referring to.-
you are using a 24h part, right? Take a liook at the clock tree. Fcy is at least 1/2 of fosc, assuming doze is not set.
on 24f parts, you have to go through the datasheet to find that out.
Didn't run on 8051 but would expect it to hold its own reasonably well.
Fairly remarkable in that a chip from the 1980s is as fast as [a newer chip]You're still measuring DMIPS/MHz, right? That's not "speed", that's just "architectural efficiency at running C code" or something like that. The RISC claim is not so much that their architectures are fundamentally faster, just that they permit building SIMPLER chips, which in turn allows the clock rate to be pushed up and give you an overall faster chip.
I will see if I can run the test on that too, see how well MIPS4k compares. They claim 1.65DMIPS/MHz on that.
(In this case, we're saying STM8 is as fast as CM0 (in DMIPS/MHz), but STM8 tops out at 16MHz, while CM0 in the same price range run 48-72MHz...) (I count that as about 4x the DMIPS/Dollar...)small correction: STM8 tops at 24MHz but then needs an external oscillator.
What does the C in 80C51 designate? (or 65C02)
Setup | Cycles @ 80MHz | Dhrystone @ 80MHz | Code Size @ 80MHz | Cycles @ 20MHz | Dhrystone @ 20MHz | Code Size @ 20MHz |
No optimizations | 1312 | 762 | 17436 | 996 | 1004 | 17436 |
GCC optimize level 1 | 631 | 1584 | 15260 | 455 | 2197 | 15256 |
GCC optimize level 2 | 481 | 2079 | 15156 | 345 | 2898 | 15152 |
GCC optimize level 3 | 445 | 2247 | 15216 | 339 | 2949 | 15212 |
GCC optimize level 3 + unroll loops | 437 | 2288 | 15308 | 326 | 3067 | 15300 |
GCC optimize level s(ize) | 680 | 1470 | 15400 | 483 | 2070 | 15396 |
See text / speed | 423 | 2364 | 15308 | 326 | 3067 | 15300 |
See text / size | 1407 | 710 | 10552 | 1232 | 811 | 10548 |
STM32F4: 1,053, IAR-ARM
STM32F4: 494, IAR-ARM
PIC32MX320: 3,378 C32 2.x, optimized (-O3)
PIC32MX440: 3,067 X32 1.21, optimized (-O3) - @ 20Mhz
PIC32MX440, 2,288 X32 1.21, optimized (-O3) @ 80Mhz
PIC32MX320: 1,151, C32 2.x
PIC32MX440, 1,004, X32 1.21 @ 20Mhz
PIC32MX440, 762, X32 1.21, @ 80Mhz
Simulation only:
PIC32MZ: 1173
PIC32MZ: 3,413, O3
Measured:
PIC24F: 2,432, C30 2.x, optimized -O3 (speed)
PIC24F: 2,403 XC16 pro, optimized (-O3)
PIC24F: 1,901, C30 2.x, optimized -O2 (speed)
PIC24F: 1,237, C30 2.x,
PIC24F: 1,106 [compiler?] optimized (-O3)
PIC24F: 993, XC16 free,
Simulation only:
PIC24F: 2,433 C30, O3
PIC24F: 2,404, XC16, O3
PIC24F: 1,215 C30
PIC24F: 978 XC16
Optimized C30:
PIC24F: 2,432, C30 2.x, optimized -O3 (speed)
PIC24F (sim): 2,433 C30, O3
//PIC24F: 1,901, C30 2.x, optimized -O2 (speed)
//PIC24F: 1,106 [compiler?] optimized (-O3)
Optimized XC16:
PIC24F: 2,403 XC16 pro, optimized (-O3)
PIC24F (sim): 2,404, XC16, O3
Non-optimized C30:
PIC24F: 1,237, C30 2.x,
PIC24F (sim): 1,215 C30
Non-optimized XC16:
PIC24F: 993, XC16 free,
PIC24F (sim): 978 XC16
Here they use another benchmark: coremark, might be interesting for comparison also.
http://www.eembc.org/coremark/ (http://www.eembc.org/coremark/)
STM8S: 482, IAR-STM8, optimized
STM8S: 434, IAR-STM8,
PIC18F26K20: 380, XC8 pro
PIC18F26K20: 323, XC8 free
PIC18F26K20: 322, PICC18 pro
AVR90USB1286: 237, gcc-avr
PIC18F26K20: 168, PICC18 lite
Was curious when I discovered that 8051 (and 80C51?) cores are in a lot of uC. A quick digikey search shows 600-700 (plus have to subtract tape&reel, etc.). So let's say 500 (but still less if you subtract pkg types), but still a lot. Or is that the only way to get an 8051? i.e. they only come as a core?The 8051 is a core anyone can use without paying royalties, and multiple reasonably good toolchains are available to support it. That means its a no brainer for many people who need to drop a simple low performance core into a chip to drop in an 8051 core. The original 8051 takes 12 clock cycles for one machine cycle. There are versions today which run at one clock cycle per machine cycle, so it can be reasonably fast. You will find an 8051 at the heart of a lot of devices you didn't even realise were processor based, as the user just sees them as a black box.
What does the C in 80C51 designate? (or 65C02)
8051 has been put in its grave for a long long time, its now a zombie created by some companies lacking innovation effort.That is the exact opposite of reality. People being really innovative seldom give a damn about the core. If that's where your innovation lies its pretty weak. It would be great if a better free to use core with good tools were available, but the 8051 is good enough for a lot of devices.
Unfortunately zombies refuse to stay in their grave and act dead.
That is the exact opposite of reality. People being really innovative seldom give a damn about the core.Which reality is that then? If you look around almost all innovations start with the availability of new technology, being or faster having more processing power to establish and realizing new technologies or having less powerusage to be run on different (mobile) platforms.
That is the exact opposite of reality. People being really innovative seldom give a damn about the core.Which reality is that then? If you look around almost all innovations start with the availability of new technology, being or faster having more processing power to establish and realizing new technologies or having less powerusage to be run on different (mobile) platforms.
Almost every part of an MCU except the CPU is an interesting place for innovation. That's why the ARM is becoming so dominant. It just doesn't matter, so people use the ARM as a default. Its something people are familiar and happy with, and it has good tool support. That's all that matters about the core for the vast majority of MCU applications. You might be surprised how few MCUs are actually running at their rated speed. People typically use a fraction of the potential core speed, to save a little current. The main reason for the growth of 32 bit MCUs has less to do with speed, than the ability to work smoothly with big memories.That is the exact opposite of reality. People being really innovative seldom give a damn about the core.Which reality is that then? If you look around almost all innovations start with the availability of new technology, being or faster having more processing power to establish and realizing new technologies or having less power usage to be run on different (mobile) platforms.
Plus, for many applications, faster isn't as needed as other measurements, like current consumption.And as we know from the big father processors as example the Core ix from Intel we see these also go together, each next silicon process reduces the voltage of the core thus decreasing the power usage thus enabling faster frequencies (or more cores) for the same temperature/power footprint.