-
#50 Reply
Posted by
dannyf
on 07 Mar, 2014 00:49
-
Added numbers for LM4F120 (CM4) and LPC1343 (CM3).
-
#51 Reply
Posted by
BravoV
on 07 Mar, 2014 04:37
-
LM4F120: 2,914, MDK-ARM, optimized (-O3 + time)
AVR90USB1286: 237, gcc-avr
Added numbers for LM4F120 (CM4) ....
Cool numbers, never did a through comparison on my own, but this at least confirms my limited observation on TI CM4, it "feels" so fast even on my bloated noob code, btw I migrated from AVR as my sole mcu in the past, thanks.
The worst part is, I become too spoiled, lazy and totally screws my effort to learn code optimization.
-
#52 Reply
Posted by
dannyf
on 07 Mar, 2014 12:24
-
Just want to make sure we agree that we are calculating results per MHz and not per MIPS.
That's what I suspected. I used a 8Mhz crystal in my test and used 4Mhz for the calculation, as the cpu is actually running at 4Mhz. I understand if you used 8Mhz - both approaches have rationale. Thus I kept the numbers the way they are, knowing that people may think one is more valid than another.
-
#53 Reply
Posted by
dannyf
on 08 Mar, 2014 13:59
-
STM32F4 added for gcc-arm (running at 16Mhz): inline with the F3's numbers.
-
#54 Reply
Posted by
dannyf
on 08 Mar, 2014 14:09
-
STM32F4 under iar-arm added. I think this is the first time in this test where gcc-arm is faster than iar-arm, with optimization turned on. The validity of the test with optimization turned on, however, is dubious without further investigation.
-
#55 Reply
Posted by
dannyf
on 08 Mar, 2014 15:22
-
Also added are the STM32F4 numbers under mdk.
IAR and Keil seem to be running neck to neck. GCC appears to be quick a bit slower than either IAR or Keil.
-
#56 Reply
Posted by
nctnico
on 08 Mar, 2014 20:49
-
Replace the C library functions with internal ones and test again to make sure it's not the C library making the difference between IAR and GCC.
-
#57 Reply
Posted by
dannyf
on 09 Mar, 2014 00:46
-
Added LPC1114 (=CM0 running at 12Mhz). The unoptimized Dhrystone/Mhz number is actually quite comparable to STM8S' (=good old 6502).
The old clunker isn't that bad, after all.
Or to put it another way, the OEMs are reasonable honest when they say that the CM0/1 chips are meant to compete with the 8-bitters.
-
#58 Reply
Posted by
jaxbird
on 09 Mar, 2014 11:51
-
Just want to make sure we agree that we are calculating results per MHz and not per MIPS.
That's what I suspected. I used a 8Mhz crystal in my test and used 4Mhz for the calculation, as the cpu is actually running at 4Mhz. I understand if you used 8Mhz - both approaches have rationale. Thus I kept the numbers the way they are, knowing that people may think one is more valid than another.
In my tests, using the internal oscillator, the configuration was like this:
Fosc = ((7.37MHz * M) / N1) / N2
Where M, N1 and N2 are PLLFBD, PLLPOST and PLLPRE.
And the actual values used:
M = 65
N1 = 2
N2 = 3
Giving Fosc = 79.841 MHz. (+/- 2%)
The datasheet/reference does list most of the instructions as executing in a single cycle, but I do find that a bit questionable as it's clearly not oscillator clock cycles they are referring to.
-
#59 Reply
Posted by
nctnico
on 09 Mar, 2014 12:49
-
Added LPC1114 (=CM0 running at 12Mhz). The unoptimized Dhrystone/Mhz number is actually quite comparable to STM8S' (=good old 6502).
The old clunker isn't that bad, after all.
Or to put it another way, the OEMs are reasonable honest when they say that the CM0/1 chips are meant to compete with the 8-bitters.
What you wrote above makes all my alarm bells ring. I'm very much doubting comparing unoptimised results has any real value. Unoptimised code is mostly used for debugging purposes where each line of code is represented by some assembly language. The aim is not even to make production grade code as no-one in their right mind would use that in a product. A real test would be to optimise for size and for speed.
-
#60 Reply
Posted by
dannyf
on 09 Mar, 2014 13:10
-
.-The datasheet/reference does list most of the instructions as executing in a single cycle, but I do find that a bit questionable as it's clearly not oscillator clock cycles they are referring to.-
you are using a 24h part, right? Take a liook at the clock tree. Fcy is at least 1/2 of fosc, assuming doze is not set.
on 24f parts, you have to go through the datasheet to find that out.
-
#61 Reply
Posted by
Kjelt
on 09 Mar, 2014 22:48
-
STM8S' (=good old 6502).
You keep on saying that, I didn't know that, is there some info on that?
I know the STM8 also has only the X and Y register (unfortunately, if they had added some extra registers that would have been nice).
But is that the only similarity?
-
#62 Reply
Posted by
jaxbird
on 23 Mar, 2014 13:24
-
.-The datasheet/reference does list most of the instructions as executing in a single cycle, but I do find that a bit questionable as it's clearly not oscillator clock cycles they are referring to.-
you are using a 24h part, right? Take a liook at the clock tree. Fcy is at least 1/2 of fosc, assuming doze is not set.
on 24f parts, you have to go through the datasheet to find that out.
Yeah, it's defined as Fcy = Fosc / 2. So 2 clocks required minimum per instruction.
Anyway, not important, my motivation was primarily to find the main reason for the large differences in measured performance. I believe we agree this is where our calculations differ, so I'm satisfied
-
-
Didn't run on 8051 but would expect it to hold its own reasonably well.
Hold its own what? I would say that this all falls under the category of "Premature optimization", but maybe not.
Can you provide a few cases where this kind of performance is the keystone in a design?
I'm curious to know where this metric would be the top design decision.
I've been learning more about uC and I see that some uC have 8051 cores in them, don't know if any you tested do.
Do you know if any do?
-
#64 Reply
Posted by
dannyf
on 25 Mar, 2014 22:58
-
Added (simulated) scores for C51 (a nxp P87C51MC2, in order to hold the data). Scores are obtained in simulation under Keil C51, on 24Mhz crystal, and calculated off a 2Mhz core frequency (the chip I think is a 12-cycle C51).
-
#65 Reply
Posted by
dannyf
on 26 Mar, 2014 12:44
-
STM8s are advertised by ST as a 0.25DMIPS/Mhz chip. That translates into about 430 dhrystones/Mhz, consistent with our measurements here.
Unfortunately, for the CMx chips, we are getting about 50 - 75% of the numbers published by ARM / vendors.
-
#66 Reply
Posted by
dannyf
on 26 Mar, 2014 14:42
-
The DMIPS/Mhz numbers for 8051 varies a lot, from the lows of <0.1DMIPS/Mhz to the highest of 0.5DMIPS/Mhz. 0.25DMIPS/Mhz (about 400+ dhrystones per Mhz) being quite often quoted.
Fairly remarkable in that a chip from the 1980s is as fast as a chip introduced in the last 10 years (STM8).
Not sure what 6500 has in terms of dhrystones scores.
-
#67 Reply
Posted by
westfw
on 26 Mar, 2014 17:13
-
Fairly remarkable in that a chip from the 1980s is as fast as [a newer chip]
You're still measuring DMIPS/MHz, right? That's not "speed", that's just "architectural efficiency at running C code" or something like that. The RISC claim is not so much that their architectures are fundamentally faster, just that they permit building SIMPLER chips, which in turn allows the clock rate to be pushed up and give you an overall faster chip.
-
#68 Reply
Posted by
hans
on 26 Mar, 2014 17:37
-
That's true. The Pentium 3 at 1GHz was faster than a Pentium 4 at 1GHz, but the Pentium 4 could clock far higher with it's new pipeline design. The end of their range it maxed out just under 4GHz or so, and we haven't seen much higher ever since (for example, my i5 3570K steps up to 3.9GHz 1-core load). The only thing that keeps pushing for more performance has been multi-threading and more efficient CPU's, with larger/better caches, more instructions to play with (if programs are enabled for them), etc.
An interesting dimension to add is power consumption per MHz. From that you could then calculate a performance/energy, as you have both Dhrystones/MHz, and mA/MHz, which divided on each other would give Dhrystone/mA ratio, or simply put "computing efficiency". That would be interesting for low power electronics like battery powered stuff which main driver is the MCU doing stuff on an regular basis.
I don't know if it's acceptable to take these figures from the datasheet.. it can depend a lot of what peripherals are turned on (ARM) or supply voltage.
I think I have a board lying around with a PIC32 on it. I will see if I can run the test on that too, see how well MIPS4k compares. They claim 1.65DMIPS/MHz on that.
-
#69 Reply
Posted by
westfw
on 26 Mar, 2014 17:42
-
(In this case, we're saying STM8 is as fast as CM0 (in DMIPS/MHz), but STM8 tops out at 16MHz, while CM0 in the same price range run 48-72MHz...) (I count that as about 4x the DMIPS/Dollar...)
-
#70 Reply
Posted by
dannyf
on 26 Mar, 2014 21:08
-
Added the simulated results for PIC32MX320F128H, under an old C32 compiler.
The unoptimized figure translates to 0.75 DMIPS/Mhz, and 2.0 DMIPS/Mhz optimized - not that believable, however.
-
#71 Reply
Posted by
dannyf
on 26 Mar, 2014 21:09
-
I will see if I can run the test on that too, see how well MIPS4k compares. They claim 1.65DMIPS/MHz on that.
Would love to see where the real thing comes out to be.
-
#72 Reply
Posted by
Kjelt
on 27 Mar, 2014 08:04
-
(In this case, we're saying STM8 is as fast as CM0 (in DMIPS/MHz), but STM8 tops out at 16MHz, while CM0 in the same price range run 48-72MHz...) (I count that as about 4x the DMIPS/Dollar...)
small correction: STM8 tops at 24MHz but then needs an external oscillator.
The CM0 will give a great increase in speed BUT with cost of codesize, the codesize of a CM0 is 30-35% larger then STM8 codesize and that also adds up to the final cost.
-
#73 Reply
Posted by
dannyf
on 27 Mar, 2014 15:43
-
The dhrystone benchmark for 6502 (actually 65C02), that I can find, suggests a 0.022 DMIPS/Mhz (not sure if it is scaled by 2 or not). That would translate into a dhrystone score of 30 / Mhz. Slower than a PIC,
-
-
Was curious when I discovered that 8051 (and 80C51?) cores are in a lot of uC. A quick digikey search shows 600-700 (plus have to subtract tape&reel, etc.). So let's say 500 (but still less if you subtract pkg types), but still a lot. Or is that the only way to get an 8051? i.e. they only come as a core?
What does the C in 80C51 designate? (or 65C02)