EEVblog Electronics Community Forum

Electronics => Microcontrollers => Topic started by: dannyf on March 01, 2014, 06:40:22 pm

Title: Dhrystone 2.1 on mcus
Post by: dannyf on March 01, 2014, 06:40:22 pm

I ran dhrystone 2.1 (a useless benchmark) on a few mcus that I have.

I would run 10,000 times the benchmark, and then flip a pin. By measuring the duration between pin flips, we measured the duration of the benchmark. The shorter the duration, the faster the execution.

No optimization of any kind, for any chip - the exact same code ran on all chips.

[edit: represent the data in terms of Dhrystone / Mhz per second, from high to low]

Quote

PIC32MX320: 3,378 C32 2.x, optimized (-O3)
PIC32MX440: 3,067 X32 1.21, optimized (-O3) - @ 20Mhz
LM4F120: 2,914, MDK-ARM, optimized (-O3 + time)
STM32F4: 2,888, MDK-ARM, optimized (-O3 + time)
STM32F4: 2,525, gcc-arm, optimized (-O3)
PIC24F: 2,432, C30 2.x, optimized -O3 (speed)
PIC24F: 2,403 XC16 pro, optimized (-O3)
PIC32MX440, 2,288 X32 1.21, optimized (-O3) @ 80Mhz
LPC1343: 2,087, MDK-ARM, optimized (-O3 + time)
STM32F3: 1,964, MDK-ARM, optimized (-O3 + time)
LM4F120: 1,911, IAR-ARM, optimized
STM32F4: 1,903, IAR-ARM, optimized
PIC24F: 1,901, C30 2.x, optimized -O2 (speed)
LPC1227: 1,506, IAR-ARM, optimized
LPC1343: 1,410, gcc-arm, optimized (-O3)
STM32F3: 1,362, IAR-ARM, optimized
LM4F120: 1,297, MDK-ARM
LM4F120: 1,245, IAR-ARM,
PIC24F: 1,237, C30 2.x,
PIC24H: 1,195, XC16, optimized (-O3)
PIC24H: 1,170, XC16 pro, optimized (-O3)
STM32F4: 1,162, IAR-ARM, optimized (-O3)
PIC32MX320: 1,151, C32 2.x
PIC24F: 1,106 [compiler?] optimized (-O3)
STM32F4: 1,053, IAR-ARM
LPC1227: 1,050, IAR-ARM
STM32F4: 1,029, MDK-ARM
PIC32MX440, 1,004, X32 1.21 @ 20Mhz
PIC24F: 993, XC16 free,
STM32F4: 955, MDK-ARM, optimized (-O3)
STM32F1: 921, IAR-ARM, optimized
LPC1343: 906, MDK-ARM
STM32F4: 902, gcc-arm
STM32F3: 858, MDK-ARM
STM32F3: 854, IAR-ARM,
STM32F4: 806, MDK-ARM
STM32F3: 804, gcc-arm, optimized
STM32F3: 766, gcc-arm,
PIC32MX440, 762, X32 1.21, @ 80Mhz
STM32F1: 736, gcc-arm, optimized
MSP430F2418: 734, IAR, optimized (3)
MSP430F2370: 667, IAR, optimized (3)
LPC1343: 664, gcc-arm
STM32F1: 653, IAR-ARM,
MSP430F2418: 630, IAR
STM32F030F: 619, MDK-ARM, O3 optimized
LPC1114: 614, IAR-ARM, optimized
MSP430F2370: 573, IAR
PIC24F: 555, [unknown]
STM32F030F: 552, MDK-ARM, O0
STM32F4: 489, IAR-ARM
PIC24H: 489, XC16 free
STM8S: 482, IAR-STM8, optimized
P87C51MC2: 470, Keil C51 optimized for speed
STM32F1: 453, gcc-arm,
P87C51MC2: 439, Keil C51 optimized for size
STM8S: 434, IAR-STM8,
LPC1114: 410, IAR-ARM
PIC18F26K20: 380, XC8 pro
PIC18F26K20: 323, XC8 free
PIC18F26K20: 322, PICC18 pro
AVR90USB1286: 237, gcc-avr
PIC18F26K20: 168, PICC18 lite

Simulation only:
PIC32MZ: 1173
PIC32MZ: 3,413, O3
PIC24F: 978 XC16
PIC24F: 2,404, XC16, O3
PIC24F: 1,215 C30
PIC24F: 2,433 C30, O3

I was surprised:

1) pic24f was really fast. and stm32f1/3 sucked on a per Mhz basis.
2) avr sucked wind.
3) stm8s did OK. Evident of 6502's staying power.

Didn't run on 8051 but would expect it to hold its own reasonably well.

edit: 1) added stm32f100 numbers.
2) added stm32f100 + iar-arm, vs. gcc-arm.
3) added results from IAR-ARM and represented the data to make it easier for the eyes.
4) added results from pic18f
5) updated PIC24F results (fat fingers) and added -O3 optimization
6) added mdk-arm numbers for STM32F3. Pretty much identical unoptimized.
7) added jaxbird's results for STM32F4 and PIC24H.
8) added PIC24F results under XC16 free/pro. Still very high.
9) added jaxbird's and hans' results for pic24h and pic24f/stm32f4, respectively.
10) added LM4F120 under IAR-ARM and MDK-ARM.
11) added LPC1343 under gcc-arm and MDK-ARM.
12) added STM32F4 under gcc-arm and iar-arm, and mdk-arm too.
13) added LPC1114 under iar-arm. The first CM0 chip in the comparison.
14) added C51 (P87C51MC2) under Keil C51. Optimized for size and speed.
15) added PIC32MX320 under C32, 2.x. (simulated)
16) added PIC32MX440F512H results, for 80Mhz and 20Mhz
17) added msp430. Fairly respective scores.
18) added LPC1227 / IAR scores.
19) added MSP430F2370 - similar to the MSP430 scores obtained earlier.
20) added simulated results for PIC32MZ, and PIC24F (XC16 and C30)
21) added the results for STM32F030F, from the ghetto thread.

Title: Re: Dhrystone 2.1 on mcus
Post by: Bored@Work on March 01, 2014, 07:03:23 pm

Quote from: dannyf on March 01, 2014, 06:40:22 pm

I was surprised:
...
2) avr sucked wind.

Not surprising. GCC without any optimization produces absolutely crap code.

Title: Re: Dhrystone 2.1 on mcus
Post by: tszaboo on March 01, 2014, 07:50:49 pm

Quote from: Bored@Work on March 01, 2014, 07:03:23 pm

Quote from: dannyf on March 01, 2014, 06:40:22 pm
I was surprised:
...
2) avr sucked wind.

Not surprising. GCC without any optimization produces absolutely crap code.

Yeah, kinda like ";" will be "NOP" after complying just to be able to put a breakpoint on it. Not to mention, cortex has a very well defined 1.25 DMIPS/MHz for M3 and 0.93 DMIPS/MHz for M0+ so running it again is kinda pointless. Unless you compare different complier settings, etc...

Title: Re: Dhrystone 2.1 on mcus
Post by: hans on March 01, 2014, 08:24:40 pm

I was doing CoreMarks last week (as it so happens), to see what compiler settings did for PIC24.
I chose CoreMarks because it focuses so much on Embedded Controllers and thus includes tests like pointer handling capability, branching etc. over pure integer speed.

I just did some more tests and with XC16 enabled for optimizations, I got:
-O 0 = 9.911 iterations / second (2700b RAM, 8015b FLASH)
-O 1 = 23.56 iterations/second (2630b RAM, 6486b FLASH)
-O 2 = 29.64 iterations/second (2630b RAM, 6634b FLASH)
-O 2 -unroll = 31.13 iterations/second (2630b RAM, 8083b FLASH)

-O 3 = 28.95 iterations/second (2630b RAM, 9454b FLASH)
-O 3 -unroll = 30.09 iterations/second (2630b RAM, 12348b FLASH)

-O s = 21.11 iterations/second (2630b RAM, 6484b FLASH)

Interesting compiler/benchmark, where grade 2 is faster than optimize grade 3 and s(size). :-// No optimizations being so slow doesn't surprise me, as it's for development and the debugger needs to be able to find the instructions in program memory.
Also funny to see how well -O1 actually compares for this benchmark. However, as always with optimizations depending how stuff is written: YMMV.

The coremarks benchmark is really easy to port (initialize uart for printf, insert timebase with 32-bit timer, done). Except.. for PIC18, because the program has a matrix multiply test, and uses 32-bit integer array indexing. The XC8 didn't like that. It said:
../coremark_v1.0/core_matrix.c:244: error: can't generate code for this expression

Anyway, I see STM8 has 16-bit divide instructions, where the AVR only has multiply (8-bit). PIC24 has 32-bit by 16-bit divide, and 17-bit by 17-bit multiply. So I guess it wins out on a lot. Not sure why it would be faster than Cortex m3 though , because it has 32-bit multiply/divide. That wouldn't be logical, unless the bus is being stalled all the time or something (highly doubt that though).

Title: Re: Dhrystone 2.1 on mcus
Post by: GiskardReventlov on March 01, 2014, 09:11:36 pm

Help me understand the numbers.

pic24fj64ga102 @ 4Mhz (8Mhz crystal), duration = 2.12 seconds -> 8.5 seconds @ 1Mhz

Is this 8.5 Dhrystones/second?

What compiler flags did you use and what versions of the compilers did you use?

As a casual, naive observer I conclude that you didn't reach any conclusion. But maybe I missed how the chips you tested are similar or dissimilar. The only commonality I saw was that you have them.

Correction:
avr90usb1286

should be

at90usb1286

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 01, 2014, 09:22:47 pm

Quote

can't generate code for this expression

You may try some earlier compilers from hi-tech.

Quote

That wouldn't be logical,

Agreed. I think we will know for sure once I have a look at the list file.

I also didn't keep track of code size but that's not that interesting to me.

Title: Re: Dhrystone 2.1 on mcus
Post by: westfw on March 02, 2014, 07:40:13 am

Quote

No optimization of any kind, for any chip

When you say that, do you mean that you didn't do any chip-specific optimizations, or that you completely turned off compiler optimizations (-O0) as well? The latter is practically worthless; some compilers do a lot more in the "optimization" step than others. Pick a generic compiler optimization parameter ("-O3"?) and use the closest equiv for each compiler.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 02, 2014, 12:02:03 pm

Finished running the code on 18f. The XC8 free compiler did quite well vs. the good old picc18.

Title: Re: Dhrystone 2.1 on mcus
Post by: nctnico on March 02, 2014, 12:35:48 pm

Quote from: dannyf on March 01, 2014, 06:40:22 pm

I ran dhrystone 2.1 (a useless benchmark) on a few mcus that I have.

I would run 10,000 times the benchmark, and then flip a pin. By measuring the duration between pin flips, we measured the duration of the benchmark. The shorter the duration, the faster the execution.

No optimization of any kind, for any chip - the exact same code ran on all chips.

Unless you used the same C library the same code didn't run on all chips. Drystone results are tainted by the efficiency of the C library.

edit: After a short peek I see several calls to strcpy and strcmp. These should be replaced with functions inside the Dhrystone test so the results are not tainted by differences in the C library.

Title: Re: Dhrystone 2.1 on mcus
Post by: HackedFridgeMagnet on March 02, 2014, 02:14:49 pm

Quote

Unless you used the same C library the same code didn't run on all chips. Drystone results are tainted by the efficiency of the C library.

Your right, but at least he has gone to the trouble to produce some results, and given a rough indication of how he did them.

Which is a start, and most of us are stuck with the efficiencies of those C libraries anyway.

Title: Re: Dhrystone 2.1 on mcus
Post by: jaromir on March 02, 2014, 09:11:20 pm

dannyf: maybe I overlooked something, but could you, please, provide your sources, say for PIC24? I'd like to verify it and maybe do tests with different MCUs.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 02, 2014, 09:28:41 pm

It is the standard dhrystone package. It comes in one .h file and two .c files dhry_1 dhry_2.c

google it. They compile right away, with minimum changes.

Title: Re: Dhrystone 2.1 on mcus
Post by: GiskardReventlov on March 02, 2014, 09:40:24 pm

Quote from: HackedFridgeMagnet on March 02, 2014, 02:14:49 pm

right, but at least he has gone to the trouble to produce some results, and given a rough indication of how he did them.

Too rough to reproduce the numbers but it's all just for fun anyway.

avr-gcc -v
arm-gcc -v
etc.
What voltages?

The Dhrystone says that it's not meant to use any optimizations, unless you just want to test optimizations.
But that's not the goal here. So they're just some numbers

Quote

Which is a start, and most of us are stuck with the efficiencies of those C libraries anyway.

Inefficiencies are bigger problem.

Title: Re: Dhrystone 2.1 on mcus
Post by: westfw on March 03, 2014, 04:15:58 am

Quote

Drystone results are tainted by the efficiency of the C library.

Of course they are. They're "tainted" by how good the C compiler is too. That's THE POINT. It's a benchmark of the chip+compiler+library SYSTEM. (and that's what SHOULD be meant by "applied no optimizations.") Benchmarks that only measure the chip are much MORE tainted.

Quote

They compile right away, with minimum changes.

Yeah, but we can't come up with numbers that compare directly to the ones in your table, without seeing exactly the wrapper code or your external calculations work. (It currently says "Drystones/sec/MHz"; but that doesn't look right, nor does it match up very well with the data (compared to published estimates.))

Title: Re: Dhrystone 2.1 on mcus
Post by: nctnico on March 03, 2014, 09:57:30 am

Quote from: westfw on March 03, 2014, 04:15:58 am

Quote
Drystone results are tainted by the efficiency of the C library.
Of course they are. They're "tainted" by how good the C compiler is too. That's THE POINT. It's a benchmark of the chip+compiler+library SYSTEM. (and that's what SHOULD be meant by "applied no optimizations.") Benchmarks that only measure the chip are much MORE tainted.

Not if you are comparing compilers. IMHO there is not much use to include a slow and fast C library in a benchmark test since you can always replace a slow C library function with a faster one.

Title: Re: Dhrystone 2.1 on mcus
Post by: jaxbird on March 03, 2014, 10:48:59 am

Interesting comparison of mcus and compilers :-+

Ran a couple for comparison and adding more data, same standard Dhrystone 2.1 code I assume.

Results:

STM32F4 (168 MHz) with Keil

Compiler Options	Dhrystones/MHz/Second
-o0	806
-o1	932
-o2	942
-o3	955
-o3 -otime	3621 (likely not valid)

PIC24HJ64GP502 (80 MHz) with XC16 (free/pro)

Compiler Options	Dhrystones/MHz/Second
-o0 (free)	489
-o1 (free)	865
-o2 (pro)	1158
-o3 (pro)	1170

Edit: Added more results for PIC24H

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 03, 2014, 11:30:51 am

-Yeah, but we can't come up with numbers that compare directly to the ones in your table, without seeing exactly the wrapper code or your external calculations work. (-

as I said in the beginning, there is nothing in the my code others than calling dhrystone and flip a pin. You the measure the duration of the pun flip and calculate dhrystone per MHz per second.

for example: say it takes 5 seconds to run 10,000 loops of dhrystone on a 10 MHz mcu. That's 10000 / 10 / 5 or 200 dhrystone per MHz per second.

it may be clearer if you take a look at the code.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 03, 2014, 12:06:06 pm

--o3 -otime 3621 -

I ran dhrystone on a lpc2106 @ 30mhz (12mhz x 5 / 2) and I got 2000 plus. No time to debug it but it wouldn't surprise me that some thing is being optimized away.

running a keil vs iar vs gcc comparison would be interesting.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 03, 2014, 12:54:11 pm

On the last one: all the benchmarking I have seen, including keils own, suggests an upper end of 1000 Dhrystone per MHz for armv7 chips. The cortex chips are crippled or smaller so lower numbers for them make sense.

Title: Re: Dhrystone 2.1 on mcus
Post by: jaxbird on March 03, 2014, 01:54:30 pm

Quote from: dannyf on March 03, 2014, 12:06:06 pm

--o3 -otime 3621 -

I ran dhrystone on a lpc2106 @ 30mhz (12mhz x 5 / 2) and I got 2000 plus. No time to debug it but it wouldn't surprise me that some thing is being optimized away.

running a keil vs iar vs gcc comparison would be interesting.

Agree, the difference is so large, it's likely not a valid result. I have not added any asserts on the outputs of the Dhrystone runs. Will try do a validation at some point.

Yeah, it would be interesting to get some comparable results between the most used compilers and their optimizations. Of course it wouldn't necessarily mean the best result is always the best compiler, but would still be interesting though.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 04, 2014, 01:27:45 am

Ran PIC24F again (just to be sure). The earlier numbers actually were slightly lower due to key-in errors.

Also ran -O3 on PIC24F. Over 2400 Dhrystones / Mhz. That is unbelievable, :)

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 04, 2014, 01:46:23 am

Also added STM32F3's numbers under mdk-arm (4.x). A virtual tie with IAR when unoptimized.

Title: Re: Dhrystone 2.1 on mcus
Post by: nuhamind2 on March 04, 2014, 01:56:33 am

Could flash memory speed affect performance ? On higher clock wait state need to be inserted so the performance/Mhz is worse on high freq. I know that STM32 has flash accellerator or something, but how efficient is that I don't know and I think flash accelerator won't help for indirect jump. Don't know though whether Dhrystone generate indirect jump.
just my 2c

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 04, 2014, 02:20:11 am

I combined jaxbird's results for F4 and 24H with mine in the first post. When I get some time, I will try XC16 on the 24F as well.

On wait state, I think so as well, particularly at high speed so performance is unlikely to increase with speed linearly. But it would be difficult for me to quantify.

Title: Re: Dhrystone 2.1 on mcus
Post by: nuhamind2 on March 04, 2014, 03:05:10 am

Quote from: dannyf on March 04, 2014, 02:20:11 am

I combined jaxbird's results for F4 and 24H with mine in the first post. When I get some time, I will try XC16 on the 24F as well.

On wait state, I think so as well, particularly at high speed so performance is unlikely to increase with speed linearly. But it would be difficult for me to quantify.

How about running the benchmark on slower frequency and disabling wait state.

Title: Re: Dhrystone 2.1 on mcus
Post by: westfw on March 04, 2014, 03:47:31 am

So you essentially replaced the printf code in the original benchmarks with a pin toggle for measuring the timing? And these numbers should result in the nominal DMIPS/MHz if divided by the usual magic constant (1757 for a VAX780)? That puts the PIC24 at about 1.4, which seems to be in the usual range for modern microcontrollers. (1.25 to 1.89 is what ARM quotes for CM3.)

The interesting question is why most of the other chips' numbers are so much lower than expected.

Title: Re: Dhrystone 2.1 on mcus
Post by: nuhamind2 on March 04, 2014, 11:50:57 am

I doubt that's the case for AVR,AVR flash is always as fast as the core when fetching instruction. The only slow down happen when doing literal load from flash which take 3 cycles compared from RAM which take 2 cycles. But you only load constant from flash if you state to store your data implicitely in flash using some kind of modifier.Otherwise your data will be copied to RAM and loaded from there.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 04, 2014, 12:10:34 pm

Quote

How about running the benchmark on slower frequency and disabling wait state.

Will look into that later.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 04, 2014, 01:47:52 pm

Played with the flash wait state (aka latency) on gcc-arm. Without optimization, the default latency setting produced a Dhrystone / Mhz score of 766 on STM32F3. Pushing it to wait state of 0 pushes the score up about 100, and pushing it to wait state 2 pushes the score down about 100.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on March 04, 2014, 01:56:32 pm

Interesting results, thanks :-+

Title: Re: Dhrystone 2.1 on mcus
Post by: nuhamind2 on March 04, 2014, 02:32:15 pm

Just adding some info , the stm32f3 page on mouser state 62 DMPIS/72Mhz (2 wait state) or 0.861 DMIPS/Mhz ( pretty close to the benchmark posted here )and 94DMIPS/72Mhz when running from CCM-RAM (0 wait state) or 1.305 DMIPS/Mhz

Title: Re: Dhrystone 2.1 on mcus
Post by: jaxbird on March 04, 2014, 02:40:09 pm

I'm surprised by the difference in results between the pic24f using C30 and pic24h using XC16. It's close to twice the score for the pic24f/C30 combination.

I had a go at a few different frequencies the pic24h to see if that makes any difference. My original test was run at ~80MHz.

4 MHz - 1195
20 Mhz - 1197
80 MHz - 1190
(All with pic24h, XC16, -O3 optimization)

Very similar, I'd put the differences down to not 100% accurate time keeping, plus using the internal 7.37MHz oscillator with feedback divisor and pre/post scaler results in frequencies slightly below or above the target.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 04, 2014, 05:34:18 pm

Yeah. I am with you on that. The PIC24 numbers are simply too good to be true. My first reaction then was that the compiler may have been coded to recognize dhrystone code.

When I get some time, I am going to insert some pin flipping code in the dhrystone itself to see if all pieces of the code are actually executed.

Title: Re: Dhrystone 2.1 on mcus
Post by: hans on March 04, 2014, 06:54:37 pm

It could also be pipeline differences between MCU's that can cause Dhrystone to perform very well or very bad. PIC18 does 1 instruction every 4 ticks, PIC24 does 1 every 2, and ARM 1 every 1, however they probably run a pipeline (others could do too). But if one instruction modifies the data of the next instruction, then that next instruction can be stalled until the preceding instruction was completed.
I don't think this changes with clock speed, because the pipeline only runs at the instruction clock.

I've now become interested to see why PIC24 is so fast, and "forked" this version of the Dhrystone: https://github.com/rkrajnc/amber/tree/5f1fc912d06346cc3266a0ed0148f0b4272f1e43/sw/dhry ( used for testing Amber ARM FPGA softcore)

Interesting part is that the test lists no of cycles for other processors like Intel i3: it says that takes 389 cycles per Dhrystone. This means 2570 Dhrystone/MIPS.
For PIC24FJ64GA004 it took 1001 cycles(XC16 -O0), so that means I got 999 Dhrystones/MIPS with XC16 -O0.
With -O3 -unroll I get 483 cycles per Dhrystone, which means 2070 Dhrystones/MIPS .
With -O3 I get 452 cycles per Dhrystone -> 2212 Dhrystones/MIPS .
With -O3 and small code model, small data model, constants in RAM I get 421 cycles per Dhrystone -> 2375 Dhrystones/MIPS .

Very much in line with what dannyf tested. But, as the PIC24 does 0.5 instruction per Hz, I argue that the actual performance is half. So I actually think you get ~1187 Dhrystone/MHz.

Time for STM32F407. I set up timer TIM2 with prescaler 0 (1:1), en clock div 1. The input frequency of the timer is the APB1 RCC clock, 37.5MHz, where the CPU runs at 150MHz. So each timer value we get, is actually 1:4 resolution.

With that setup, the figures for IAR are (FLASH / 150MHz):
2044 cycles/Dhrystone @ no optimisations -> 489 Dhrystone/MHz
1904 cycles/Dhrystone @ low -> 525 Dhrystone/MHz
1228 cycles/Dhrystone @ medium -> 814 Dhrystone/MHz
1182 cycles/Dhrystone @ high (size) -> 846 Dhrystone/MHz
1086 cycles/Dhrystone @ high (balanced) -> -> 920 Dhrystone/MHz
860 cycles/Dhrystone @ high (speed) -> 1162 Dhrystone/MHz

I had a plan to run the code from RAM, but if I place add __ramfunc to every function, it actually gets slower. With no optimisations I get 2026 cycles/Dhrystone, but with high speed optimisation it still takes 1396 cycles/Dhrystone :-//

So I tried lowering the clock speed, to 75MHz, and even 37.5MHz, but it makes no difference. So it certainly is not the FLASH wait state.

By the way, how did you trick the PIC18 into running Dhrystone? The version I got wants to allocate 5K bytes in 1 array, which is larger than the whole memory of the chip.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 04, 2014, 07:13:13 pm

I came across this document, produced by TI: http://www.ti.com/lit/an/slaa205c/slaa205c.pdf (http://www.ti.com/lit/an/slaa205c/slaa205c.pdf)

The chart in the back does suggest that PIC24 has lower cycle counts for math operations and comparable cycle counts with ARM7TDMI (LPC2106 class chips) in thumb mode (which is similar to STM32F chips).

So maybe what we observed is indeed within realm of possibilities.

Title: Re: Dhrystone 2.1 on mcus
Post by: smashIt on March 04, 2014, 08:14:28 pm

Quote from: hans on March 04, 2014, 06:54:37 pm

But, as the PIC24 does 0.5 instruction per MHz, I argue that the actual performance is half. So I actually think you get ~1187 Dhrystone/MHz.

i think you went the wrong direction

if the pic has 2000 dhrystone/MHz and 2 cicles per instruction, it should do 4000 dhrystone / million instrucions / second

Title: Re: Dhrystone 2.1 on mcus
Post by: hans on March 04, 2014, 09:32:06 pm

Maybe I should have explained that with cycles I meant instructions.

As 2 Hz are 1 instruction. So 2MHz yields 1MIPS.
I understand how you could assume a cycle is 1 Hz.

Title: Re: Dhrystone 2.1 on mcus
Post by: westfw on March 05, 2014, 08:11:34 am

Quote

PIC24 does 1 [instruction] every 2 [cycles]

Where did you get that idea? The pic24 manuals say "up to 40 MIPS" (and 40MHz clock) and:

Quote

All instructions execute in a single cycle, with the exception of instructions that change the program flow,

(Of course, the 8-bit PICs also say something like that, and elsewhere define a "cycle" as "four oscillator clocks", but I believe that the PIC24 is really 1 instruction per clock...)

Title: Re: Dhrystone 2.1 on mcus
Post by: hans on March 05, 2014, 09:12:13 am

Look up the datasheet of a PIC24FJ64GB004. Even the summary page says;
"Up to 16 MIPS Operation @ 32 MHz"

PIC24 and dsPIC practically all use the same core, but with different speeds, DSP instructions, and the E/H series have some instructions removed & added.
For example: a PIC24EP128GP202 can run at 60/70 MIPS (depending on temperature range), and the PLL has a maximum output of 120 or 140MHz.
A DSPIC33FJ128GP804 can clock up to 80MHz, which yields 40MIPS.

That's why I focused on the no. of instructions PIC24 uses to complete a Dhrystone, and calculate from there.

Title: Re: Dhrystone 2.1 on mcus
Post by: JTR on March 05, 2014, 10:28:59 am

Quote from: westfw on March 05, 2014, 08:11:34 am

(Of course, the 8-bit PICs also say something like that, and elsewhere define a "cycle" as "four oscillator clocks", but I believe that the PIC24 is really 1 instruction per clock...)

The PIC24 is a minimum of two clocks per instruction, gospel...

Title: Re: Dhrystone 2.1 on mcus
Post by: nuhamind2 on March 05, 2014, 10:31:11 am

Perhaps the better wording is every instruction take the multiple of 2 clock cycles

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 05, 2014, 11:01:52 am

Quote

Where did you get that idea?

The datasheet? It is fairly easy to pick it up there, actually.

Quote

The PIC24 is a minimum of two clocks per instruction, gospel...

Yeah. It has been well known for quite some time by now, :)

Quote

Perhaps the better wording is every instruction take the multiple of 2 clock cycles

Yeah.

Title: Re: Dhrystone 2.1 on mcus
Post by: westfw on March 05, 2014, 11:28:06 am

Huh. I guess the PIC24H parts can double the (up to) 40MHz external clock, yielding an 80MHz internal clock and 40MIPs instruction rate... I hadn't realized that they internal clock rate had gotten so high!

Title: Re: Dhrystone 2.1 on mcus
Post by: legacy on March 05, 2014, 12:23:18 pm

can i see the C code of the test ? i'd like to test my board/toolchain

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 05, 2014, 01:58:04 pm

I did some testing. I put some pin-flipping patterns into the various Procx() called in dhrystone benchmark. Those patterns differ from each other so by observing them on the pins I get to confirm that those routines are called.

I did in fact observe those patterns so they are indeed called by the benchmark, both in debug and release modes. That would suggest that the dhrystone numbers for PIC24F are real -> fairly remarkable I think.

Two possible shortfalls here:

1) maybe the routines are only called with those patterns inserted: instead of inserting those patterns, I did an OR 0x00 on the output port and the time I got is similar to the original score.
2) maybe the routines are called by the results are faulty: I did not investigate that.

Too bad that PIC24 is really under-marketed by Microchip and under-appreciated by the mass.

Title: Re: Dhrystone 2.1 on mcus
Post by: diyaudio on March 05, 2014, 02:04:42 pm

Quote from: dannyf on March 05, 2014, 01:58:04 pm

Too bad that PIC24 is really under-marketed by Microchip and under-appreciated by the mass.

I just got on-board with the PIC24 a month ago (after I received my samples) as I ran out of I/O using a 18F series, thus far its very interesting, true that its under marketed, few open source projects or material on the net.

Title: Re: Dhrystone 2.1 on mcus
Post by: jaxbird on March 06, 2014, 02:24:58 pm

No doubt, that Microchip designed a killer 16 bit series with the pic24/dsPic33 series. It's been my favorite in this class for the last couple of years. For me it started with a $25 microstick dev board including a couple of dip package mcus to try it out.

Just want to make sure we agree that we are calculating results per MHz and not per MIPS. The results posted by Hans and my results are very similar 1100+ from Hans with pic24f and XC16. Mine at 1190 using pic24h and XC16.

Dannyf: I know you used C30, did you have a chance to run your tests with XC16 for comparison?

Just for fun I had a go at overclocking. The pic24h I tested will run at 120MHz+, not bad at all. But of course pointless as it's not guaranteed to run stable within the temperature specs at that clock speed. But you could probably run it at 100MHz without any problems as long as it's not exposed to extreme temperatures.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 06, 2014, 06:52:51 pm

I did. The numbers are posted earlier and updated in the first post. The XC16 produced comparable but marginally slower performance vs. C30.

Title: Re: Dhrystone 2.1 on mcus
Post by: JTR on March 06, 2014, 10:18:21 pm

Anyway, for all that has been said and argued, the simple fact is that there is no difference between the PIC24F and the PIC24H in terms of instructions per MHz. Both of them scale linearly and in lock step to each other. Ergo, there cannot be a difference in Dhrystones per MHz. The fact that the PIC24F is listed here as having pretty much double the performance of the PIC24H (with same compiler and settings) itself has to be a red flag that something is wrong with the calculations.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 07, 2014, 12:48:32 am

Quote

pretty much double the performance

"double the performance" = 2x.

I wonder what might have caused that, :)

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 07, 2014, 12:49:37 am

Added numbers for LM4F120 (CM4) and LPC1343 (CM3).

Title: Re: Dhrystone 2.1 on mcus
Post by: BravoV on March 07, 2014, 04:37:45 am

Quote from: dannyf on March 01, 2014, 06:40:22 pm

LM4F120: 2,914, MDK-ARM, optimized (-O3 + time)

AVR90USB1286: 237, gcc-avr

Quote from: dannyf on March 07, 2014, 12:49:37 am

Added numbers for LM4F120 (CM4) ....

Cool numbers, never did a through comparison on my own, but this at least confirms my limited observation on TI CM4, it "feels" so fast even on my bloated noob code, btw I migrated from AVR as my sole mcu in the past, thanks.

The worst part is, I become too spoiled, lazy and totally screws my effort to learn code optimization. :palm:

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 07, 2014, 12:24:20 pm

Quote

Just want to make sure we agree that we are calculating results per MHz and not per MIPS.

That's what I suspected. I used a 8Mhz crystal in my test and used 4Mhz for the calculation, as the cpu is actually running at 4Mhz. I understand if you used 8Mhz - both approaches have rationale. Thus I kept the numbers the way they are, knowing that people may think one is more valid than another.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 08, 2014, 01:59:11 pm

STM32F4 added for gcc-arm (running at 16Mhz): inline with the F3's numbers.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 08, 2014, 02:09:23 pm

STM32F4 under iar-arm added. I think this is the first time in this test where gcc-arm is faster than iar-arm, with optimization turned on. The validity of the test with optimization turned on, however, is dubious without further investigation.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 08, 2014, 03:22:32 pm

Also added are the STM32F4 numbers under mdk.

IAR and Keil seem to be running neck to neck. GCC appears to be quick a bit slower than either IAR or Keil.

Title: Re: Dhrystone 2.1 on mcus
Post by: nctnico on March 08, 2014, 08:49:55 pm

Replace the C library functions with internal ones and test again to make sure it's not the C library making the difference between IAR and GCC.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 09, 2014, 12:46:30 am

Added LPC1114 (=CM0 running at 12Mhz). The unoptimized Dhrystone/Mhz number is actually quite comparable to STM8S' (=good old 6502).

The old clunker isn't that bad, after all. :)

Or to put it another way, the OEMs are reasonable honest when they say that the CM0/1 chips are meant to compete with the 8-bitters.

Title: Re: Dhrystone 2.1 on mcus
Post by: jaxbird on March 09, 2014, 11:51:38 am

Quote from: dannyf on March 07, 2014, 12:24:20 pm

Quote
Just want to make sure we agree that we are calculating results per MHz and not per MIPS.

That's what I suspected. I used a 8Mhz crystal in my test and used 4Mhz for the calculation, as the cpu is actually running at 4Mhz. I understand if you used 8Mhz - both approaches have rationale. Thus I kept the numbers the way they are, knowing that people may think one is more valid than another.

In my tests, using the internal oscillator, the configuration was like this:

Fosc = ((7.37MHz * M) / N1) / N2

Where M, N1 and N2 are PLLFBD, PLLPOST and PLLPRE.

And the actual values used:

M = 65
N1 = 2
N2 = 3

Giving Fosc = 79.841 MHz. (+/- 2%)

The datasheet/reference does list most of the instructions as executing in a single cycle, but I do find that a bit questionable as it's clearly not oscillator clock cycles they are referring to.

Title: Re: Dhrystone 2.1 on mcus
Post by: nctnico on March 09, 2014, 12:49:36 pm

Quote from: dannyf on March 09, 2014, 12:46:30 am

Added LPC1114 (=CM0 running at 12Mhz). The unoptimized Dhrystone/Mhz number is actually quite comparable to STM8S' (=good old 6502).

The old clunker isn't that bad, after all. :)

Or to put it another way, the OEMs are reasonable honest when they say that the CM0/1 chips are meant to compete with the 8-bitters.

What you wrote above makes all my alarm bells ring. I'm very much doubting comparing unoptimised results has any real value. Unoptimised code is mostly used for debugging purposes where each line of code is represented by some assembly language. The aim is not even to make production grade code as no-one in their right mind would use that in a product. A real test would be to optimise for size and for speed.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 09, 2014, 01:10:46 pm

.-The datasheet/reference does list most of the instructions as executing in a single cycle, but I do find that a bit questionable as it's clearly not oscillator clock cycles they are referring to.-

you are using a 24h part, right? Take a liook at the clock tree. Fcy is at least 1/2 of fosc, assuming doze is not set.

on 24f parts, you have to go through the datasheet to find that out.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on March 09, 2014, 10:48:17 pm

Quote from: dannyf on March 09, 2014, 12:46:30 am

STM8S' (=good old 6502).

You keep on saying that, I didn't know that, is there some info on that?
I know the STM8 also has only the X and Y register (unfortunately, if they had added some extra registers that would have been nice).
But is that the only similarity?

Title: Re: Dhrystone 2.1 on mcus
Post by: jaxbird on March 23, 2014, 01:24:32 pm

Quote from: dannyf on March 09, 2014, 01:10:46 pm

.-The datasheet/reference does list most of the instructions as executing in a single cycle, but I do find that a bit questionable as it's clearly not oscillator clock cycles they are referring to.-

you are using a 24h part, right? Take a liook at the clock tree. Fcy is at least 1/2 of fosc, assuming doze is not set.

on 24f parts, you have to go through the datasheet to find that out.

Yeah, it's defined as Fcy = Fosc / 2. So 2 clocks required minimum per instruction.

Anyway, not important, my motivation was primarily to find the main reason for the large differences in measured performance. I believe we agree this is where our calculations differ, so I'm satisfied :)

Title: Re: Dhrystone 2.1 on mcus
Post by: GiskardReventlov on March 25, 2014, 04:24:27 pm

Quote from: dannyf on March 01, 2014, 06:40:22 pm

Didn't run on 8051 but would expect it to hold its own reasonably well.

Hold its own what? I would say that this all falls under the category of "Premature optimization", but maybe not.
Can you provide a few cases where this kind of performance is the keystone in a design?
I'm curious to know where this metric would be the top design decision.

I've been learning more about uC and I see that some uC have 8051 cores in them, don't know if any you tested do.
Do you know if any do?

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 25, 2014, 10:58:57 pm

Added (simulated) scores for C51 (a nxp P87C51MC2, in order to hold the data). Scores are obtained in simulation under Keil C51, on 24Mhz crystal, and calculated off a 2Mhz core frequency (the chip I think is a 12-cycle C51).

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 26, 2014, 12:44:22 pm

STM8s are advertised by ST as a 0.25DMIPS/Mhz chip. That translates into about 430 dhrystones/Mhz, consistent with our measurements here.

Unfortunately, for the CMx chips, we are getting about 50 - 75% of the numbers published by ARM / vendors.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 26, 2014, 02:42:38 pm

The DMIPS/Mhz numbers for 8051 varies a lot, from the lows of <0.1DMIPS/Mhz to the highest of 0.5DMIPS/Mhz. 0.25DMIPS/Mhz (about 400+ dhrystones per Mhz) being quite often quoted.

Fairly remarkable in that a chip from the 1980s is as fast as a chip introduced in the last 10 years (STM8).

Not sure what 6500 has in terms of dhrystones scores.

Title: Re: Dhrystone 2.1 on mcus
Post by: westfw on March 26, 2014, 05:13:09 pm

Quote

Fairly remarkable in that a chip from the 1980s is as fast as [a newer chip]

You're still measuring DMIPS/MHz, right? That's not "speed", that's just "architectural efficiency at running C code" or something like that. The RISC claim is not so much that their architectures are fundamentally faster, just that they permit building SIMPLER chips, which in turn allows the clock rate to be pushed up and give you an overall faster chip.

Title: Re: Dhrystone 2.1 on mcus
Post by: hans on March 26, 2014, 05:37:57 pm

That's true. The Pentium 3 at 1GHz was faster than a Pentium 4 at 1GHz, but the Pentium 4 could clock far higher with it's new pipeline design. The end of their range it maxed out just under 4GHz or so, and we haven't seen much higher ever since (for example, my i5 3570K steps up to 3.9GHz 1-core load). The only thing that keeps pushing for more performance has been multi-threading and more efficient CPU's, with larger/better caches, more instructions to play with (if programs are enabled for them), etc.

An interesting dimension to add is power consumption per MHz. From that you could then calculate a performance/energy, as you have both Dhrystones/MHz, and mA/MHz, which divided on each other would give Dhrystone/mA ratio, or simply put "computing efficiency". That would be interesting for low power electronics like battery powered stuff which main driver is the MCU doing stuff on an regular basis.
I don't know if it's acceptable to take these figures from the datasheet.. it can depend a lot of what peripherals are turned on (ARM) or supply voltage.

I think I have a board lying around with a PIC32 on it. I will see if I can run the test on that too, see how well MIPS4k compares. They claim 1.65DMIPS/MHz on that.

Title: Re: Dhrystone 2.1 on mcus
Post by: westfw on March 26, 2014, 05:42:16 pm

(In this case, we're saying STM8 is as fast as CM0 (in DMIPS/MHz), but STM8 tops out at 16MHz, while CM0 in the same price range run 48-72MHz...) (I count that as about 4x the DMIPS/Dollar...)

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 26, 2014, 09:08:25 pm

Added the simulated results for PIC32MX320F128H, under an old C32 compiler.

The unoptimized figure translates to 0.75 DMIPS/Mhz, and 2.0 DMIPS/Mhz optimized - not that believable, however.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 26, 2014, 09:09:24 pm

Quote

I will see if I can run the test on that too, see how well MIPS4k compares. They claim 1.65DMIPS/MHz on that.

Would love to see where the real thing comes out to be.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on March 27, 2014, 08:04:14 am

Quote from: westfw on March 26, 2014, 05:42:16 pm

(In this case, we're saying STM8 is as fast as CM0 (in DMIPS/MHz), but STM8 tops out at 16MHz, while CM0 in the same price range run 48-72MHz...) (I count that as about 4x the DMIPS/Dollar...)

small correction: STM8 tops at 24MHz but then needs an external oscillator.
The CM0 will give a great increase in speed BUT with cost of codesize, the codesize of a CM0 is 30-35% larger then STM8 codesize and that also adds up to the final cost.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 27, 2014, 03:43:49 pm

The dhrystone benchmark for 6502 (actually 65C02), that I can find, suggests a 0.022 DMIPS/Mhz (not sure if it is scaled by 2 or not). That would translate into a dhrystone score of 30 / Mhz. Slower than a PIC, :)

Title: Re: Dhrystone 2.1 on mcus
Post by: GiskardReventlov on March 27, 2014, 05:37:28 pm

Was curious when I discovered that 8051 (and 80C51?) cores are in a lot of uC. A quick digikey search shows 600-700 (plus have to subtract tape&reel, etc.). So let's say 500 (but still less if you subtract pkg types), but still a lot. Or is that the only way to get an 8051? i.e. they only come as a core?

What does the C in 80C51 designate? (or 65C02)

Title: Re: Dhrystone 2.1 on mcus
Post by: miguelvp on March 27, 2014, 06:07:18 pm

Quote from: GiskardReventlov on March 27, 2014, 05:37:28 pm

What does the C in 80C51 designate? (or 65C02)

CMOS

Title: Re: Dhrystone 2.1 on mcus
Post by: hans on March 29, 2014, 05:12:33 pm

Okay, I've got a PIC32 target board (PIC32MX440F512H @ 80MHz) and ran the Dhyrstone benchmark.

Setup	Cycles @ 80MHz	Dhrystone @ 80MHz	Code Size @ 80MHz	Cycles @ 20MHz	Dhrystone @ 20MHz	Code Size @ 20MHz
No optimizations	1312	762	17436	996	1004	17436
GCC optimize level 1	631	1584	15260	455	2197	15256
GCC optimize level 2	481	2079	15156	345	2898	15152
GCC optimize level 3	445	2247	15216	339	2949	15212
GCC optimize level 3 + unroll loops	437	2288	15308	326	3067	15300
GCC optimize level s(ize)	680	1470	15400	483	2070	15396
See text / speed	423	2364	15308	326	3067	15300
See text / size	1407	710	10552	1232	811	10548

After the standard GCC 0/1/2/s , I tried the following:

"Remove unused sections" -> Code size dropped to 14880 bytes, execution time increased to 445 (?). Slower, so I turned it off.
"Optimization level stdlib" set to level 3 -> 436 cycles / Dhrystone! (-1 cycle) Code is 16176 bytes (+868byte).
"Isolate each function in it's own section" -> 423 cycles / Dhrystone (-14 cycles / 2364 Dhrystone/MIPS). Code size was 15308 bytes (+0 bytes).
"Use legacy stdlib" -> Back to 439 cycles, code 33400 bytes. Nope, not attractive feature at all :)
"Generate 16-bit code" -> 803 cycles / dhrystone, code dropped to 15144 bytes. Quite a performance hit.

As personal interest, the smallest code I could get was 10552 bytes, with GCC 16-bit code, GCC optimize s, no unroll loop, remove unused functions, allow section overlap, optimize s for stdlib, 16-bit in linker (for stdlib?). Execution took 1407 cycles, though.

Now, I first did the test only @ 80MHz, wrote the above and thought "something doesnt seem right". As the 1.65DMIPS/MHz claim should give 2900Dhrystone/MIPS. I quickly thought about the FLASH accelerator, and I suspect it's not as agood as on most modern ARM chips or the new PIC32MZ. So I reran the test at 20MHz (which seems to run at full speed, e.g. lowering further doesn't yield faster results) and got these. They seem about right, or better than claimed.
Also glad to see the new PIC32MZ has a FLASH accelerator that should do the business @ full speed :)

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 29, 2014, 06:33:53 pm

Thanks. Your numbers are fairly consistent with mine (simulated). The poorer efficiency at higher frequency makes sense to me - maybe due to flash wait state.

The optimized numbers, not just for pic32 but other mcus as well, are not that believable as you have to investigate each routine to make sure that they are not being optimized away - one way to do that is actually to write something to the port in those routines so they don't get cut by the compiler - and you can watch the port to make sure that they are actually run. Too much work.

Title: Re: Dhrystone 2.1 on mcus
Post by: hans on March 29, 2014, 06:56:24 pm

Yes I suspect it's the FLASH accelerator, as the PIC32MZ datasheet goes into the prefetch module, which is only 16 bytes deep. So I am pretty certain any call will result in a performance drop.

Unless the compiler inlines all the test functions, of course. In that case maybe a "dummy" function is possible so it tricks the compiler into thinking those functions are used more often, and "can't" inline them anymore. However, as the compiler will try to do the same on my "real" programs, I take it as a feature :)
Also, too much effort, and would need to do the same for every other chip as they also use GCC.

Sorry I forgot to add the compiler; it's XC32 v1.21

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 29, 2014, 07:14:53 pm

Updated the list for your compiler.

I compared XC32 vs. C32 in simulation and the numbers I got are almost identical so I think they are (essentially?) the same, just different branding.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 30, 2014, 08:53:50 pm

Added MSP430 scores. Fairly respectable but considerably lower than the PIC24 did.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on March 31, 2014, 12:01:35 pm

Danny what was the difference in compiler settings between these two measurements?

Quote

STM32F4: 1,053, IAR-ARM
STM32F4: 494, IAR-ARM

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 31, 2014, 10:57:31 pm

Kjelt, the 1053 score came from me, IAR-ARM, no optimization.

The 494 score came from one of the participants and I can look back and see exactly what it is.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on March 31, 2014, 11:01:06 pm

hans got a score of 489 for STM32F4 using IAR, no optimization. I may have mis-transcribed it.

Corrected now.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on April 01, 2014, 08:11:07 am

But a difference of a factor two with the same compiler and compiler settings does not compute?

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on April 01, 2014, 10:52:06 am

Sure.

But I am not in the business of validating someone's results. They are what they are, as reported here.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on May 26, 2014, 09:43:58 pm

Updated simulated benchmark for PIC32MZ (not a single chip available) and PIC24F (XC16 and C30).

First, PIC32MZ vs. PIC32MX:

Quote

PIC32MX320: 3,378 C32 2.x, optimized (-O3)
PIC32MX440: 3,067 X32 1.21, optimized (-O3) - @ 20Mhz
PIC32MX440, 2,288 X32 1.21, optimized (-O3) @ 80Mhz
PIC32MX320: 1,151, C32 2.x
PIC32MX440, 1,004, X32 1.21 @ 20Mhz
PIC32MX440, 762, X32 1.21, @ 80Mhz

Simulation only:
PIC32MZ: 1173
PIC32MZ: 3,413, O3

The results look to be roughly comparable: non-optimized at around 1200 and optimized around 3400.

The compiler for PIC32MZ is XC32.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on May 26, 2014, 09:46:48 pm

2ndly, PIC24F actual vs. simulation:

Quote

Measured:
PIC24F: 2,432, C30 2.x, optimized -O3 (speed)
PIC24F: 2,403 XC16 pro, optimized (-O3)
PIC24F: 1,901, C30 2.x, optimized -O2 (speed)
PIC24F: 1,237, C30 2.x,
PIC24F: 1,106 [compiler?] optimized (-O3)
PIC24F: 993, XC16 free,

Simulation only:
PIC24F: 2,433 C30, O3
PIC24F: 2,404, XC16, O3
PIC24F: 1,215 C30
PIC24F: 978 XC16

Very consistent results between the measured benchmarks vs. simulated benchmarks.

Not unexpected.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on May 26, 2014, 09:52:27 pm

Now, XC16 vs. C30:

Quote

Optimized C30:
PIC24F: 2,432, C30 2.x, optimized -O3 (speed)
PIC24F (sim): 2,433 C30, O3
//PIC24F: 1,901, C30 2.x, optimized -O2 (speed)
//PIC24F: 1,106 [compiler?] optimized (-O3)

Optimized XC16:
PIC24F: 2,403 XC16 pro, optimized (-O3)
PIC24F (sim): 2,404, XC16, O3

Slight edge to C30, about 1% higher scores.

Quote

Non-optimized C30:
PIC24F: 1,237, C30 2.x,
PIC24F (sim): 1,215 C30

Non-optimized XC16:
PIC24F: 993, XC16 free,
PIC24F (sim): 978 XC16

C30 still leads, with a 20% margin.

The old dog does have some tricks, :)

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on August 14, 2014, 11:31:32 am

I was quite impressed with the dhrystone performance of PIC24 - they are surprisingly fast in the benchmarks we have done earlier.

It turns out that TI has also done a series benchmarking of 16-bit / 8-bit mcus, as published in SLAA205.

Here is a screen shot of one of the tables in the appnote. Look at the cycle counts for pic24/dspic.

Wow!

For a job well done, Microchip. Wish they had done more to push that chip.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on August 14, 2014, 12:03:41 pm

Here they use another benchmark: coremark, might be interesting for comparison also.
http://www.eembc.org/coremark/ (http://www.eembc.org/coremark/)

Title: Re: Dhrystone 2.1 on mcus
Post by: bwat on August 14, 2014, 12:29:25 pm

Quote from: Kjelt on August 14, 2014, 12:03:41 pm

Here they use another benchmark: coremark, might be interesting for comparison also.
http://www.eembc.org/coremark/ (http://www.eembc.org/coremark/)

Coremark was mentioned on the very first page. :)

Title: Re: Dhrystone 2.1 on mcus
Post by: amyk on August 14, 2014, 12:30:39 pm

There's a list of DMIPS/MHz here that you may find helpful for comparison:
http://en.wikipedia.org/wiki/Instructions_per_second#Timeline_of_instructions_per_second (http://en.wikipedia.org/wiki/Instructions_per_second#Timeline_of_instructions_per_second)

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on August 14, 2014, 05:32:51 pm

I have a workstation with dual quad-core Xeon (3.0Ghz+), equivalent to 250,000 dmips.

Pretty impressive vs. any Cortex M3 machines, :).

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on September 11, 2014, 11:26:59 am

Added STM32F030F - from the ghetto thread.

Not terribly impressive, more on the lower-end of the CMx chips and not that much faster than typical 8-bit chips. Its advantage I think is in its ability to run really fast - I have clocked the little guy at over 64Mhz (flash wait = 1) or 52Mhz (flash wait = 0). So its raw MIPS numbers are still impressive, in spite of its mediocre MIPS/Mhz numbers.

Among the 8-bit / 16-bit chips, PIC24F remains the champ in terms of the MIPS/Mhz race. It just has limited frequency range.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on September 24, 2014, 12:30:31 am

Dug up an old STM8 training presentation.

Notice the 0.29dmips figure? That translates to about 500 dhrystones / Mhz.

Very close to our test figures:

Code: [Select]

STM8S:                482,      IAR-STM8,      optimized
STM8S:                434,       IAR-STM8,
PIC18F26K20:     380,       XC8 pro
PIC18F26K20:     323,       XC8 free
PIC18F26K20:     322,       PICC18 pro
AVR90USB1286:  237,       gcc-avr
PIC18F26K20:     168,       PICC18 lite

The presentation is also right that that kind of speed is 1.5 - 3x of some PICs and AVR.

Title: Re: Dhrystone 2.1 on mcus
Post by: BravoV on September 24, 2014, 02:46:14 am

Any plan for TI's TM4C123x or TM4C129x series ?

Title: Re: Dhrystone 2.1 on mcus
Post by: coppice on September 24, 2014, 06:37:56 am

Quote from: GiskardReventlov on March 27, 2014, 05:37:28 pm

Was curious when I discovered that 8051 (and 80C51?) cores are in a lot of uC. A quick digikey search shows 600-700 (plus have to subtract tape&reel, etc.). So let's say 500 (but still less if you subtract pkg types), but still a lot. Or is that the only way to get an 8051? i.e. they only come as a core?

What does the C in 80C51 designate? (or 65C02)

The 8051 is a core anyone can use without paying royalties, and multiple reasonably good toolchains are available to support it. That means its a no brainer for many people who need to drop a simple low performance core into a chip to drop in an 8051 core. The original 8051 takes 12 clock cycles for one machine cycle. There are versions today which run at one clock cycle per machine cycle, so it can be reasonably fast. You will find an 8051 at the heart of a lot of devices you didn't even realise were processor based, as the user just sees them as a black box.

Every single engineer graduating in China has studied the 8051 in detail. This is a huge incentive for people to put it in their chips, as a huge number of the engineers developing with 8 bit MCUs today are in China. If you try to sell them a chip with an 8051 core you can talk about the things which make the chip interesting. If you have any other core the conversation ends up mostly about tools, and how much hassle it will be to get up to speed with the core. Even very popular cores, like the PICs, have this disadvantage in China. Many people will recognise a parallel with ARM cores in bigger MCUs on a global level. If a 32 bit MCU doesn't have an ARM core it will be a big struggle to sell it to a lot of people.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on September 24, 2014, 06:44:28 am

8051 has been put in its grave for a long long time, its now a zombie created by some companies lacking innovation effort.
Unfortunately zombies refuse to stay in their grave and act dead.

Title: Re: Dhrystone 2.1 on mcus
Post by: coppice on September 24, 2014, 08:16:26 am

Quote from: Kjelt on September 24, 2014, 06:44:28 am

8051 has been put in its grave for a long long time, its now a zombie created by some companies lacking innovation effort.
Unfortunately zombies refuse to stay in their grave and act dead.

That is the exact opposite of reality. People being really innovative seldom give a damn about the core. If that's where your innovation lies its pretty weak. It would be great if a better free to use core with good tools were available, but the 8051 is good enough for a lot of devices.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on September 24, 2014, 03:35:23 pm

Quote from: coppice on September 24, 2014, 08:16:26 am

That is the exact opposite of reality. People being really innovative seldom give a damn about the core.

Which reality is that then? If you look around almost all innovations start with the availability of new technology, being or faster having more processing power to establish and realizing new technologies or having less powerusage to be run on different (mobile) platforms.

Title: Re: Dhrystone 2.1 on mcus
Post by: mikerj on September 24, 2014, 05:01:20 pm

Quote from: Kjelt on September 24, 2014, 03:35:23 pm

Quote from: coppice on September 24, 2014, 08:16:26 am
That is the exact opposite of reality. People being really innovative seldom give a damn about the core.
Which reality is that then? If you look around almost all innovations start with the availability of new technology, being or faster having more processing power to establish and realizing new technologies or having less powerusage to be run on different (mobile) platforms.

Not so. Simply using a faster microcontroller does not make a design innovative, it's what you do with that microcontroller that is important. Innovative new designs using old (and extremely cheap) 8 bit micros like the 8051 are regularly developed.

Title: Re: Dhrystone 2.1 on mcus
Post by: coppice on September 24, 2014, 05:52:31 pm

Quote from: Kjelt on September 24, 2014, 03:35:23 pm

Quote from: coppice on September 24, 2014, 08:16:26 am
That is the exact opposite of reality. People being really innovative seldom give a damn about the core.
Which reality is that then? If you look around almost all innovations start with the availability of new technology, being or faster having more processing power to establish and realizing new technologies or having less power usage to be run on different (mobile) platforms.

Almost every part of an MCU except the CPU is an interesting place for innovation. That's why the ARM is becoming so dominant. It just doesn't matter, so people use the ARM as a default. Its something people are familiar and happy with, and it has good tool support. That's all that matters about the core for the vast majority of MCU applications. You might be surprised how few MCUs are actually running at their rated speed. People typically use a fraction of the potential core speed, to save a little current. The main reason for the growth of 32 bit MCUs has less to do with speed, than the ability to work smoothly with big memories.

Most high volume users buy MCUs for their interesting peripherals. Mixed signal peripherals are a particularly strong area for competition. Special memory features , such as fancy weak cell checking, or EDMI error corrrection, are often critical features for flash. These are the areas where you compete in the MCU market.

There are clear high volume niche exceptions to the above - DSP and control loops often push the core pretty hard. However, these are not the bulk of the market.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on September 24, 2014, 06:20:40 pm

I know that but if you look at the last two technology waves naming the cellphone and the iPad you can hardly say that those applications could be done 10 years before they were done, because they could not at least with decent battery life time and computing power to do something that the user "must" have. The new tech waves arise when the application and the technology are both just mature enough to make it work. And yes you can do innovative things with 30 year old microcontrollers but you can also do them then (and even better) with the current generation of uC's. At least that is my opinion. And if you talk about a 8051 you probably talk about the modern 8051 based uC's with modern peripherals and state of the art silicon but just the old school core and not about the 8051 with the parallel data and adress bus and external memory to be able to do anything.

Title: Re: Dhrystone 2.1 on mcus
Post by: dannyf on September 25, 2014, 11:09:47 am

The sources of advancement in computer power have been more limited to raw speed at which the cpu runs vs. architecture advances.

Take the chart here for example: http://www.netlib.org/performance/html/dhrystone.data.col0.html (http://www.netlib.org/performance/html/dhrystone.data.col0.html)

On a DMIPS/Mhz basis, the early chips (towards the bottom), like 6502/8086/186/286 are more on par of 0.1DMIPS/Mhz. The newer chips, like Pentium, etc. are more 0.5 - 1 or 1.5 DMIPS/Mhz, over the course of 20-30 years. Or 5 - 15x.

In the mean time, the speed at which they run had gone from 4Mhz to 200Mhz, or 50 times, and even more if you benchmark against today's Xeon / Core chips.

So there is not a whole lot to be gained going from one architecture to another - I think our own numbers here show as well. The increase in performance comes more from running the chips faster and faster.

Plus, for many applications, faster isn't as needed as other measurements, like current consumption.

Title: Re: Dhrystone 2.1 on mcus
Post by: Kjelt on September 25, 2014, 11:25:13 am

Quote from: dannyf on September 25, 2014, 11:09:47 am

Plus, for many applications, faster isn't as needed as other measurements, like current consumption.

And as we know from the big father processors as example the Core i_x from Intel we see these also go together, each next silicon process reduces the voltage of the core thus decreasing the power usage thus enabling faster frequencies (or more cores) for the same temperature/power footprint.

SMF 2.0.19 | SMF © 2021, Simple Machines
Simple Audio Video Embedder
SMFAds for Free Forums | Powered by SMFPacks Advanced Attachments Uploader Mod