Author Topic: Dhrystone 2.1 on mcus  (Read 28341 times)

0 Members and 1 Guest are viewing this topic.

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Dhrystone 2.1 on mcus
« on: March 01, 2014, 06:40:22 pm »
I ran dhrystone 2.1 (a useless benchmark) on a few mcus that I have.

I would run 10,000 times the benchmark, and then flip a pin. By measuring the duration between pin flips, we measured the duration of the benchmark. The shorter the duration, the faster the execution.

No optimization of any kind, for any chip - the exact same code ran on all chips.

[edit: represent the data in terms of Dhrystone / Mhz per second, from high to low]

Quote
PIC32MX320:       3,378      C32 2.x,        optimized (-O3)
PIC32MX440:       3,067      X32 1.21,      optimized (-O3) - @ 20Mhz
LM4F120:             2,914,     MDK-ARM,      optimized (-O3 + time)
STM32F4:             2,888,     MDK-ARM,      optimized (-O3 + time)
STM32F4:             2,525,     gcc-arm,         optimized (-O3)
PIC24F:                2,432,     C30 2.x,        optimized -O3 (speed)
PIC24F:                2,403      XC16 pro,     optimized (-O3)
PIC32MX440,       2,288      X32 1.21,      optimized (-O3) @ 80Mhz
LPC1343:             2,087,     MDK-ARM,     optimized (-O3 + time)
STM32F3:           1,964,     MDK-ARM,      optimized (-O3 + time)
LM4F120:             1,911,   IAR-ARM,        optimized
STM32F4:              1,903,   IAR-ARM,       optimized
PIC24F:                1,901,     C30 2.x,        optimized -O2 (speed)
LPC1227:             1,506,    IAR-ARM,       optimized
LPC1343:             1,410,    gcc-arm,       optimized (-O3)
STM32F3:            1,362,    IAR-ARM,       optimized
LM4F120:             1,297,   MDK-ARM
LM4F120:             1,245,   IAR-ARM,       
PIC24F:                1,237,    C30 2.x,
PIC24H:              1,195,    XC16,           optimized (-O3)
PIC24H:              1,170,    XC16 pro,     optimized (-O3)
STM32F4:            1,162,    IAR-ARM,      optimized (-O3)
PIC32MX320:      1,151,    C32 2.x
PIC24F:               1,106    [compiler?]    optimized (-O3)
STM32F4:             1,053,   IAR-ARM
LPC1227:             1,050,   IAR-ARM
STM32F4:             1,029,   MDK-ARM
PIC32MX440,       1,004,   X32 1.21       @ 20Mhz
PIC24F:               993,      XC16 free,   
STM32F4:            955,       MDK-ARM,     optimized (-O3)
STM32F1:            921,       IAR-ARM,       optimized
LPC1343:            906,       MDK-ARM
STM32F4:            902,       gcc-arm
STM32F3:            858,       MDK-ARM
STM32F3:            854,       IAR-ARM,
STM32F4:            806,       MDK-ARM
STM32F3:            804,       gcc-arm,        optimized
STM32F3:            766,       gcc-arm,
PIC32MX440,      762,       X32 1.21,       @ 80Mhz
STM32F1:            736,       gcc-arm,        optimized
MSP430F2418:    734,       IAR,               optimized (3)
MSP430F2370:    667,       IAR,               optimized (3)
LPC1343:            664,       gcc-arm
STM32F1:            653,       IAR-ARM,
MSP430F2418:    630,       IAR
STM32F030F:      619,       MDK-ARM, O3  optimized
LPC1114:            614,       IAR-ARM,        optimized
MSP430F2370:    573,       IAR
PIC24F:               555,      [unknown]
STM32F030F:       552,      MDK-ARM, O0
STM32F4:            489,       IAR-ARM
PIC24H:              489,       XC16 free
STM8S:                482,      IAR-STM8,      optimized
P87C51MC2:       470,       Keil C51         optimized for speed
STM32F1:            453,       gcc-arm,
P87C51MC2:       439,       Keil C51         optimized for size
STM8S:                434,       IAR-STM8,
LPC1114:            410,       IAR-ARM
PIC18F26K20:     380,       XC8 pro
PIC18F26K20:     323,       XC8 free
PIC18F26K20:     322,       PICC18 pro
AVR90USB1286:  237,       gcc-avr
PIC18F26K20:     168,       PICC18 lite

Simulation only:
PIC32MZ:             1173
PIC32MZ:             3,413,   O3
PIC24F:                978       XC16
PIC24F:                2,404,   XC16, O3
PIC24F:                1,215    C30
PIC24F:                2,433    C30, O3



I was surprised:

1) pic24f was really fast. and stm32f1/3 sucked on a per Mhz basis.
2) avr sucked wind.
3) stm8s did OK. Evident of 6502's staying power.

Didn't run on 8051 but would expect it to hold its own reasonably well.

edit: 1) added stm32f100 numbers.
        2) added stm32f100 + iar-arm, vs. gcc-arm.
        3) added results from IAR-ARM and represented the data to make it easier for the eyes.
        4) added results from pic18f
        5) updated PIC24F results (fat fingers) and added -O3 optimization
        6) added mdk-arm numbers for STM32F3. Pretty much identical unoptimized.
        7) added jaxbird's results for STM32F4 and PIC24H.
        8) added PIC24F results under XC16 free/pro. Still very high.
        9) added jaxbird's and hans' results for pic24h and pic24f/stm32f4, respectively.
        10) added LM4F120 under IAR-ARM and MDK-ARM.
        11) added LPC1343 under gcc-arm and MDK-ARM.
        12) added STM32F4 under gcc-arm and iar-arm, and mdk-arm too.
         13) added LPC1114 under iar-arm. The first CM0 chip in the comparison.
        14) added C51 (P87C51MC2) under Keil C51. Optimized for size and speed.
        15) added PIC32MX320 under C32, 2.x. (simulated)
        16) added PIC32MX440F512H results, for 80Mhz and 20Mhz
         17) added msp430. Fairly respective scores.
         18) added LPC1227 / IAR scores.
         19) added MSP430F2370 - similar to the MSP430 scores obtained earlier.
          20) added simulated results for PIC32MZ, and PIC24F (XC16 and C30)
         21) added the results for  STM32F030F, from the ghetto thread.
« Last Edit: September 11, 2014, 11:22:06 am by dannyf »
================================
https://dannyelectronics.wordpress.com/
 

Offline Bored@Work

  • Super Contributor
  • ***
  • Posts: 3932
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #1 on: March 01, 2014, 07:03:23 pm »
I was surprised:
...
2) avr sucked wind.

Not surprising. GCC without any optimization produces absolutely crap code.
I delete PMs unread. If you have something to say, say it in public.
For all else: Profile->[Modify Profile]Buddies/Ignore List->Edit Ignore List
 

Offline NANDBlog

  • Super Contributor
  • ***
  • Posts: 4478
  • Country: nl
Re: Dhrystone 2.1 on mcus
« Reply #2 on: March 01, 2014, 07:50:49 pm »
I was surprised:
...
2) avr sucked wind.

Not surprising. GCC without any optimization produces absolutely crap code.

Yeah, kinda like ";" will be "NOP" after complying just to be able to put a breakpoint on it. Not to mention, cortex has a very well defined 1.25 DMIPS/MHz for M3 and 0.93 DMIPS/MHz for M0+ so running it again is kinda pointless. Unless you compare different complier settings, etc...
 

Offline hans

  • Super Contributor
  • ***
  • Posts: 1042
  • Country: nl
Re: Dhrystone 2.1 on mcus
« Reply #3 on: March 01, 2014, 08:24:40 pm »
I was doing CoreMarks last week (as it so happens), to see what compiler settings did for PIC24.
I chose CoreMarks because it focuses so much on Embedded Controllers and thus includes tests like pointer handling capability, branching etc. over pure integer speed.

I just did some more tests and with XC16 enabled for optimizations, I got:
-O 0 = 9.911 iterations / second (2700b RAM, 8015b FLASH)
-O 1 = 23.56 iterations/second (2630b RAM, 6486b FLASH)
-O 2 = 29.64 iterations/second (2630b RAM, 6634b FLASH)
-O 2 -unroll = 31.13 iterations/second (2630b RAM, 8083b FLASH)

-O 3 = 28.95 iterations/second (2630b RAM, 9454b FLASH)
-O 3 -unroll = 30.09 iterations/second (2630b RAM, 12348b FLASH)

-O s = 21.11 iterations/second (2630b RAM, 6484b FLASH)

Interesting compiler/benchmark, where grade 2 is faster than optimize grade 3 and s(size).  :-// No optimizations being so slow doesn't surprise me, as it's for development and the debugger needs to be able to find the instructions in program memory.
Also funny to see how well -O1 actually compares for this benchmark. However, as always with optimizations depending how stuff is written: YMMV.

The coremarks benchmark is really easy to port (initialize uart for printf, insert timebase with 32-bit timer, done).  Except.. for PIC18, because the program has a matrix multiply test, and uses 32-bit integer array indexing. The XC8 didn't like that. It said:
../coremark_v1.0/core_matrix.c:244: error: can't generate code for this expression

Anyway, I see STM8 has 16-bit divide instructions, where the AVR only has multiply (8-bit). PIC24 has 32-bit by 16-bit divide, and 17-bit by 17-bit multiply. So I guess it wins out on a lot. Not sure why it would be faster than Cortex m3 though , because it has 32-bit multiply/divide. That wouldn't be logical, unless the bus is being stalled all the time or something (highly doubt that though).
« Last Edit: March 01, 2014, 08:26:26 pm by hans »
 

Offline GiskardReventlov

  • Frequent Contributor
  • **
  • Posts: 598
  • Country: 00
  • How many pseudonyms do you have?
Re: Dhrystone 2.1 on mcus
« Reply #4 on: March 01, 2014, 09:11:36 pm »
Help me understand the numbers.

pic24fj64ga102 @ 4Mhz (8Mhz crystal), duration = 2.12 seconds -> 8.5 seconds @ 1Mhz

Is this 8.5 Dhrystones/second?

What compiler flags did you use and what versions of the compilers did you use?

As a casual, naive observer I conclude that you didn't reach any conclusion.  But maybe I missed how the chips you tested are similar or dissimilar. The only commonality I saw was that you have them.

Correction:
avr90usb1286

should be

at90usb1286

 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #5 on: March 01, 2014, 09:22:47 pm »
Quote
can't generate code for this expression

You may try some earlier compilers from hi-tech.

Quote
That wouldn't be logical,

Agreed. I think we will know for sure once I have a look at the list file.

I also didn't keep track of code size but that's not that interesting to me.
================================
https://dannyelectronics.wordpress.com/
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 3066
  • Country: us
Re: Dhrystone 2.1 on mcus
« Reply #6 on: March 02, 2014, 07:40:13 am »
Quote
No optimization of any kind, for any chip
When you say that, do you mean that you didn't do any chip-specific optimizations, or that you completely turned off compiler optimizations (-O0) as well?  The latter is practically worthless; some compilers do a lot more in the "optimization" step than others.   Pick a generic compiler optimization parameter ("-O3"?) and use the closest equiv for each compiler.
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #7 on: March 02, 2014, 12:02:03 pm »
Finished running the code on 18f. The XC8 free compiler did quite well vs. the good old picc18.
================================
https://dannyelectronics.wordpress.com/
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 17979
  • Country: nl
    • NCT Developments
Re: Dhrystone 2.1 on mcus
« Reply #8 on: March 02, 2014, 12:35:48 pm »
I ran dhrystone 2.1 (a useless benchmark) on a few mcus that I have.

I would run 10,000 times the benchmark, and then flip a pin. By measuring the duration between pin flips, we measured the duration of the benchmark. The shorter the duration, the faster the execution.

No optimization of any kind, for any chip - the exact same code ran on all chips.
Unless you used the same C library the same code didn't run on all chips. Drystone results are tainted by the efficiency of the C library.

edit: After a short peek I see several calls to strcpy and strcmp. These should be replaced with functions inside the Dhrystone test so the results are not tainted by differences in the C library.
« Last Edit: March 02, 2014, 09:40:10 pm by nctnico »
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline HackedFridgeMagnet

  • Super Contributor
  • ***
  • Posts: 1938
  • Country: au
Re: Dhrystone 2.1 on mcus
« Reply #9 on: March 02, 2014, 02:14:49 pm »
Quote
Unless you used the same C library the same code didn't run on all chips. Drystone results are tainted by the efficiency of the C library.

Your right, but at least he has gone to the trouble to produce some results, and given a rough indication of how he did them.

Which is a start, and most of us are stuck with the efficiencies of those C libraries anyway.
 

Offline jaromir

  • Supporter
  • ****
  • Posts: 256
  • Country: sk
Re: Dhrystone 2.1 on mcus
« Reply #10 on: March 02, 2014, 09:11:20 pm »
dannyf: maybe I overlooked something, but could you, please, provide your sources, say for PIC24? I'd like to verify it and maybe do tests with different MCUs.
My hobby projects: https://hackaday.io/jaromir ----------- http://jaromir.xf.cz/
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #11 on: March 02, 2014, 09:28:41 pm »
It is the standard dhrystone package. It comes in one .h file and two .c files dhry_1 dhry_2.c

google it. They compile right away, with minimum changes.
================================
https://dannyelectronics.wordpress.com/
 

Offline GiskardReventlov

  • Frequent Contributor
  • **
  • Posts: 598
  • Country: 00
  • How many pseudonyms do you have?
Re: Dhrystone 2.1 on mcus
« Reply #12 on: March 02, 2014, 09:40:24 pm »
right, but at least he has gone to the trouble to produce some results, and given a rough indication of how he did them.

Too rough to reproduce the numbers but it's all just for fun anyway.

avr-gcc -v
arm-gcc -v
etc.
What voltages?

The Dhrystone says that it's not meant to use any optimizations, unless you just want to test optimizations.
But that's not the goal here. So they're just some numbers

Quote
Which is a start, and most of us are stuck with the efficiencies of those C libraries anyway.

Inefficiencies are bigger problem.
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 3066
  • Country: us
Re: Dhrystone 2.1 on mcus
« Reply #13 on: March 03, 2014, 04:15:58 am »
Quote
Drystone results are tainted by the efficiency of the C library.
Of course they are.  They're "tainted" by how good the C compiler is too.  That's THE POINT.  It's a benchmark of the chip+compiler+library SYSTEM.  (and that's what SHOULD be meant by "applied no optimizations.")  Benchmarks that only measure the chip are much MORE tainted.

Quote
They compile right away, with minimum changes.
Yeah, but we can't come up with numbers that compare directly to the ones in your table, without seeing exactly the wrapper code or your external calculations work.  (It currently says "Drystones/sec/MHz"; but that doesn't look right, nor does it match up very well with the data (compared to published estimates.))
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 17979
  • Country: nl
    • NCT Developments
Re: Dhrystone 2.1 on mcus
« Reply #14 on: March 03, 2014, 09:57:30 am »
Quote
Drystone results are tainted by the efficiency of the C library.
Of course they are.  They're "tainted" by how good the C compiler is too.  That's THE POINT.  It's a benchmark of the chip+compiler+library SYSTEM.  (and that's what SHOULD be meant by "applied no optimizations.")  Benchmarks that only measure the chip are much MORE tainted.
Not if you are comparing compilers. IMHO there is not much use to include a slow and fast C library in a benchmark test since you can always replace a slow C library function with a faster one.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline jaxbird

  • Frequent Contributor
  • **
  • Posts: 767
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #15 on: March 03, 2014, 10:48:59 am »
Interesting comparison of mcus and compilers  :-+

Ran a couple for comparison and adding more data, same standard Dhrystone 2.1 code I assume.

Results:

STM32F4 (168 MHz) with Keil

Compiler Options      Dhrystones/MHz/Second
-o0806
-o1932
-o2942
-o3955
-o3 -otime3621 (likely not valid)


PIC24HJ64GP502 (80 MHz) with XC16 (free/pro)

Compiler Options      Dhrystones/MHz/Second
-o0 (free)489
-o1 (free)865
-o2 (pro)1158
-o3 (pro)1170


Edit: Added more results for PIC24H
« Last Edit: March 03, 2014, 04:22:38 pm by jaxbird »
Analog Discovery Projects: http://www.thestuffmade.com
Youtube random project videos: https://www.youtube.com/user/TheStuffMade
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #16 on: March 03, 2014, 11:30:51 am »
-Yeah, but we can't come up with numbers that compare directly to the ones in your table, without seeing exactly the wrapper code or your external calculations work.  (-

as I said in the beginning, there is nothing in the my code others than calling dhrystone and flip a pin. You the measure the duration of the pun flip and calculate dhrystone per MHz per second.

for example: say it takes 5 seconds to run 10,000 loops of dhrystone on a 10 MHz mcu. That's 10000 / 10 / 5 or 200 dhrystone per MHz per second.

it may be clearer if you take a look at the code.
================================
https://dannyelectronics.wordpress.com/
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #17 on: March 03, 2014, 12:06:06 pm »
--o3 -otime 3621 -

I ran dhrystone on a lpc2106 @ 30mhz (12mhz x 5 / 2) and I got 2000 plus. No time to debug it but it wouldn't surprise me that some thing is being optimized away.

running a keil vs iar vs gcc comparison would be interesting.
================================
https://dannyelectronics.wordpress.com/
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #18 on: March 03, 2014, 12:54:11 pm »
On the last one: all the benchmarking I have seen, including keils own, suggests an upper end of 1000 Dhrystone per MHz for armv7 chips. The cortex chips are crippled or smaller so lower numbers for them make sense.
================================
https://dannyelectronics.wordpress.com/
 

Offline jaxbird

  • Frequent Contributor
  • **
  • Posts: 767
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #19 on: March 03, 2014, 01:54:30 pm »
--o3 -otime 3621 -

I ran dhrystone on a lpc2106 @ 30mhz (12mhz x 5 / 2) and I got 2000 plus. No time to debug it but it wouldn't surprise me that some thing is being optimized away.

running a keil vs iar vs gcc comparison would be interesting.

Agree, the difference is so large, it's likely not a valid result. I have not added any asserts on the outputs of the Dhrystone runs. Will try do a validation at some point.

Yeah, it would be interesting to get some comparable results between the most used compilers and their optimizations. Of course it wouldn't necessarily mean the best result is always the best compiler, but would still be interesting though.

Analog Discovery Projects: http://www.thestuffmade.com
Youtube random project videos: https://www.youtube.com/user/TheStuffMade
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #20 on: March 04, 2014, 01:27:45 am »
Ran PIC24F again (just to be sure). The earlier numbers actually were slightly lower due to key-in errors.

Also ran -O3 on PIC24F. Over 2400 Dhrystones / Mhz. That is unbelievable, :)
================================
https://dannyelectronics.wordpress.com/
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #21 on: March 04, 2014, 01:46:23 am »
Also added STM32F3's numbers under mdk-arm (4.x). A virtual tie with IAR when unoptimized.
================================
https://dannyelectronics.wordpress.com/
 

Offline nuhamind2

  • Regular Contributor
  • *
  • Posts: 138
  • Country: id
Re: Dhrystone 2.1 on mcus
« Reply #22 on: March 04, 2014, 01:56:33 am »
Could flash memory speed affect performance ? On higher clock wait state need to be inserted so the performance/Mhz is worse on high freq. I know that STM32 has flash accellerator or something, but how efficient is that I don't know and I think flash accelerator won't help for indirect jump. Don't know though whether Dhrystone generate indirect jump.
just my 2c
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8229
  • Country: 00
Re: Dhrystone 2.1 on mcus
« Reply #23 on: March 04, 2014, 02:20:11 am »
I combined jaxbird's results for F4 and 24H with mine in the first post. When I get some time, I will try XC16 on the 24F as well.

On wait state, I think so as well, particularly at high speed so performance is unlikely to increase with speed linearly. But it would be difficult for me to quantify.
================================
https://dannyelectronics.wordpress.com/
 

Offline nuhamind2

  • Regular Contributor
  • *
  • Posts: 138
  • Country: id
Re: Dhrystone 2.1 on mcus
« Reply #24 on: March 04, 2014, 03:05:10 am »
I combined jaxbird's results for F4 and 24H with mine in the first post. When I get some time, I will try XC16 on the 24F as well.

On wait state, I think so as well, particularly at high speed so performance is unlikely to increase with speed linearly. But it would be difficult for me to quantify.
How about running the benchmark on slower frequency and disabling wait state.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf