-
I fired up my UPduino 2.0 and I loaded it with Mecrisp-Quintus Forth - running on the picorv32 soc.
Here is a naive benchmark I've tried - mostly because I got the results from Mecrisp-Forth running the 8 queens on the F407 @168MHz years back..
0 variable solutions
0 variable nodes
: bits ( n -- mask ) 1 swap lshift 1- ;
: lowBit ( mask -- bit ) dup negate and ;
: lowBit- ( mask -- bits ) dup 1- and ;
: next3 ( dl dr f files -- dl dr f dl' dr' f' )
not >r
2 pick r@ and 2* 1+
2 pick r@ and 2/
2 pick r> and ;
: try ( dl dr f -- )
dup if
1 nodes +!
dup 2over and and
begin ?dup while
dup >r lowBit next3 recurse r> lowBit-
repeat
else 1 solutions +! then
drop 2drop ;
: queens ( n -- )
0 solutions ! 0 nodes !
-1 -1 rot bits try
solutions @ . ." solutions, " nodes @ . ." nodes" ;
\
\ Mecrisp-Forth 1.2, STM32F407, 168MHz clock, serial 115k2:
\ =====================================================
\ 8 testQ 92 solutions, 1965 nodes t= 4 ms ok.
\ 10 testQ 724 solutions, 34815 nodes t= 43 ms ok.
\ 12 testQ 14200 solutions, 841989 nodes t= 1002 ms ok.
\ 13 testQ 73712 solutions, 4601178 nodes t= 5457 ms ok.
\ 14 testQ 365596 solutions, 26992957 nodes t= 31936 ms ok.
\ 16 testQ 14772512 solutions, 1126417791 nodes t= 1331997 ms ok.
\ : testQ ( N -- ) millis @ swap queens millis @ swap - ." t= " . ." ms" ;
\ Mecrisp-Quintus 0.31 for RISC-V 32 IM, picorv32, ice40UP5k, 12MHz clock, serial 115k2:
\ ===========================================================
\ 8 testQ 92 solutions, 1965 nodes t= 1612989 clocks ok.
\ 10 testQ 724 solutions, 34815 nodes t= 27066757 clocks ok.
\ 12 testQ 14200 solutions, 841989 nodes t= 650884555 clocks ok.
\ : testQ ( N -- ) $20040000 @ swap queens $20040000 @ swap - ." t= " . ." clocks" ;
So it seems to me the picorv32 is approx 3.9x slower in this benchmark compared to F407.
-
There's something fishy here. According to your log, the STM32F407 is running at 168MHz. But the picorv32 is running at... 12 MHz?
To get an idea of the raw performance, the STM32F4 line has a ~3.37 Coremark/MHz figure. Not sure about picorv32 (and I think this would depend on the exact configuration), but a figure of about 2.5 Coremark/MHz sounds realistic (it may even be optimistic here.)
So... in terms of raw performance, the STM32F4 should be about: 3.37*168/(2.5*12) times faster. Which is ~18.9 ...
Unless the 12 MHz clock rate stated in your log is erroneous, I can't figure out how the F407 could be "only" about 4 times faster. Unless Mecrisp-Quintus is MUCH more efficient than Mecrisp-Forth for the STM32F4... Can someone enlighten me?
-
F407@168MHz 12 Queens 1,002sec
picorv@12MHz 12 Queens 650884555*1/(12e6)=54.24sec
(54.24/1.002)*(12/168)= 3.87x slower Fcpu to Fcpu
Let us assume the F407 does 1 forth primitive in 1 clock..
picorv32 needs ~4 clocks typically..
-
Ah, sorry. You should have made this clearer in your original post. Faster or slower when benchmarking needs clear definitions.
Anyway, now that makes sense. I haven't found figures for Coremark on the picorv32, but they give drysthone figures. Most importantly, they state an average of 4 CPI for the picorv32, whereas on a Cortex M4 it will be close to 1 . So a factor of 4 between the two on average is expected. And the Coremark figure on a picorv32 is then likely very far from the 2.5 /MHz I was expecting. It's probably under 1 Coremark/MHz.
Read here:
https://github.com/cliffordwolf/picorv32
(Cycles per Instruction Performance)
-
.
-
.. and when compiled with "Acrobatics compiler" words (a stronger optimization) you get a ~30% speed up..
\ Mecrisp-Quintus 0.31a for RISC-V 32 IM, picorv32, ice40UP5k, 12MHz clock, serial 115k2:
\ ===========================================================
\ 8 testQ 92 solutions, 1965 nodes t= 1612989 clocks
\ 10 testQ 724 solutions, 34815 nodes t= 27066757 clocks
\ 12 testQ 14200 solutions, 841989 nodes t= 650884555 clocks
\ : testQ ( N -- ) $20040000 @ swap queens $20040000 @ swap - ." t= " . ." clocks" ;
\ Mecrisp-Quintus 0.32 for RISC-V 32 IM, picorv32, ice40UP5k, 24MHz clock, Acro comp., serial 115k2:
\ ===========================================================
\ 8 testQ 92 solutions, 1965 nodes t= 1083561 clocks
\ 10 testQ 724 solutions, 34815 nodes t= 17548407 clocks
\ 12 testQ 14200 solutions, 841989 nodes t= 420951159 clocks
\ : testQ ( N -- ) $20040000 @ swap queens $20040000 @ swap - ." t= " . ." clocks" ;
F407@168MHz 12 Queens 1.002sec
picorv@24MHz 12 Queens 420951159*1/(24e6)=17.54sec
(17.54/1.002)*(24/168)= 2.5x slower Fcpu to Fcpu