Author Topic: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO (Read 36068 times)

brucehoult · « **Reply #75 on:** November 12, 2019, 12:18:24 am »

Thanks. That seems more like it. That puts it at exactly the same efficiency per MHz as the dual-issue U54 RISC-V. But it's still beating the A53 in the Pi3 and Odroid C2 by 40% and the Cortex A7 (surely similar to M7?) in the Raspberry Pi 2 by 61%.

I don't doubt that Paul does better work than the Raspberry Pi foundation. And maybe the DTCM in the M7 is just *that* much better than the L1 cache in the A7 or A53. Wow.

I want to see this for myself. I've just ordered a Teensy 4 on Amazon (Paul's store is offline) and will have it tomorrow. (I already have some older Teensy's)

SiliconWizard · « **Reply #76 on:** November 12, 2019, 03:21:43 am »

Can the MCU on the Teensy 4.0 board really run reliably @1GHz? That's pretty impressive. Is it not getting too hot?

maginnovision · « **Reply #77 on:** November 12, 2019, 11:02:40 am »

On an XMOS XS1 8 core unit(500MHz total clock, 125MHz max per "thread") I got...

333333.148 milliseconds all running one thread and...
~~46682.174~~ 111828.668 milliseconds running 4 threads. Nothing tricky I just literally split the load equally across 4 cores where each did one chunk of 928290.

EDIT: Fixed time, small issue with arrays caused large time discrepancy.

maginnovision · « **Reply #78 on:** November 12, 2019, 05:19:06 pm »

Quote from: GeorgeOfTheJungle on November 12, 2019, 11:36:40 am

Quote from: maginnovision on November 12, 2019, 11:02:40 am
On an XMOS XS1 8 core unit(500MHz total clock, 125MHz max per "thread") I got...

333333.148 milliseconds all running one thread and...
46682.174 milliseconds running 4 threads. Nothing tricky I just literally split the load equally across 4 cores where each did one chunk of 928290.

Could you do a gist on github of that? I'd like to run it on the esp32 with two threads/cores.

~~Yea, I'll update this post later with the link.~~

https://gist.github.com/Maginnovision/2f7bd99afeeed351d421573950fbfdee

Here is the gist. For 2 threads you can use just countPrime0() and countPrime2() and change REQ_PRIMES to 1856580. Run each on its own thread. You can use the counter pointer or not(do it local), I was using it to verify they were all done.

I use this small sketch for a teensy 3.6 to time it:

Code: [Select]

int led = 13;
void setup() {                
  // initialize the digital pin as an output.
  Serial.begin(9600);
  pinMode(led, INPUT_PULLDOWN);
  delay(2000);     
}
unsigned long s_time = 0;
unsigned long e_time = 0;

// the loop routine runs over and over again forever:
void loop() {
  Serial.printf("\n\nWaiting for signal on LED pin...\n");
  while(digitalReadFast(led) == 0) {};
  s_time = micros();
  Serial.printf("Starting timer...\n");
  
  while(digitalReadFast(led) != 0) {};
  e_time = micros();
  
  double micro_seconds = e_time-s_time;
  double milli_seconds = micro_seconds / 1000.0;
  double seconds = milli_seconds / 1000.0;
  double minutes = seconds / 60.0;
  Serial.printf("Benchmark took: %u microseconds\n\t\t%.3f milliseconds\n\t\t%.3f seconds\n\t\t%.3f minutes\n", e_time-s_time, milli_seconds, seconds, minutes);
}

iMo · « **Reply #79 on:** November 12, 2019, 05:57:23 pm »

Quote from: GeorgeOfTheJungle on November 12, 2019, 07:56:39 am

Quote from: SiliconWizard on November 12, 2019, 03:21:43 am
Can the MCU on the Teensy 4.0 board really run reliably @1GHz? That's pretty impressive. Is it not getting too hot?
I don't think it can run for long, gets quite toasty at 1GHz.

Code: [Select]
@1008MHz: 250mA@5V => 1.25W @960MHz: 230mA@5V => 1.15W @912MHz: 210mA@5V => 1.05W @816MHz: 165mA@5V => 0.825W @720MHz: 140mA@5V => 0.7W @600MHz: 100mA@5V => 0.5W

Isn't the mcu 3.3V??

iMo · « **Reply #80 on:** November 12, 2019, 06:32:58 pm »

In case there is a switcher 5V/3.3V on the Teensy board you will get pretty different results for the MCU's power dissipation then..
PS: there is none switcher, so do subtract say 20mA (the other circuitry there on the pcb) from that current and multiply by 3.3V instead of 5V..
https://www.pjrc.com/teensy/schematic.html

brucehoult · « **Reply #81 on:** November 12, 2019, 08:24:46 pm »

Quote from: GeorgeOfTheJungle on November 12, 2019, 06:27:27 pm

Quote from: imo on November 12, 2019, 05:57:23 pm
Isn't the mcu 3.3V??

Yes, but I used one of these:

Oh! I need one of those. I have a mains voltage "Kill A Watt" which ,at the moment, is showing my HiFive Unleashed drawing 6.15 W at idle at 1.45 GHz (and near 8 when fully loaded).

[The FU-540 in the HiFive Unleashed is a test chip that taped out 2 years ago. Unlike our current cores, it has no clock gating or automatic frequency adjustment etc, and was fabricated in a high-leakage corner of the 28nm process. I just knocked it back to 10 MHz and it's still using 5.35 W at idle, 5.40 fully loaded. And takes 24 seconds to start emacs and open a small text file, instead of 8.6 at 100 MHz or 1.0 at 1.5 GHz.]

Anyway .. USB testers / meters ... there are a ton on Amazon. That one is $9.99. Some are a little less, some a little more .. not enough to matter. So is that good one? https://www.amazon.com/Soondar-Charging-Concurrent-Real-time-Smartphone/dp/B00ORNOWZK/

brucehoult · « **Reply #82 on:** November 12, 2019, 08:31:16 pm »

It looks like this one, for $1 more, might support higher current and also total Joules or mAh or something. https://www.amazon.com/X-DRAGON-Multimeter-Chargers-Capacity-Accuracy/dp/B019RHJRM8

SiliconWizard · « **Reply #83 on:** November 12, 2019, 08:53:05 pm »

Quote from: imo on November 12, 2019, 06:32:58 pm

PS: there is none switcher, so do subtract say 20mA (the other circuitry there on the pcb) from that current and multiply by 3.3V instead of 5V..
https://www.pjrc.com/teensy/schematic.html

Well, it uses a TLV75733 LDO for the 3.3V, which has a typical quiescent current of 25µA, so the current drawn at the input would be basically the current the MCU draws + 25µA (+ what the other chips on the 3.3V draw: the W25Q16 draws less than 1µA in power-down mode, which I assume it is in once the MCU has booted?, and the MKL02Z32, which is a small MCU that I also assume would be in low-power mode most of the time). So the excess current would be much less than 20mA. I think you can basically neglect it compared to the current the main MCU draws.

As to the power figure GeorgeOfTheJungle gave, you're right, it's wrong. It's the total power drawn from USB, not the power dissipated by the MCU!

iMo · « **Reply #84 on:** November 12, 2019, 08:58:50 pm »

With USB meter showing 5V and a current of 100mA through the board, the power dissipation of the MCU will be 3.3V*(100mA-20mA)=0.264W. The 20mA is my estimation of the other components current on the Teensy 4.0 board with linear regulator. The power loss at the 3.3V regulator will be 100mA*1.7V = 0.17W.
In case you have a board with a switcher powering the MCU, the 100mA on the USB meter display will be something like 136mA through the board.. Thus the MCU power dissipation will be 3.3V*(136mA-20mA)=0.38W.
Example only.

SiliconWizard · « **Reply #85 on:** November 12, 2019, 09:31:27 pm »

Quote from: GeorgeOfTheJungle on November 12, 2019, 08:56:13 pm

Hey guys, can't you multiply current by 3.3? LOL. The current is the same with an LDO!!!

I personally mainly "rectified" imo's estimation (which I think is largely overestimated) of 20mA of current draw on the 3.3V rail outside of the MCU, so I just said that it could probably be neglected here. Though if in fact the Flash chip is not put in power-down mode, and the small MCU is fully active, he may be close to the real figure, so, that would have to be checked.

So yes, its basically just a matter of multiplying your current figure by 3.3V. The figures you gave are not wrong per se (as you didn't claim they were the MCU power dissipation), but I think, if you're gonna give power figures, you could as well give the MCU power dissipation directly. Would be more evocative. Just a detail. I think imo assumed (and in turn made me first assume) the power figures were for the MCU itself, as it's what is interesting here.

brucehoult · « **Reply #86 on:** November 12, 2019, 09:40:43 pm »

Quote from: GeorgeOfTheJungle on November 12, 2019, 08:50:55 pm

Have you got the teensy 4 already?

"Out for delivery"

brucehoult · « **Reply #87 on:** November 13, 2019, 12:27:49 am »

Got my Teensy 4.0 board and set it up.

I get the same results as GeorgeOfTheJungle, at 600 MHz, to the ms, adding my code to an Arduino sketch and adapting main() slightly and calling it from setup():

Code: [Select]

int main(){
  long beg = millis();
  int res = countPrimes();
  long m = millis() - beg;
  Serial.print(res);
  Serial.print(" primes found in ");
  Serial.print(m);
  Serial.println(" ms");
  return 0;
}

3713160 primes found in 37381 ms "faster" (the default)
3713160 primes found in 43516 ms "fast"

Verified that "fast" is -O1 and "faster" is -O2. Compile line for "fast":

/home/bruce/software/arduino-1.8.10/hardware/teensy/../tools/arm/bin/arm-none-eabi-gcc -O1 -Wl,--gc-sections,--relax -T/home/bruce/software/arduino-1.8.10/hardware/teensy/avr/cores/teensy4/imxrt1062.ld -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -o /tmp/arduino_build_829669/Blink.ino.elf /tmp/arduino_build_829669/sketch/Blink.ino.cpp.o /tmp/arduino_build_829669/core/core.a -L/tmp/arduino_build_829669 -larm_cortexM7lfsp_math -lm -lstdc++

The code is 228 bytes long which is in line with some other Thumb2 results I've had but bigger than some. The gcc is prettty old .. 5.4.1 20160919.

Something that confuses me is that the objdump supplied in arduino-1.8.10/hardware/teensy/../tools/arm/bin/arm-none-eabi-objdump doesn't understand some of the instructions in the elf file!

brucehoult · « **Reply #88 on:** November 13, 2019, 12:47:17 am »

I have to say the OOBE with Teensy 4.0 is pretty great. I remember having to work much harder to make Teensy 2.0 work back in 2009 or whenever.

SiliconWizard · « **Reply #89 on:** November 13, 2019, 01:12:31 am »

Out of curiosity, could you try with -O3? (As it activates pretty aggressive optimizations that can make a difference.)

brucehoult · « **Reply #90 on:** November 13, 2019, 01:17:35 am »

Quote from: SiliconWizard on November 13, 2019, 01:12:31 am

Out of curiosity, could you try with -O3? (As it activates pretty aggressive optimizations that can make a difference.)

Sure.

3713160 primes found in 37381 ms -O3
3713160 primes found in 39171 ms -Os

So in this case -O2 and -O3 are exactly the same

maginnovision · « **Reply #91 on:** November 13, 2019, 01:24:30 am »

Quote from: brucehoult on November 13, 2019, 12:47:17 am

I have to say the OOBE with Teensy 4.0 is pretty great. I remember having to work much harder to make Teensy 2.0 work back in 2009 or whenever.

The teensy boards have been pretty much ready to go since T3.1 I think. Once he really had his own designs rather than shrinking others.

brucehoult · « **Reply #92 on:** November 13, 2019, 02:49:35 am »

Quote from: brucehoult on November 13, 2019, 01:17:35 am

So in this case -O2 and -O3 are exactly the same

By the way, I consider this a good thing. When I wrote this code (back when I had only x86 and Raspberry Pi 2/3 to test it on) I especially wanted it to be a test of the CPU, not the compiler.

Berni · « **Reply #93 on:** November 13, 2019, 12:04:06 pm »

Quote from: SiliconWizard on November 13, 2019, 01:12:31 am

Out of curiosity, could you try with -O3? (As it activates pretty aggressive optimizations that can make a difference.)

-O3 can sometimes slow things down. Some of the optimizations deal with code size rather than speed. So usually a good bet is the -O2 since its pretty much always faster than any optimization levels below it.

-Os in gcc means optimize for size. It does tend to make code a good deal faster than -O0 but its main goal is size, not speed

-Ofast in gcc is what you want for speed. This turns on only optimizations that help with speed and ignores any code size optimization. This should be the fastest as long as the larger code size doesn't cause enough memory traffic to bottle neck the CPU from getting its data from RAM fast enough.

SiliconWizard · « **Reply #94 on:** November 13, 2019, 06:28:17 pm »

It's of course extremely dependent both on the target in question and on the code itself.

In my experience, -O3 yields faster execution than -O2 in many cases, and otherwise is usually at least as fast, and I've never personally run into a case where it was slower, so this is my usual default (unless I work on very small targets, for which optimizing for size would be critical. -O3 does aggressive code inlining in many cases so the code size can inflate significantly. Depends of course on your code structure.) I benchmarked sorting algorithms lately and got consistently +10% faster execution with -O3 than -O2 on PC targets. With some computing intensive stuff (especially with floating point), it can be as high as +30 to +50% faster...

I did some tests with Bruce's code, and I confirm that with his code I also get the same execution time with -O2 and -O3 on my Core i7: 2490 ms. Now I tried with -Ofast, and it's actually slightly slower (which is consistent with my previous benchmarks with -Ofast which I've found often slower than -O3 actually), with 2550 ms. This is not that surprising, as execution time depends on many factors including how code and data are cached.

Of course benchmarking across different targets is a tricky business and the way you do it all depends on your goals. If you want to know the fastest a given algorithm can execute on a given CPU, you could write it directly in optimized assembly, using any specific instruction you can to speed things up. That would be really taking advantage of said CPU. Now if you want more of a typical "feel" of what you'd get with real-life code in a high-level language, using generic C and moderate optimization levels makes sense.

brucehoult · « **Reply #95 on:** November 13, 2019, 11:19:22 pm »

Quote from: SiliconWizard on November 13, 2019, 06:28:17 pm

I did some tests with Bruce's code, and I confirm that with his code I also get the same execution time with -O2 and -O3 on my Core i7: 2490 ms. Now I tried with -Ofast, and it's actually slightly slower (which is consistent with my previous benchmarks with -Ofast which I've found often slower than -O3 actually), with 2550 ms. This is not that surprising, as execution time depends on many factors including how code and data are cached.

On an i7-8650U (which I have two of .. a NUC and a ThinkPad X1 Carbon) it's actually faster with -O1 (2735ms) than with -O2 or -O3 (3428ms) !!

maginnovision · « **Reply #96 on:** November 14, 2019, 03:55:16 am »

With XMOS parts -Os is the recommended O level for anything other than debugging. Speed or code size.

SiliconWizard · « **Reply #97 on:** November 14, 2019, 04:00:57 am »

Quote from: brucehoult on November 13, 2019, 11:19:22 pm

Quote from: SiliconWizard on November 13, 2019, 06:28:17 pm
I did some tests with Bruce's code, and I confirm that with his code I also get the same execution time with -O2 and -O3 on my Core i7: 2490 ms. Now I tried with -Ofast, and it's actually slightly slower (which is consistent with my previous benchmarks with -Ofast which I've found often slower than -O3 actually), with 2550 ms. This is not that surprising, as execution time depends on many factors including how code and data are cached.

On an i7-8650U (which I have two of .. a NUC and a ThinkPad X1 Carbon) it's actually faster with -O1 (2735ms) than with -O2 or -O3 (3428ms) !!

On my i7-5930K, I get: 2490 ms for -O2 and -O3, but 2605 ms for -O1... (GCC 9.2.0 here if that matters.)

Berni · « **Reply #98 on:** November 14, 2019, 06:28:46 am »

Quote from: SiliconWizard on November 13, 2019, 06:28:17 pm

It's of course extremely dependent both on the target in question and on the code itself.

In my experience, -O3 yields faster execution than -O2 in many cases, and otherwise is usually at least as fast, and I've never personally run into a case where it was slower, so this is my usual default (unless I work on very small targets, for which optimizing for size would be critical. -O3 does aggressive code inlining in many cases so the code size can inflate significantly. Depends of course on your code structure.) I benchmarked sorting algorithms lately and got consistently +10% faster execution with -O3 than -O2 on PC targets. With some computing intensive stuff (especially with floating point), it can be as high as +30 to +50% faster...

I did some tests with Bruce's code, and I confirm that with his code I also get the same execution time with -O2 and -O3 on my Core i7: 2490 ms. Now I tried with -Ofast, and it's actually slightly slower (which is consistent with my previous benchmarks with -Ofast which I've found often slower than -O3 actually), with 2550 ms. This is not that surprising, as execution time depends on many factors including how code and data are cached.

Of course benchmarking across different targets is a tricky business and the way you do it all depends on your goals. If you want to know the fastest a given algorithm can execute on a given CPU, you could write it directly in optimized assembly, using any specific instruction you can to speed things up. That would be really taking advantage of said CPU. Now if you want more of a typical "feel" of what you'd get with real-life code in a high-level language, using generic C and moderate optimization levels makes sense.

Yeah all of this is indeed heavily platform dependent. That's why i used the word "sometimes". There is no solid rule for what optimization is best for all scenarios.

On modern PCs getting things fast is mostly about making sure you can work in the cache as much as possible(and less code size helps with this too). The CPU can do quite a lot of optimizations on the fly such as out of order execution, branch prediction, hyperthreading all making sure the execution pipeline is as full as possible. Fast math is also about memory arrangement to make it fit into SIMD instructions. On the other hand a 8bit MCU is pretty stupid and requires the compiler to do more work optimizing things, especially when it has very few registers and limited instructions. This is usually where optimizing for size has more of a penalty, some levels showing huge differences.

But it seams to me like the faster computer hardware gets the less programmers care about optimizing anything since "it runs fast enough anyway". Hence why the windows calculator calc.exe now uses about 20 to 30MB of RAM to run.

brucehoult · « **Reply #99 on:** November 14, 2019, 08:09:14 am »

Quote from: SiliconWizard on November 14, 2019, 04:00:57 am

Quote from: brucehoult on November 13, 2019, 11:19:22 pm
Quote from: SiliconWizard on November 13, 2019, 06:28:17 pm
I did some tests with Bruce's code, and I confirm that with his code I also get the same execution time with -O2 and -O3 on my Core i7: 2490 ms. Now I tried with -Ofast, and it's actually slightly slower (which is consistent with my previous benchmarks with -Ofast which I've found often slower than -O3 actually), with 2550 ms. This is not that surprising, as execution time depends on many factors including how code and data are cached.

On an i7-8650U (which I have two of .. a NUC and a ThinkPad X1 Carbon) it's actually faster with -O1 (2735ms) than with -O2 or -O3 (3428ms) !!

On my i7-5930K, I get: 2490 ms for -O2 and -O3, but 2605 ms for -O1... (GCC 9.2.0 here if that matters.)

At this level semi-random things such as how code (especially branch targets) happen to fall in cache lines makes a big difference. And ASLR makes it vary from run to run.

I think we can all agree that modern x86 is stupidly fast, and even the slowest microcontrollers are amazing.

I'm trying to estimate how long the university VAX I learned to program on would take to run this. I think it'll be around 24 hours.

It would be interesting to try an ATmega2560. I think it will *just* fit without modification. I'm using 8000 bytes (plus a handfull) of global memory, which is less than 8 KB. But I've only got 328s here.

It would fit on an Apple ][ or C64 or Atari XL. But no one has C compilers for them. C compilers for Z80 suck but at least they exist. Anyone have a working Speccy or Amstrad CPC or something?

It'll probably take a week to run.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO (Read 36068 times)

Share me