Author Topic: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO  (Read 35657 times)

0 Members and 1 Guest are viewing this topic.

Offline maginnovision

  • Super Contributor
  • ***
  • Posts: 1963
  • Country: us
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #150 on: November 20, 2019, 08:20:57 am »
Ok, it's re running with nSeive as an int. i for the loop was always an int.

Also I checked my xCore200 board with 5 threads, the max I can do pretty quickly... 113655.671  milliseconds(+/- 20ns). Little slower than the XS1 board running 4 threads but it's not that surprising since it ends up just waiting on the final thread but the XS1 gets 125MHz threads. The XS1 would probably be even faster running 8 threads since the earlier ones would finish and move the processing time along to later threads.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 3998
  • Country: nz
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #151 on: November 20, 2019, 08:37:15 am »
Hmm .. ok. That puts it closer to Skylake than I expected, but ok.

If a 47W laptop part is throttling then my 15W TDP 8650U is probably throttling even more :-) But it's damn quick. I don't normally observe it throttling in a 3 second single core test. *Maybe* it's going from 4.2 to 3.9. But that makes it even fewer than 11.5b clocks .. that would be 10.7. It takes (in a NUC) about 20 seconds of all-core work to drop back to 3.4, and even after 20 minutes it's still at 2.8. The same 8650U CPU in a ThinkPad drops back much more quickly, and gets to 2.2 at the end of the same workload.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 3998
  • Country: nz
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #152 on: November 20, 2019, 08:45:00 am »
Also I checked my xCore200 board with 5 threads, the max I can do pretty quickly... 113655.671  milliseconds(+/- 20ns). Little slower than the XS1 board running 4 threads but it's not that surprising since it ends up just waiting on the final thread but the XS1 gets 125MHz threads. The XS1 would probably be even faster running 8 threads since the earlier ones would finish and move the processing time along to later threads.

I'm not sure I understand how fine level threading will work on an xCore for this code. Does it get the correct number of primes? Are the arrays duplicated?
 

Offline maginnovision

  • Super Contributor
  • ***
  • Posts: 1963
  • Country: us
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #153 on: November 20, 2019, 08:54:08 am »
Also I checked my xCore200 board with 5 threads, the max I can do pretty quickly... 113655.671  milliseconds(+/- 20ns). Little slower than the XS1 board running 4 threads but it's not that surprising since it ends up just waiting on the final thread but the XS1 gets 125MHz threads. The XS1 would probably be even faster running 8 threads since the earlier ones would finish and move the processing time along to later threads.

I'm not sure I understand how fine level threading will work on an xCore for this code. Does it get the correct number of primes? Are the arrays duplicated?

The way I set it up was running it through from start to end in a single thread. Any time it got to a point where a thread would start I'd basically print out the function state data. So each thread would start with those state variables(including arrays) and then break when the proper number of primes was done.

https://gist.github.com/Maginnovision/2f7bd99afeeed351d421573950fbfdee

The main was basically:
Code: [Select]
#ifdef EXPLORER5
    unsafe {
        timing <: 1;
        par {
            EX5countPrime0(&res1);
            EX5countPrime1(&res2);
            EX5countPrime2(&res3);
            EX5countPrime3(&res4);
            EX5countPrime4(&res5);
        }
        timing <: 0;

        res = res1 + res2 + res3 + res4 + res5;
        printf("%d primes(1) found\n", res1);
        printf("%d primes(2) found\n", res2);
        printf("%d primes(3) found\n", res3);
        printf("%d primes(4) found\n", res4);
        printf("%d primes(5) found\n", res5);
        printf("%d primes(total) found\n\n", res);
    }
#endif

The timing was done by a teensy 3.6 monitoring the "timing" gpio.
« Last Edit: November 20, 2019, 08:56:50 am by maginnovision »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 3998
  • Country: nz
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #154 on: November 20, 2019, 09:09:14 am »
The way I set it up was running it through from start to end in a single thread. Any time it got to a point where a thread would start I'd basically print out the function state data. So each thread would start with those state variables(including arrays) and then break when the proper number of primes was done.

https://gist.github.com/Maginnovision/2f7bd99afeeed351d421573950fbfdee

Ahhh .. so it wasn't starting the algorithm from zero knowledge .. each thread started from a snapshot made during a prior "learning" run.

So the total amount of work is the same, but it's actually not running the algorithm multithreaded.

What happens if you start all the threads from empty arrays, start each thread at trial = 3 + threadNum, and increment trial by NumThreads each time at try_next? And then sum nPrimes from all the threads at the end?  Or can you do a mutex with a single global "trial" variable? I assume there's an "atomic increment memory" instruction?

 

Offline techman-001

  • Frequent Contributor
  • **
  • !
  • Posts: 748
  • Country: au
  • Electronics technician for the last 50 years
    • Mecrisp Stellaris Unofficial UserDoc
Re: Raspberry Pi 4
« Reply #155 on: November 20, 2019, 03:18:39 pm »
Got my Teensy 4.0 board and set it up.

I get the same results as GeorgeOfTheJungle, at 600 MHz, to the ms, adding my code to an Arduino sketch and adapting main() slightly and calling it from setup():
Code: [Select]
int main(){
  long beg = millis();
  int res = countPrimes();
  long m = millis() - beg;
  Serial.print(res);
  Serial.print(" primes found in ");
  Serial.print(m);
  Serial.println(" ms");
  return 0;
}

3713160 primes found in 37381 ms "faster" (the default)
3713160 primes found in 43516 ms "fast"

Verified that "fast" is -O1 and "faster" is -O2.  Compile line for "fast":

/home/bruce/software/arduino-1.8.10/hardware/teensy/../tools/arm/bin/arm-none-eabi-gcc -O1 -Wl,--gc-sections,--relax -T/home/bruce/software/arduino-1.8.10/hardware/teensy/avr/cores/teensy4/imxrt1062.ld -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -o /tmp/arduino_build_829669/Blink.ino.elf /tmp/arduino_build_829669/sketch/Blink.ino.cpp.o /tmp/arduino_build_829669/core/core.a -L/tmp/arduino_build_829669 -larm_cortexM7lfsp_math -lm -lstdc++

The code is 228 bytes long which is in line with some other Thumb2 results I've had but bigger than some. The gcc is prettty old .. 5.4.1 20160919.

Something that confuses me is that the objdump supplied in arduino-1.8.10/hardware/teensy/../tools/arm/bin/arm-none-eabi-objdump doesn't understand some of the instructions in the elf file!

Apologies for butting in, but I love benchmarks for their fun value and just had to try your 3713160 primes with Forth.

I used  a generic Forth primes algorithm written by Mark Willis and archived from a post he made on Usenet in 2011.

It compiled to 184 bytes under Mecrisp-Stellaris running on a STM32F103 with 8kB Ram and 128kB Flash clocked at 72 Mhz.

3713160 primes took 219927 ms not printing the primes
3713160 primes took 283662 ms after printing, the last  prime being 3713159

... 3712627 3712669 3712679 3712697 3712699 3712711 3712717 3712721 3712739 3712747 3712757 3712769 3712801 3712823 3712831 3712843 3712871 3712873 3712889 3712897 3712909 3712927 3712949 3712979 3712981 3713027 3713041 3713053 3713057 3713069 3713071 3713077 3713081 3713147 3713153 3713159

 ;D
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 3998
  • Country: nz
Re: Raspberry Pi 4
« Reply #156 on: November 20, 2019, 06:32:09 pm »
3713160 primes took 219927 ms not printing the primes
3713160 primes took 283662 ms after printing, the last  prime being 3713159

... 3712627 3712669 3712679 3712697 3712699 3712711 3712717 3712721 3712739 3712747 3712757 3712769 3712801 3712823 3712831 3712843 3712871 3712873 3712889 3712897 3712909 3712927 3712949 3712979 3712981 3713027 3713041 3713053 3713057 3713069 3713071 3713077 3713081 3713147 3713153 3713159

Looks like you've found all primes less than 3713160, not the first 3713160 primes. The last one should be 62710561, which is almost 17 times bigger.
 
3713159 is the 264262th prime, so by that measure you've gone about 1/14th of the way.

A full run will take about 30 times longer. It gets harder as you go along.
 

Offline maginnovision

  • Super Contributor
  • ***
  • Posts: 1963
  • Country: us
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #157 on: November 20, 2019, 07:56:58 pm »
The way I set it up was running it through from start to end in a single thread. Any time it got to a point where a thread would start I'd basically print out the function state data. So each thread would start with those state variables(including arrays) and then break when the proper number of primes was done.

https://gist.github.com/Maginnovision/2f7bd99afeeed351d421573950fbfdee

Ahhh .. so it wasn't starting the algorithm from zero knowledge .. each thread started from a snapshot made during a prior "learning" run.

So the total amount of work is the same, but it's actually not running the algorithm multithreaded.

What happens if you start all the threads from empty arrays, start each thread at trial = 3 + threadNum, and increment trial by NumThreads each time at try_next? And then sum nPrimes from all the threads at the end?  Or can you do a mutex with a single global "trial" variable? I assume there's an "atomic increment memory" instruction?

So far these are my results:
Code: [Select]
1 Thread - 383701084 microseconds, 3713160 primes(total) found
That's it, because it's been running for 2 hours without finishing 2 threads, so I'm guessing that doesn't work. I didn't sleep last night because my kids are sick but if you think of anything else to try I'm glad to, but I won't be coming up with a true multithreaded version soon, haha. I also had some significantly different results from the 2560, so I'm running it again to make sure it doesn't give me a different result. Lastly, your primes.txt got changed at some point(probably testing your 2560?) and it now has SZ defined as 100 by default, not 1000.
« Last Edit: November 20, 2019, 07:59:04 pm by maginnovision »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 3998
  • Country: nz
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #158 on: November 20, 2019, 08:05:05 pm »
Lastly, your primes.txt got changed at some point(probably testing your 2560?) and it now has SZ defined as 100 by default, not 1000.

Oops. Must have pushed a temporary version by accident. Fixed.
 

Offline techman-001

  • Frequent Contributor
  • **
  • !
  • Posts: 748
  • Country: au
  • Electronics technician for the last 50 years
    • Mecrisp Stellaris Unofficial UserDoc
Re: Raspberry Pi 4
« Reply #159 on: November 20, 2019, 08:31:44 pm »
3713160 primes took 219927 ms not printing the primes
3713160 primes took 283662 ms after printing, the last  prime being 3713159

... 3712627 3712669 3712679 3712697 3712699 3712711 3712717 3712721 3712739 3712747 3712757 3712769 3712801 3712823 3712831 3712843 3712871 3712873 3712889 3712897 3712909 3712927 3712949 3712979 3712981 3713027 3713041 3713053 3713057 3713069 3713071 3713077 3713081 3713147 3713153 3713159

Looks like you've found all primes less than 3713160, not the first 3713160 primes. The last one should be 62710561, which is almost 17 times bigger.
 
3713159 is the 264262th prime, so by that measure you've gone about 1/14th of the way.

A full run will take about 30 times longer. It gets harder as you go along.

Thanks for the update, <DUH> I assumed that the single parameter in Marks program was for the number of primes, but it looks like it was for the maximum prime itself. Trying now with 62710561 :)

« Last Edit: November 20, 2019, 08:46:18 pm by techman-001 »
 

Offline maginnovision

  • Super Contributor
  • ***
  • Posts: 1963
  • Country: us
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #160 on: November 20, 2019, 08:36:10 pm »
Code: [Select]
#define SZ 1000
int primes[SZ], sieve[SZ];
int nSieve = 0;

int countPrimes() {
primes[0] = 2; sieve[0] = 4; ++nSieve;
int nPrimes = 1, trial = 3+1, sqr = 2;
while (1) {
while (sqr * sqr <= trial) ++sqr;
--sqr;
for (int i = 0; i < nSieve; ++i) {
if (primes[i] > sqr) goto found_prime;
while (sieve[i] < trial) sieve[i] += primes[i];
if (sieve[i] == trial) goto try_next;
}
break;
found_prime:
if (nSieve < SZ) {
primes[nSieve] = trial;
sieve[nSieve] = trial * trial;
++nSieve;
// printf("Saved %d: %d\n", nSieve, trial);
}
++nPrimes;
try_next:
trial += 2;
}
return nPrimes;
}

This fails to finish on my PC in a few minutes which would be a thread if it was thread 1 in 0..1. Since it finished in 2.435 seconds normally it's not just the micro. I let it run for 7 minutes on the PC.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 3998
  • Country: nz
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #161 on: November 20, 2019, 09:34:50 pm »
Ah, yeah, it's not going to find any primes at all if you only test even numbers :-)

The same change but with initial trial kept at 3 not 3+1 takes 3.35 sec instead of 3.40 sec on the machine I'm sitting at now.

So, yeah that scheme for threading isn't going to work. Need to stop one thread getting ahead of the others, at least until the arrays are filled.

It might make sense to just use a single thread until SZ primes (1000) have been found and the arrays filled. That's a very small proportion of the total time.
 

Offline maginnovision

  • Super Contributor
  • ***
  • Posts: 1963
  • Country: us
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #162 on: November 20, 2019, 11:08:14 pm »
Ah, yeah, it's not going to find any primes at all if you only test even numbers :-)

The same change but with initial trial kept at 3 not 3+1 takes 3.35 sec instead of 3.40 sec on the machine I'm sitting at now.

So, yeah that scheme for threading isn't going to work. Need to stop one thread getting ahead of the others, at least until the arrays are filled.

It might make sense to just use a single thread until SZ primes (1000) have been found and the arrays filled. That's a very small proportion of the total time.

I had tried another version which immediately aborted but I thought the even numbers version was more indicative of the problem. Also I've had 3 results in a row with the 2560 9,751,193 millis. So it seems by default it was using the ext ram. When we were using this board previously all memory was mapped so I never thought about what it would do if it wasn't told what to do. So when scaled for clock speed it's not far off your result(but actually a little quicker). It wouldn't be too hard to lock the other threads using a lock that thread 0 holds until arrays filled/done. I think only having arrays populated will cause a ton of duplicate work. You'd basically be having every thread beyond the first doing  the same work. If implementing the trial = 3 + threadnumber and trial + totalthreads it would be somewhat duplicated but not entirely.
 

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #163 on: November 20, 2019, 11:25:34 pm »
Apologies for butting in, but I love benchmarks for their fun value and just had to try your 3713160 primes with Forth.

Apologies for butting in, but (idem) with JavaScript :) :
primes.js: 3713160 primes found in 14078 ms

Quote from: primes.js
(function countPrimes (SZ, primes, sieve, nPrimes, nSieve, trial, sqr, t0) {
loop:
    while (1) {
        trial+= 2;
        while (sqr*sqr <= trial) sqr++;
        sqr--;
        for (var j=0; j<nSieve; j++) {
            if (primes[j] > sqr) {
                if (nSieve < SZ) {
                    primes.push(trial);
                    sieve.push(trial*trial);
                    nSieve++;
                    //console.log(nSieve, trial);
                }
                nPrimes++;
                continue loop;
            }
            while (sieve[j] < trial) sieve[j]+= primes[j];
            if (sieve[j] === trial) continue loop;
        }
        break;
    }
    console.log(nPrimes+ " primes found in "+ (Date.now()- t0)+ " ms");
})(1000, [2], [4], 1, 1, 1, 2, Date.now());
« Last Edit: November 22, 2019, 06:12:32 pm by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline techman-001

  • Frequent Contributor
  • **
  • !
  • Posts: 748
  • Country: au
  • Electronics technician for the last 50 years
    • Mecrisp Stellaris Unofficial UserDoc
Re: Raspberry Pi 4
« Reply #164 on: November 21, 2019, 12:45:05 am »
3713160 primes took 219927 ms not printing the primes
3713160 primes took 283662 ms after printing, the last  prime being 3713159

... 3712627 3712669 3712679 3712697 3712699 3712711 3712717 3712721 3712739 3712747 3712757 3712769 3712801 3712823 3712831 3712843 3712871 3712873 3712889 3712897 3712909 3712927 3712949 3712979 3712981 3713027 3713041 3713053 3713057 3713069 3713071 3713077 3713081 3713147 3713153 3713159

Looks like you've found all primes less than 3713160, not the first 3713160 primes. The last one should be 62710561, which is almost 17 times bigger.
 
3713159 is the 264262th prime, so by that measure you've gone about 1/14th of the way.

A full run will take about 30 times longer. It gets harder as you go along.

Thanks for the update, <DUH> I assumed that the single parameter in Marks program was for the number of primes, but it looks like it was for the maximum prime itself. Trying now with 62710561 :)

Reaching the prime of 62710561  took 2894115 ms on a STM32F103 MCU  (same as in the Blue Pill)  @ 72 MHz  using Forth. I'd expect it to be perhaps 3x quicker using Assembly or C ?

This was about 12.9 times longer than my previous effort to the prime of 3713160.
 

Online iMo

  • Super Contributor
  • ***
  • Posts: 4675
  • Country: nr
  • It's important to try new things..
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #165 on: November 21, 2019, 09:34:27 am »
It is about running the same source on different archs and measuring elapsed time, imho, not about the result :)
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 3998
  • Country: nz
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #166 on: November 21, 2019, 08:40:27 pm »
It is about running the same source on different archs and measuring elapsed time, imho, not about the result :)

Yes, my idea was to run exactly the same C code and with as similar as possible quality of compiler (e.g. gcc -O1) on a wide range of machines. It's not too easy to make something that takes long enough to measure it on a fast machine but can fit on to a very small machine, AND that can't be optimized to nothing by stupid compiler tricks.

Translating the same algorithm to a similar language with compiler or JIT can be interesting too. I've actually have a Java version at http://hoult.org/primes.java for some time, but not publicized it.

I made things a little bit harder for this by using goto in the C code. They are quite structured so it translates easily to a language with labelled break & continue.

Using a very different language such as Forth is maybe interesting to compare different languages on the same machine. Using a completely different algorithm is .. well, that's another comparison again. Fun to play with sure.

My primes algorithm is probably not the best one in the world :-) Actually, I wrote essentially the same algorithm in FORTRAN IV in 1980 when I was a 17 year old high school student, entered it on punched cards, and ran it on the Burroughs B1700 in a computer bureau in Whangarei one night. In that program I printed all the primes less than 1,000,000. I remember that I used a 3-way "computed goto" :-) So I have a little bit of an attachment to this algorithm. In fact that FORTRAN program was a bit more sophisticated as it started with trial=5 and then incremented it alternately by 2 or 4 (5, 7, 11, 13, 17, 19, 23, 25 ...) so not even testing multiples of 2 or 3.

It was interesting when "sieve" became a commonly used benchmark on microcomputers a few years later it used a completely different implementation, with a bitmap of all the numbers in the range of interest (which would need 7.5 MB of memory to find the same primes as the program we are using here!) and as each prime is found clearing the bits for all multiples of that prime.

Both algorithms can I think be justifiably called "Sieve of Eratosthenes". The bitmap one is perhaps closer to what Eratosthenes actually did. Looking now, I see that Wikipedia lists my version as "Incremental sieve" https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Incremental_sieve. Of course Wikipedia wasn't available to me in hick-town NZ in 1980 so I was forced to come up with the algorithm (including the point Wikipedia notes of starting at the square of each new prime) by myself.

Ha!! Wikipedia's reference for "Incremental Sieve", a 2008 paper, calls my version "The Genuine Sieve of Eratosthenes" and the bitmap one "unfaithful sieve"! https://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf

But anyway, this is all just in fun, so any way people want to take the time to play around with something a bit different is fine by me, and interesting :-)
 
The following users thanked this post: 2N3055, iMo

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4196
  • Country: us
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #167 on: November 22, 2019, 01:17:07 am »
Quote
I wrote essentially the same algorithm in FORTRAN IV in 1980
Heh.  I wrote a similar program at about the same time, in PDP10 assembler.It was a bit memorable because it was the first time I "noticed" that printing numbers was "expensive" compared to just calculating them...


Hmm...  Oh look - still around, I think!

Code: [Select]
start2: move CNT,count          ;number of primes that we want total
        movei THIS,3
        movei PLACE,1           ;last used spot in the table
NUM.LP: addi THIS,2             ;step to next prime
        setz INDEX,             ;start with first prime in table
TST.LP: move TEST,THIS
        idiv TEST,TABLE(INDEX)
        jumpe REM,NUM.LP        ;evenly divisible, go to next number to test
        move TEST,TABLE(INDEX) 
        imul TEST,TEST          ;square the current probe.
        camg TEST,THIS          ;result bigger than test number -> done
        aoja INDEX,TST.LP
;Current number is Prime!
        addi PLACE,1            ;next spot in the table
        movem THIS,TABLE(PLACE) ;save the prime
        movei 1,101
        move 3,[5,,^d10]
        move 2,this
        nout
         trn
        sojg CNT,NUM.LP         ;try the next odd number
        haltf

        lit
        var

TABLE:  2                       ;first two primes, as a base
        3
;area to be filled in by program...
end start
 

Offline senso

  • Frequent Contributor
  • **
  • Posts: 951
  • Country: pt
    • My AVR tutorials
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #168 on: November 22, 2019, 03:24:33 pm »
Gave this a try, for curiosity sake..
Ubuntu inside a VirtualBox VM, running with 2 virtual cores, real CPU is a i7-8750H tweaked a bit with ThrottleStop.

gcc -O1:

Code: [Select]
Starting run
3713160 primes found in 2797 ms
242 bytes of code in countPrimes()

real 0m2,800s
user 0m2,794s
sys 0m0,004s

With gcc -O2 something funny happens, is slower and reports negative size..

Code: [Select]
Starting run
3713160 primes found in 3028 ms
-416 bytes of code in countPrimes()

real 0m3,038s
user 0m3,021s
sys 0m0,008s

gcc -Os:
Code: [Select]
Starting run
3713160 primes found in 3181 ms
-410 bytes of code in countPrimes()

real 0m3,186s
user 0m3,178s
sys 0m0,004s

And gcc -O0:

Code: [Select]
Starting run
3713160 primes found in 8623 ms
392 bytes of code in countPrimes()

real 0m8,643s
user 0m8,612s
sys 0m0,012s

And for fun, latest version of VS Community Edition 2017, C++ compiler, O1 optimization, on native Win10:
Code: [Select]
Starting run
3713160 primes found in 3460 ms
147 bytes of code in countPrimes()

So, Linux on a VM is 600ms faster than the native MS compiler :wtf:
« Last Edit: November 22, 2019, 03:39:30 pm by senso »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 3998
  • Country: nz
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #169 on: November 22, 2019, 05:05:26 pm »
Finding the code size from in the program itself is hacky and depends on findPrimes() and main() staying adjacent in the final binary and in that order. It seems to be reasonably reliable on gcc with -O1, but not higher levels.

147 bytes of code is impressively small from vc++. Much smaller than any other ISA so I'm not sure that's a genuine number. Would have to look at the disassembly to know.

I also found -O1 faster than -O2 on my i7-8650U machines.
 

Offline gmb42

  • Frequent Contributor
  • **
  • Posts: 294
  • Country: gb
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #170 on: November 22, 2019, 05:18:28 pm »
And for fun, latest version of VS Community Edition 2017, C++ compiler, O1 optimization, on native Win10:
Code: [Select]
Starting run
3713160 primes found in 3460 ms
147 bytes of code in countPrimes()

So, Linux on a VM is 600ms faster than the native MS compiler :wtf:

The VS optimisations flags are a little different to gcc, for VS 2017 they are:

Code: [Select]
/O1 maximum optimizations (favor space) /O2 maximum optimizations (favor speed)
/Ob<n> inline expansion (default n=0)   /Od disable optimizations (default)
/Og enable global optimization          /Oi[-] enable intrinsic functions
/Os favor code space                    /Ot favor code speed
/Ox optimizations (favor speed)         /Oy[-] enable frame pointer omission
/favor:<blend|ATOM> select processor to optimize for, one of:
    blend - a combination of optimizations for several different x86 processors
    ATOM - Intel(R) Atom(TM) processors

So O1 actually minimises space as can be seen in the output that says only
Code: [Select]
147 bytes of code in countPrimes()
In my machine (i7-8700) with /O1:

Code: [Select]
Starting run
3713160 primes found in 3145 ms
146 bytes of code in countPrimes()

and with /O2:

Code: [Select]
Starting run
3713160 primes found in 3115 ms
176 bytes of code in countPrimes()
 

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #171 on: November 22, 2019, 05:47:18 pm »
On my Mac -Os is smaller (172 / 224) and faster (4.9 / 5.6) than -O1

Quote
$ gcc -Os /primes.c ; time ./a.out
Starting run
3713160 primes found in 4977 ms
172 bytes of code in countPrimes()

real   0m4.981s
user   0m4.973s
sys   0m0.006s

$ gcc -O1 /primes.c ; time ./a.out
Starting run
3713160 primes found in 5603 ms
224 bytes of code in countPrimes()

real   0m5.612s
user   0m5.591s
sys   0m0.016s

And -O0 is slower than JavaScript...

primes.js: 3713160 primes found in 14078 ms

Quote
$ gcc -O0 /primes.c ; time ./a.out
Starting run
3713160 primes found in 18625 ms
400 bytes of code in countPrimes()

real   0m18.640s
user   0m18.597s
sys   0m0.032s
« Last Edit: November 22, 2019, 05:49:46 pm by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #172 on: November 22, 2019, 05:56:53 pm »
Finding the code size from in the program itself is hacky and depends on findPrimes() and main() staying adjacent in the final binary and in that order. It seems to be reasonably reliable on gcc with -O1, but not higher levels.

147 bytes of code is impressively small from vc++. Much smaller than any other ISA so I'm not sure that's a genuine number. Would have to look at the disassembly to know.

I also found -O1 faster than -O2 on my i7-8650U machines.

Use nm:

Quote
$ nm ./a.out
0000000100000000 T __mh_execute_header
                 U _clock
0000000100000d8a T _countPrimes
0000000100000e36 T _main
0000000100001030 S _nSieve
0000000100001040 S _primes
                 U _printf
                 U _puts
0000000100001fe0 S _sieve
                 U dyld_stub_binder

0xe36-0xd8a= 172
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline maginnovision

  • Super Contributor
  • ***
  • Posts: 1963
  • Country: us
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #173 on: November 22, 2019, 07:24:26 pm »
Nobody get better than my 2.435s yet? It's not even a fast PC. Maybe it's all the mobile CPUs.
 

Online iMo

  • Super Contributor
  • ***
  • Posts: 4675
  • Country: nr
  • It's important to try new things..
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #174 on: November 22, 2019, 08:45:53 pm »
Pelles C, Win7 64b, i3-6320 3900 MHz
-std:C11 -Tx64-coff -Ot -Ob1 -fp:precise -W1 -Gr

3713160 primes found in 3603 ms
224 bytes of code in countPrimes()


« Last Edit: November 22, 2019, 08:49:14 pm by imo »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf