Author Topic: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO  (Read 36060 times)

0 Members and 1 Guest are viewing this topic.

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: Raspberry Pi 4
« Reply #100 on: November 14, 2019, 08:16:16 am »
The .pdf (*) says (page 2) "Tightly coupled GPIOs, operating at the same frequency as Arm".

I was hoping to see ~ 1/2 the cpu clock, but it's only 1/4th:

Code: [Select]
#define PIN 13

void setup () { pinMode(PIN, OUTPUT); }

void loop () {
  while (1) {
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
  }
}

Gives 150MHz @600MHz cpu clock. What am I doing wrong?
« Last Edit: November 14, 2019, 02:50:16 pm by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4034
  • Country: nz
Re: Raspberry Pi 4
« Reply #101 on: November 14, 2019, 08:21:29 am »
Dammit .. I obviously *need* an Arduino mega 2560. Ordered a clone for $15.

He who dies with the most toys wins.
« Last Edit: November 14, 2019, 08:23:04 am by brucehoult »
 

Offline OwO

  • Super Contributor
  • ***
  • Posts: 1250
  • Country: cn
  • RF Engineer.
Re: Raspberry Pi 4
« Reply #102 on: November 14, 2019, 08:45:51 am »
Unroll the loop a bit?
Code: [Select]
void loop () {
  while (1) {
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
  }
}
Email: OwOwOwOwO123@outlook.com
 
The following users thanked this post: GeorgeOfTheJungle

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: Raspberry Pi 4
« Reply #103 on: November 14, 2019, 08:59:13 am »
Unroll the loop a bit?
Code: [Select]
void loop () {
  while (1) {
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
  }
}

That's what I tried first, but does nothing. It looks as if the gpio bus clock was 1/2 the cpu clock? But that's not what the .pdf says.
« Last Edit: November 14, 2019, 02:37:08 pm by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline OwO

  • Super Contributor
  • ***
  • Posts: 1250
  • Country: cn
  • RF Engineer.
Re: Raspberry Pi 4
« Reply #104 on: November 14, 2019, 01:42:58 pm »
Zynq-7010 (Cortex A9) @ 650MHz:
Code: [Select]
$ gcc primes.c -o primes -O3
$ time ./primes
Starting run
3713160 primes found in 39728 ms
-396 bytes of code in countPrimes()

real    0m39.745s
user    0m39.723s
sys     0m0.011s
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
650000

i7-4700MQ @ 2.40GHz:
Code: [Select]
$ clang-7 primes.c -o primes -Ofast
$ time ./primes
Starting run
3713160 primes found in 4707 ms
256 bytes of code in countPrimes()

real 0m4.708s
user 0m4.708s
sys 0m0.000s
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
2400000
$ grep MHz /proc/cpuinfo
cpu MHz : 2394.522
cpu MHz : 2395.176
cpu MHz : 2394.455
cpu MHz : 2394.955
cpu MHz : 2394.427
cpu MHz : 2394.608
cpu MHz : 2394.432
cpu MHz : 2394.821
Email: OwOwOwOwO123@outlook.com
 

Offline OwO

  • Super Contributor
  • ***
  • Posts: 1250
  • Country: cn
  • RF Engineer.
Re: Raspberry Pi 4
« Reply #105 on: November 14, 2019, 02:34:10 pm »
Yes, that's the first thing I tried, but doesn't work. It looks as if the gpio bus clock is 1/2 the cpu clock? But that's not what the .pdf says...
Can you look at the disassembly and see if the accesses are single instruction? Also I see GPIO is on a AHB bus and behind an adapter of some sort ("AIPS-Lite"). Maybe the core waits for the write response when accessing uncached memory? I would imagine accessing peripheral areas also stalls the entire processor until the access is acknowledged...
Email: OwOwOwOwO123@outlook.com
 
The following users thanked this post: GeorgeOfTheJungle

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: Raspberry Pi 4
« Reply #106 on: November 14, 2019, 02:45:44 pm »
Also I see GPIO is on a AHB bus and behind an adapter of some sort ("AIPS-Lite"). Maybe the core waits for the write response when accessing uncached memory? I would imagine accessing peripheral areas also stalls the entire processor until the access is acknowledged...

Maybe, but then, why do they say "Tightly coupled GPIOs, operating at the same frequency as Arm" in the pdf? What I'm seeing here is similar to what the STM32s do, where the gpio bus runs at a different speed from a different clock than the CPU.

Quote
Can you look at the disassembly and see if the accesses are single instruction?

I'm using the arduino IDE, I don't know how to do that with this.

Edit:
There are 5 "gpio modules" GPIO1..5, maybe they're not all equal? Pin 13 (what I'm using) is on GPIO2, perhaps some other port is faster?
« Last Edit: November 14, 2019, 02:56:42 pm by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline OwO

  • Super Contributor
  • ***
  • Posts: 1250
  • Country: cn
  • RF Engineer.
Re: Raspberry Pi 4
« Reply #107 on: November 14, 2019, 03:53:02 pm »
Operating frequency of the GPIO peripheral doesn't mean shit. If it takes N cycles to get a write request through the interconnect to the peripheral and the write response back, then that's N cycles the processor can NOT DO ANYTHING because it's required to guarantee order of accesses (this is IO memory and not cached memory). I think 2 cycles for a GPIO write is already very good. If you want it down to 1 cycle the GPIO controller must be integrated into the CPU itself.
Email: OwOwOwOwO123@outlook.com
 
The following users thanked this post: GeorgeOfTheJungle

Offline coppice

  • Super Contributor
  • ***
  • Posts: 8646
  • Country: gb
Re: Raspberry Pi 4
« Reply #108 on: November 14, 2019, 03:57:08 pm »
Unroll the loop a bit?
Code: [Select]
void loop () {
  while (1) {
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTSET = CORE_PIN13_BITMASK;
    CORE_PIN13_PORTCLEAR = CORE_PIN13_BITMASK;
  }
}

That's what I tried first, but does nothing. It looks as if the gpio bus clock was 1/2 the cpu clock? But that's not what the .pdf says.
On most simple machines I would expect what you get, if the GPIOs are on the full speed bus. One cycle to get the set instruction. One cycle to write to the GPIO, One cycle to get the clear instruction. One cycle to write to the GPIO, Rinse and repeat.
« Last Edit: November 14, 2019, 03:59:01 pm by coppice »
 
The following users thanked this post: GeorgeOfTheJungle

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14470
  • Country: fr
Re: Raspberry Pi 4
« Reply #109 on: November 14, 2019, 04:25:15 pm »
I did some tests with Bruce's code, and I confirm that with his code I also get the same execution time with -O2 and -O3 on my Core i7: 2490 ms. Now I tried with -Ofast, and it's actually slightly slower (which is consistent with my previous benchmarks with -Ofast which I've found often slower than -O3 actually), with 2550 ms. This is not that surprising, as execution time depends on many factors including how code and data are cached.

On an i7-8650U (which I have two of .. a NUC and a ThinkPad X1 Carbon) it's actually faster with -O1 (2735ms) than with -O2 or -O3 (3428ms) !!

On my i7-5930K, I get: 2490 ms for -O2 and -O3, but 2605 ms for -O1... (GCC 9.2.0 here if that matters.)

At this level semi-random things such as how code (especially branch targets) happen to fall in cache lines makes a big difference. And ASLR makes it vary from run to run.

Yup. Also 1/ not all GCC back-ends are born equal, some issue much better code for their given target than others, 2/ even when using "similar" targets (Core-i7 here), there can be huge difference running the exact same object code. The i7-5930K (even though now a bit old) is still a power horse, and it supports quad-channel RAM, and probably a lot more cache than the typical CPUs used on laptops (I'd have to check with yours.) I compiled it as 64-bit executables if that makes a difference, don't know if you did or if you only tested 32-bit builds.

I also checked with my laptop, which has a (relatively old) i7-2600M, and I get about twice the execution time, but still -O1 is slower than -O2 or -O3 on it, although on laptops (on mine for sure), you are likely using some kind of "on-demand" frequency governor, so you never really get the top performance, and performance can vary according on many more factors than on systems running at a fixed frequency... (so at -O1, on my laptop, the fun fact is that execution times between runs seemed to have much more variation than with -O3.)

It would fit on an Apple ][ or C64 or Atari XL. But no one has C compilers for them. C compilers for Z80 suck but at least they exist. Anyone have a working Speccy or Amstrad CPC or something?

C compilers for Z80 do exist yeah. I remember people writing Gameboy apps in C for instance. Dunno if the compiler sucked, but it sure seemed to work fine.

All I have is a Sinclair QL (68008), there are C compilers but have never used any. Don't really feel like firing it up again and fiddle with this at the moment, but it should be doable.

Wouldn't one of the emulators for the Spectrum or CPC be fine for this? (I guess there are some emulators that should be relatively accurate timing wise?)
« Last Edit: November 14, 2019, 04:30:46 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4034
  • Country: nz
Re: Raspberry Pi 4
« Reply #110 on: November 14, 2019, 05:23:06 pm »
Quote
Can you look at the disassembly and see if the accesses are single instruction?

I'm using the arduino IDE, I don't know how to do that with this.

The Arduino IDE actually makes it relatively simple to do this. Make a trivial change to your source code -- maybe add and delete a character and then hit the "check"/"compile" button. IN the panel at the bottom (maybe make it bigger) you'll see a line like the following, with the first (very long) "word" ending in gcc and the path to your eventual executable binary file in the middle (here Blink.ino.elf) after a "-o". Or for certain targets the gcc might be ld instead.

/home/bruce/software/arduino-1.8.10/hardware/teensy/../tools/arm/bin/arm-none-eabi-gcc -O1 -Wl,--gc-sections,--relax -T/home/bruce/software/arduino-1.8.10/hardware/teensy/avr/cores/teensy4/imxrt1062.ld -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -o /tmp/arduino_build_829669/Blink.ino.elf /tmp/arduino_build_829669/sketch/Blink.ino.cpp.o /tmp/arduino_build_829669/core/core.a -L/tmp/arduino_build_829669 -larm_cortexM7lfsp_math -lm -lstdc++

Open a terminal window (from your OS, nothing to do with gcc) and copy and paste the bit with gcc or ld and the output file. Don't try to run it yet!

/home/bruce/software/arduino-1.8.10/hardware/teensy/../tools/arm/bin/arm-none-eabi-gcc  /tmp/arduino_build_829669/Blink.ino.elf

Now just replace the "gcc" bit by "objdump -d":

/home/bruce/software/arduino-1.8.10/hardware/teensy/../tools/arm/bin/arm-none-eabi-objdump -d  /tmp/arduino_build_829669/Blink.ino.elf

You can run that.

If you can't scroll your terminal window backwards then you might want to put " | more" (or " | less") on the end, or redirect the output to a file with " >/home/bruce/myDisassembly.txt" or whatever other location or name you want. (Your name probably isn't Bruce...)

If the compiler is gcc then you can get an assembly language listing by instead finding the line that compiled your code ("Blink.ino.cpp") to an object file ("-o .../Blink.ino.cpp.o"). You can just copy and paste the whole line into your console/terminal window and re-run it. If you add to the end " -g -Wa,-adhl" then you'll get a listing printed to the terminal with the original lines of C code, the generated assembly language, and the binary (hex) code for the instructions.
 
The following users thanked this post: GeorgeOfTheJungle

Offline iMo

  • Super Contributor
  • ***
  • Posts: 4782
  • Country: pm
  • It's important to try new things..
Re: Raspberry Pi 4
« Reply #111 on: November 14, 2019, 06:17:21 pm »
The .pdf (*) says (page 2) "Tightly coupled GPIOs, operating at the same frequency as Arm".
I was hoping to see ~ 1/2 the cpu clock, but it's only 1/4th:
Here is something on that:
https://community.nxp.com/docs/DOC-342954
 
The following users thanked this post: GeorgeOfTheJungle

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #112 on: November 14, 2019, 06:30:03 pm »
It seems that the writes take two cycles :-(

Code: [Select]
#define PIN 13
volatile unsigned int* _set= (volatile unsigned int*) 0x42004084;
volatile unsigned int* _clr= (volatile unsigned int*) 0x42004088;
volatile unsigned int* _flip= (volatile unsigned int*) 0x4200408c;

void setup () { pinMode(PIN, OUTPUT); }
void loop () {
  while (1) { *_flip= 0x8; *_flip= 0x8; *_flip= 0x8; *_flip= 0x8; }
}
« Last Edit: November 25, 2019, 10:44:51 am by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: Raspberry Pi 4
« Reply #113 on: November 14, 2019, 06:35:31 pm »
The .pdf (*) says (page 2) "Tightly coupled GPIOs, operating at the same frequency as Arm".
I was hoping to see ~ 1/2 the cpu clock, but it's only 1/4th:
Here is something on that:
https://community.nxp.com/docs/DOC-342954

 :-+

Quote
RT1060 provides two set of GPIOs registers to control pads output. GPIO1 to GPIO3 is general GPIO, and GPIO6 to GPIO8 is tightly GPIO, but they share the same pad, that means the gpio pin can select from GPIO1/2/3 to GPIO6/7/8.

Then there's still hope! Because I'm using GPIO2.
« Last Edit: November 14, 2019, 06:43:28 pm by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 8646
  • Country: gb
Re: Raspberry Pi 4
« Reply #114 on: November 14, 2019, 06:36:52 pm »
It would fit on an Apple ][ or C64 or Atari XL. But no one has C compilers for them.
C compilers for the Apple ][ existed. I used one of them. If they existed for the Apple ][, I'm sure they existed for the C64 as well.
C compilers for Z80 suck but at least they exist. Anyone have a working Speccy or Amstrad CPC or something?
C compilers for the Z80 were fine, but most C compilers used for Z80s were actually 8080 compilers, and the restricted instruction set they spewed out certainly hampered performance. Nonetheless, huge amounts of widely used CP/M and embedded Z80 code were developed in C, and ran very well.
 

Offline iMo

  • Super Contributor
  • ***
  • Posts: 4782
  • Country: pm
  • It's important to try new things..
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #115 on: November 14, 2019, 06:37:54 pm »
Quote
When using the register DR_TOGGLE and the fast GPIO we will get the best performance of the pin.
From the above link.. Did you try it?
 

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #116 on: November 14, 2019, 06:40:39 pm »
Quote
When using the register DR_TOGGLE and the fast GPIO we will get the best performance of the pin.
From the above link.. Did you try it?

Yes:
Code: [Select]
volatile unsigned int* _flip= (volatile unsigned int*) 0x4200408c;
But I'm using the wrong GPIO group it seems.
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline iMo

  • Super Contributor
  • ***
  • Posts: 4782
  • Country: pm
  • It's important to try new things..
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #117 on: November 14, 2019, 06:46:55 pm »
My bet with fast GPIO group and unrolled loop you get 300MHz toggle :)
 
The following users thanked this post: GeorgeOfTheJungle

Offline maginnovision

  • Super Contributor
  • ***
  • Posts: 1963
  • Country: us
Re: Raspberry Pi 4
« Reply #118 on: November 14, 2019, 06:57:03 pm »
Dammit .. I obviously *need* an Arduino mega 2560. Ordered a clone for $15.

He who dies with the most toys wins.

If that doesn't work I can always try with some protos that didn't work out. 2560's with 256k RAM. Would be slightly slower than all on chip but they can run 16 or 20 MHz.
 

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #119 on: November 14, 2019, 07:18:18 pm »
My bet with fast GPIO group and unrolled loop you get 300MHz toggle :)

Bad news... I'm using GPIO7 already :-(



https://github.com/PaulStoffregen/cores/blob/master/teensy4/imxrt.h#L5039-L5050

Code: [Select]
volatile unsigned int* _flip= (volatile unsigned int*) 0x4200408c;
« Last Edit: November 14, 2019, 07:45:00 pm by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Online Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6259
  • Country: fi
    • My home page and email address
Re: Raspberry Pi 4
« Reply #120 on: November 14, 2019, 08:14:10 pm »
Can the MCU on the Teensy 4.0 board really run reliably @1GHz? That's pretty impressive. Is it not getting too hot?
You do need a heatsink on the i.MX chip.

I was hoping to see ~ 1/2 the cpu clock, but it's only 1/4th:
Each bus load/store does take two clocks, because the armv7-m thumb2 load and store (single) instructions take two cycles each, so 1/4th CPU clock on the output pin is expected.

(The Processor Instruction Timings section in the Cortex-M3 Technical Manual gives 2 clocks per load/store instruction, with a footnote saying that "Generally, load-store instructions take two cycles for the first access [and one cycle for each additional access]", so I believe the two cycles per I/O load/store is expected.  Note: while i.MX RT1060 is a Cortex-M7, Cortex-M7 instruction timings are not published.  So, I went with this suggestion.)

Teensy 4.0 does have the processor cycle counter enabled and directly accessible as the ARM_DWT_CYCCNT macro, BTW.
« Last Edit: November 14, 2019, 08:24:11 pm by Nominal Animal »
 
The following users thanked this post: GeorgeOfTheJungle

Offline iMo

  • Super Contributor
  • ***
  • Posts: 4782
  • Country: pm
  • It's important to try new things..
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #121 on: November 14, 2019, 09:40:41 pm »
Yep, it looks the 150Mhz is the max toggling freq, a pity..
https://www.nxp.com/docs/en/application-note/AN12240.pdf
 
The following users thanked this post: GeorgeOfTheJungle

Offline maginnovision

  • Super Contributor
  • ***
  • Posts: 1963
  • Country: us
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #122 on: November 14, 2019, 10:17:27 pm »
What does the 150MHz toggling look like on a scope? Is it pretty clean? I still haven't thought of a reason to buy a teensy 4. Waiting on the 3.6 form factor with sd slot.
 

Online Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6259
  • Country: fi
    • My home page and email address
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #123 on: November 15, 2019, 01:02:51 am »
Yep, it looks the 150Mhz is the max toggling freq
You should be able to reach 240 MHz if you overclock to 960MHz, although you need a heatsink for the i.MX RT1060 chip.  :P

What does the 150MHz toggling look like on a scope?
I wish I had a scope with that kind of bandwidth!

I don't think I have seen any scope screenshots of that at the PJRC forum, but based on the discussions around PWM, timers, SPI/I2C I/O, especially in the looong beta test thread, I think one needs to go an order of magnitude lower in the frequency to generate controllable/useful outputs.  Just my opinion, though.
 

Offline GeorgeOfTheJungleTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: RPi 4 / STM32 / ESP32 / Teensy 4 / RISC-V GAZPACHO
« Reply #124 on: November 15, 2019, 07:11:47 am »
What does the 150MHz toggling look like on a scope? Is it pretty clean? I still haven't thought of a reason to buy a teensy 4. Waiting on the 3.6 form factor with sd slot.

This is with a Micsig TO1074, 130MHz probe, Teensy @960MHz. There's an interrupt that kicks in every 1ms.

872212-0

Agilent DSO7104, 1165A probe, Teensy @1008MHz:

872310-1

Teensy @600MHz:

872298-2

The interrupt:

872304-3
« Last Edit: November 15, 2019, 11:17:30 am by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 
The following users thanked this post: iMo, Nominal Animal


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf