#define LED GPIO_NUM_4
void setup(){
pinMode(LED, OUTPUT);
}
void IRAM_ATTR loop(){ // Unrolled loop to avoid cache miss / branches
while(1){
WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // Set led pin high. GPIO_OUT_W1TS / GPIO_OUT_W1TC work like STM32 BSRR registers (Set/Reset mask)
WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED); // Set led pin low
WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // And so on
WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED);
WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED);
WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED);
WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED);
WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED);
}
}
So 240MHZ dual-core, but performs like a 45HP car with 300KG in the trunk and a brick under the gas pedal... but hey, with "Sport", "Turbo" and "V6" stickers :-DD
Only Core 0: 4220ms
Only Core 1: 4273ms
Core 0: 4226ms
Core 1: 4235ms
Core 0: 4226ms
Core 1: 4235ms
So 240MHZ dual-core, but performs like a 45HP car with 300KG in the trunk and a brick under the gas pedal... but hey, with "Sport", "Turbo" and "V6" stickers :-DD
This is not a CPU, but a MCU, targetting external circuitry, so yes, IO speed is very important.
These ARM32 processors use an ARM32 core (which they bought in from ARM as a "block") ...
void IRAM_ATTR loop(){ // Unrolled loop to avoid cache miss / branches
40375144: 004136 entry a1, 32
while(1){
WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // Set led pin high. GPIO_OUT_W1TS / GPIO_OUT_W1TC work like STM32 BSRR registers (Set/Reset mask)
40375147: fcafa1 l32r a10, 40374404 <_iram_text_start>
WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED); // Set led pin low
4037514a: fcaf91 l32r a9, 40374408 <_iram_text_start+0x4>
WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // Set led pin high. GPIO_OUT_W1TS / GPIO_OUT_W1TC work like STM32 BSRR registers (Set/Reset mask)
4037514d: 081c movi.n a8, 16
4037514f: 0020c0 memw
40375152: 0a89 s32i.n a8, a10, 0
WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED); // Set led pin low
40375154: 0020c0 memw
40375157: 0989 s32i.n a8, a9, 0
WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // And so on
40375159: 0020c0 memw
4037515c: 0a89 s32i.n a8, a10, 0QuoteThese ARM32 processors use an ARM32 core (which they bought in from ARM as a "block") ...But it wouldn't be entirely surprising if "memw" to specific memory regions cause a trap to OS code that carefully managed access by multiple CPUs/etc. :-(
TX SZ: 67108864 Bytes (64MB)
SRAM->SRAM: 176ms (363MB/s)
PSRAM->SRAM: 2428ms (26MB/s)
PSRAM->PSRAM: 7061ms (9MB)
You can always allow full speed and leave the performance/watt selection to the user, just like any stm32.
Compiler might be even able to do some optimizations on compile-time today, I'm sure.
But for MCUs, newlib-nano 1, 2 and derivatives such as picolibc will copy word by word (long, in fact) with some loop unrolling if the source and destination addresses are both word alignedUnless...
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
// byte by byte copy
| It's worth investigating exactly which version of memcpy your particular build ends up giving you... |
Yes, I already said 16MHz effective rate! Not terrible, but I think the 80MHz bus is a bit silly when having 2x 240MHz cores.The 512K is a banner spec. If you dive into the datasheet, only 416K is continuous. 32K is ICache. Up to 64K is DCache.
The mouth-watering 8MB PSRAM sounds great, but in real-life is rather limited.
memcpy tests copying 32K uint8_t, interleaving two buffers to avoid caching:Code: [Select]TX SZ: 67108864 Bytes (64MB)
SRAM->SRAM: 176ms (363MB/s)
PSRAM->SRAM: 2428ms (26MB/s)
PSRAM->PSRAM: 7061ms (9MB)
No idea why SRAM is achieving 364MB/s?
I expected memcpy to copy one byte at a time, so at best 240MB/s for 240MHz cpu.
Changing the buffer to uint32_t throwed the same results.
Also, of 512KB, only 295KB were available for the user, with BT, Wifi...everything disabled. Was a bit of a disssapointment.
This compiler uses -Os (Optimize for size) by default.Which memcpy you have access to is a link-level thing, not a compiler thing.
With -Ofast, gcc will inline fixed length memcpy with size up to 64 bytesHmm. That raises the interesting possibility that a user-written loop of smaller memcpy() calls would be faster than a single large memcpy(), depending on the library...
Compiler might be even able to do some optimizations on compile-time today, I'm sure.
With -Ofast, gcc will inline fixed length memcpy with size up to 64 bytes, clang up to 16 bytes (using target cortex-m7).
I just thought I'd point out that none of the ESP processors use an ARM core.Most use an Xtensa core from "Tensilica" (same principles apply; Tensilica just isn't as successful as ARM.) Some of the newer ESP chips used a RISC-V core.I thought the XTensa cores were designed to be space-efficient and easy to accommodate on FPGAs and ASICs, not necessarily high-performance. Properties like 1 cycle = 1 instruction pipelines and wide flash read-ahead tend to conflict with small footprint.
I just thought I'd point out that none of the ESP processors use an ARM core.Most use an Xtensa core from "Tensilica" (same principles apply; Tensilica just isn't as successful as ARM.) Some of the newer ESP chips used a RISC-V core.I thought the XTensa cores were designed to be space-efficient and easy to accommodate on FPGAs and ASICs, not necessarily high-performance. Properties like 1 cycle = 1 instruction pipelines and wide flash read-ahead tend to conflict with small footprint.