EEVblog Electronics Community Forum

Electronics => Microcontrollers => Topic started by: DavidAlfa on May 24, 2022, 12:56:05 pm

Title: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 24, 2022, 12:56:05 pm

Just got one of these to play with.
Perhabs I'm missing out some details about bus interconnection/speeds, but toggling a GPIO in the simplest way, placed in IRAM, reaches 8MHz, which seems way too slow for fecking 240MHz CPU.
Also, if I run a second thread in Core 1 toggling a different pin, effective rate halves. Not impressed at all.
Additonally it has periodic jittering due the ESP system interrupts, which is expected, yet to find if it's possible to disable all that, having full control of the system.

Code: [Select]

#define LED GPIO_NUM_4
void setup(){
	pinMode(LED, OUTPUT);
}
void IRAM_ATTR loop(){                                                    // Unrolled loop to avoid cache miss / branches
  while(1){
    WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // Set led pin high. GPIO_OUT_W1TS / GPIO_OUT_W1TC work like STM32 BSRR registers (Set/Reset mask)
    WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED); // Set led pin low
    WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // And so on
    WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED);
    WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED);
    WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED);
    WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED);
    WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED);
  }
}

So 240MHZ dual-core, but performs like a 45HP car with 300KG in the trunk and a brick under the gas pedal... but hey, with "Sport", "Turbo" and "V6" stickers :-DD

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: peter-h on May 24, 2022, 01:21:30 pm

I think this is the same issue I found here
https://www.eevblog.com/forum/microcontrollers/32f417-spi-running-at-one-third-the-speed-it-should/ (https://www.eevblog.com/forum/microcontrollers/32f417-spi-running-at-one-third-the-speed-it-should/)

These ARM32 processors use an ARM32 core (which they bought in from ARM as a "block") running at 100000GHz and then they tack on the various in-house designed (or bought-in) peripherals which not only do not run at anywhere near the CPU speed (they run off a "peripheral clock" which is the CPU clock divided by 2^N, and with quite a low max e.g. 50MHz) but also need multiple clocks (of this slow peripheral clock) to de-metastable data syncing.

So stuff like general I/O, reading the status registers of UARTs, etc, runs very slowly relative to what you may expect from the CPU clock speed.

The solution is to use DMA for as much as possible. Even for generating a fast waveform on a GPIO pin it may pay to use DMA to pick values out of a circular RAM buffer (which incidentally will avoid most of the jitter due to interrupt servicing).

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 24, 2022, 01:33:24 pm

Yeah, these were my thoughts. Didn't went in-depth on the architecture yet.
Makes sense, targetting IoT, the SoC might be oriented to data processing, not I/O power.
That's a bit sad, with faster peripherals it would blow out the STM32 out of the water: 8MB SPI flash/psram, 512K SRAM, HW hashing/crypto...

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: hans on May 24, 2022, 02:21:33 pm

I was looking at the ESP32 chipsets as a SPI WiFi/BLE bridge. There is some hosted firmware for it available (esp-hosted), but I was looking at the low-level details myself.. specifically SPI master/slave: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/peripherals/spi_master.html

Lookup "Transaction Duration" and be amazed to find out that the overhead to set up a 1-byte "transaction" takes 25us. On 240MHz that is 6000 cycles |O. I bet there is some serious IDF overhead in there.. as I can't imagine the RTL code being *that* slow.

Regarding GPIO toggle speed.. IIRC a STM32H7 is not too dissimilar. The GPIOs are tucked away on a separate peripheral bus , which must be accessed via several bus bridges each which require bus arbitration, setup of a AXI bus (or similar) transaction, handshaking, etc. It's all to make the CPU go fast and be able to have the peripherals operate at a lower frequency, either for speed limitations of that logic (e.g. the peripheral bus with possibly a couple dozen slaves) or power consumption.

You'll probably find complexity of said peripherals go up exponentially to handle more complicated orchestras of pin wiggling using DMA without CPU interrupts.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 24, 2022, 02:50:29 pm

Checking the reference manual:
https://espressif.com/sites/default/files/documentation/esp32-s3_technical_reference_manual_en.pdf

I've seen the APB bus runs at 80MHZ max, that's what the GPIO runs.
Effective 16MHz from 80MHz (each io cycle is 2 writes, high/low) might mean something like this:
- 1 cycle for CPU fetching pointer address
- 1 cycle for CPU write to pointer address
- 1 cycle for APB sync?
- 1 cycle for APB transfer?

But didn't found the details.
And yes, there's a lot of overhead, but sometimes understandable, as your code runs as a process managed by the expresiff OS (FreeRTOS IIRC), and it must take care of everything.
But 6K cycles seems a lot.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: peter-h on May 24, 2022, 02:53:36 pm

A colleague is using the ESP32 also.

Pluses:
Lots of bang for the buck; much cheaper than ST32F4, because the "Western" vendors rip everyone off.
The ETH libs (including TLS) actually work because, reportedly, the company rented one of the "social media prominent" coders and paid him to sort out their libs, whereas ST just put out a load of sh*it written by a random collection of employees passing through ST, leaving users to spend months googling for bug fixes.
Can use SPI RAM chips, which is a big thing because many apps are either RAM-limited or can be made much more powerful if you have lots of RAM.

Minuses:
It is Chinese so a) will have far fewer "serious" users e.g. motor vehicles etc that drive long term mfg life; b) some "political risk" (sanction potential, etc, so basically forget getting any if China was ever dumb enough to get adventurous over Taiwan).

Pin waggling using DMA is really trivial. I have a waveform generator done that way, and once you have it running (I confess to paying someone to write the basic stuff) then it is just a few lines, and to wiggle a pin with a square wave you just need a circular buffer of 2 values. The period is set up with a timer. I posted the source code here...

Having been involved in this project for 1-2 years I reckon that 99% of embedded "IOT" doesn't need any ETH performance. It goes out over ADSL, or even 3G/4G. It just needs a solid code library. It talks to a private server anyway; running a public-facing HTTPS server will always be a long term disaster in an embedded product.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 24, 2022, 03:41:12 pm

Yes,I know, I wasn't trying to make any waveform, just checking some basics.
I've also tried Timer DMA->GPIO in STM32, also saving a lot of space using uint8 instead uint32 and DMA size to byte for both src and dst.
Only tried pointing to the GPIO base address, so this byte would write to pins 0-7, but I guess it could address any of the higher bytes of the 32-bit GPIO register.

Edit: Well, it seems like the second core doesn't really hurt the first one. It's definitely the peripheral bus bandwidth.
Made some cpu performance tests handling memory/floats : www.jdoodle.com/ia/rjS (http://www.jdoodle.com/ia/rjS)

Code: [Select]

Only Core 0: 4220ms
Only Core 1: 4273ms
Core 0: 4226ms
Core 1: 4235ms
Core 0: 4226ms
Core 1: 4235ms

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: SiliconWizard on May 24, 2022, 08:44:46 pm

IO toggling rate as a measure of a CPU performance? Really? :-DD

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 24, 2022, 10:20:11 pm

This is not a CPU, but a MCU, targetting external circuitry, so yes, IO speed is very important.
You might want to interface some external device that can't be driven with any existing hardware peripheral, and as you may know, 8MHz isn't lighting fast, most basic PICs can toggle it faster while running a lot slower, so depending on the application, you might waste a lot of cycles.

I just found strange that it took 15 CPU cycles to toggle a bit, directly accessing the register, and 30 when running dual core, wasn't sure if this was a memory bottleneck or what, until I later found the APB bus runs at 80MHZ.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: james_s on May 25, 2022, 04:05:20 am

The ESP32 is primarily about the WiFi, the GPIO is somewhat limited but in practice that is not a problem for the sort of stuff it is intended to be used in. Certainly I've never needed to toggle GPIO at 8MHz on one, mostly I use SPI or I2C for more advanced peripherals and use the regular GPIO for stuff like LEDs.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: peter-h on May 25, 2022, 06:23:10 am

There is a general emphasis on not making GPIO too fast because of EMC. A 500MHz CPU with a 25MHz external xtal radiates almost nothing but high slew rate GPIO is a nightmare. They could have programmable slew rate GPIO but I don't think they do. It's not easy to do. You can have multiple size mosfets and select which size you use; not sure if this is done.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: Doctorandus_P on May 25, 2022, 07:27:30 am

Quote from: DavidAlfa on May 24, 2022, 12:56:05 pm

So 240MHZ dual-core, but performs like a 45HP car with 300KG in the trunk and a brick under the gas pedal... but hey, with "Sport", "Turbo" and "V6" stickers :-DD

That is a very bad analogy.
The days of assessing a uC's performance by how fast it can toggle an I/O pin are over.
The days of squeezing the most out of your uC by hand optimised assembly and cycle counting to get "perfect" performance are also over.

And it's all because of a combination of progress and physical limitations.
Even micrcontrollers are getting things like caches to speed stuff up or "flash acellerators" that access the Flash in 512 it wide chunks because the flash is much slower then the processor itself.

And quick I/O toggling is also seldom needed for a uC. take for example a CNC controller running GRBL. Generating step frequencies of some 200kHz is adequate, but it needs a lot of buffering and processing of the text strings to translate G-code to stepper motor timings, and all that background processing is not timing critical.

If you really need fast I/O, then use an FPGA, or use a microcontroller with a dedicated fast peripheral that suits your application.

The weird thing is though:
Why does an anal-ogy and ass-essing remind me of a full bridge rectum fire?

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: Siwastaja on May 25, 2022, 07:53:10 am

Quote from: DavidAlfa on May 24, 2022, 10:20:11 pm

This is not a CPU, but a MCU, targetting external circuitry, so yes, IO speed is very important.

No, GPIO toggling rate is almost never important.

Do you have any actual application in mind?

I mean, I have used MCUs to bitbang fast protocols. Program logic is always the problem. For example, a 400MHz core and 100MHz IO bus play along just fine. Compared to a 10MHz PIC or AVR, 400MHz core makes interrupt latency just disappear. It means you don't have to think about a few logical operations and maybe an if-else. Maybe you spend 40 cycles on that ISR + IO generation logic, but it's equivalent to 1 cycle(!!!) on that 10MHz AVR. Plus another AVR-equivalent cycle for the IO.

So you have the IO performance of the PIC/AVR which was perfectly fine for almost everything, but also get the CPU performance which allows you to just write normal applications and utilize ISRs, instead of some hacker-level hand assembly trickery.

IO latency of 1/16MHz = 62.5 ns is really fine for almost everything, but of course you can't bitbang (R)MII for 100M Ethernet, for example. Do you want to do something comparable to this, or what do you have in mind?

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: hans on May 25, 2022, 08:08:48 am

The only measure of MCU performance is the one of your application. You may find yourself in a very tight IRQ where GPIO performance may matter, but that is indeed "almost never".

Then again you could also load a Coremark benchmark, and stare yourself blind on a AVR vs Cortex-m3 comparison. But the Coremark benchmark does a matrix multiplication; is that indicative of your MCU application? If not, maybe that AVR is just fine.

The IO performance of fast CPU's is relatively speaking appalling. But absolutely speaking, there is nothing wrong with a 50ns or 100ns pin toggle rate. It will likely only cause problems for e.g. toggling the CS of a SPI device too fast.
And if you want the last bit/s of performance you'll need to resort to DMA anyway, and hope that the manufacturer has sophisticated enough peripherals that it can play your orchestra of pin toggles to complete the transactions without 'too much' CPU intervention.
And if that's not possible.. it's very common and relatively easy to connect a FPGA (like a Lattice ICE40 or bigger) to such a MCU.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: westfw on May 25, 2022, 09:22:54 am

Quote

These ARM32 processors use an ARM32 core (which they bought in from ARM as a "block") ...

I just thought I'd point out that none of the ESP processors use an ARM core.Most use an Xtensa core from "Tensilica" (same principles apply; Tensilica just isn't as successful as ARM.) Some of the newer ESP chips used a RISC-V core.

A chip that doesn't specifically target "8bit replacement" is extremely likely to have very slow IO, compared to internal CPU clock rate. On top of that, who knows what goes on in the SDK/OS that helps allow "easy" use of WiFi. I mean, the code looks OK:

Code: [Select]

void IRAM_ATTR loop(){                                                    // Unrolled loop to avoid cache miss / branches
40375144:       004136          entry   a1, 32
  while(1){
    WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // Set led pin high. GPIO_OUT_W1TS / GPIO_OUT_W1TC work like STM32 BSRR registers (Set/Reset mask)
40375147:       fcafa1          l32r    a10, 40374404 <_iram_text_start>
    WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED); // Set led pin low
4037514a:       fcaf91          l32r    a9, 40374408 <_iram_text_start+0x4>
    WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // Set led pin high. GPIO_OUT_W1TS / GPIO_OUT_W1TC work like STM32 BSRR registers (Set/Reset mask)
4037514d:       081c            movi.n  a8, 16
4037514f:       0020c0          memw
40375152:       0a89            s32i.n  a8, a10, 0
    WRITE_PERI_REG(GPIO_OUT_W1TC_REG,1<<LED); // Set led pin low
40375154:       0020c0          memw
40375157:       0989            s32i.n  a8, a9, 0
    WRITE_PERI_REG(GPIO_OUT_W1TS_REG,1<<LED); // And so on
40375159:       0020c0          memw
4037515c:       0a89            s32i.n  a8, a10, 0

But it wouldn't be entirely surprising if "memw" to specific memory regions cause a trap to OS code that carefully managed access by multiple CPUs/etc. :-(

(alas, I find the xtensa instruction set particularly difficult to understand without having studied it.)

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: Siwastaja on May 25, 2022, 10:09:58 am

Quote from: westfw on May 25, 2022, 09:22:54 am

Quote
These ARM32 processors use an ARM32 core (which they bought in from ARM as a "block") ...
But it wouldn't be entirely surprising if "memw" to specific memory regions cause a trap to OS code that carefully managed access by multiple CPUs/etc. :-(

I don't think so - it would be much slower.

Note that DavidAlfa's "8MHz" notation sounds slower than it actually is because if I understood correctly, this was square wave frequency, i.e., one IO latency is 1/16MHz. Not so bad at all - it's going to be much faster than 8-bit AVR running cbi/sbi for 2 cycles at 20MHz - i.e. 5MHz square wave.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 27, 2022, 03:19:39 pm

Yes, I already said 16MHz effective rate! Not terrible, but I think the 80MHz bus is a bit silly when having 2x 240MHz cores.

The mouth-watering 8MB PSRAM sounds great, but in real-life is rather limited.
memcpy tests copying 32K uint8_t, interleaving two buffers to avoid caching:

Code: [Select]

TX SZ:          67108864 Bytes (64MB)
SRAM->SRAM:     176ms (363MB/s)
PSRAM->SRAM:    2428ms (26MB/s)
PSRAM->PSRAM:   7061ms (9MB)

No idea why SRAM is achieving 364MB/s?
I expected memcpy to copy one byte at a time, so at best 240MB/s for 240MHz cpu.
Changing the buffer to uint32_t throwed the same results.

Also, of 512KB, only 295KB were available for the user, with BT, Wifi...everything disabled. Was a bit of a disssapointment.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: Siwastaja on May 27, 2022, 03:56:47 pm

It isn't silly at all. Fast cores are pipelined exactly to break down critical paths (largest number of logic gates between two flip-flops) into smaller ones. But bus, by definition, needs to go "everywhere". It can't run fast, given the same silicon process node, power consumption requirement, and style of design.

That's why there are usually multiple buses. Some are faster, some slower. For the fastest interfaces, single point-to-point links are used instead of shared bus.

80MHz for an IO bus is nothing surprising.

You are just seeing the effects that you can't scale up speed arbitrarily. Some parts scale easier, and are also more important - hence CPU is made faster, IO kept slower. This is fine because 99.99% of use cases need this, because to decide what IO operation you want to do, you usually spend dozens of instructions on the CPU.

Library or compiler supplied memcpy is always highly optimized and of course will use the full memory bandwidth (here, 32-bit moves). memcpy can see the alignment and the size from its arguments, so can start or end the copy with single-byte moves but do the bulk in full word writes.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 27, 2022, 05:34:39 pm

You can always allow full speed and leave the performance/watt selection to the user, just like any stm32.
I suspected it was doing 32-bit transfers under the hood, but wasn't sure, thanks for clarifying.
I guess I could force 8-bit transfers by using unaligned addresses?

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: Siwastaja on May 27, 2022, 05:53:52 pm

If addresses are unaligned, memcpy would move the first/last few bytes as 8 or 16 bit moves, but move the aligned parts with 32 bit moves. A typical memcpy would thus have a few checks taking a few clock cycles extra to get started, but pay back quickly. Compiler might be even able to do some optimizations on compile-time today, I'm sure.

99% of the time, "stock" memcpy is nearly the fastest and definitely least-effort way to copy arbitrary data with any size and alignment.

If you want to test 8-bit moves, just write your own for(int i=0; i<1234; i++) u8_thing[ i ] = another_u8_thing[ i ];

And even if you don't enjoy writing assembly, it's a good idea to start reading the compiler listing / disassembly.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: Siwastaja on May 27, 2022, 06:00:03 pm

Quote from: DavidAlfa on May 27, 2022, 05:34:39 pm

You can always allow full speed and leave the performance/watt selection to the user, just like any stm32.

But critical path delay is fixed in design. Not "any STM32" support over 80MHz bus speeds, either. Although the high end models support 240MHz (half of 480MHz CPU!) AHB where GPIOs are.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 27, 2022, 06:50:07 pm

You're right, APB1 clock is usually half the core.
That's a good implementation of memcpy then. That's also what I thought when forcing unaligned access, but again, most of the time these things seem to be plain stupid, having to control everything yourself, so one never knows!
Anyways, I'm hating developing on this thing, probably going to the "misc" box soon. For some reason, I detest any arduino-thing!

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: rteodor on May 27, 2022, 08:37:23 pm

The ~~IOMUX~~ GPIO matrix might be the bottleneck. Ethernet Rx/Tx pins (25 or 50MHz) specifically go around the ~~IOMUX~~ GPIO matrix because it is too slow. I remember seeing this explanation in some of Espressif documents.

Later edit: it is GPIO matrix not IOMUX.
The wording they used in "ESP32 Technical Reference manual", Chapter 5.1 was: "Some high-speed digital functions (Ethernet, SDIO, SPI, JTAG, UART) can bypass the GPIO Matrix for better high-frequency digital performance. In this case, the IO_MUX is used to connect these pads directly to the peripheral."
But I think there are other mentions.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 27, 2022, 09:39:10 pm

IOMUX is a direct connection to the peripheral, so there's no need of accessing the slow bus, thus avoiding the delay.
But to manually toggle a pin, you must use the matrix.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: newbrain on May 27, 2022, 10:58:17 pm

Quote from: Siwastaja on May 27, 2022, 05:53:52 pm

Compiler might be even able to do some optimizations on compile-time today, I'm sure.

With -Ofast, gcc will inline fixed length memcpy with size up to 64 bytes, clang up to 16 bytes (using target cortex-m7).

As for memcpy itself, the full glibc (generic target) will split the copy in the three phases (unaligned, aligned, leftovers) over a threshold of 16 and use byte copy otherwise.

But for MCUs, newlib-nano (https://github.com/32bitmicro/newlib-nano-2/blob/master/newlib/libc/string/memcpy.c) 1, 2 and derivatives such as picolibc will copy word by word (long, in fact) with some loop unrolling if the source and destination addresses are both word aligned, but if not, it will revert to byte by byte copy.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: westfw on May 28, 2022, 02:20:16 am

Quote

But for MCUs, newlib-nano 1, 2 and derivatives such as picolibc will copy word by word (long, in fact) with some loop unrolling if the source and destination addresses are both word aligned

Unless...

Code: [Select]

#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
 // byte by byte copy

It's worth investigating exactly which version of memcpy your particular build ends up giving you...

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 28, 2022, 09:09:55 am

This compiler uses -Os (Optimize for size) by default. But clearly it must be doing so, otherwise it's impossile, even with the pipelining, you cannot access the ram faster than it runs.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: hans on May 28, 2022, 12:43:29 pm

Quote from: DavidAlfa on May 27, 2022, 03:19:39 pm

Yes, I already said 16MHz effective rate! Not terrible, but I think the 80MHz bus is a bit silly when having 2x 240MHz cores.

The mouth-watering 8MB PSRAM sounds great, but in real-life is rather limited.
memcpy tests copying 32K uint8_t, interleaving two buffers to avoid caching:
Code: [Select]
TX SZ: 67108864 Bytes (64MB) SRAM->SRAM: 176ms (363MB/s) PSRAM->SRAM: 2428ms (26MB/s) PSRAM->PSRAM: 7061ms (9MB)
No idea why SRAM is achieving 364MB/s?
I expected memcpy to copy one byte at a time, so at best 240MB/s for 240MHz cpu.
Changing the buffer to uint32_t throwed the same results.

Also, of 512KB, only 295KB were available for the user, with BT, Wifi...everything disabled. Was a bit of a disssapointment.

The 512K is a banner spec. If you dive into the datasheet, only 416K is continuous. 32K is ICache. Up to 64K is DCache.
The main memory bus is most probably 32-bit wide, so the SRAM has a max bandwidth of 32-bit*240M = 960MB/s. But a copy on the same bus already halves the potential copy bandwidth (480MB/s) There is also some bus arbitration in there for other users (e.g. instruction data), and some instructions for copy loop and pointer management (besides initial unalignment overhead), so 364MB/s is quite reasonable. A small 32-bit word copy loop unrolled 8 times on a STM32U575 (160MHz+ICache+DCache) yields 234MB/s, which pretty comparable clock-for-clock.

However these numbers depends alot on the newlib implementation. On this STM32U575 target with nano.specs, the native memcpy does a very naive 32-bit copy with no unrolling. It only manages 30MB/s copy speed. That's just shy of a order of magnitude slower.. Maybe GCC/newlib swaps out the memcpy for an more unrolled one at a higher optimization level, however, I can't test that out right now because of a GCC bug.. (the compiler segfaults on some of STM32s HAL code with higher optimization levels)
But ultimately, also this is barely a benchmark of CPU speed.. maybe memory if you get the compiler configured right, but it still doesn't say much how fast the CPU handles conditional code/branching, or how much problems it has with register pressure.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: DavidAlfa on May 28, 2022, 01:26:25 pm

Ahh the bugs, that's how it works now.
Make something interesting on paper, sell, then wait for programmers reporting Bugs after wasting hundreds of hours, 10 years later still having unfixed silicon erratas.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: westfw on May 28, 2022, 09:54:53 pm

Quote

This compiler uses -Os (Optimize for size) by default.

Which memcpy you have access to is a link-level thing, not a compiler thing.
AFAIK, there are no build systems that will change their library selection based on the compiler optimization switch.

So you're at the mercy of the "specs" used in your link, and however the libc (normally supplied as a binary) was compiled. (I THOUGH that nano.specs was usually desired for microcontrollers, and that "nano" implied that __OPTIMIZE_SIZE__ had been used for the compile, but people are reporting unrolled memcpy() in their uC builds, so perhaps it's even more random than that.

Quote

With -Ofast, gcc will inline fixed length memcpy with size up to 64 bytes

Hmm. That raises the interesting possibility that a user-written loop of smaller memcpy() calls would be faster than a single large memcpy(), depending on the library...

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: JimDrew on July 07, 2023, 01:47:14 am

Just wanted to point out that the ESP32-S2 has a fast GPIO option, known as 'dedicated GPIO'. You program the matrix so the CPU is directly connected to a pin. Doing this you can get 80MHz (12.5ns) pin toggles.

https://docs.espressif.com/projects/esp-idf/en/latest/esp32s2/api-reference/peripherals/dedic_gpio.html

The ESP32-S3 has a similar feature but a bit expanded from the ESP32-S2.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: SiliconWizard on July 07, 2023, 03:45:57 am

Quote from: newbrain on May 27, 2022, 10:58:17 pm

Quote from: Siwastaja on May 27, 2022, 05:53:52 pm
Compiler might be even able to do some optimizations on compile-time today, I'm sure.

With -Ofast, gcc will inline fixed length memcpy with size up to 64 bytes, clang up to 16 bytes (using target cortex-m7).

Yes, but it really depends on the target itself. That's what GCC does for the arm Cortex-M targets, x86_64, and probably a bunch of others.
Unfortunately, it doesn't seem to for the Xtensa (ESP32-S3), on which it will still call memset() in - seemingly - any case.
You can check with godbolt.org, I just did.

I don't know what's the rationale behind that, but just the way it is.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: spostma on July 07, 2023, 09:10:41 pm

the DMA on the ESP32S3 is even faster than on the classic ESP32;
see Bitluni's bitbanged SuperVGA project using DMA and R-2R dividers:
https://blog.adafruit.com/2023/06/30/a-new-esp32-s3-board-with-high-rez-vga-output-esp32-video-bitlunislab/
https://hackaday.com/2023/06/28/much-better-vga-from-an-esp32/

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: JimDrew on July 09, 2023, 04:02:11 am

The ESP32-S2 and ESP32-S3 both have the new CPU instructions that can read/write a pin directly from the CPU without going through the IO MUX. You need to set a special register and then enable the pins in this mode (including a 'bundle' of any number of pins). You can set/clear individual bits or any number of bits through a mask. You can also read the state of individual GPIOs or a bunch all at the same time. This all occurs at the CPU bus speed of 80MHz, which 12.5ns for a read or write of a single or some number of pins.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: bson on July 09, 2023, 08:52:15 pm

Quote from: westfw on May 25, 2022, 09:22:54 am

I just thought I'd point out that none of the ESP processors use an ARM core.Most use an Xtensa core from "Tensilica" (same principles apply; Tensilica just isn't as successful as ARM.) Some of the newer ESP chips used a RISC-V core.

I thought the XTensa cores were designed to be space-efficient and easy to accommodate on FPGAs and ASICs, not necessarily high-performance. Properties like 1 cycle = 1 instruction pipelines and wide flash read-ahead tend to conflict with small footprint.

Title: Re: ESP32-S3 (Dual-core 240MHz blah blah) crappy performance?
Post by: SiliconWizard on July 09, 2023, 10:25:20 pm

Quote from: bson on July 09, 2023, 08:52:15 pm

Quote from: westfw on May 25, 2022, 09:22:54 am
I just thought I'd point out that none of the ESP processors use an ARM core.Most use an Xtensa core from "Tensilica" (same principles apply; Tensilica just isn't as successful as ARM.) Some of the newer ESP chips used a RISC-V core.
I thought the XTensa cores were designed to be space-efficient and easy to accommodate on FPGAs and ASICs, not necessarily high-performance. Properties like 1 cycle = 1 instruction pipelines and wide flash read-ahead tend to conflict with small footprint.

The Xtensa LX7 in the ESP32-S3 is not too shabby. It's given for 2.56 Coremark/MHz per core, while a Cortex-M4 is typically about 3.3 Coremark/MHz. Not as good sure, but not dramatically so.
And the ESP32-S3 has a dual core LX7, at up to 240 MHz (which many Cortex-M4 do not achieve). It also has a small RISC-V core on top of that. Pretty cool really.
The only thing that sucks IMHO on ESP32's in general are the low-power modes. Other than that, they are pretty capable.

SMF 2.0.19 | SMF © 2021, Simple Machines
Simple Audio Video Embedder
SMFAds for Free Forums | Powered by SMFPacks Advanced Attachments Uploader Mod