Author Topic: uC caching program parts to RAM (Read 3685 times)

Harvs · « **on:** April 20, 2014, 07:32:57 am »

I've never used this technique but having heard it mentioned a few times I'm just wondering how often if at all others are using it.

The concept being, with uC speeds outstripping flash performance it's possible to load time critical function/s into SRAM and execute from there. With some uCs also including instruction caches and even prefetch units I'm wondering what sort of real-world gains people have seen by doing this "software caching"?

abyrvalg · « **Reply #1 on:** April 20, 2014, 09:06:37 am »

I've seen this many times in older NOR-based ARM7/9 mobile phones firmwares: frequent functions like memcpy, memset moved to internal RAM or TCM (tightly coupled memory) at startup. But I have no data on performance improvement. All such cases were static (pin a single piece of code there, no swapping at runtime), never seen any dynamic implementation (moving pieces of code costs some CPU time too, so a careful evaluation of what is cheaper needed). Also these tricks doesn't appear in newer Cortex-Ax based models, looks like they just rely on bigger caches now.

westfw · « **Reply #2 on:** April 20, 2014, 09:41:05 am »

It gets complicated. A lot of uCs can't run code from their RAM (eg AVR), because they have strict "Harvard Architectures" where the program memory bus is completely separate, and perhaps even a different size, than the data memory bus. (PIC baseline architectures has 12bit program memory, for example.)

Some of the MIPS (PIC32) and ARM microcontrollers can execute from RAM, but they have a "modified" Harvard Architecture that still has a separate bus to flash for program fetches. That means that while the flash is slower than RAM, using RAM results in more bus contention if you're trying to use the RAM bus for both instructions AND data. So it's not easy to decide what can best benefit from being moved to RAM. (I'm reminded of my beloved PDP10, on which the 16 "registers" were also addressable as memory, and you could make code faster by copying it into the registers and running it there. But then you'd have fewer registers available, and had to absorb the overhead of moving the code into the registers...)

There was at least one DSP-like processor I read about some years ago that had large amounts of on-chip RAM, and for NORMAL execution it would load the RAM from external nv memory of some kind (flash has gotten faster since then, though.)

At some point, vendors give up and put a well-defined instruction cache on the chip, instead of those quirky "flash accelerators", and then it doesn't matter so much.

Jeroen3 · « **Reply #3 on:** April 20, 2014, 10:31:54 am »

This becomes a issue when you exceed 50 mhz. for examples, the stm32f4 series runs 170 mhz with a flash prefetcher and 128 bit wide flash and thumb mode. This way sequential code runs fast. But as soon as you jump the prefetch is lost and you lose time.
But executing from ram is not faster for some reason.

With the nxp lpc4300 flashless series you can run about 50 to 70 mhz on external quad spi flash, but the full 200 mhz can only be used from ram. pay attention to which ram banks you use, since the arm I D and S bussed do not connect to all banks.

Harvs · « **Reply #4 on:** April 20, 2014, 10:40:15 am »

Interesting responses. The Cortex M3/4 series is what I was really thinking about, and statically loading functions at start up. Sounds like there's no point.

If the STM32F4 is no quicker from RAM, I wonder what processors they've been talking about loading to SRAM for speed.

Jeroen3 · « **Reply #5 on:** April 20, 2014, 02:07:40 pm »

Statically relocating the interrupt vector table and some interrupts handlers might have some benefits.
But take a look at how the Cortex M3 and M4 access the bus. They use a multi-bus architecture where the chip manufacturer also does some multilayer bus-fu.
They interface with a I, D and System bus.
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf

There are also more uses where you can use dynamically loaded code.
For example in a device which has multiple boot-selected operating modes, and not enough executable flash to store it all.
Or a bootloader, you can't read&write flash the same time.
Just make sure you CRC the code blocks, you don't want runaways due to failed code copying.

andyturk · « **Reply #6 on:** April 20, 2014, 03:04:42 pm »

Quote from: Harvs on April 20, 2014, 10:40:15 am

Interesting responses. The Cortex M3/4 series is what I was really thinking about, and statically loading functions at start up. Sounds like there's no point.

Never done it myself, but one of the theoretical advantages is power savings. If your device is going to sleep most of the time and all your ISRs are in RAM, you can save juice by turning off the flash.

T3sl4co1l · « **Reply #7 on:** April 20, 2014, 04:11:03 pm »

One could also envision cryptographic applications, fail-dead programs for example. The chip is powered up, code bootstrapped into RAM (instead of programming it in Flash), and left running forever (on supplied power or battery backup) until tampered with. Once powered down, the program is gone. Except it's not, exactly, because SRAM and DRAM are both known to survive for certain periods of time, particularly at low voltages/temperatures. So there could be a memory wipe routine in the bootloader, so if it's powered up again, it clears itself besides.

Tim

Hypernova · « **Reply #8 on:** April 20, 2014, 04:29:54 pm »

Pretty much par for the course for TI 28xx series DSP's I use at work. You can specify if you want a function/const to run from RAM using a #pragma and then setup the linker to give to two sets of pointers upon compile (where it is on flash and where it should be in RAM), during POR you use memcopy to move the code into RAM using the pointers. I get about 3x boost in speed.

BravoV · « **Reply #9 on:** April 20, 2014, 05:04:35 pm »

Quoting from one of TI arm m4f core datasheet.

Btw, for this particular chip, the address 0x0000.0000-0x0003.FFFF is the on chip flash memory.

theatrus · « **Reply #10 on:** April 24, 2014, 03:54:48 pm »

As stated here, the D and I busses can run in parallel, so if you are forcing fetches from memory for instructions and data, your speed up might not materialize (though branch heavy register heavy code will fly).

Depending on your compiler and linker setup, you can also use position independent code (basically everything is PC relative). There is usually a small speed penalty, but you are not forcing a specific layout, and your application can swap code on the fly.

This is very uncommon on micro controllers.

Kjelt · « **Reply #11 on:** April 24, 2014, 08:47:42 pm »

I have seen it used and recommended by the manufacturer for an STM8 writing to the flash since wriiting causes the entire flash to be inaccessible for a few ms. So the code that writes the data into the flash is copied to a seperate ram area and executed from that area until the flash is ready again it returns.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: uC caching program parts to RAM (Read 3685 times)

Harvs

uC caching program parts to RAM

abyrvalg

Re: uC caching program parts to RAM

westfw

Re: uC caching program parts to RAM

Jeroen3

Re: uC caching program parts to RAM

Harvs

Re: uC caching program parts to RAM

Jeroen3

Re: uC caching program parts to RAM

andyturk

Re: uC caching program parts to RAM

T3sl4co1l

Re: uC caching program parts to RAM

Hypernova

Re: uC caching program parts to RAM

BravoV

Re: uC caching program parts to RAM

theatrus

Re: uC caching program parts to RAM

Kjelt

Re: uC caching program parts to RAM

Share me