Author Topic: How to use external SRAM on microcontrollers? (Read 18720 times)

technix · « **on:** June 03, 2017, 04:14:57 pm »

Microcontrollers like ATmega2560 (Arduino Mega 2560) and STM32F407 have external memory bus. How to use that in a way that makes sense?

Take one of my STM32F407ZGT6 kits for example, it has an external IS62WV51216BLL-55TLI chip on it, making 1MB of 1/10 speed SRAM. It have 1MB of internal 1/8 speed Flash, 192kB full-speed SRAM. How to allocate code and data between those? How to write the initialization code to utilize them?

Expanding on this idea, for advanced microcontrollers with a complex internal bus, how to make the code work the best, especially when DMA is involved?

Kleinstein · « **Reply #1 on:** June 03, 2017, 05:08:03 pm »

If possible one should look for a µC with sufficient internal RAM. So external RAM to a small µC like the Mega2560 usually does not make much sense, if there are alternative ARMs with enough memory. External memory is usually slow and uses quite a lot of pins / connections. So if possible it is something to avoid.

Configuration of extra RAM depends on the chip. The best use also depends both on the Chip and the problem. Usually chip internal RAM is much faster than external. If a lot of RAM is needed, a kind of SOC chip or module (e.g. Rasberry or similar) might be an alternative, even if this means having an extra µC for IO functions not offered by the SOC.

Bruce Abbott · « **Reply #2 on:** June 03, 2017, 05:40:52 pm »

It makes sense if you need a lot of RAM and not much I/O. The ATmega2560 only has 8k of internal RAM, but 86 I/O pins. Only 54 of those pins are available on the Arduino Mega 2560, but that is still plenty enough to add external memory. External RAM is slower than internal RAM but faster then serial memory devices, and does not have to be treated any differently to normal RAM.

The ATmega2560 is limited to 64k of address space, but even more RAM can be added by banking. If you already have an Arduino it might make more sense to add a 'shield' than replace it with an MCU with more internal RAM.

512Kb SRAM expansion for the Arduino Mega

hans · « **Reply #3 on:** June 03, 2017, 06:25:10 pm »

If you want to run code/data seamlessly from externally memory devices, then make sure said devices can integrate into the memory map of your device, preferably unbanked.

Of course you can connect a (parallel) SRAM chip to any microcontroller given that it has enough I/O; but you don't want to be introducing any read/write memory functions that slows down access, and most of all this is then not seamlessly accessible through the memory map.
Well, you still can, but if you need the performance then you don't.

The FSMC on STM32F407 (and others) can integrate in the memory map (figure 18). It has a maximum speed of 60MHz, which is slower than internal SRAM, but still workable.

You can just initialize FSMC for your memory device and then access it through the physical address from figure 18. That's what I did in a project where I used 512kB SRAM as a bulk sample buffer. Using some pointers starting at 0x6000 0000, that works.

If you want to place variables from your C code with preinitialized values , then things become a bit more complicated. I think you'll need to modify the linker script to tell the compiler it can reserve data there, but also modify the startup code so it copies data from FLASH to RAM, or zero's the values in that region. In addition, it also needs to initialize FSMC before it starts doing this. This is all happening before main() is called, some IDE's will even hide the startup script from the user if it's not necessary to show it to them .

DMA to external memory is not a problem at all. If you look at figure 5 of the STM32F407 datasheet, you'll see that FSMC connects straight into the AHB matrix. DMA operations are just Memory <> Memory transfers, and can still achieve many MegaBytes/s transfer speeds without much problem.

If you step up to STM32F46x series, they also have quad SPI mode with memory integration. You set up the peripheral how it can random access and read the serial memory, and the hardware will handle the rest for you whenever you want to access it from your code, DMA, etc.

technix · « **Reply #4 on:** June 03, 2017, 07:34:14 pm »

Quote from: hans on June 03, 2017, 06:25:10 pm

If you step up to STM32F46x series, they also have quad SPI mode with memory integration. You set up the peripheral how it can random access and read the serial memory, and the hardware will handle the rest for you whenever you want to access it from your code, DMA, etc.

I do have a few STM32F756ZGT6 (I just skipped the F46x series) and that thing have both QSPI and SDRAM support. I can attach 32MB of NOR Flash and 64MB of SDRAM on it (W25Q256 + MT48LC32M16A2) and this shadows the built-in memory very quickly. Is it a good idea to put code in QSPI directly and XIP directly off it and put data in DRAM? This way maybe I can use for the chip with least Flash and SRAM, and the internal resources is only used to run a small bootloader.

technix · « **Reply #5 on:** June 03, 2017, 07:38:30 pm »

Quote from: Kleinstein on June 03, 2017, 05:08:03 pm

If possible one should look for a µC with sufficient internal RAM. So external RAM to a small µC like the Mega2560 usually does not make much sense, if there are alternative ARMs with enough memory. External memory is usually slow and uses quite a lot of pins / connections. So if possible it is something to avoid.

Configuration of extra RAM depends on the chip. The best use also depends both on the Chip and the problem. Usually chip internal RAM is much faster than external. If a lot of RAM is needed, a kind of SOC chip or module (e.g. Rasberry or similar) might be an alternative, even if this means having an extra µC for IO functions not offered by the SOC.

I am talking about ARMs with external RAM added. Things like IPv4/IPv6 dual stack needs a lot of RAM to work when the application grows beyond a handful of sockets (four or five used by mDNS, one for NDP, three each connection for HTTP/TLS, RAM goes away real fast.)

dgtl · « **Reply #6 on:** June 03, 2017, 09:07:23 pm »

As usual, there are nice solutions and then there are hacks. As for ARMs the external SRAM is memory-mapped, they are accessible somewhere in the memory address range. See the uc documentation, where that region is (search for memory map).
The proper solutions depend on the compiler toolchain used. The memory layout is controlled for gcc by linker script (.ld file). Usually you add another memory section for the external ram. Then use in the code the "attribute section" for global or static variables to redirect them to that sram linker section. For dynamically-allocated RAM, the malloc usually takes RAM from one location and this can be either SRAM or internal RAM; setting this up depends again on the tools used (gcc libc for arm microcontrollers provides malloc, but that itself depends on user-provided sbrk() function to give the large memory ranges when needed and different toolchain vendors may provide their different versions).
The hackish way is just declare a pointer somewhere in the SRAM region and start using it. In that case, the compiler does not know about the memory layout, you have to allocate the memory and calculate addresses for each variable on paper yourself. Take the address, cast it to pointer of your data type and start using that pointer. Take care of not messing up pointer alignment requirements, otherwise very strange bugs start happening. When your external memory has just one or a couple variables (structures etc), this way is not too bad. If you are allocating tens or hundreds of variables manually, you're doing it wrong.

hans · « **Reply #7 on:** June 04, 2017, 08:46:56 pm »

Quote from: technix on June 03, 2017, 07:34:14 pm

Quote from: hans on June 03, 2017, 06:25:10 pm
If you step up to STM32F46x series, they also have quad SPI mode with memory integration. You set up the peripheral how it can random access and read the serial memory, and the hardware will handle the rest for you whenever you want to access it from your code, DMA, etc.
I do have a few STM32F756ZGT6 (I just skipped the F46x series) and that thing have both QSPI and SDRAM support. I can attach 32MB of NOR Flash and 64MB of SDRAM on it (W25Q256 + MT48LC32M16A2) and this shadows the built-in memory very quickly. Is it a good idea to put code in QSPI directly and XIP directly off it and put data in DRAM? This way maybe I can use for the chip with least Flash and SRAM, and the internal resources is only used to run a small bootloader.

If you don't care about your code being stolen, then yes you could run code from QSPI or DRAM. AFAIK there is no code protection for QSPI devices.

Also, code will run slower from QSPI. Random access times can be quite slow compared to internal FLASH. Fortunately the m7 has larger caches so it's probably a far better pick than the m4 for this use case.
But cache misses can still occur on large jumps, like interrupts, and will slow things down.
Of course, you could place interrupts in internal FLASH if you're really inclined to put some extra effort in

.

Quote from: dgtl on June 03, 2017, 09:07:23 pm

As usual, there are nice solutions and then there are hacks. As for ARMs the external SRAM is memory-mapped, they are accessible somewhere in the memory address range. See the uc documentation, where that region is (search for memory map).
The proper solutions depend on the compiler toolchain used. The memory layout is controlled for gcc by linker script (.ld file). Usually you add another memory section for the external ram. Then use in the code the "attribute section" for global or static variables to redirect them to that sram linker section. For dynamically-allocated RAM, the malloc usually takes RAM from one location and this can be either SRAM or internal RAM; setting this up depends again on the tools used (gcc libc for arm microcontrollers provides malloc, but that itself depends on user-provided sbrk() function to give the large memory ranges when needed and different toolchain vendors may provide their different versions).
The hackish way is just declare a pointer somewhere in the SRAM region and start using it. In that case, the compiler does not know about the memory layout, you have to allocate the memory and calculate addresses for each variable on paper yourself. Take the address, cast it to pointer of your data type and start using that pointer. Take care of not messing up pointer alignment requirements, otherwise very strange bugs start happening. When your external memory has just one or a couple variables (structures etc), this way is not too bad. If you are allocating tens or hundreds of variables manually, you're doing it wrong.

Agreed.
Doing manual memory allocation for your program only works on a small scale, like the case I pointed out where 1 set of entities occupies the whole external SRAM. Fixed layout, single file manages it, so still easy to maintain.
You don't want to do manual memory allocations to much larger than that. I think if I would want to share external SRAM for more than 2-3 different entities, I will probably look at fixing the loader file for the compiler.

westfw · « **Reply #8 on:** June 04, 2017, 11:31:10 pm »

Quote

Take one of my STM32F407ZGT6 kits for example, it has an external IS62WV51216BLL-55TLI chip on it, making 1MB of 1/10 speed SRAM. It have 1MB of internal 1/8 speed Flash, 192kB full-speed SRAM. How to allocate code and data between those?

Ah, but the 1/8 speed Flash as an "accelerator" attached...

In general, if I had 10x more "slow external RAM" than "faster internal RAM", I'd plan to configure the compiler/linker to use the external memory for all the general purpose stuff, and hope that was fast enough. (1/10th speed SRAM on STM32f4xx is till 60ns memory, right? That's pretty fast.) Then, if needed, I start moving critical data structures into the on-chip fast RAM, as needed and AFTER careful analysis ("premature optimization is the root of much pain.") OTOH, if I had an application that needed a lot of external RAM, I'd be looking for a chip with on-chip cache for that memory...

Does the on-chip DMA access the external memory? That could change things. OTOH, external DMA controllers might be accessing the external RAM, when they can't access the internal RAM.

Quote

How to write the initialization code to utilize them?

Well, you have to initialize the Flexible Static Memory Controller before you can use the external SRAM. If you're going to use that for most of your data, it means modifying the C startup code to do the initialization before it copies .data from flash or bzero's .bss...

alank2 · « **Reply #9 on:** June 05, 2017, 01:38:05 am »

Some AVR's like the xmega384c3 have up to 32K sram. I'm working on a mega1284p right now that has 16K. I could imagine using an external sram and writing a driver to load/store information from it to local sram, but I wonder how much compiler support there is for using external sram like it was internal...

technix · « **Reply #10 on:** June 05, 2017, 04:18:27 am »

Quote from: westfw on June 04, 2017, 11:31:10 pm

Quote
Take one of my STM32F407ZGT6 kits for example, it has an external IS62WV51216BLL-55TLI chip on it, making 1MB of 1/10 speed SRAM. It have 1MB of internal 1/8 speed Flash, 192kB full-speed SRAM. How to allocate code and data between those?
Ah, but the 1/8 speed Flash as an "accelerator" attached...

The accelerator don't help when interrupts are involved anyway...

Quote from: westfw on June 04, 2017, 11:31:10 pm

In general, if I had 10x more "slow external RAM" than "faster internal RAM", I'd plan to configure the compiler/linker to use the external memory for all the general purpose stuff, and hope that was fast enough. (1/10th speed SRAM on STM32f4xx is till 60ns memory, right? That's pretty fast.) Then, if needed, I start moving critical data structures into the on-chip fast RAM, as needed and AFTER careful analysis ("premature optimization is the root of much pain.") OTOH, if I had an application that needed a lot of external RAM, I'd be looking for a chip with on-chip cache for that memory...

If the external volatile memory is significantly larger (e.g a 64MB DRAM on STM32F756ZGT6) I would store the application code compressed in Flash and include a small bootloader that decompresses the application image into the DRAM before execution. Also I will keep the ELF format intact so the decompression routine can place the sections in correct locations before application code starts.

Quote from: westfw on June 04, 2017, 11:31:10 pm

Does the on-chip DMA access the external memory? That could change things. OTOH, external DMA controllers might be accessing the external RAM, when they can't access the internal RAM.

Yes STM32F4 DMA can access the external memory.

Quote from: westfw on June 04, 2017, 11:31:10 pm

Quote
How to write the initialization code to utilize them?
Well, you have to initialize the Flexible Static Memory Controller before you can use the external SRAM. If you're going to use that for most of your data, it means modifying the C startup code to do the initialization before it copies .data from flash or bzero's .bss...

Bruce Abbott · « **Reply #11 on:** June 05, 2017, 06:37:23 am »

Quote from: alank2 on June 05, 2017, 01:38:05 am

Some AVR's like the xmega384c3 have up to 32K sram. I'm working on a mega1284p right now that has 16K. I could imagine using an external sram and writing a driver to load/store information from it to local sram, but I wonder how much compiler support there is for using external sram like it was internal...

The ATmega1284p does not have an external bus interface so you can't use external RAM 'like it was internal', you have to treat it like a peripheral (unless you do something silly like creating a virtual machine).

How much of a problem this is depends on what you need the extra RAM for. Obviously you can't run compiled code or allocate C variables in it, but it could hold code and variables for an interpreted language. If it's just to store data then you would use it like any other external memory. Advantages might include shorter access time (compared to SPI or I2C), faster writing and not having to worry about wearout (compared to a serial EEPROM or SD Card).

westfw · « **Reply #12 on:** June 05, 2017, 09:32:24 am »

Quote

Quote
but the 1/8 speed Flash as an "accelerator" attached...
The accelerator don't help when interrupts are involved anyway...

Really? Admittedly, the datasheets I have don't seem to describe the "ART Accelerator" very much (is it separate than the caches controlled by the FLASH_ACR register? I can't tell.) But it is documented that the flash interface is 128bits wide; so each access is fetching up to 8 instructions, which I would think would push internal flash well beyond external RAM in performance?) It does claim "performance equiv to zero wait-states" at up to 168MHz...

Quote

If the external volatile memory is significantly larger (e.g a 64MB DRAM on STM32F756ZGT6) I would store the application code compressed in Flash and include a small bootloader that decompresses the application image into the DRAM before execution.

It depends on whether your code is big, or your data is big. You won't get 32:1 compression (1MB flash on that chip, right?) of code, anyway.
(heh. Ask about decompressing ROM to RAM with a rom bootstrap, running the RAM image to download a smaller RAM "secondary bootstrap" image from the network, and using THAT to download the bigger (and final) RAM image that you actually wanted to run... Ahh, back when memory was a lot more scarce than today!)
Also, the F7 had "conventional" cache in addition to the "accelerator", so I'd expect it to do better with external memory than the F4s.
(I haven't read the datasheet carefully enough to be sure, but I visualize "conventional cache" as being coupled with the CPU (helps regardless of memory system), while the "accelerator" is coupled (and tightly) to the flash memory controller.)

technix · « **Reply #13 on:** June 05, 2017, 01:14:25 pm »

Quote from: westfw on June 05, 2017, 09:32:24 am

Quote
Quote
but the 1/8 speed Flash as an "accelerator" attached...
The accelerator don't help when interrupts are involved anyway...
Really? Admittedly, the datasheets I have don't seem to describe the "ART Accelerator" very much (is it separate than the caches controlled by the FLASH_ACR register? I can't tell.) But it is documented that the flash interface is 128bits wide; so each access is fetching up to 8 instructions, which I would think would push internal flash well beyond external RAM in performance?) It does claim "performance equiv to zero wait-states" at up to 168MHz...

It is likely a crude pipeline cache. When the interrupt fires, it pulls a read from the beginning of the Flash - a cache flush. Then the ISR is loaded - a second flush. Finally after the ISR return a third cache flush is required. Three flushes each interrupt. Ouch.

Quote from: westfw on June 05, 2017, 09:32:24 am

Quote
If the external volatile memory is significantly larger (e.g a 64MB DRAM on STM32F756ZGT6) I would store the application code compressed in Flash and include a small bootloader that decompresses the application image into the DRAM before execution.
It depends on whether your code is big, or your data is big. You won't get 32:1 compression (1MB flash on that chip, right?) of code, anyway.
(heh. Ask about decompressing ROM to RAM with a rom bootstrap, running the RAM image to download a smaller RAM "secondary bootstrap" image from the network, and using THAT to download the bigger (and final) RAM image that you actually wanted to run... Ahh, back when memory was a lot more scarce than today!)
Also, the F7 had "conventional" cache in addition to the "accelerator", so I'd expect it to do better with external memory than the F4s.
(I haven't read the datasheet carefully enough to be sure, but I visualize "conventional cache" as being coupled with the CPU (helps regardless of memory system), while the "accelerator" is coupled (and tightly) to the flash memory controller.)

For the '756 I can put the compressed code image in 16MB or 32MB of QSPI (W25Q128 or W25Q256.) W25Q128 is actually very cheap here. And a 1/2 or 1/4 compression ratio is doable using lzop or gzip.

mubes · « **Reply #14 on:** June 10, 2017, 09:22:51 am »

Quote from: technix on June 05, 2017, 01:14:25 pm

It is likely a crude pipeline cache. When the interrupt fires, it pulls a read from the beginning of the Flash - a cache flush. Then the ISR is loaded - a second flush. Finally after the ISR return a third cache flush is required. Three flushes each interrupt. Ouch.

....which makes it very worthwhile moving your ISRs and vector table into the fast ram. A few magic incantations in your linker file and startup script (assuming they're not already there....a surprising number are) and you've got fast interrupt response with a heck of a lot less impact on the rest of the system...only cost is the ram and a tiny bit of startup delay. If you're uberconcerned about such things I guess you could even use the mpu to stop them getting trodden on accidentally.

Dave


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: How to use external SRAM on microcontrollers? (Read 18720 times)

technix

How to use external SRAM on microcontrollers?

Kleinstein

Re: How to use external SRAM on microcontrollers?

Bruce Abbott

Re: How to use external SRAM on microcontrollers?

hans

Re: How to use external SRAM on microcontrollers?

technix

Re: How to use external SRAM on microcontrollers?

technix

Re: How to use external SRAM on microcontrollers?

dgtl

Re: How to use external SRAM on microcontrollers?

hans

Re: How to use external SRAM on microcontrollers?

westfw

Re: How to use external SRAM on microcontrollers?

alank2

Re: How to use external SRAM on microcontrollers?

technix

Re: How to use external SRAM on microcontrollers?

Bruce Abbott

Re: How to use external SRAM on microcontrollers?

westfw

Re: How to use external SRAM on microcontrollers?

technix

Re: How to use external SRAM on microcontrollers?

mubes

Re: How to use external SRAM on microcontrollers?

Share me