Electronics > Microcontrollers

STM32F4xx and core coupled memory: comments?

(1/6) > >>

I am playing with the STM32F407-VE right now. I fixed an error in the ST linker script - it stated 192K of RAM in one block, where the manual states 128K in the main block and 64K of core coupled memory (CCM) at a totally different location. So now I have two memory areas.

The CCM memory has a separate bus and won't contend with DMA (CCM is only usable by the CPU). So here are my thoughts.

64K is more than enough for my stack, and data, so why not put those in the CCM area? I can still use the 128K main RAM for large data blocks, and of course, DMA.

I don't expect to be using heavy DMA soon, but it seems like a good plan to start. DMA won't affect interrupt latency. And if I goof up on DMA pointers, it won't harm my program memory.

Thanks for any comments!

There is nothing wrong with this approach. However don't expect a significant performance boost. With zero wait state memory, even if the access has to go though the bus, it is still pretty fast.

In any scenario where moving things to TCM makes or breaks a project, you should probably be looking at a higher performing MCU. There are good exceptions to this, of course.

I think there is nothing wrong with assigning TCM to the processor stack, (heap) and variables. You don't have to tell the C compiler all about the capabilities of the chip if you don't want the C program to use any of the 128K SRAM. Just design your program with 64KB TCM in mind so that the DMA controller has as much uncontested access to SRAM.

However, I do somewhat agree with ataradov in that if you need to do this, you're probably converging outside the performance scope of this chip. Behaviour that has to run real-time (e.g. because that's why it's fed by DMA in the first place) will favor predictable memory access timing and access arbitration. If the DMA controller is pushing so much data around that each individual stream can potentially stall on each other, you may get some timing jitter or kind-of 'intermodulation' products of both stream transfer periods.

That is perhaps not important for peripherals that have built in FIFO's and rely on burst performance, but in my experience ST has very little peripheral FIFO's because the chip has got so many DMA streams.. (real PITA)

Also, the main 128K SRAM is also split into 2 modules of 112K and 16K. In theory that allows multiple streams to be accessed (e.g. ethernet and USB HS with GP DMA streams) to either SRAM regions.
I'm actually surprised that ST's documentation doesn't state clearly the memory mapped regions of all user memory (e.g. SRAM, FLASH, FSMC). Table 3 and 4 in the user manual does describe some memory regions for boot options, but it does not state where TCM is within the memory address region.. |O

Afaik F407 can work with 192kB of internal sram off C as a single block.. At least I had no issues with it in past.. I had to mess with the linker script while I executed the code off the internal sram or off the external sram (or to have the heap in the external sram).

Putting stack in core-coupled RAM is what I tend to do by default when this RAM is available. It's the simplest thing you can just do and expect some performance improvement and a bit more predictability to DMA access timing. While at it, put globals and statics that are written and read by CPU only there as well.

Don't expect massive performance boost, though. In ARM, there is acceptable number of registers and compilers optimize quite well so that routines usually do not need to use that much stack.

Clever use of buses with DMA is more significant. For example, I had a project where I ran out of performance with two ADCs each doing 16 bit DMA accesses to different parts in memory, then two different interrupt handlers reading out 8 previous values, each from the different ADC, and average them. As soon as I changed the ADCs to run in synchronized Dual Mode, generating 32-bit DMA accesses where the half-words are from the different ADCs, and the two ISRs sharing the work doing just four (not eight anymore) 32-bit loads and averaging with SIMD instructions, the performance was more than enough with good safety margin. This required use of some inline assembly and dedicating one general-purpose register for intermediate math preventing compiler for using it (gcc: -ffixed-r11) but was definitely worth it.


[0] Message Index

[#] Next page

There was an error while thanking
Go to full version