Author Topic: I think I understand the STM32F7/H7 value line MCU. (Read 5051 times)

technix · « **on:** November 05, 2018, 12:32:16 pm »

Those chips are the STM32F730/F750/H750 parts. While being a fast Cortex-M7 those chips lacked Flash, severely - 64kB for F730/F750 and 128kB for H750, while they still have full 256kB/320kB/1MB SRAM like other parts in the same product line.

I think I understand what is going on now: STMicroelectronics is emulating a ROMless processor with this. All of the chips above have QSPI with support of XIP like the rest of the product line, and I have the feeling that whatever that goes into the little amount of on-chip Flash is never intended to be actual application. The developer is supposed to write their own bootloader and load that into the internal Flash, and then put the application code into QSPI. The bootloader would initialize the QSPI for XIP, a task that cannot be easily done using strap pins or option bytes given the intricacy of the settings, and jump into it.

ataradov · « **Reply #1 on:** November 05, 2018, 05:46:33 pm »

Well, yes. I thought this was obvious. Furthermore, with this much RAM, you can load your firmware into RAM and get a really good performance and use external flash for the data only.

iMo · « **Reply #2 on:** November 05, 2018, 05:51:39 pm »

The flash technology in those MCUs is slow - therefore a bottleneck. A flash throughput could be say 30-40MB/sec read max (when no ART cache in use). Therefore the mcu-makers go into flash-less chips.. Load the code from somewhere outside the mcu into the internal fast sram and execute it there..

djacobow · « **Reply #3 on:** November 05, 2018, 05:52:55 pm »

It's also not uncommon that a few core routines are important for performance and the vast majority of your code would be just as happy on a 1 MHz 8 bitter as an a fast M7, in which case, put the routines that need to be fast in flash and let everything else live in QSPI.

ataradov · « **Reply #4 on:** November 05, 2018, 05:54:04 pm »

Quote from: djacobow on November 05, 2018, 05:52:55 pm

It's also not uncommon that a few core routines are important for performance and the vast majority of your code would be just as happy on a 1 MHz 8 bitter as an a fast M7, in which case, put the routines that need to be fast in flash and let everything else live in QSPI.

Internal flash is not that much faster than external one.

SiliconWizard · « **Reply #5 on:** November 05, 2018, 05:57:33 pm »

This is what cache is for.

djacobow · « **Reply #6 on:** November 05, 2018, 05:58:07 pm »

Quote from: ataradov on November 05, 2018, 05:54:04 pm

Quote from: djacobow on November 05, 2018, 05:52:55 pm
It's also not uncommon that a few core routines are important for performance and the vast majority of your code would be just as happy on a 1 MHz 8 bitter as an a fast M7, in which case, put the routines that need to be fast in flash and let everything else live in QSPI.
Internal flash is not that much faster than external one.

OK, then RAM, then. :-) The concept is still the same. There is a hierarchy of memory performance and putting your whole code into one section is simple but inefficient. You can learn a lot by profiling your application and getting familiar with sections and linker scripts.

djacobow · « **Reply #7 on:** November 05, 2018, 06:01:33 pm »

Quote from: SiliconWizard on November 05, 2018, 05:57:33 pm

This is what cache is for.

Cache is great, but there are times you want very predictable performance or for some reason want to know for sure that something will always run as quickly as possible, in which case blocks of instruction ram or lockable cache lines can become interesting. I think in general-purpose architectures like PCs, it's hard to make use of such features, but in embedded they can make a lot of sense.

SiliconWizard · « **Reply #8 on:** November 05, 2018, 06:11:58 pm »

Yes, although relying on cycle-exact execution on such processors may not necessarily be the wisest design decision.

But yes, in this particular case, you can execute code from RAM.
Assuming that RAM access is perfectly predictable, as on some processors RAM accesses may still pass through cache mechanisms (that you would then have to disable).

ataradov · « **Reply #9 on:** November 05, 2018, 06:16:21 pm »

If you need absolutely predictable execution, then TCM is the only way. Everything else depends on the bus utilization at least.

ajb · « **Reply #10 on:** November 05, 2018, 07:02:16 pm »

Quote from: SiliconWizard on November 05, 2018, 06:11:58 pm

as on some processors RAM accesses may still pass through cache mechanisms (that you would then have to disable).

Hopefully any device that has a cache also has an MPU that can disable the cache on a per-region basis. This becomes really important in situations like peripherals run from an external memory bus, or on parts like the STM32F7 where only the core goes through the cache, anything involving DMA.

krho · « **Reply #11 on:** November 05, 2018, 07:46:14 pm »

The parts are uninterested, because they don't offer encrypting what's XIPed from QSPI

ajb · « **Reply #12 on:** November 05, 2018, 09:08:41 pm »

Quote from: krho on November 05, 2018, 07:46:14 pm

The parts are uninterested, because they don't offer encrypting what's XIPed from QSPI

But you could decrypt the QSPI contents into RAM, then execute it there. Obviously decrypting into external memory would present a big security hole, so you'd want to run from internal RAM, but you could add external RAM for data storage and then you have the entire on-board RAM available for the application. 1MB of application space is pretty good, especially since it doesn't need to contain your bootloader or even initialization values if you set up your linker properly.

Although since we're talking about ST, trying to use QSPI and the FMC and more than two or three other peripherals at once usually requires using the largest available package and dealing with a lot of stupid PCB routing.

MT · « **Reply #13 on:** November 06, 2018, 12:30:09 am »

Ah! Yes the stupid PCB routing, thanks ST for placing QSPI BK2 NCC on the other side of the chip!
Thank you, thank you!

chickenHeadKnob · « **Reply #14 on:** November 06, 2018, 01:25:35 am »

Quote from: MT on November 06, 2018, 12:30:09 am

Ah! Yes the stupid PCB routing, thanks ST for placing QSPI BK2 NCC on the other side of the chip!
Thank you, thank you!

I was just doing the cubeMX thing H750V 100 pin QFP package. Why is it ST micro does everything in such a random haphazard way, infuriating. If you select Ethernet RMII you lose ADC/DAC I/O, FMC data/address on all sides of chip. nothing is ever logical.

MT · « **Reply #15 on:** November 06, 2018, 02:32:33 am »

Have wondered the same for a long time. It's just a frustrating pain in the arse, perhaps they just use the same florplan masks (if possible? i dunno?!) for everything per package size and then just change individual cpu/peripheral blocks within as they go along creating all the MCU variants?!

ehughes · « **Reply #16 on:** November 06, 2018, 03:25:56 am »

Flash is the most expensive part of an MCU. Future MCUs (40nm process and below) are foregoing flash to save cost. It is cheaper to add external QSPI or HyperFlash.

NOR flash needs a thick gate oxide layer (extremely high capacitance). There isn't a NOR flash that can be read faster than 25 to 40Mhz depending on process. The best you can do is play games with organization (cache, really wide data paths). Parts rated at 133/166MHz are a joke. The actual cells are read much slower. QSPI performance can be all over the map with performance and cache can be rendered useless pretty easily.

Example:

I am optimizing a DSP application for the i.MX RT 1021. We are trying to get enough performance from the external QSPI as external HyperFlash is expensive and there isn't quite enough internal TCM to run everything internal. How you split up you RAM is crucial.

Q15 RFFT 128 Points.

Average of 57000 Cycles (min of 27k cycles and max of 72k cycles over a1000 runs). This is the CMSIS RFFT function running from cached QSPI. As a reference, on a Cortex M4 with internal flash uses about 5.6k cycles for the same transform (assume all other things equal, spherical cows, etc.)

The same code with just the FFT twiddle factors stored in DTCM memory I can get 3.1k cycles per FFT with only a 100 cycle spread over a 1000 test cycles.

Running from QSPI can be *very* non deterministic. Simple changes to critical constants /code location can yield *an order of magnitude* of performance difference.

This is the future of how MCUs (external flash via QSPI/HyperFlash) will be produced so get ready!

ataradov · « **Reply #17 on:** November 06, 2018, 03:30:06 am »

My hope is that we will come up with a better interface for the flash. Something like ULPI for USB transceivers. It does not have to be super low pin count, since I expect most manufacturers to integrate the separate die in the same package (similar to what Giga Devices does). Having a separate chip is very expensive in large volume.

If I were in charge of any of this, I would jump on defining this protocol immediately.

technix · « **Reply #18 on:** November 06, 2018, 05:05:29 am »

Quote from: ataradov on November 06, 2018, 03:30:06 am

My hope is that we will come up with a better interface for the flash. Something like ULPI for USB transceivers. It does not have to be super low pin count, since I expect most manufacturers to integrate the separate die in the same package (similar to what Giga Devices does). Having a separate chip is very expensive in large volume.

If I were in charge of any of this, I would jump on defining this protocol immediately.

I think for most vendors QSPI/OctoSPI for NOR and raw NAND interface for NAND is already good enough.

Quote from: chickenHeadKnob on November 06, 2018, 01:25:35 am

Quote from: MT on November 06, 2018, 12:30:09 am
Ah! Yes the stupid PCB routing, thanks ST for placing QSPI BK2 NCC on the other side of the chip!
Thank you, thank you!

I was just doing the cubeMX thing H750V 100 pin QFP package. Why is it ST micro does everything in such a random haphazard way, infuriating. If you select Ethernet RMII you lose ADC/DAC I/O, FMC data/address on all sides of chip. nothing is ever logical.

Ditto here, trying to layer out a F750Z in 144-pin QFP. Had to sacrifice built-in Ethernet for USB HS, SDRAM and QSPI to coexist, and this resulting in the plan having to rely on DM9000A for Ethernet.

ajb · « **Reply #19 on:** November 06, 2018, 05:54:06 am »

I really have to wonder how much it would actually cost (I guess mainly in die routing?) to just implement a 100% crosspoint matrix for 100-200 IO pins. Actually I guess you probably have about as many peripheral endpoints as IO pins in a typical MCU these days so that does seem like a lot of switches now that I've typed that number out.

Sure would make PCB layout easier, though.

ataradov · « **Reply #20 on:** November 06, 2018, 05:58:59 am »

Quote from: ajb on November 06, 2018, 05:54:06 am

I really have to wonder how much it would actually cost (I guess mainly in die routing?) to just implement a 100% crosspoint matrix for 100-200 IO pins. Actually I guess you probably have about as many peripheral endpoints as IO pins in a typical MCU these days so that does seem like a lot of switches now that I put that I've typed that number out.

This will be very expensive and probably will not work at all in general case. Atmel/Microchip MCUs traditionally had decent number of multiplexing options, but even there fast interfaces (USB, SDRAM) have only one option and the option is whatever pads were closer on the die. Otherwise it is very hard to get decent signal integrity.

But good and convenient multiplexing for slower interfaces is absolutely possible.

technix · « **Reply #21 on:** November 06, 2018, 09:42:25 pm »

Quote from: ajb on November 06, 2018, 05:54:06 am

I really have to wonder how much it would actually cost (I guess mainly in die routing?) to just implement a 100% crosspoint matrix for 100-200 IO pins. Actually I guess you probably have about as many peripheral endpoints as IO pins in a typical MCU these days so that does seem like a lot of switches now that I've typed that number out.

Sure would make PCB layout easier, though.

I think that would cost about the same as a FPGA with ~10k LUTs for a 100-pin matrix. Instead of a microcontroller, it would be easier to design such a board as an FPGA with a Cortex-M4F hard core in the center too.

Speaking of, it would likely be economically viable to produce such a chip with main FPGA+ARM on a 28nm node, and package in some stacked QSPI as shared ARM code and FPGA bitstream storage.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: I think I understand the STM32F7/H7 value line MCU. (Read 5051 times)

Share me