Author Topic: Memory model for Microcontrollers (Read 9673 times)

MK14 · « **Reply #125 on:** May 08, 2024, 11:53:59 am »

While we are on this subject area, one question comes to mind.

On 'proper', larger CPUs/computers, i.e. NOT most small/medium Microcontrollers. They usually have memory management units, virtual memory and operating systems. So, that each significant process/program and hence on individual cores/threads. It can appear and be treated as if it had an entire computer to itself, potentially starting at address $0000.., and with little or no worries about conflicting/sharing with anything else.
With some exceptions, of course.

But on (especially lower end) Microcontrollers, they typically don't have memory management units, virtual memory and somewhat often, don't run operating systems (with many exceptions).

Yet, even at the low end of Microcontrollers these days, e.g. Raspberry PI PICO (pair of small M0(+) arm cores). They can have two or more, separate cores.

So, how does the compiler(s), linkers etc. Cope with such possible conflicts, and uncertainty, over what the other core(s) / threads are using or not?

Or in other words, you could have two completely separate Main()'s (one running on one core, the other running on the second), but which MUST mostly always use completely different locations in memory (except where they are intentionally sharing memory).

My guess would be that the resources get split up (using link symbols etc), into smaller sections. With one core using its own set of linker symbols. But that must get very complicated, if one core runs a C program and the other, runs a completely different language (if that is even allowed).

Fortunately, most/many programs on the PI PICO, are just fine with using only a single core. So, I've been fine, up to now.

westfw · « **Reply #126 on:** May 08, 2024, 07:41:13 pm »

Quote

two or more, separate cores.

Usually you compile and one set of sources into a single binary, so that that stage of development doesn't know that there are going to be more than one CPU. Stack and Heap can be assigned to each cpu (or task) at runtime, but data, text, and rodata can be freely and randomly interleaved.

MK14 · « **Reply #127 on:** May 08, 2024, 08:42:21 pm »

Quote from: westfw on May 08, 2024, 07:41:13 pm

Quote
two or more, separate cores.
Usually you compile and one set of sources into a single binary, so that that stage of development doesn't know that there are going to be more than one CPU. Stack and Heap can be assigned to each cpu (or task) at runtime, but data, text, and rodata can be freely and randomly interleaved.

Thanks, that makes a lot of sense.

I didn't release that such a simple and elegant solution, existed for that problem. I was expecting something longer and more complicated.

Also, there might have to be rules/limits, so that only the first thread/core can access the hardware registers (as in I/O). To avoid possible race hazards, locking conditions and things. Otherwise mechanisms such as mutexes might resolve such issues.

Nominal Animal · « **Reply #128 on:** May 08, 2024, 09:40:28 pm »

The simplest locking primitives when you have more than one core in a microcontroller are usually spinlocks. They are named that, because a core waiting for the spinlock to become free will just spin in a tight loop.

RP2040, for example, provides 32 hardware spinlocks. On other architectures, either an atomic compare-exchange or load-linked, store-conditional is used (in a loop), with each core or task having their own identifying value they try to set the spinlock to. They spin in the loop trying, until they succeed.

The main difficulty in implementation is the interactions with caches (cache coherency, atomic accesses), requiring careful construction for each architecture: a naïve implementation will look like it should work, but because of cache coherency or memory atomicity details, more than one core may believe they own the spinlock at the same time! Also, if the cores have separate caches, the cacheline containing the spinlock may end up being ping-ponged between RAM and the two contending cores, making the spinning part amazingly slow. Which is why operating systems typically provide "better" locking primitives, which let other stuff run while tasks wait to grab a lock, and only use spinlocks for very short durations, like copying some variable or object only.

MK14 · « **Reply #129 on:** May 08, 2024, 10:26:21 pm »

Quote from: Nominal Animal on May 08, 2024, 09:40:28 pm

The simplest locking primitives when you have more than one core in a microcontroller are usually spinlocks. They are named that, because a core waiting for the spinlock to become free will just spin in a tight loop.

RP2040, for example, provides 32 hardware spinlocks. On other architectures, either an atomic compare-exchange or load-linked, store-conditional is used (in a loop), with each core or task having their own identifying value they try to set the spinlock to. They spin in the loop trying, until they succeed.

The main difficulty in implementation is the interactions with caches (cache coherency, atomic accesses), requiring careful construction for each architecture: a naïve implementation will look like it should work, but because of cache coherency or memory atomicity details, more than one core may believe they own the spinlock at the same time! Also, if the cores have separate caches, the cacheline containing the spinlock may end up being ping-ponged between RAM and the two contending cores, making the spinning part amazingly slow. Which is why operating systems typically provide "better" locking primitives, which let other stuff run while tasks wait to grab a lock, and only use spinlocks for very short durations, like copying some variable or object only.

That's clever, that the RP2040 (PICO), for example, provides 32 hardware spinlocks. Using the SIO unit, to create them. Since the SIO connects to both cores, and (presumably) internally has 32 bits worth of state, for doing it.

As you detailed, caches can be problematic. They can also introduce significant uncertainty, on how the real-time software (especially timing wise), will behave.

32 spinlocks, should be enough, for quite a few independent sections of software, to do their own things.

I suppose, a mechanism, more like a callback (but to help with multiple cores), whereby, it can carry on doing other stuff, until the locked thing(s) it wanted, are ready.

Given the extensive, long and sometimes prohibitively expensive software development costs. It might be better, to get a single core CPU with much greater speed (clock frequency and/or IPC) for that single core, rather than handle all the complexities of coping with multiple cores.

On the other hand, it can be fun, educational and save money on the parts cost, if something was going into large scale mass production. Then the dual core RP2040, can make sense. Hence making it worth the extra time invested in such work.

Nominal Animal · « **Reply #130 on:** May 09, 2024, 12:58:38 pm »

Quote from: MK14 on May 08, 2024, 10:26:21 pm

Given the extensive, long and sometimes prohibitively expensive software development costs. It might be better, to get a single core CPU with much greater speed (clock frequency and/or IPC) for that single core, rather than handle all the complexities of coping with multiple cores.

I disagree, but my opinion is colored by my own experience, and having specialized in parallel and distributed computing (in HPC, simulations and such).

When you can separate the logical tasks for each core, the end result can be much simpler than the same functionality running on a single core.
The key difference is understanding and using the various mechanisms one can use for communications (and passing data) between the cores, and how to separate the "jobs" effectively.

In other words, you do need to learn/know more programming techniques, but it is worth it in the end. Message passing, message queues and mailboxes, are extremely common, and very often used in systems programming especially in low-level graphics interfaces. Inter-core interrupts (where code in one core causes an interrupt in a designated other core or all other cores) are also useful, but more for synchronization than data/message passing. RPis use a mailbox interface for communicating between the VC and the ARM core, for example. If you care to learn by a combination of learning the background idea/theory and then experiment, I assure you can get the hang of it quick.

It does require some experience to design the separation between logical workloads in a parallel/multi-core system effectively, though.
If we look at historical precedents, the first parallelization primitives in programming languages revolved around parallelizing loops and such, which really isn't that useful in real life; we just had to learn the better ways, eventually.

MK14 · « **Reply #131 on:** May 09, 2024, 01:50:21 pm »

Quote from: Nominal Animal on May 09, 2024, 12:58:38 pm

I disagree, but my opinion is colored by my own experience, and having specialized in parallel and distributed computing (in HPC, simulations and such).

When you can separate the logical tasks for each core, the end result can be much simpler than the same functionality running on a single core.
The key difference is understanding and using the various mechanisms one can use for communications (and passing data) between the cores, and how to separate the "jobs" effectively.

In other words, you do need to learn/know more programming techniques, but it is worth it in the end. Message passing, message queues and mailboxes, are extremely common, and very often used in systems programming especially in low-level graphics interfaces. Inter-core interrupts (where code in one core causes an interrupt in a designated other core or all other cores) are also useful, but more for synchronization than data/message passing. RPis use a mailbox interface for communicating between the VC and the ARM core, for example. If you care to learn by a combination of learning the background idea/theory and then experiment, I assure you can get the hang of it quick.

It does require some experience to design the separation between logical workloads in a parallel/multi-core system effectively, though.
If we look at historical precedents, the first parallelization primitives in programming languages revolved around parallelizing loops and such, which really isn't that useful in real life; we just had to learn the better ways, eventually.

I don't think there are any right or wrong answers as such. It is like many engineering things, whereby there are millions of different ways of skinning a cat.

A period of time, after I posted, what you quoted. I partly changed my mind, because some real-time tasks, would actually benefit from the separation (partly like how you describe), but I was thinking more on the lines of. One of the CPUs could have any intensive interrupt handling sections, with the corresponding timing jitter, and increases in maximum latency.
But the other core, could have few or no interrupts, and hence be able to have rather deterministic timings and progress through what it is supposed to do.

I see your point, which is a very good one. Which (if I understood it correctly), is saying that splitting real life software tasks/jobs (presumably in some but not all cases), between different CPUs (or threads), can be a useful way of breaking/partitioning down a task, into efficiently sized 'chunks' for software developers, to handle.

Because single core performance (barring possible future developments/inventions, e.g. Quantum computers, although that would be more like a huge number of tiny cores, acting in parallel), is unlikely to speed up that much, because of laws of physics limitations, such as the speed of light and limits as to how small, low capacitance and fast, real life producible integrated circuits, can develop into.
Also, that tends to use disproportionately much more power, than lower frequency solutions.

Whereas, having an ever increasing number of cores, in the same processor package, for CPUs or graphics cards. Tends to be cost, size and power efficient. So could well be the way forward.

But there could be barriers, such as Amdahl law.
https://en.wikipedia.org/wiki/Amdahl%27s_law

JPortici · « **Reply #132 on:** May 09, 2024, 05:35:44 pm »

Quote from: MK14 on May 08, 2024, 11:53:59 am

While we are on this subject area, one question comes to mind.

On 'proper', larger CPUs/computers, i.e. NOT most small/medium Microcontrollers. They usually have memory management units, virtual memory and operating systems. So, that each significant process/program and hence on individual cores/threads. It can appear and be treated as if it had an entire computer to itself, potentially starting at address $0000.., and with little or no worries about conflicting/sharing with anything else.
With some exceptions, of course.

But on (especially lower end) Microcontrollers, they typically don't have memory management units, virtual memory and somewhat often, don't run operating systems (with many exceptions).

Yet, even at the low end of Microcontrollers these days, e.g. Raspberry PI PICO (pair of small M0(+) arm cores). They can have two or more, separate cores.

So, how does the compiler(s), linkers etc. Cope with such possible conflicts, and uncertainty, over what the other core(s) / threads are using or not?

Or in other words, you could have two completely separate Main()'s (one running on one core, the other running on the second), but which MUST mostly always use completely different locations in memory (except where they are intentionally sharing memory).

My guess would be that the resources get split up (using link symbols etc), into smaller sections. With one core using its own set of linker symbols. But that must get very complicated, if one core runs a C program and the other, runs a completely different language (if that is even allowed).

Fortunately, most/many programs on the PI PICO, are just fine with using only a single core. So, I've been fine, up to now.

my extremely limited experience with dual core MCUs is that they are completely separated system: they share the GPIO and some mailbox system to communicate with each other, but then each core is in fact a separate MCU: separate memory, separate peripherals. So each core effectively runs its own firmware and treats the other core(s) as independent entities, and use the mailboxes to communicate.

Then if you use an RTOS with multicore support you can select which task goes to which core, either at compile time or at runtime.. freeRTOS for example support some kind of multicore multiprocessing

MK14 · « **Reply #133 on:** May 09, 2024, 05:56:02 pm »

Quote from: JPortici on May 09, 2024, 05:35:44 pm

Then if you use an RTOS with multicore support you can select which task goes to which core, either at compile time or at runtime.. freeRTOS for example support some kind of multicore multiprocessing

That sounds like a good idea. The extra resources of multiple-cores, combined with the extra features of an OS, such as the aforementioned freeRTOS, sounds interesting for some projects. It would mean a project running on a very low cost RP2040 Raspberry PI pico, could have some of the functionality of its bigger brothers, such as the Raspberry PI 5.
But without their bulky size, relatively heavy power consumption and difficulties, that might arise, because a full blown Linux, Windows or similar OS, is too much for most microcontroller projects.

Nominal Animal · « **Reply #134 on:** May 09, 2024, 06:00:43 pm »

I swear I posted a reply, but it vanished into thin air!

Quote from: MK14 on May 09, 2024, 01:50:21 pm

I don't think there are any right or wrong answers as such.

Fully agreed; that's why I described mine as an opinion I currently have due to my own personal experiences.

Quote from: MK14 on May 09, 2024, 01:50:21 pm

Which (if I understood it correctly), is saying that splitting real life software tasks/jobs (presumably in some but not all cases), between different CPUs (or threads), can be a useful way of breaking/partitioning down a task, into efficiently sized 'chunks' for software developers, to handle.

Yes, exactly.

Splitting code into separate tasks can make each of them much more manageable even on a single-core device; although then you do need a task scheduler, and think about priorities and pre-emption. On a multi-core device with one main task per core, you don't even need a task scheduler.

As mentioned, FreeRTOS provides one such task scheduler and locking primitives and message mailbox primitives for you.

In my own case, the overall approach has lead to me considering using more than one MCU in many of my designs, so that the main MCU doesn't need to be so powerful (so a cheaper and more energy-efficient MCU suffices). For example, the ultra-cheap Padauk pdk14/pdk15 ones (programmed using sddc and Free PDK) for additional PWM, or PDM generation in a tight loop.

In the same vein, I like to use microcontrollers with native USB, high-speed (480 Mbit/s) if possible, full-speed at minimum (12 Mbit/s), for I/O with single board computers running Linux. It is easy for the microcontrollers to handle the timing requirements et cetera, and just transfer the not-so-timing-critical commands and bulk data via USB. These don't need to be expensive either, as the WCH CH55x are cheap (CH552G about 0.50€ in singles!) and easily programmed using sddc and ch55xduino.
My favourite, of course, is Teensy 4.0, which using default Teensyduino USB Serial implementation can easily sustain 200+ Mbit/s (25 Mbytes/s) to a Linux host, with no other requirements on the Teensy side than writing in chunks of at least few bytes, instead of individual bytes to the USB stream.

I haven't even checked how fast I can transfer to Teensy using Linux USB Bulk transfers, because at 25 Mbytes/sec, you can update a 320×200 pixel 16-bit display at over 160fps; the bottleneck is the Teensy-to-Display, not the USB Serial or tty layer.

Quote from: MK14 on May 09, 2024, 01:50:21 pm

But there could be barriers, such as Amdahl law.
https://en.wikipedia.org/wiki/Amdahl%27s_law

The main practical cause for Amdahl's law in HPC simulations is the simulator design: they do computation and communication in separate steps instead of in parallel.

Let me rant a bit, and show you a real-world example of this.

Consider a 2D simulation limited to a rectangular area. You split it in roughly equal sections, with each core (or node in HPC terms) handling one. The border regions between neighbors need to be synchronized between nodes, so most simulators just do some computation, then transfer the changes in the border region to their neighbors, in turns. Because the width of the border region is fixed –– it is at least the maximum interaction distance within the simulation –– it forms the fixed, non-parallelizable term in Amdahl's law, limiting the benefit of adding more cores to work on the same problem.

My solution is to subdivide the region each core handles. For example, if the region in split in one dimension, then we need four subregions per core with roughly equal amounts of work:
╔═╤═╤═╤═╦═╤═╤═╤═╦═╤═╤═╤═╗
║1│2│3│4║1│2│3│4║1│2│3│4║ (three regions/cores)
╚═╧═╧═╧═╩═╧═╧═╧═╩═╧═╧═╧═╝

All cores work on region 3.
All cores work on region 4, for double the amount.
All cores work on region 3. Each core also sends the relevant changes from region 4 to their neighbor on the right, and receives the same information from their neighbor on the left.
All cores work on region 2.
All cores work on region 1, for double the amount.
All cores work on region 2. Each core also sends the relevant changes from region 1 to their neighbor on the left, and receives the same from their neighbor on the right.

After each such cycle, even detailed balance is maintained (all reversible events have the exact same probability in either direction; for example a particle wandering from subregion 1 to subregion 4, and vice versa – but depending on the subregion origin, you need to start the examination from a different substep!). It's easy to verify numerically, but I've never bothered to find out how to write the proof in a form acceptable to mathy papers.

This pushes the Amdahl's law limit much higher: to where the transfer of data takes longer than the computation in the sub-step. Up to that point, this scales linearly –– quite perfectly so in practice! –– as the number of cores increases.

When the simulated region is split to multiple cores in two or three dimensions, you need a minimum of 4×4=16 to 4×4×4=64 cells, with a 2×2 or 2×2×2 computing region traveling it in a (near) Hilbert curve, with similarly two cells between the working areas of neighboring cores.

Unfortunately, scientists creating these simulators do not seem to be able to handle this kind of spatial complexity in their programming, so their response is typically "That's way too complicated! I'll just request more CPU time from the cluster and use the same crappy code everyone else uses".

Yes, I am bitter! When I was young, I didn't believe scientists could be this short-sighted. And I cannot really read papers on materials physics using Monte Carlo methods anymore, because the discontinuities at simulation boundaries in their plots makes my eyes bleed.

Ahem.

Back to the topic at hand.

In many cases the Amdahl limit is a consequence of data transfers, either their latency, or limited bandwidth. Heck, even on x86-64/AMD64 processors the RAM ↔ cache bandwidth and cache miss latencies (and in multithreaded processes, cacheline ping-pong between cores) is typically the limiting factor, and not the computational capabilities of a core.

Nominal Animal · « **Reply #135 on:** May 09, 2024, 07:13:43 pm »

Quote from: JPortici on May 09, 2024, 05:35:44 pm

my extremely limited experience with dual core MCUs is that they are completely separated system: they share the GPIO and some mailbox system to communicate with each other, but then each core is in fact a separate MCU: separate memory, separate peripherals. So each core effectively runs its own firmware and treats the other core(s) as independent entities, and use the mailboxes to communicate.

This is also common when the cores are of different types. For example, in Raspberry Pi's, the ARM core and the VC core use mailboxes to communicate with each other.

It is also very common with all sorts of graphics controllers on ARM, x86, and x86-64.

I like the approach. The main gotcha is fully, carefully and completely documenting the interface between them, and either sticking to it, or keeping the documentation ahead of the implementation. Letting the documentation lapse is an utterly fatal failure in the long term. I'm serious, no hyperbole.

The documentation serves a dual purpose: it describes for those who do not know yet or have forgotten some of the details; but also forces those who do know organize their thoughts and describe their knowledge of it. Multi-core bugs typically arise from misunderstanding how the system is supposed to work, the difference between what the programmer intended and what they wrote in code, or from the complex interactions between separate pieces of code (typical example being locking order and locking schemes in general). I really cannot stress the importance of correct, full documentation of the interfaces/boundaries enough.

MK14 · « **Reply #136 on:** May 09, 2024, 07:17:03 pm »

Quote from: Nominal Animal on May 09, 2024, 06:00:43 pm

I swear I posted a reply, but it vanished into thin air!

Next time, use a spinlock! (Joke)

Quote from: Nominal Animal on May 09, 2024, 06:00:43 pm

has lead to me considering using more than one MCU in many of my designs, so that the main MCU doesn't need to be so powerful (so a cheaper and more energy-efficient MCU suffices)

That's a tricky one. Technically speaking, especially from a software point of view, but also from the real life power consumption figures and component costs (BOMs), you are right.

But, in practice, I can imagine (if it was for a large department in a business), the upper managers and most senior technical staff, saying no!

Because, if you have a single, large MCU, doing everything, that will be seen as just having one piece of software/firmware inside of it. So there is only one (albeit very large and complicated), piece of software to write, test and get working.

But, if instead, you have 10 relatively small MCUs, doing the same task. Although, technically speaking, I accept that it could use a lot less power, cost a lot less, allow much more efficient partitioning of the software development process, potentially making it cheaper, quicker and more reliable, to create.

The upper management may raise the following concerns:

Instead of there being one single MCU, that needs to work, there would be ten, so it could (in rough theory), be ten times more likely to break, and hence be less reliable.

There (could) would be ten sets of software firmware, to download, install, test, develop and maintain. Hence ten smaller software teams, but with people sometimes leaving, vacations and sickness etc. At any given time, one or more of those teams could be non-working, because the sole key player of that team, is unavailable at the moment.

If MCU hardware malfunction was suspected, as a cause of the current issues. One MCU would be a lot quicker and easier to replace, than ten smaller ones, usually. But not always.

Instead of having one big set of manuals for the sole big MCU, there could be 4 or 5, different sets of manuals, for each different MCU type. Which would be a lot of work for a new team member to absorb, especially at the beginning.

If there is a need for debugging. As well as uncertainty over it being an issue with the software or hardware, it would also be an issue, which of the n (e.g. ten) MCUs, was causing the issues.

In practice, communications systems, are a risk of being problematic, such as a source of very complicated, and perhaps exceedingly rare and/or difficult to reproduce issues and bugs.
Because a big single MCU, doesn't necessarily need to communicate with anything else (depending on the project requirements).
But, for example ten MCU system, often (but not always, they could be almost 100% independent for some projects), may need very extensive communications. Which could have very hard to realize, bugs or issues, with the communications protocols.

Also, easily debugging such systems, and being able to check (trace) all those communications, could be extremely difficult, expensive and time consuming.

E.g. What if every hundredth (1 in 100) time, the system is switched on, it fails to 'boot up' and start, reliably. It would be a real pain, switching it on and off, hundreds of times. It might take 30 seconds for it to fully boot-up, into its menus or whatever it does. So very time consuming, if there are problems.
Then the issue would be, which combination of those ten MCUs, is causing the issue(s)?
If you want to manually flick through the software, to see if you can see any potential weaknesses. There could be, ten sets of such software listings, which is harder and takes longer (probably).

But overall, I accept that perhaps experimentation and real life experiences, of using those two alternatives techniques. Would help, find out which is best (if any method is), rather than using human intuition, and gut-feelings, which are not always right.

JPortici · « **Reply #137 on:** May 09, 2024, 07:20:12 pm »

Quote from: MK14 on May 09, 2024, 05:56:02 pm

Quote from: JPortici on May 09, 2024, 05:35:44 pm
Then if you use an RTOS with multicore support you can select which task goes to which core, either at compile time or at runtime.. freeRTOS for example support some kind of multicore multiprocessing

That sounds like a good idea. The extra resources of multiple-cores, combined with the extra features of an OS, such as the aforementioned freeRTOS, sounds interesting for some projects. It would mean a project running on a very low cost RP2040 Raspberry PI pico, could have some of the functionality of its bigger brothers, such as the Raspberry PI 5.
But without their bulky size, relatively heavy power consumption and difficulties, that might arise, because a full blown Linux, Windows or similar OS, is too much for most microcontroller projects.

i guess it depends from the task at hand... and if the cores are symmetric (identical) or not.
For example, it's very rare for me to need a variable number of threads at runtime, so i know in advance how many resources i need, hence i design the tasks (or the firmware) to reside in the appropriate core

In case you wonder it's common for MCUs to be asymmetrical (for example, M4+M0/cortex A + cortex M/dsPIC33CH in which either/and core/peripherals are different) unlesss you go for something like cortex R which is symmetrical, but in that case you usually want to run the cores in lockstep mode

Nominal Animal · « **Reply #138 on:** May 09, 2024, 07:29:28 pm »

Quote from: MK14 on May 09, 2024, 07:17:03 pm

I can imagine the upper managers and most senior technical staff, saying no!

Definitely yes. Which makes me want to

I often remind people that my own opinions are geared towards making better stuff and making things better, to learn to make better things; and not getting employed with a good salary or making a profitable business. To me, there are too many of the latter already, and not nearly enough of the former, because in the current electronics and software markets, the two are usually/typically mutually exclusive. Feel free to disagree, too: the situation, needs, and balance varies from person to person. Mine are just one approach among many.

nctnico · « **Reply #139 on:** May 09, 2024, 07:43:54 pm »

Quote from: MK14 on May 09, 2024, 07:17:03 pm

Because, if you have a single, large MCU, doing everything, that will be seen as just having one piece of software/firmware inside of it. So there is only one (albeit very large and complicated), piece of software to write, test and get working.

But, if instead, you have 10 relatively small MCUs, doing the same task. Although, technically speaking, I accept that it could use a lot less power, cost a lot less, allow much more efficient partitioning of the software development process, potentially making it cheaper, quicker and more reliable, to create.

The problem with this approach is that partitioning into seperate processes is very hard to get right. Processes typically need to communicate at many different levels which hugely complicates timing relationships. In practise seperating into different processes often leads to more problems than it solves so it needs to be considered very carefully. My personal rule of thumb is to only split processes when their functionality really, truly needs to happen in parallel. This also requires making a good structural design of how data is being processes and states are going to be handled. Just starting coding is going to end in a mess. Especially when having teams doing their own thing without coordination and specifications.

MK14 · « **Reply #140 on:** May 09, 2024, 08:00:21 pm »

Quote from: nctnico on May 09, 2024, 07:43:54 pm

The problem with this approach is that partitioning into seperate processes is very hard to get right. Processes typically need to communicate at many different levels which hugely complicates timing relationships. In practise seperating into different processes often leads to more problems than it solves so it needs to be considered very carefully. My personal rule of thumb is to only split processes when their functionality really, truly needs to happen in parallel. This also requires making a good structural design of how data is being processes and states are going to be handled. Just starting coding is going to end in a mess. Especially when having teams doing their own thing without coordination and specifications.

You have raised some very good points!, thanks.

That makes perfect sense.

I'd also like to add, that in my understanding. Many projects, end up changing, during the project development phase.
Perhaps because the customers, keep on changing their minds and/or keep on fiddling with the specifications (warning: Those issues, get multiplied by a large factor, if any government department, is involved!).

But also, while the thing (project) is being developed. The original ideas can need extensive reworking, to work reliably and well. So, there can be lots of changes.

So, having one big MCU, means that it can contain, many of these changes.

But as you appeared to state. Splitting up the functionality, between different units (MCUs), could lead to issues, further down the line.

E.g. The Big/Main MCU has 512k of ram and the smallest additional MCUs, have only 96 bytes each.

So, if things change and all the code is in a sole, big MCU. Project changes, meaning the smaller tasks need an additional 256 bytes (because of buffers to support, bigger and wider, command queues or whatever). It is no problem, needing a few extra sets of 256 bytes, when a total 512k of ram is available.

But if the smallest MCUs, were specified as having only 96 bytes ram in total for all its code. That would be a major headache.

MK14 · « **Reply #141 on:** May 09, 2024, 08:10:44 pm »

Quote from: JPortici on May 09, 2024, 07:20:12 pm

i guess it depends from the task at hand... and if the cores are symmetric (identical) or not.
For example, it's very rare for me to need a variable number of threads at runtime, so i know in advance how many resources i need, hence i design the tasks (or the firmware) to reside in the appropriate core

In case you wonder it's common for MCUs to be asymmetrical (for example, M4+M0/cortex A + cortex M/dsPIC33CH in which either/and core/peripherals are different) unlesss you go for something like cortex R which is symmetrical, but in that case you usually want to run the cores in lockstep mode

Now you mention it, I had only partially noticed that many MCUs are asymmetric.

I presume that the MCU manufacturer's know what they are doing. So, either their own research and experience or customer (especially very big ones) requests and requirements (or even practicalities of the manufacturing process, such as to keep overall power consumption to within reasonable limits), is why they have done that.

Thanks for pointing it out.

Maybe there is a tendency for one of the processors, to handle the real-time I/O, and related interrupts and timings. So, it doesn't need more complicated instructions, such as floating point, nor needing maximum performance. As a lot of its time, might be 'wasted', in polling loops, waiting for interrupts to occur and other house-keeping tasks.

But the other processor, runs (perhaps) the algorithms and other heavy stuff. So, needs the best performance, floating-point and other extra bits and pieces.

I also wonder if it is to reduce the licensing fees for the Arm CPU cores?
As an M0(+) is free or relatively free for some (I think), but the M4 etc, floating-point versions, probably do get charged licence fees.

nctnico · « **Reply #142 on:** May 09, 2024, 08:17:38 pm »

The problem not only exists when using multiple CPUs (or even seperate microcontrollers), but also when using pre-emptive (time slicing) multitasking. For a project I'm involved in there is an ongoing project to rewrite the entire project from scratch due to a bad design of seperate processes, sharing of information and timing problems. The original design simply is unfixable.

When using multiple physical CPUs / microcontrollers you have the additional problem that you don't know the state and communication between the physical units will fail. This means you'll need to implement methods to recover / deal with communication and node problems.

SiliconWizard · « **Reply #143 on:** May 09, 2024, 08:33:55 pm »

Quote from: Nominal Animal on May 09, 2024, 12:58:38 pm

Quote from: MK14 on May 08, 2024, 10:26:21 pm
Given the extensive, long and sometimes prohibitively expensive software development costs. It might be better, to get a single core CPU with much greater speed (clock frequency and/or IPC) for that single core, rather than handle all the complexities of coping with multiple cores.
I disagree, but my opinion is colored by my own experience, and having specialized in parallel and distributed computing (in HPC, simulations and such).

When you can separate the logical tasks for each core, the end result can be much simpler than the same functionality running on a single core.
The key difference is understanding and using the various mechanisms one can use for communications (and passing data) between the cores, and how to separate the "jobs" effectively.

Yes. I agree with that. But this "key" difference is unfortunately not well known, even less so mastered, by most. That's unfortunate.

I find the RP2040 a very nice starting point for learning how to design embedded multi-core systems. It has hardware spinlocks, a very simple mailbox between the 2 cores, and banked SRAM which allows the 2 cores to access RAM concurrently for data that is placed in different banks. That gives a lot to play with, while stilll being simple enough to grasp in just a few hours.

Nominal Animal · « **Reply #144 on:** May 09, 2024, 09:30:10 pm »

Quote from: nctnico on May 09, 2024, 07:43:54 pm

The problem with this approach is that partitioning into seperate processes is very hard to get right.

Quote from: SiliconWizard on May 09, 2024, 08:33:55 pm

Quote from: Nominal Animal on May 09, 2024, 12:58:38 pm
The key difference is understanding and using the various mechanisms one can use for communications (and passing data) between the cores, and how to separate the "jobs" effectively.
Yes. I agree with that. But this "key" difference is unfortunately not well known, even less so mastered, by most. That's unfortunate.

Both true.

This does require experience and understanding of the possible approaches, and their relative merits and downsides, definitely. Yet, the knowledge is not that complicated or massive; the volume of experience and understanding is not that big.

I do a lot of POSIX+GNU C development on Linux, and extremely often use POSIX threads. (The default stack size is immense, so it is a good practice to set a smaller/configurable stack size for your worker threads; it's just four or five lines of C more.)
x86, x86-64, and Armv8.1-A also have atomic operations (not just ll-sc/cmpxchg, but atomic add/and/or/xor) exposed via compiler-provided atomic built-ins, which allow very interesting lockless thread-safe structures.

I suspect, but am not certain, that writing useful tools and experiment with abstract data structures and methods of splitting overall workload to separate tasks/threads/processes in an effective and maintainable way, in the Linux/BSD/MacOS POSIX+GNU C environment, is an useful way to learn parallel (and distributed if you add e.g. MPI, say OpenMPI) programming techniques useful for embedded programming. Things like Unix domain stream sockets are pretty near perfect analogues for error-free full-duplex serial communications, and POSIX signal delivery behaves very similarly to interrupts (with selective masking/blocking of signals on a per-thread basis).

In comparison, I do not believe using e.g. OpenMP extensions to C are useful at all.

I do test all my ideas in fully-hosted environment in C in Linux first, before writing it for a microcontroller. Because I understand the difference between freestanding and hosted C environments, I understand which C library features I need to avoid to keep the implementation portable to limited embedded environments, and what Linux/POSIX features "map" nicely to embedded features. In many cases, I simulate (to a varying degree – sometimes just the user interface, sometimes entirely) what I'd like the microcontroller to achieve, before even picking which microcontroller dev board I'll use!
Mostly, my projects interface to a SBC, though, and I haven't made any truly complex microcontroller projects yet.

I definitely believe the underlying problem –– the difficulty in learning how to split the workload efficiently, which synchronization and communication primitives to use –– is an issue of lack of well-known good examples, documentation, and other learning materials; and not that the concepts involved are themselves hard per se. It is very different, which means that the more experience you have in single-threaded C, the harder it can be to un-learn the details that work differently in a parallel or distributed environment, before one can truly learn and understand the new different stuff.

tellurium · « **Reply #145 on:** May 09, 2024, 10:27:14 pm »

So why don't you write a guide, guys, and share the knowledge you have.
Take rp2040 as an example, and create a simple lockless queue that two tasks on two cores can use to exchange data.

Nominal Animal · « **Reply #146 on:** May 09, 2024, 11:19:59 pm »

Quote from: tellurium on May 09, 2024, 10:27:14 pm

Take rp2040 as an example, and create a simple lockless queue that two tasks on two cores can use to exchange data.

RP2040 is a Cortex-M0+ (ARMv6-m) core, which does not have any kind of atomic support; lockless data structures cannot be implemented for it. That's why they added the hardware spinlock support: software atomics are impossible on Cortex-M0+.

Cortex-M3/M4/M7 (ARMv7-m) supports load-link, store-conditional aka LL/SC for atomics, via LDREX (load-exclusive) and STREX (store-exclusive); see e.g. ARM Cortex-M3 documentation here.

Basically, LDREX loads (8, 16, or 32-bit) data from an address, and you have a few cycles to do something with it. The following STREX to the same address succeeds only if the other core(s) did not access the same address (and the "do something with it" did not include executing a CLREX instruction); it has both a store destination address in a register, as well as a destination success register.

With LDREX/STREX (or any other LL/SC implementation), "lockless" structures are inherently spinlocks. For example, let's say you have a counter, and you want each caller to obtain a separate value, monotonically increasing without gaps:

Code: [Select]

unsigned int v = 0;

unsigned int next(void) {
    return __atomic_fetch_add(&v, 1, __ATOMIC_SEQ_CST);
}

On x86 and x86-64, the function simplifies to loading 1 in some register and then doing a single lock xadd instruction. The lock prefix means the processor will handle the synchronization across cores, and it will be atomic. It can take some cycles longer than a normal xadd, but it will not spin.

On ARMv8-a, it simplifies to loading the address and 1 to some registers, and then doing a single ldaddal (same as staddal) instruction, which is equivalent to the lock xadd on x86/x86-64.

On Cortex-M3/M4/M7, it compiles to a loop around ldrex, adds, strex, with dmb (data memory barrier) surrounding the loop. These normally end up never iterating more than once, and I do not believe it is possible for multiple cores to keep perfectly lockstep for more than a few iterations (even if you try to), because of caches and the underlying bus hardware. While it looks and acts atomic at the C level, I don't really consider this "lockless".

On Cortex-M0/M0+, you need to implement your own __atomic_fetch_add(), as the compiler just generates a call to it. The instruction set just doesn't have a way to implement it automatically. The implementation on RP2040 would use a dedicated spinlock (typically the same one for all "atomic accesses") for the duration of the load-modify-store cycle. It, too, normally spins/blocks only for the duration of one load-modify-store cycle, but isn't "lockless" by any interpretation.

SiliconWizard · « **Reply #147 on:** May 09, 2024, 11:26:38 pm »

Note that the Pico SDK does include such queues already implemented for you, using spinlocks. You can have a look at the source code. They've done a decent job.

Links:
https://github.com/raspberrypi/pico-sdk/tree/master/src/common/pico_sync
https://github.com/raspberrypi/pico-sdk/tree/master/src/common/pico_util

Nominal Animal · « **Reply #148 on:** May 09, 2024, 11:33:12 pm »

Quote from: tellurium on May 09, 2024, 10:27:14 pm

So why don't you write a guide, guys, and share the knowledge you have.

My own learning path was so different to those learning programming today that I don't know what the effective approach to learning this stuff is. I don't even have any good books to recommend!

I have thought about it, though; a lot, actually. The problem is I need a set of ~~victims~~ learners to experiment on first, to see exactly which approaches and descriptions of the concepts involved actually work on todays programmers.

Besides, I'm still broken, with any kind of responsibility or making promises causing me to fail, for now. I can answer questions and help, and do my own stuff if I don't show it to others or in public, but that's about it.

nctnico · « **Reply #149 on:** May 10, 2024, 10:26:29 am »

Quote from: Nominal Animal on May 09, 2024, 11:33:12 pm

Quote from: tellurium on May 09, 2024, 10:27:14 pm
So why don't you write a guide, guys, and share the knowledge you have.
My own learning path was so different to those learning programming today that I don't know what the effective approach to learning this stuff is. I don't even have any good books to recommend!

Same here. A good start would be to create a mental model people can use to 'visualise' how processes interact when they run in parallel (either on multiple CPUs and/or by time-slicing). Thinking back to my EE study, this topic wasn't explored very deeply during software engineering related classes. Maybe the problem is that educators have the idea that processes run isolated without any interaction between processes.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Memory model for Microcontrollers (Read 9673 times)

Share me