two or more, separate cores.
Quotetwo or more, separate cores.Usually you compile and one set of sources into a single binary, so that that stage of development doesn't know that there are going to be more than one CPU. Stack and Heap can be assigned to each cpu (or task) at runtime, but data, text, and rodata can be freely and randomly interleaved.
The simplest locking primitives when you have more than one core in a microcontroller are usually spinlocks. They are named that, because a core waiting for the spinlock to become free will just spin in a tight loop.
RP2040, for example, provides 32 hardware spinlocks. On other architectures, either an atomic compare-exchange or load-linked, store-conditional is used (in a loop), with each core or task having their own identifying value they try to set the spinlock to. They spin in the loop trying, until they succeed.
The main difficulty in implementation is the interactions with caches (cache coherency, atomic accesses), requiring careful construction for each architecture: a naïve implementation will look like it should work, but because of cache coherency or memory atomicity details, more than one core may believe they own the spinlock at the same time! Also, if the cores have separate caches, the cacheline containing the spinlock may end up being ping-ponged between RAM and the two contending cores, making the spinning part amazingly slow. Which is why operating systems typically provide "better" locking primitives, which let other stuff run while tasks wait to grab a lock, and only use spinlocks for very short durations, like copying some variable or object only.
Given the extensive, long and sometimes prohibitively expensive software development costs. It might be better, to get a single core CPU with much greater speed (clock frequency and/or IPC) for that single core, rather than handle all the complexities of coping with multiple cores.
I disagree, but my opinion is colored by my own experience, and having specialized in parallel and distributed computing (in HPC, simulations and such).
When you can separate the logical tasks for each core, the end result can be much simpler than the same functionality running on a single core.
The key difference is understanding and using the various mechanisms one can use for communications (and passing data) between the cores, and how to separate the "jobs" effectively.
In other words, you do need to learn/know more programming techniques, but it is worth it in the end. Message passing, message queues and mailboxes, are extremely common, and very often used in systems programming especially in low-level graphics interfaces. Inter-core interrupts (where code in one core causes an interrupt in a designated other core or all other cores) are also useful, but more for synchronization than data/message passing. RPis use a mailbox interface for communicating between the VC and the ARM core, for example. If you care to learn by a combination of learning the background idea/theory and then experiment, I assure you can get the hang of it quick.
It does require some experience to design the separation between logical workloads in a parallel/multi-core system effectively, though.
If we look at historical precedents, the first parallelization primitives in programming languages revolved around parallelizing loops and such, which really isn't that useful in real life; we just had to learn the better ways, eventually.
While we are on this subject area, one question comes to mind.
On 'proper', larger CPUs/computers, i.e. NOT most small/medium Microcontrollers. They usually have memory management units, virtual memory and operating systems. So, that each significant process/program and hence on individual cores/threads. It can appear and be treated as if it had an entire computer to itself, potentially starting at address $0000.., and with little or no worries about conflicting/sharing with anything else.
With some exceptions, of course.
But on (especially lower end) Microcontrollers, they typically don't have memory management units, virtual memory and somewhat often, don't run operating systems (with many exceptions).
Yet, even at the low end of Microcontrollers these days, e.g. Raspberry PI PICO (pair of small M0(+) arm cores). They can have two or more, separate cores.
So, how does the compiler(s), linkers etc. Cope with such possible conflicts, and uncertainty, over what the other core(s) / threads are using or not?
Or in other words, you could have two completely separate Main()'s (one running on one core, the other running on the second), but which MUST mostly always use completely different locations in memory (except where they are intentionally sharing memory).
My guess would be that the resources get split up (using link symbols etc), into smaller sections. With one core using its own set of linker symbols. But that must get very complicated, if one core runs a C program and the other, runs a completely different language (if that is even allowed).
Fortunately, most/many programs on the PI PICO, are just fine with using only a single core. So, I've been fine, up to now.
Then if you use an RTOS with multicore support you can select which task goes to which core, either at compile time or at runtime.. freeRTOS for example support some kind of multicore multiprocessing
I don't think there are any right or wrong answers as such.
Which (if I understood it correctly), is saying that splitting real life software tasks/jobs (presumably in some but not all cases), between different CPUs (or threads), can be a useful way of breaking/partitioning down a task, into efficiently sized 'chunks' for software developers, to handle.
But there could be barriers, such as Amdahl law.
https://en.wikipedia.org/wiki/Amdahl%27s_law
my extremely limited experience with dual core MCUs is that they are completely separated system: they share the GPIO and some mailbox system to communicate with each other, but then each core is in fact a separate MCU: separate memory, separate peripherals. So each core effectively runs its own firmware and treats the other core(s) as independent entities, and use the mailboxes to communicate.
I swear I posted a reply, but it vanished into thin air!
has lead to me considering using more than one MCU in many of my designs, so that the main MCU doesn't need to be so powerful (so a cheaper and more energy-efficient MCU suffices)
Then if you use an RTOS with multicore support you can select which task goes to which core, either at compile time or at runtime.. freeRTOS for example support some kind of multicore multiprocessing
That sounds like a good idea. The extra resources of multiple-cores, combined with the extra features of an OS, such as the aforementioned freeRTOS, sounds interesting for some projects. It would mean a project running on a very low cost RP2040 Raspberry PI pico, could have some of the functionality of its bigger brothers, such as the Raspberry PI 5.
But without their bulky size, relatively heavy power consumption and difficulties, that might arise, because a full blown Linux, Windows or similar OS, is too much for most microcontroller projects.
I can imagine the upper managers and most senior technical staff, saying no!
Because, if you have a single, large MCU, doing everything, that will be seen as just having one piece of software/firmware inside of it. So there is only one (albeit very large and complicated), piece of software to write, test and get working.
But, if instead, you have 10 relatively small MCUs, doing the same task. Although, technically speaking, I accept that it could use a lot less power, cost a lot less, allow much more efficient partitioning of the software development process, potentially making it cheaper, quicker and more reliable, to create.
The problem with this approach is that partitioning into seperate processes is very hard to get right. Processes typically need to communicate at many different levels which hugely complicates timing relationships. In practise seperating into different processes often leads to more problems than it solves so it needs to be considered very carefully. My personal rule of thumb is to only split processes when their functionality really, truly needs to happen in parallel. This also requires making a good structural design of how data is being processes and states are going to be handled. Just starting coding is going to end in a mess. Especially when having teams doing their own thing without coordination and specifications.
i guess it depends from the task at hand... and if the cores are symmetric (identical) or not.
For example, it's very rare for me to need a variable number of threads at runtime, so i know in advance how many resources i need, hence i design the tasks (or the firmware) to reside in the appropriate core
In case you wonder it's common for MCUs to be asymmetrical (for example, M4+M0/cortex A + cortex M/dsPIC33CH in which either/and core/peripherals are different) unlesss you go for something like cortex R which is symmetrical, but in that case you usually want to run the cores in lockstep mode
Given the extensive, long and sometimes prohibitively expensive software development costs. It might be better, to get a single core CPU with much greater speed (clock frequency and/or IPC) for that single core, rather than handle all the complexities of coping with multiple cores.I disagree, but my opinion is colored by my own experience, and having specialized in parallel and distributed computing (in HPC, simulations and such).
When you can separate the logical tasks for each core, the end result can be much simpler than the same functionality running on a single core.
The key difference is understanding and using the various mechanisms one can use for communications (and passing data) between the cores, and how to separate the "jobs" effectively.
The problem with this approach is that partitioning into seperate processes is very hard to get right.
The key difference is understanding and using the various mechanisms one can use for communications (and passing data) between the cores, and how to separate the "jobs" effectively.Yes. I agree with that. But this "key" difference is unfortunately not well known, even less so mastered, by most. That's unfortunate.
Take rp2040 as an example, and create a simple lockless queue that two tasks on two cores can use to exchange data.
unsigned int v = 0;
unsigned int next(void) {
return __atomic_fetch_add(&v, 1, __ATOMIC_SEQ_CST);
}
On x86 and x86-64, the function simplifies to loading 1 in some register and then doing a single lock xadd instruction. The lock prefix means the processor will handle the synchronization across cores, and it will be atomic. It can take some cycles longer than a normal xadd, but it will not spin.So why don't you write a guide, guys, and share the knowledge you have.
So why don't you write a guide, guys, and share the knowledge you have.My own learning path was so different to those learning programming today that I don't know what the effective approach to learning this stuff is. I don't even have any good books to recommend!