Author Topic: shared memory in multi-CPU systems: looking for books, docs, ...  (Read 5158 times)

0 Members and 1 Guest are viewing this topic.

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #25 on: December 05, 2024, 02:15:28 am »
Yeah, they're not intended to run the same OS on both. One is meant to run Linux (or similar) and the other to run bare-metal programs.

Have you looked into the "mailbox"?

I use my Duo literally every day -- it's pretty much permanently plugged into one of my laptop's USB ports -- but only for testing user-mode Linux programs, so far.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #26 on: December 05, 2024, 12:17:17 pm »
...
So to access a shared RAM block, you have to properly invalidate/clean the respective caches, manually. Quite fun.

I think they were essentially meant to communicate via a mailbox and not directly share any RAM, ...

My limited understanding is that the HPC mob has standardised on message passing. The abstraction and high level code is usable on any architecture (with appropriate tiny hand-crafted primitives) and is scalable to thousands of processors. Caches and shared memory on the other hand... not so much.

There are 32core/chip MCUs available at Digikey, and if 32 isn't enough you can transparently (at the software level) add more chips in "parallel". 

The future is already here, but it isn't evenly distributed. Time for software to catch up.
« Last Edit: December 05, 2024, 12:22:44 pm by tggzzz »
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #27 on: December 05, 2024, 12:20:36 pm »
So to access a shared RAM block, you have to properly invalidate/clean the respective caches, manually. Quite fun.

It is somehow a manageable task if you write the software yourself and it is relatively simple, but in fact this hardware situation has not allowed Linux and *BSD to run decently on SGI IP28R10K for 25 years (no hw support/trick at all for the cache coherence), and this teaches that - it is not that people are willing to spend so much time fixing hw-things in software -

In fact then either you are paid to do it, or you don't do it, because it is a boring pain!
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #28 on: December 05, 2024, 01:13:09 pm »
My limited understanding is that the HPC mob has standardised on message passing.
It is usable on any architecture (with appropriate tiny hand-crafted primitives) and is scalable to thousands of processors.
Caches, on the other hand...

Cache is evil, but, worse still, "cache-coherent Non-Uniform Memory Architecture" is what I see in my MIPS5+ prototype:
  • "ccNUMA" system architecture
  • hw coherence manager
  • by default it is disabled, you can enable it by grounding a pin ... the first thing I did was solder a jumper to ground, so as to be 100% sure that it is always enabled!
  • local memories are shared in a Single-System Image (SSI)
  • single shared address space (64bit)
  • single copy of operating system (XINU in my case)
  • designed to scale to very large processor counts (128-192 ... only 8 in my case)
  • MIPS5+, 4-way super-scalar microprocessor
  • not compatible with MIPS-4 Instruction Set Architecture -> there are no AS/C compilers!
  • out of Order execution -> 26 pages of the 800 UM pages are about "bad side effects"
  • two Floating-point Execution Units
  • two Fixedpoint-point Execution Units
  • each queue can issue one instruction per cycle: { Add, Mul, MAC, Div }
  • each unit is controlled by an 8-entry Flt.Pt. queue
  • each unit can trap an exception
  • large virtual and physical address spaces
  • 52-bit virtual address (data)
  • 48-bit physical address (CPU side, 64-bit address register)
  • very large TLB page sizes 64M, 256M and 1G page sizes!!!!
  • very wild cop0, when it receives an interrupt/exception, it only sets a bit to say whether or not it has concluded the current instruction with the precise idea that then the software will understand what happened, therefore to calculate the return address, PC_next rather than PC_curr. In addition to this, if there is an exception all the LL/SC instructions are canceled (flushed), which is quite serious from the synchronization side because everything must be canceled and resynchronized
  • kseg0 is uncached, you can address it to an experimental tr-mem

Being a protype it's full Debugging Challenge. It has extra hw to trace some of activity hidden on chip.
Deep sub-micron features are difficult to observe in testers, impossible in a system, expecially in a large systems with many processors.
Failure point may be difficult to find, and exact failure condition must be recreated.

It's dead, a project they decided not to continue  :-//

Unfortunately I don't have all the documentation, worse still it's a prototype not 100% working, and all the analysis tools are missing, which I'm sure they preferred to destroy rather than release.

One of the manual quotes external missing documents:
+ Architecture Verification Programs (AVP)
+ Micro-architecture Verification Programs (MVP)
+ Random diagnostics from programmable random code generators (missing testbench)
+ self-checked and/or compared Diagnostics with a reference machine (which is also missing)
...
for which debugging equipment is mentioned. In particular one of the ICEs for which my company was contacted.

I literally saved a board and two DVDs from the hydraulic press, and I'm trying to understand how it works.
I'd like to recreate some of its features in HDL (the trmem, in particular, or any other good shared mem mechanism), obviously simplifying them.
The goal is to learn something and improve my own RISC-ish softcore toy.

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #29 on: December 05, 2024, 01:48:12 pm »
My limited understanding is that the HPC mob has standardised on message passing.
It is usable on any architecture (with appropriate tiny hand-crafted primitives) and is scalable to thousands of processors.
Caches, on the other hand...

Cache is evil, but, worse still, "cache-coherent Non-Uniform Memory Architecture" is what I see in my MIPS5+ prototype:
<horror story omitted for brevity>

Oh, yes indeed.

There is a long and inglorious history of hardware micro-architecture being exposed to software, on the presumption that software will be able to deal with it. The recent poster child for that is was the Itanic.

Or with coherent caches coherence implemented in hardware, once you have "too many" cores+caches, the cache coherence traffic expands exponentially and eventually dominates the traffic and memory latency. IIRC that hit Andy Bechtolsheim's attempts to have too many interconnected AMD Opterons.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #30 on: December 05, 2024, 02:30:54 pm »
Or with coherent caches coherence implemented in hardware, once you have "too many" cores+caches, the cache coherence traffic expands exponentially and eventually dominates the traffic and memory latency. IIRC that hit Andy Bechtolsheim's attempts to have too many interconnected AMD Opterons.

This is the issue that worries me the most with 128-192 cores, but I have no idea how they solved it, nor how to try to do something myself
The prototype board I play on only have 8 CPUs and offers the possibility of selectively disabling various things

For now I have created my "comfort zone"

- cores forced to "execution in-order"
- ccNUMA used as little as possible
- everything runs in kseg0, so uncached
- all shared memory passes through trmem (I only have 32Kbyte per core)

so I can manage that monster in my spare time, but not exploit its potential  :D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #31 on: December 05, 2024, 02:46:52 pm »


This is the representation of my board.
I don't know how those "routers" are implemented.
It's something into a big FPGA. I don't have any HDL source.

On the documentation I found a note that says that it was in collaboration with IBM.
IBM was(is?) famous for having similar chips, used in mainframes and POWER workstations.
Too bad all the documentation is NOT public.
I have never been able to see a single manual or datasheet.

I use the default router settings, which do their job transparently, the address-range is divided equally for each CPU island with a slice of the same size.
Surely this configuration can be reprogrammed, but not as exactly as doing it.

Anyway, it is a technology from 2004, about 20 years of computing, which means that today things certainly do not work that way.
And if they abandoned the project... I also fear that this was not the right direction to take at all.

Dunno :-//
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #32 on: December 05, 2024, 04:44:13 pm »
There is a long and inglorious history of hardware micro-architecture being exposed to software, on the presumption that software will be able to deal with it. The recent poster child for that is was the Itanic.

Or with coherent caches coherence implemented in hardware, once you have "too many" cores+caches, the cache coherence traffic expands exponentially and eventually dominates the traffic and memory latency. IIRC that hit Andy Bechtolsheim's attempts to have too many interconnected AMD Opterons.

In the end, if you have hundreds of cores contending on the same cache line you are going to have a bad day.

So the solution to this has been to reduce contention which reduces cache coherence traffic.  A couple of the big ideas here are lock free queues and read copy update.  Both make heavy use of atomic instructions rather than mutex locks.  The problem with a mutex is that it creates cache invalidation even when there is no contention.

Read-copy-update allows read mostly data structures to be completely contention free for readers, and writers update with atomic operations (generally pointer swaps) and defer reclamation until after all other threads have synchronized for some other reason.

Lock free queues allow essentially zero copy message passing.  You push a pointer through a queue to a consumer which then takes ownership of the referenced object, but since the object is in shared memory there is no explicit copy.

Other important advances are improved avoidance of false sharing, bigger overall caches with more levels to allow faster synchronization between "nearby" cores.  Better runtime performance counters and smarter operating systems allow better scheduling affinity and page migration. Better tools for detecting race conditions allow developers to more easily write concurrent code. 

The basic idea is that you still have to write code to reduce shared mutable state as much as possible just like when using non coherent or message passing systems but when it comes time to synchronize it's still lower overhead to have the hardware do it.  And if you mess up and have contention, the symptom is your code runs slower and you have performance tools to identify hotspots.  With cache incoherent systems the symptom is random infrequent and silent data corruption.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #33 on: December 05, 2024, 05:05:10 pm »
Or with coherent caches coherence implemented in hardware, once you have "too many" cores+caches, the cache coherence traffic expands exponentially and eventually dominates the traffic and memory latency. IIRC that hit Andy Bechtolsheim's attempts to have too many interconnected AMD Opterons.

This is the issue that worries me the most with 128-192 cores, but I have no idea how they solved it, nor how to try to do something myself

What make you think they did solve it, or that it is even solvable? :)

Software is an "externality", i.e. somebody else's problem.

Quote
The prototype board I play on only have 8 CPUs and offers the possibility of selectively disabling various things

For now I have created my "comfort zone"

- cores forced to "execution in-order"
- ccNUMA used as little as possible
- everything runs in kseg0, so uncached
- all shared memory passes through trmem (I only have 32Kbyte per core)

so I can manage that monster in my spare time, but not exploit its potential  :D

You really ought to look at and understand the XMOS xCORE hardware, xC software, and toolchain.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online coppice

  • Super Contributor
  • ***
  • Posts: 10117
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #34 on: December 05, 2024, 05:26:16 pm »
A couple of the big ideas here are lock free queues and read copy update.
Lock free algorithms - lock free queues, lock free hashes, etc. - were a hot topic 20 years ago. It has gone very quiet, which is sad. People like Cliff Click were quite prominent in lock free hashes for an obvious reason. He was with a company making hardware with a huge number of cores, and the traffic associated with locking was killing them. As we now have 96 cores in a single AMD x86_64 package the problem Cliff Click faced nearly 20 years ago is now with all of us.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #35 on: December 05, 2024, 07:44:00 pm »


Ampere
192 cores
« Last Edit: December 05, 2024, 07:46:55 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online coppice

  • Super Contributor
  • ***
  • Posts: 10117
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #36 on: December 05, 2024, 08:29:43 pm »
That server is a strange design., The heat pipe assembly has narrowly spaced fins, which will need significant pressure to push the air through the gaps. However, I see not guiding so the air is forced through, rather than around, that assembly.
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #37 on: December 05, 2024, 10:25:44 pm »
A couple of the big ideas here are lock free queues and read copy update.
Lock free algorithms - lock free queues, lock free hashes, etc. - were a hot topic 20 years ago. It has gone very quiet, which is sad. People like Cliff Click were quite prominent in lock free hashes for an obvious reason. He was with a company making hardware with a huge number of cores, and the traffic associated with locking was killing them. As we now have 96 cores in a single AMD x86_64 package the problem Cliff Click faced nearly 20 years ago is now with all of us.

What do you mean it's gone quiet?  Lock free queues especially are well developed and in wide use.  I'm not so sure about hashes.  I know that some uses of hash tables that are infrequently updates instead use lock-free trees with RCU.  In both cases, they are usually only used in cases where you need very high parallelism -- otherwise mutex locks or spinlocks work just fine.
 

Online coppice

  • Super Contributor
  • ***
  • Posts: 10117
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #38 on: December 05, 2024, 10:50:13 pm »
A couple of the big ideas here are lock free queues and read copy update.
Lock free algorithms - lock free queues, lock free hashes, etc. - were a hot topic 20 years ago. It has gone very quiet, which is sad. People like Cliff Click were quite prominent in lock free hashes for an obvious reason. He was with a company making hardware with a huge number of cores, and the traffic associated with locking was killing them. As we now have 96 cores in a single AMD x86_64 package the problem Cliff Click faced nearly 20 years ago is now with all of us.

What do you mean it's gone quiet?  Lock free queues especially are well developed and in wide use.  I'm not so sure about hashes.  I know that some uses of hash tables that are infrequently updates instead use lock-free trees with RCU.  In both cases, they are usually only used in cases where you need very high parallelism -- otherwise mutex locks or spinlocks work just fine.
Everything is moving towards high parallelism these days. If you start something now that you hope to have a long future you should probably be looking at how the locks will work out with 1000 cores. Its usually pretty ugly, and you really need to keep those locks to a minimum.

15 to 20 years ago people were working on all kinds of lock free things. Queues are an easy one in many cases. Hashes are tough. Many things have defeated people. I don't expect to hear much about lock free queues these days, but I hope people are still working on other things. Even if they can't make them work, maybe they will identify enhancements to hardware that would make them work better in the future. It looks like we are in for more and more parallelism. We have to learn to make it work well for a wider range of tasks.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #39 on: December 05, 2024, 11:18:35 pm »
A couple of the big ideas here are lock free queues and read copy update.
Lock free algorithms - lock free queues, lock free hashes, etc. - were a hot topic 20 years ago. It has gone very quiet, which is sad. People like Cliff Click were quite prominent in lock free hashes for an obvious reason. He was with a company making hardware with a huge number of cores, and the traffic associated with locking was killing them. As we now have 96 cores in a single AMD x86_64 package the problem Cliff Click faced nearly 20 years ago is now with all of us.

What do you mean it's gone quiet?  Lock free queues especially are well developed and in wide use.  I'm not so sure about hashes.  I know that some uses of hash tables that are infrequently updates instead use lock-free trees with RCU.  In both cases, they are usually only used in cases where you need very high parallelism -- otherwise mutex locks or spinlocks work just fine.
Everything is moving towards high parallelism these days. If you start something now that you hope to have a long future you should probably be looking at how the locks will work out with 1000 cores. Its usually pretty ugly, and you really need to keep those locks to a minimum.

It has been moving that way for a few decades; the high performance computing mob have always been pushing technology and finding the pain points.

1000 cores indiscriminately sharing the same memory won't work. The nearest you can get is with barrel processors (e.g. Sun's Niagara T series) which are useful for "embarassingly parallel" applications.

Message passing does scale, but applications need to find the sweet spot between
  • fine grained means smaller data transfers, but frequent setup/teardown can be painful
  • coarse grained means fewer data transfers, but the more data has to be "separated" and transferred
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online coppice

  • Super Contributor
  • ***
  • Posts: 10117
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #40 on: December 05, 2024, 11:32:46 pm »
It has been moving that way for a few decades; the high performance computing mob have always been pushing technology and finding the pain points.
HPC workloads are very different from the highly parallel things which are growing now. You can sidestep most of the blockages with HPC workloads. Other workloads have a mass of chock points.
1000 cores indiscriminately sharing the same memory won't work. The nearest you can get is with barrel processors (e.g. Sun's Niagara T series) which are useful for "embarassingly parallel" applications.
Its a long time since people tried to have a lot of processors accessing a truly uniform memory system. With modern memory layouts it isn't really the memory you need to worry about. Its the traffic ensuring consistency in the memory. If you can get the number of locks way down, you can keep growing the CPU count for a wide range of loads.
Message passing does scale, but applications need to find the sweet spot between
  • fine grained means smaller data transfers, but frequent setup/teardown can be painful
  • coarse grained means fewer data transfers, but the more data has to be "separated" and transferred
Unless the messages are flowing on a regular heartbeat, they eventually become a congestion point. The regular heartbeat approach suits highly regular processing, like HPC. It doesn't suit more complex workloads. Communication is just overhead, and latency.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #41 on: December 06, 2024, 02:06:47 pm »
is there any good book about these new (to me) tecnologies?
Mine are all 1995-1997-2005
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #42 on: December 06, 2024, 04:55:12 pm »
Even if they can't make them work, maybe they will identify enhancements to hardware that would make them work better in the future.

One form of hardware assist that has a lot of interest but has been plagued with implementation problems is lock elision.  The basic idea is to take advantage of speculative execution to speculatively skip a lock acquisition.  If you get to the release with no memory conflict you basically skip the lock altogether.  If there is a conflict you throw out the speculative state, restart and acquire the lock properly.

One problem with lock free data structures is that writing everything in terms of cmpxchg can be tricky, and it doesn't take more than a handful of atomic operations before it's more efficient to just acquire a lock. Hardware lock elision / transactional memory lets you effectively create your own atomic operations, and leverages the hardware speculative execution engine to roll them back if they fail.

Unfortunately Intel's attempts to implement this had a lot of security problems and were disabled.  Newer Intel microarchitecture don't support it and I think POWER abandoned it as well.  I'm not sure if it's an idea that will eventually come back or if it's effectively obsoleted by better alternatives.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #43 on: December 06, 2024, 05:08:03 pm »
Even if they can't make them work, maybe they will identify enhancements to hardware that would make them work better in the future.

One form of hardware assist that has a lot of interest but has been plagued with implementation problems is lock elision.  The basic idea is to take advantage of speculative execution to speculatively skip a lock acquisition.  If you get to the release with no memory conflict you basically skip the lock altogether.  If there is a conflict you throw out the speculative state, restart and acquire the lock properly.

I bet the people that dreamed that up were previously working on the Itanic!

Quote
One problem with lock free data structures is that writing everything in terms of cmpxchg can be tricky, and it doesn't take more than a handful of atomic operations before it's more efficient to just acquire a lock. Hardware lock elision / transactional memory lets you effectively create your own atomic operations, and leverages the hardware speculative execution engine to roll them back if they fail.

Unfortunately Intel's attempts to implement this had a lot of security problems and were disabled.  Newer Intel microarchitecture don't support it and I think POWER abandoned it as well.  I'm not sure if it's an idea that will eventually come back or if it's effectively obsoleted by better alternatives.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15894
  • Country: fr
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #44 on: December 07, 2024, 06:53:37 am »
Yeah, they're not intended to run the same OS on both. One is meant to run Linux (or similar) and the other to run bare-metal programs.

Yep, but they do share all RAM, so that is kind of odd. They just basically threw a second core, bridged it to the same bus and off you go.

Have you looked into the "mailbox"?

Yep, implemented that as well. It's certainly handy to pass short messages and as a synchronizaton means. But each channel can only hold up to 8 bytes, so sharing large blocks of data this way is unpractical. (To be fair, there are 8 channels available and you can use them freely, so even if less practical, you can always use all 8 channels simultaneously to pass up to 64 bytes per transaction. But that's still much slower and clunkier than directly sharing a data block, if said block is large.)

As I said, handling cache sync manually does work though, for directly sharing RAM. That's useful for larger buffers and not too complicated. Just takes a bit of work. (But yes, that would make it pretty unconvenient for sharing RAM in the context of a general-purpose OS). So I mentioned that as an example of such a system with no cache coherency at all and how it can be a good exercise.

I'm pretty happy with those chips - overall, one can do a lot more than I even thought initially, completely baremetal, so that's a big win.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #45 on: December 07, 2024, 10:28:43 am »
Yeah, they're not intended to run the same OS on both. One is meant to run Linux (or similar) and the other to run bare-metal programs.

Yep, but they do share all RAM, so that is kind of odd. They just basically threw a second core, bridged it to the same bus and off you go.

Have you looked into the "mailbox"?

Yep, implemented that as well. It's certainly handy to pass short messages and as a synchronizaton means. But each channel can only hold up to 8 bytes, so sharing large blocks of data this way is unpractical. (To be fair, there are 8 channels available and you can use them freely, so even if less practical, you can always use all 8 channels simultaneously to pass up to 64 bytes per transaction. But that's still much slower and clunkier than directly sharing a data block, if said block is large.)

As I said, handling cache sync manually does work though, for directly sharing RAM. That's useful for larger buffers and not too complicated. Just takes a bit of work. (But yes, that would make it pretty unconvenient for sharing RAM in the context of a general-purpose OS). So I mentioned that as an example of such a system with no cache coherency at all and how it can be a good exercise.

If you have a language designed presuming multicore operation, then the difference between mailboxes and shared memory can be handled transparently.

No need to hope a random person implementing a design (and even more random person maintaining/extending it) will not make a subtle mistake.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #46 on: December 07, 2024, 11:30:14 am »
I'm pretty happy with those chips - overall, one can do a lot more than I even thought initially, completely baremetal, so that's a big win.

you are one of the few who speaks well of it.
There are people who had harh words about the fact that the documentation is poor  :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #47 on: December 07, 2024, 11:48:00 am »
[ ref to mailbox ] is certainly handy to pass short messages and as a synchronizaton means.
But each channel can only hold up to 8 bytes, so sharing large blocks of data this way is unpractical.

Yup, you always have to settle, nothing is ever ideal  :-//

With cache there are several problems, even a problem regarding the size of the cache-block!

That is, the compiler's machine layer must take into account that when you invalidate a cache line you invalidate a "block of memory" of a "certain size", and in there there must be only and exclusively ONE variable of interest and nothing else.

If you have two mutexes, they must never be allocated at contiguous addresses, or they will both end up in the same cache block with disastrous consequences because if you SC that block, it's as if you were SC'ing both mutexes!

myC is not able to handle this issue correctly, so as a temporary solution all the mutexes object of ll/sc instructions are byte-stuffed until the cache block size is filled.

It wastes memory, but at least things work as they should.
« Last Edit: December 07, 2024, 12:34:31 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline dobsonr741

  • Frequent Contributor
  • **
  • Posts: 706
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #48 on: December 07, 2024, 02:57:59 pm »
A little insight into the Apple M silicon architecture: https://developer.apple.com/videos/play/wwdc2020/10686 where not just the various CPU but GPU and AI cores sharing the same memory. Will not reveal the details of cache coherence, but you will hear about how a developer on the platform can navigate.
 
The following users thanked this post: DiTBho

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15894
  • Country: fr
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #49 on: December 09, 2024, 01:09:29 am »
I'm pretty happy with those chips - overall, one can do a lot more than I even thought initially, completely baremetal, so that's a big win.

you are one of the few who speaks well of it.
There are people who had harh words about the fact that the documentation is poor  :o :o :o

A lot is left undocumented, but you can cross-find info on the datasheets for the SG2000/2002 (which has a bit more info, but barely more) and of course, all the source code they made available. They also have complementary doc (although limited) in their github repos.

That took work definitely and a fair bit of "reverse-engineering". But it is cheap and pretty powerful, and still has enough documented (even if it's sparse and requires a lot of work) to be able to use it baremetal (which was not the intent of the guys who make it available to the public), so that's still a pretty positive point. Try doing that with any kind of typical SBC out there.

You get two RISC-V cores with similar performance as a iMXRT1060, at less than 1/3 the price, and with 64MB of embedded DDR2. Actually, with the vector extension on the first core and the second core (without vector, but otherwise the same double precision FPU as in the first core), you get more performance. I have use cases in mind in particular for audio processing applications. The second core can be entirely dedicated to UI stuff.

A few years back, I got interested in the K210 (which was also a dual-core 64-bit RISC MCU), but it had literally NO documentation at all (you just had a SDK and that's it), a datasheet which was barely a product brief, and frankly a pretty unhelpful tech team. The double-precision part of the FPU was permanently disabled in their SDK and nobody ever answered why. They had a FFT coprocessor but limited to Q16 and I get shorter exec time for a FFT of same size and double-precision FP with the CV1800B than I did with the coprocessor of the K210. So, there you have it.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf