Author Topic: shared memory in multi-CPU systems: looking for books, docs, ...  (Read 5155 times)

0 Members and 1 Guest are viewing this topic.

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
shared memory in multi-CPU systems: looking for books, docs, ...
« on: November 27, 2024, 09:06:19 pm »
hi
I'm studying multi-CPU systems that involve shared memory.

In particular MIPS4 superscalar systems { R10k, R12K, R14K, R16K, ... }, for which I have some documentation.
There are only a few sw&hw examples and the most recent is dated 1994.

2024 - 1994  :o :o :o

I wonder: how do modern { POWER, ARM, RISC-V, ... } systems with several cores work internally regarding shared ram?
Is there any study material?

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: WillTurner

Offline WillTurner

  • Regular Contributor
  • *
  • Posts: 66
  • Country: au
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #1 on: November 27, 2024, 10:25:01 pm »
Raspberry Pi Pico datasheet  :)
 
The following users thanked this post: dobsonr741

Offline cosmicray

  • Frequent Contributor
  • **
  • Posts: 343
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #2 on: November 27, 2024, 10:53:46 pm »
Apple sold some Xserve rack mount systems, around 2007-2009 that had dual Intel processors. A quick review sounds like the main memory was shared, but I'm not positive about that.
it's only funny until someone gets hurt, then it's hilarious - R. Rabbit
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #3 on: November 27, 2024, 11:30:13 pm »
Well, my boss' ampere SoC (arm64) internally has 128 cores. They somehow share slices of ram. I don't know how.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #4 on: November 28, 2024, 12:43:56 am »
I see a lot of "I don't really know" answers...

On modern machines with up to 32 or 64 cores at least, the answer is that the RAM is shared equally between all CPUs, but caches are used to drastically decrease the number of times a CPU talks to RAM.

A pretty typical setup is:

- 32k or 64k each of very fast (~3 clock cycles access) private instruction and data cache on each CPU core

- 0.5 to 2 MB of L2 cache shared typically between a group of 4 cores

- if more than 4 cores, then typically about 1-2 MB/core of L3 cache. This will often be NUMA with each group of maybe 4 cores having fast access to 4 MB of L3 cache, but slower access to the other 28 MB or 60 MB etc of L3 if it's not being heavily used. Slower, but still two or more times faster than RAM.

In larger machines the RAM can also be split into chunks that are more quickly accessed by one cluster of cores, but also accessible more slowly (over some kind of bus) by cores in other clusters.

Tech web site reviews of new CPUs often examine such details e.g. here for the 2990wx that I used to build my work PC in early 2019.

https://www.legitreviews.com/amd-ryzen-threadripper-2990wx-processor-review_207135
 
The following users thanked this post: SiliconWizard, WillTurner

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #5 on: November 28, 2024, 01:13:33 pm »
I see a lot of "I don't really know" answers...

E.g. talking about MIPS4, the only documentation you can find publicly online is about R10K, which is documented as a four-way superscalar design that
implements register renaming and executes instructions out-of-order.

However when you look at the implementations (e.g. Workstations, Servers), unless they are development-SBCs where you pay to have all the documentation with both hw and sw examples (e.g. Windriver VxWorks SBCs), you don't really understand how things work.

Worse still, MIPS4 opens a problem with the "out-of-order" nature of the CPU paired with the Branch Prediction and Speculative Execution nature of a purist RISC design: although one or more instructions may begin execution during each cycle, and each instruction takes several or many cycles to complete, when a branch instruction is decoded, its branch condition may not yet be known. However, the R10000 processor can predict whether the branch is taken, and then continue decoding and executing subsequent instructions along the predicted path.

The problem with cache-based shared ram is when a branch prediction is wrong, because the processor must back up to the original branch and take the other path. This technique is called "speculative execution", and whenever the processor discovers a mispredicted branch, it aborts all speculatively-executed instructions and restores the processor's state to the state it held before the branch.

And here specifically the problem - the manual says - the cache state is not restored, and this is clearly a side effect of speculative execution.

So, if the speculative approach involved a Conditional Store (SC): will it be restored? ...

Well ... it depends on whether there is an external coherence agent or not
If there is not, then - the manual also says - if the cache is involved, then it won't be restored, so this is a real mess that needs at least a "sw barrier".

-

This is how MIPS4 CPUs work, and there are different hw implementations of what is around the CPU.

Speaking of simple MIPS4 based systems with 1 CPU and 1 GFX, the CPU to { COP2, GFX, DMA, ... } approach uses the cache as a mechanism for the CPU to know if a cell has been modified

In particular there are:
---
  • "Non-Cache Coherent" systems, coherency is not maintained between CPU Caches and external Bus DMA. Which usually means cache writeback/invalidate needs to be performed by software (barrier + voodoo code) before/after { COP2, GFX, DMA, ... } requests.
  • "Cache Coherent" systems, coherency is maintained by an external agent that uses the multiprocessor primitives provided by the processor (LL/SC instructions in a conditional block) to maintain cache coherency for interventions and invalidations. External duplicate tags can be used by the external agent to filter external coherency requests and automatically update

There are MIPS4 implementations with 128 CPUs, they use NUMA (propietary NUMAflex tecnology), but I can't find any documentation.
They certainly don't use simple hw mechanisms like the simple cache coherency agent, nor do they use trivial dual-port-ram.

On modern machines with up to 32 or 64 cores at least, the answer is that the RAM is shared equally between all CPUs, but caches are used to drastically decrease the number of times a CPU talks to RAM.

The problem is how the "shared ram" is implemented.

In my case, I don't use any cache. Instead I use a simple dual-port ram paired with a coherent agent embedded into the dual-port ram.
I call it "tr-mem", as it's a simple "transactional memory".

It's not difficult to make a dual-port ram, it becomes a little more difficult to make a quad-port ram, which also exists as ASIC chip, so it's still feasible and sensible, but I've never seen octo port-ram.

I think the 128-CPU system uses a ram model that leverages super fast packet networking and wired routers (cross-bar matrix?), coupled with cache systems with cache-coherence agents.

Dunno  :-//
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #6 on: November 28, 2024, 01:24:03 pm »
(
that issue I was talking about is already being merged into upstream GCC.
See`-mr10k-cache-barrier=setting` option.
)
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #7 on: November 28, 2024, 01:46:34 pm »
The problem with cache-based shared ram is when a branch prediction is wrong, because the processor must back up to the original branch and take the other path. This technique is called "speculative execution", and whenever the processor discovers a mispredicted branch, it aborts all speculatively-executed instructions and restores the processor's state to the state it held before the branch.

And here specifically the problem - the manual says - the cache state is not restored, and this is clearly a side effect of speculative execution.

I think you must be confused.

A store that is executed speculatively can NOT be allowed to update the cache if it is aborted. I've never heard of any machine that would actually write to its private L1 cache and then undo it later, though that would be a valid implementation. The usual thing is to have a store queue which allows the queue to be partially flushed on a mis-speculation, and stores are not actually written to L1 cache (or RAM, for a write-through cache) until the store is no longer speculative.

Depending on the memory model, non-speculative stores that are after a speculative store in program order can be allowed to write to cache/RAM first. That's another story :-)

There is NO WAY that s speculative store can be allowed to update cache and then not have that undone on mis-speculation. JUST NO WAY. That computer will just asplode.

If the manual says "the cache state is not restored" then I expect that is in a section about speculative LOADS, which can load the data into the cache (possibly causing something else to be flushed to make room) even while they are still speculative, and on a mis-speculation you can leave the data that you didn't actually need in the cache.

There is no correctness problem with doing that, only a possible inefficiency in that data that you might turn out to need later got flushed to make room. And -- a concern from much more recently than the R10K was designed -- it creates a security timing .
side-channel
 
The following users thanked this post: DiTBho

Offline radiogeek381

  • Regular Contributor
  • *
  • Posts: 133
  • Country: us
    • SoDaRadio
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #8 on: November 28, 2024, 06:34:43 pm »
First, Try either of the Hennessy and Patterson textbooks.  In "Quantitative Approach" there's a section on memory consistency and another on coherence.

Second, I'd suggest you work on understanding the coherence scheme without burdening yourself with the dynamics of LL/SC.  I've worked on both Alpha and MIPS implementations, both have LL/SC, and there are more ways to do it wrong than to do it right. Start with the basic consistency mechanisms, then the implications of speculation.

For what it is worth, both Alpha and MIPS allowed state changes in the cache that were caused by a load on a bad (incorrectly speculated) path.  Once the address enters the memory/cache interface, any fill/displacement/victimization is going to happen.  If it weren't for some security implications, this would be a really good thing, as it allows a load following a mispredicted hammock to prefetch interesting data into the cache.  For that matter, a processor can do whatever prefetching into caches it wants to do, provided that the addresses are "valid." 

Stores, on the other hand, are different.  They must be "held" until the store instruction (and all previous instructions in program order) have retired. Smart designs squirrel the store away in a write buffer that is visible to the processor doing the write (so that a load after a store gets the new value even though the store hasn't retired yet.  Since the load can't retire until the store is ready to retire, there's no hazard in this speculation.). The write buffer entry is pushed to the cache/memory interface at some point in the future.

LL/SC is wayyyy more complicated.  The fundamental rules are:

LL on processor A sets some state that says "I did an LL to an address that matches some pattern".
SC on processor A fails unless its address matches the same pattern. It clears the LL state.
Any SC that arrives at A from some other processor clears the LL state.

No other operation from any other processor (even a write) can affect the LL state.

And here's where it gets really nasty.  What does it mean for an SC operation on processor A to occur before an SC arriving from some other processor? This is where lots of things get tied up in the processor pipeline, the cache coherence scheme, and all kinds of other things.  In addition, there are some nasty issues relating to instructions that may appear between the LL and SC.  I don't remember exactly what the restrictions were, but I know that Alpha's rules were the result of an extended discussion.

 
The following users thanked this post: DiTBho

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #9 on: November 28, 2024, 08:21:38 pm »
+= Hennessy and Patterson, Quantitative Approach (kindle version) 
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: eeproks

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #10 on: November 28, 2024, 08:46:32 pm »
More than 20 years ago there was this old discussion, something shocking, for anyone who thinks that cache coherence is handled transparently by the CPU.

At the time it was not well known that installing a MIPS4 R10K on a platform designed to host a MIPS4 R5K could lead to very serious kernel problems, whether NetBSD, OpenBSD or Linux.

There was no platform-side documentation, and even today there is only CPU-side documentation, and not even a very comprehensive one, especially when you find that MIPS R10k needs external hw to manage the cache coherency, otherwise there are very serious problems: no hw examples!

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline radiogeek381

  • Regular Contributor
  • *
  • Posts: 133
  • Country: us
    • SoDaRadio
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #11 on: November 28, 2024, 10:48:09 pm »
Quote
More than 20 years ago there was this old discussion, something shocking, for anyone who thinks that cache coherence is handled transparently by the CPU.

That conversation was pretty peculiar to an R10K behavior and the interaction between speculation and writes to memory mapped IO.  Again, this is way into the weeds, and should not be seen as "normal" in the industry. And being SGI's first OOO design there are many things that they learned along the way.

The *implementation or mechanism* for the cache coherence logic is not part of an instruction set architectural specification. The *behavior* of the coherence logic is the component of the architectural specification.

An ISA spec is a commitment between the processor designer and the programmer. (There may be multiple sets of commitments based on system vs. user mode programming.) Even within a single architecture family the cache coherence scheme is likely different for each product in the line. Some configurations require directory based schemes. Some can snoop. Some have separate coherence networks. Others synchronize at a common point. For a given instruction set, all are required to show the same behavior relating to the apparent order of writes from processor X as sensed by processor Y. Coherence commitments and memory consistency commitments *are* part of the ISA contract. But their implementation is not.

Again, this is founded on my experience with Vax, Alpha, and MIPS so may not be representative of x86, Arm, or RISC V.

So, the stack looks like this:  Memory consistency model (very much an ISA component) -> Cache coherence model (an implementation detail, but must provide the stated memory consistency) -> semaphore signaling function (like LL/SC vs. active messages vs. global lock vs...)



****

as an aside: Writes to memory mapped IO space should never be committed until the associated store instruction is retireable. (Writes to any memory location must be predicated by this.) Machines that can't make that guarantee were probably not designed as intended. There is, however, a problem when a READ of an IO register produces a side-effect. Then mis-speculated loads can cause unexpected things to happen. But professional architects have long known that IO registers with READ side-effects are dangerous, and so their use is rare, and ought to be tied to barrier instructions.

For various other reasons, some architectures require a barrier of some sort between writes to IO space and other operations. This prevents a situation where a write to "LaunchRocket.GO" completes before the write to "Rocket.Destination" register. There are similar hazards associated with races between data/control buffer fills and interrupt notification.

=====

As a general rule, memory systems should deal with loads as they arrive, not as they retire. This may change ownership or validity of cache blocks.

Memory systems should deal with stores when they are certain to be retired. This is the minimum: practical designs will include write buffers that furnish data to "local" loads appearing after the corresponding store in program order.

=====
 
The following users thanked this post: DiTBho

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #12 on: November 28, 2024, 11:33:34 pm »
Even within a single architecture family the cache coherence scheme is likely different for each product in the line.
Some configurations require directory based schemes.
Some can snoop.
Some have separate coherence networks.


That's exactly the kind of book and docs I am looking for.
Real documented sw/hw example on real platforms!

I should also take the SMP of the PowerMac G4 and the C8K HPPA2, both 2xCPU.
In this case the documentation (in?) the Linux kernel!
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #13 on: November 29, 2024, 12:09:50 am »
An ISA spec is a commitment between the processor designer and the programmer.

Exactly.

An implementation can do anything it likes, as long as the final result is guaranteed to be AS IF the code was run on the simplest non-pipelined no cache multiple CPI implementation.

Quote
Again, this is founded on my experience with Vax, Alpha, and MIPS so may not be representative of x86, Arm, or RISC V.

R10K and Pentium Pro were very much the pioneers in this stuff -- even eclipsing what mainframes had been doing -- and they made some mistakes which have since been learned from.

For how things *should* work, look at Aarch64 and RISC-V, which have both learned a lot of lessons from the pioneering R10K, Alpha, P6 work. ESPECIALLY, I think, RISC-V which has gone out of its way to bring in world experts from both industry and academia to get various parts of the design right.

A prime example is the memory model, which was done quite early on:

https://googlegroups.com/a/groups.riscv.org/group/isa-dev/attach/3434a200dbcc0/memory-model-spec.pdf?part=0.2

(that's also an appendix in the main ISA manual now)
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #14 on: November 29, 2024, 03:23:19 pm »
memory-model-spec

+=memory-model-spec.pdf?part=0.2 (kindle)

That's exactly what I am looking for, thanks!  :-+
edit:
Very very useful even for the toy architecture I work on from time to time
« Last Edit: November 29, 2024, 03:41:59 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #15 on: November 29, 2024, 03:39:29 pm »
experience with Vax, Alpha, and MIPS

on which MIPS-platforms?

mine:
- SGI IP30/R10K, single core, coherent
- SGI IP32/R10K, single core, non-coherent (sold)
- Atlas MIPS32R2, single core, coherent
- IDT-R2K, MIPS R2000 single core goldcap, N.A. doesn't have any cache controller
- FON2 MIPS32R2, 1xcore/SoC, coherent
- RSP MIPS32R2, 1xcore/SoC, coherent
- RB532 MIPS32R2, 1xcore/SoC, coherent
- RBM33G MIPS32R2, 2xcore/SoC, coherent
- MIPS5++ prototype, softcore/FPGA, N.A.  doesn't have any cache controller

supporting { UMon/fw, YaMon/fw, Linux/Kernel, XINU/Kernel }
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #16 on: November 29, 2024, 04:12:21 pm »
Typically, IO addresses are mapped to an "uncached region" as this way "uncached reads" and "uncached writes" are guaranteed to be non-speculative.

However, there is a further *BIG* problem with R10k: every speculative block read request to an unpredictable physical address can occur if the CPU speculatively fetches data due to a Load or Jump Register instruction specifying a register that has a "garbage" value during the speculative phase.

So, speculative block accesses to load-sensitive I/O areas "can" present an unwanted side-effect.

I had to modify the machine layer of my myC compiler to ensure that a "garbage" register never happens around an IO operation.

It's easy to fix up once you know about it, but it's a real pain if you don't know that even this problem exists.
I don't know how many hours I lost before I figured out why Yamon/R10K crashed every now and then accessing the memory mapped serial port.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #17 on: November 30, 2024, 03:57:46 pm »
...

In short the "R10K on DMA" problem may be summarized as follows:
- R10K may start executing load/store instructions that wouldn't be started by normal program flow
- this can happen due to mispredicted branch
- in "cache-coherent system" it would just cancel this instruction later on, and no one would notice anything.
- in "non cache-coherent", it is impossible to cancel loads or stores <------ it says "cancel", I guess from a pending IO queue
- that may result in cache line being marked as dirty.
- if there is DMA going on to the same location during that time, data will be lost after write-back.
- speculative loads are not as much of a problem
- they can be taken care of by extra cache invalidation op after each DMA.

A workaround, specific for a platform that comes with special HW feature that causes any cached-coherent stores to all but first (x)Mbyte of KSEG0 addresses to fail.
1) this allows to configure KSEG0 as "cached coherent"
2) put the most important firmware/kernel code into first (x) Mbyte of RAM
3) use HIHGMEM for the rest (in Linux kernel terms)
4) any memory that is being target of DMA transfer will be removed from TLB
5) and thus inaccessible while DMA is in progress <-------------- BINGO!?!
6) (well no, you still have to) make sure it isn't brought back into TLB by an accidental reference


That's MIPS R10K  :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #18 on: November 30, 2024, 07:03:35 pm »
I wonder: how do modern { POWER, ARM, RISC-V, ... } systems with several cores work internally regarding shared ram?
Is there any study material?

To answer this specifically: all modern systems including those named are fully cache coherent.  Non cache coherent shared memory is pretty much dead in the realm of general purpose computing.  Instead, a combination of more cache levels to reduce global memory access and smarter software and operating systems are used to reduce contention.  Cache coherence basically means that a cache needs to have exclusive access to a given cache line to perform a write.  This is accomplished by cache invalidation messages exchanged over the processor interconnect.

Up through most high end desktop CPUs, most systems have a single memory controller (possibly dual channel) shared equally between all cores.  This main memory is symmetric.

Systems like the threadripper pro and high core count server CPUs, as well as most multi socket systems have multiple memory controllers attached to different core complexes / sockets.  This means that each CPU has its own locally attached memory and to talk to other memory it has to use the CPU to CPU interconnect.  This is still cache coherent, it just comes with a performance impact.  Page and thread migration is used to try to keep processors accessing local memory as much as possible.

Within the realm of cache cohence there is a fairly wide range of memory ordering guarantees.  Power and arm have weaker memory ordering, sparc and x86 have stronger memory ordering know as total store ordering where all CPUs observe main memory

You can think of this like so: pretty much any high performance processor has a store buffer.  It's the one of the most valuable caches in a system because stores can trivially be deferred and blocking on stores would be a huge performance hit. When a CPU does a load, it checks the store buffer before the caches.  Total store ordering just means that 1) writes to the store buffer go in program order 2)  nothing in the store buffer can be visible by any other core, and 3) commits from the store buffer are globally visible simulaneously via the cache cohence and 4) commits from the store buffer must happen in order.  If the oldest write in the buffer is blocked due to a cache miss or something, no other writes can be committed.

ARM and Power for example don't guarantee all of these rules and don't have total store ordering.  This potentially allows avoiding some unnecessary stalls at the expense of requiring more explicit barriers when synchronization is needed.

Device memory, in contrast to main memory is often uncached.  When cached (for instance when used for shared memory with an accelerator) it's usually cache coherent as well.   The mechanism is generally different, since it's attached on an expansion bus like PCIe or CXL, but doing fundamentally the same thing: cache lines are either shared and read only or exclusive and writeable by a single agent.
 
The following users thanked this post: DiTBho

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #19 on: November 30, 2024, 08:12:50 pm »
One thing that has changed between the era of R10k and now is that the memory controllers have moved on chip in package. 

Back in the 90s, processors exposed their front side bus and the system designer was responsible for hooking up the memory controller(s) and possibly the upper cache levels to the CPUs.  This meant that more of the implementation of cache coherence and memory ordering was up to the system designer not the CPU itself.  Different systems could have different guarantees even if using the same processor. 

Modern systems have their memory controllers and cache coherence controllers built into the processor, whether on the same die or a chiplet. 
 
The following users thanked this post: DiTBho

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #20 on: November 30, 2024, 08:35:14 pm »
One thing that has changed between the era of R10k and now is that the memory controllers have moved on chip in package. 

Back in the 90s, processors exposed their front side bus and the system designer was responsible for hooking up the memory controller(s) and possibly the upper cache levels to the CPUs.  This meant that more of the implementation of cache coherence and memory ordering was up to the system designer not the CPU itself. 

In my IDT R2K development board the cache controller is an optional module that you can decide to install.
The default configuration of dev kit doesn't include any cache, and any cache controller.
You can also choose the CPU module: { R2000, R3000 }.
The FPU is also an external chip on a further module you can decide to install.

With MIPS32R2, you have "SoCs", all-in-one.

Different systems could have different guarantees even if using the same processor. 

just three examples of what I've worked on with SGI/MIPS
- IP28, MIPS R10K -> non-coherent, no hardware special HW feature that causes any cached-coherent stores to all but first (x)Mbyte of KSEG0 addresses to fail.
- IP32, MIPS R10K -> non-coherent, but with a special HW feature that causes any cached-coherent stores to all but first (x)Mbyte of KSEG0 addresses to fail.
- IP30, MIPS R10K -> coherent, piece of cake
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Postal2

  • Frequent Contributor
  • **
  • !
  • Posts: 826
  • Country: 00
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #21 on: December 01, 2024, 06:38:45 am »
If anyone is reading this by chance in hopes of getting information, please refer to the Nvidia CUDA documentation.
 
The following users thanked this post: DiTBho

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #22 on: December 04, 2024, 08:23:52 am »
Maybe also the book I posted in the fpga area here on the forum - pipelined multi core mips implementation and correctness - I took the paper version, which was delivered 4000Km from where I am now.

The core of the discussion is different from this specific discussion area, but it also covers multi-core, cache, and pipeline, and is specific to mips.

I have to get home.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #23 on: December 04, 2024, 08:19:52 pm »
OT, but interesting book

+= the Cray X-MP/Model 24: A Case Study in Pipelined Architecture and Vector Process (Robbins)
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15894
  • Country: fr
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #24 on: December 05, 2024, 01:59:33 am »
As a maybe "interesting" (or fun) fact, the CV1800B (in the milkV duo boards) and probably similar for the other SoCs of the duo series (SG2000/2002) although I haven't tested the latter yet: there are two RISC-V cores inside, both have access to the main RAM, each has its own cache and there is absolutely no cache coherency mechanism. They are implemented as two completely separate CPUs (they even have the same hart ID!) that happen to share the same peripherals and main RAM.

So to access a shared RAM block, you have to properly invalidate/clean the respective caches, manually. Quite fun.

I think they were essentially meant to communicate via a mailbox and not directly share any RAM, which is why that's the only examples you'll find in the official SDK. But since I went baremetal with this SoC, I played with sharing RAM, and that was as barebones as it gets. A good exercise, although not exactly fun. Not ultra efficient either, since sharing large blocks will involve invalidating/cleaning the cache either line by line or the whole cache, which takes some time.
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #25 on: December 05, 2024, 02:15:28 am »
Yeah, they're not intended to run the same OS on both. One is meant to run Linux (or similar) and the other to run bare-metal programs.

Have you looked into the "mailbox"?

I use my Duo literally every day -- it's pretty much permanently plugged into one of my laptop's USB ports -- but only for testing user-mode Linux programs, so far.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #26 on: December 05, 2024, 12:17:17 pm »
...
So to access a shared RAM block, you have to properly invalidate/clean the respective caches, manually. Quite fun.

I think they were essentially meant to communicate via a mailbox and not directly share any RAM, ...

My limited understanding is that the HPC mob has standardised on message passing. The abstraction and high level code is usable on any architecture (with appropriate tiny hand-crafted primitives) and is scalable to thousands of processors. Caches and shared memory on the other hand... not so much.

There are 32core/chip MCUs available at Digikey, and if 32 isn't enough you can transparently (at the software level) add more chips in "parallel". 

The future is already here, but it isn't evenly distributed. Time for software to catch up.
« Last Edit: December 05, 2024, 12:22:44 pm by tggzzz »
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #27 on: December 05, 2024, 12:20:36 pm »
So to access a shared RAM block, you have to properly invalidate/clean the respective caches, manually. Quite fun.

It is somehow a manageable task if you write the software yourself and it is relatively simple, but in fact this hardware situation has not allowed Linux and *BSD to run decently on SGI IP28R10K for 25 years (no hw support/trick at all for the cache coherence), and this teaches that - it is not that people are willing to spend so much time fixing hw-things in software -

In fact then either you are paid to do it, or you don't do it, because it is a boring pain!
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #28 on: December 05, 2024, 01:13:09 pm »
My limited understanding is that the HPC mob has standardised on message passing.
It is usable on any architecture (with appropriate tiny hand-crafted primitives) and is scalable to thousands of processors.
Caches, on the other hand...

Cache is evil, but, worse still, "cache-coherent Non-Uniform Memory Architecture" is what I see in my MIPS5+ prototype:
  • "ccNUMA" system architecture
  • hw coherence manager
  • by default it is disabled, you can enable it by grounding a pin ... the first thing I did was solder a jumper to ground, so as to be 100% sure that it is always enabled!
  • local memories are shared in a Single-System Image (SSI)
  • single shared address space (64bit)
  • single copy of operating system (XINU in my case)
  • designed to scale to very large processor counts (128-192 ... only 8 in my case)
  • MIPS5+, 4-way super-scalar microprocessor
  • not compatible with MIPS-4 Instruction Set Architecture -> there are no AS/C compilers!
  • out of Order execution -> 26 pages of the 800 UM pages are about "bad side effects"
  • two Floating-point Execution Units
  • two Fixedpoint-point Execution Units
  • each queue can issue one instruction per cycle: { Add, Mul, MAC, Div }
  • each unit is controlled by an 8-entry Flt.Pt. queue
  • each unit can trap an exception
  • large virtual and physical address spaces
  • 52-bit virtual address (data)
  • 48-bit physical address (CPU side, 64-bit address register)
  • very large TLB page sizes 64M, 256M and 1G page sizes!!!!
  • very wild cop0, when it receives an interrupt/exception, it only sets a bit to say whether or not it has concluded the current instruction with the precise idea that then the software will understand what happened, therefore to calculate the return address, PC_next rather than PC_curr. In addition to this, if there is an exception all the LL/SC instructions are canceled (flushed), which is quite serious from the synchronization side because everything must be canceled and resynchronized
  • kseg0 is uncached, you can address it to an experimental tr-mem

Being a protype it's full Debugging Challenge. It has extra hw to trace some of activity hidden on chip.
Deep sub-micron features are difficult to observe in testers, impossible in a system, expecially in a large systems with many processors.
Failure point may be difficult to find, and exact failure condition must be recreated.

It's dead, a project they decided not to continue  :-//

Unfortunately I don't have all the documentation, worse still it's a prototype not 100% working, and all the analysis tools are missing, which I'm sure they preferred to destroy rather than release.

One of the manual quotes external missing documents:
+ Architecture Verification Programs (AVP)
+ Micro-architecture Verification Programs (MVP)
+ Random diagnostics from programmable random code generators (missing testbench)
+ self-checked and/or compared Diagnostics with a reference machine (which is also missing)
...
for which debugging equipment is mentioned. In particular one of the ICEs for which my company was contacted.

I literally saved a board and two DVDs from the hydraulic press, and I'm trying to understand how it works.
I'd like to recreate some of its features in HDL (the trmem, in particular, or any other good shared mem mechanism), obviously simplifying them.
The goal is to learn something and improve my own RISC-ish softcore toy.

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #29 on: December 05, 2024, 01:48:12 pm »
My limited understanding is that the HPC mob has standardised on message passing.
It is usable on any architecture (with appropriate tiny hand-crafted primitives) and is scalable to thousands of processors.
Caches, on the other hand...

Cache is evil, but, worse still, "cache-coherent Non-Uniform Memory Architecture" is what I see in my MIPS5+ prototype:
<horror story omitted for brevity>

Oh, yes indeed.

There is a long and inglorious history of hardware micro-architecture being exposed to software, on the presumption that software will be able to deal with it. The recent poster child for that is was the Itanic.

Or with coherent caches coherence implemented in hardware, once you have "too many" cores+caches, the cache coherence traffic expands exponentially and eventually dominates the traffic and memory latency. IIRC that hit Andy Bechtolsheim's attempts to have too many interconnected AMD Opterons.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #30 on: December 05, 2024, 02:30:54 pm »
Or with coherent caches coherence implemented in hardware, once you have "too many" cores+caches, the cache coherence traffic expands exponentially and eventually dominates the traffic and memory latency. IIRC that hit Andy Bechtolsheim's attempts to have too many interconnected AMD Opterons.

This is the issue that worries me the most with 128-192 cores, but I have no idea how they solved it, nor how to try to do something myself
The prototype board I play on only have 8 CPUs and offers the possibility of selectively disabling various things

For now I have created my "comfort zone"

- cores forced to "execution in-order"
- ccNUMA used as little as possible
- everything runs in kseg0, so uncached
- all shared memory passes through trmem (I only have 32Kbyte per core)

so I can manage that monster in my spare time, but not exploit its potential  :D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #31 on: December 05, 2024, 02:46:52 pm »


This is the representation of my board.
I don't know how those "routers" are implemented.
It's something into a big FPGA. I don't have any HDL source.

On the documentation I found a note that says that it was in collaboration with IBM.
IBM was(is?) famous for having similar chips, used in mainframes and POWER workstations.
Too bad all the documentation is NOT public.
I have never been able to see a single manual or datasheet.

I use the default router settings, which do their job transparently, the address-range is divided equally for each CPU island with a slice of the same size.
Surely this configuration can be reprogrammed, but not as exactly as doing it.

Anyway, it is a technology from 2004, about 20 years of computing, which means that today things certainly do not work that way.
And if they abandoned the project... I also fear that this was not the right direction to take at all.

Dunno :-//
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #32 on: December 05, 2024, 04:44:13 pm »
There is a long and inglorious history of hardware micro-architecture being exposed to software, on the presumption that software will be able to deal with it. The recent poster child for that is was the Itanic.

Or with coherent caches coherence implemented in hardware, once you have "too many" cores+caches, the cache coherence traffic expands exponentially and eventually dominates the traffic and memory latency. IIRC that hit Andy Bechtolsheim's attempts to have too many interconnected AMD Opterons.

In the end, if you have hundreds of cores contending on the same cache line you are going to have a bad day.

So the solution to this has been to reduce contention which reduces cache coherence traffic.  A couple of the big ideas here are lock free queues and read copy update.  Both make heavy use of atomic instructions rather than mutex locks.  The problem with a mutex is that it creates cache invalidation even when there is no contention.

Read-copy-update allows read mostly data structures to be completely contention free for readers, and writers update with atomic operations (generally pointer swaps) and defer reclamation until after all other threads have synchronized for some other reason.

Lock free queues allow essentially zero copy message passing.  You push a pointer through a queue to a consumer which then takes ownership of the referenced object, but since the object is in shared memory there is no explicit copy.

Other important advances are improved avoidance of false sharing, bigger overall caches with more levels to allow faster synchronization between "nearby" cores.  Better runtime performance counters and smarter operating systems allow better scheduling affinity and page migration. Better tools for detecting race conditions allow developers to more easily write concurrent code. 

The basic idea is that you still have to write code to reduce shared mutable state as much as possible just like when using non coherent or message passing systems but when it comes time to synchronize it's still lower overhead to have the hardware do it.  And if you mess up and have contention, the symptom is your code runs slower and you have performance tools to identify hotspots.  With cache incoherent systems the symptom is random infrequent and silent data corruption.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #33 on: December 05, 2024, 05:05:10 pm »
Or with coherent caches coherence implemented in hardware, once you have "too many" cores+caches, the cache coherence traffic expands exponentially and eventually dominates the traffic and memory latency. IIRC that hit Andy Bechtolsheim's attempts to have too many interconnected AMD Opterons.

This is the issue that worries me the most with 128-192 cores, but I have no idea how they solved it, nor how to try to do something myself

What make you think they did solve it, or that it is even solvable? :)

Software is an "externality", i.e. somebody else's problem.

Quote
The prototype board I play on only have 8 CPUs and offers the possibility of selectively disabling various things

For now I have created my "comfort zone"

- cores forced to "execution in-order"
- ccNUMA used as little as possible
- everything runs in kseg0, so uncached
- all shared memory passes through trmem (I only have 32Kbyte per core)

so I can manage that monster in my spare time, but not exploit its potential  :D

You really ought to look at and understand the XMOS xCORE hardware, xC software, and toolchain.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 10117
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #34 on: December 05, 2024, 05:26:16 pm »
A couple of the big ideas here are lock free queues and read copy update.
Lock free algorithms - lock free queues, lock free hashes, etc. - were a hot topic 20 years ago. It has gone very quiet, which is sad. People like Cliff Click were quite prominent in lock free hashes for an obvious reason. He was with a company making hardware with a huge number of cores, and the traffic associated with locking was killing them. As we now have 96 cores in a single AMD x86_64 package the problem Cliff Click faced nearly 20 years ago is now with all of us.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #35 on: December 05, 2024, 07:44:00 pm »


Ampere
192 cores
« Last Edit: December 05, 2024, 07:46:55 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 10117
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #36 on: December 05, 2024, 08:29:43 pm »
That server is a strange design., The heat pipe assembly has narrowly spaced fins, which will need significant pressure to push the air through the gaps. However, I see not guiding so the air is forced through, rather than around, that assembly.
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #37 on: December 05, 2024, 10:25:44 pm »
A couple of the big ideas here are lock free queues and read copy update.
Lock free algorithms - lock free queues, lock free hashes, etc. - were a hot topic 20 years ago. It has gone very quiet, which is sad. People like Cliff Click were quite prominent in lock free hashes for an obvious reason. He was with a company making hardware with a huge number of cores, and the traffic associated with locking was killing them. As we now have 96 cores in a single AMD x86_64 package the problem Cliff Click faced nearly 20 years ago is now with all of us.

What do you mean it's gone quiet?  Lock free queues especially are well developed and in wide use.  I'm not so sure about hashes.  I know that some uses of hash tables that are infrequently updates instead use lock-free trees with RCU.  In both cases, they are usually only used in cases where you need very high parallelism -- otherwise mutex locks or spinlocks work just fine.
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 10117
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #38 on: December 05, 2024, 10:50:13 pm »
A couple of the big ideas here are lock free queues and read copy update.
Lock free algorithms - lock free queues, lock free hashes, etc. - were a hot topic 20 years ago. It has gone very quiet, which is sad. People like Cliff Click were quite prominent in lock free hashes for an obvious reason. He was with a company making hardware with a huge number of cores, and the traffic associated with locking was killing them. As we now have 96 cores in a single AMD x86_64 package the problem Cliff Click faced nearly 20 years ago is now with all of us.

What do you mean it's gone quiet?  Lock free queues especially are well developed and in wide use.  I'm not so sure about hashes.  I know that some uses of hash tables that are infrequently updates instead use lock-free trees with RCU.  In both cases, they are usually only used in cases where you need very high parallelism -- otherwise mutex locks or spinlocks work just fine.
Everything is moving towards high parallelism these days. If you start something now that you hope to have a long future you should probably be looking at how the locks will work out with 1000 cores. Its usually pretty ugly, and you really need to keep those locks to a minimum.

15 to 20 years ago people were working on all kinds of lock free things. Queues are an easy one in many cases. Hashes are tough. Many things have defeated people. I don't expect to hear much about lock free queues these days, but I hope people are still working on other things. Even if they can't make them work, maybe they will identify enhancements to hardware that would make them work better in the future. It looks like we are in for more and more parallelism. We have to learn to make it work well for a wider range of tasks.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #39 on: December 05, 2024, 11:18:35 pm »
A couple of the big ideas here are lock free queues and read copy update.
Lock free algorithms - lock free queues, lock free hashes, etc. - were a hot topic 20 years ago. It has gone very quiet, which is sad. People like Cliff Click were quite prominent in lock free hashes for an obvious reason. He was with a company making hardware with a huge number of cores, and the traffic associated with locking was killing them. As we now have 96 cores in a single AMD x86_64 package the problem Cliff Click faced nearly 20 years ago is now with all of us.

What do you mean it's gone quiet?  Lock free queues especially are well developed and in wide use.  I'm not so sure about hashes.  I know that some uses of hash tables that are infrequently updates instead use lock-free trees with RCU.  In both cases, they are usually only used in cases where you need very high parallelism -- otherwise mutex locks or spinlocks work just fine.
Everything is moving towards high parallelism these days. If you start something now that you hope to have a long future you should probably be looking at how the locks will work out with 1000 cores. Its usually pretty ugly, and you really need to keep those locks to a minimum.

It has been moving that way for a few decades; the high performance computing mob have always been pushing technology and finding the pain points.

1000 cores indiscriminately sharing the same memory won't work. The nearest you can get is with barrel processors (e.g. Sun's Niagara T series) which are useful for "embarassingly parallel" applications.

Message passing does scale, but applications need to find the sweet spot between
  • fine grained means smaller data transfers, but frequent setup/teardown can be painful
  • coarse grained means fewer data transfers, but the more data has to be "separated" and transferred
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 10117
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #40 on: December 05, 2024, 11:32:46 pm »
It has been moving that way for a few decades; the high performance computing mob have always been pushing technology and finding the pain points.
HPC workloads are very different from the highly parallel things which are growing now. You can sidestep most of the blockages with HPC workloads. Other workloads have a mass of chock points.
1000 cores indiscriminately sharing the same memory won't work. The nearest you can get is with barrel processors (e.g. Sun's Niagara T series) which are useful for "embarassingly parallel" applications.
Its a long time since people tried to have a lot of processors accessing a truly uniform memory system. With modern memory layouts it isn't really the memory you need to worry about. Its the traffic ensuring consistency in the memory. If you can get the number of locks way down, you can keep growing the CPU count for a wide range of loads.
Message passing does scale, but applications need to find the sweet spot between
  • fine grained means smaller data transfers, but frequent setup/teardown can be painful
  • coarse grained means fewer data transfers, but the more data has to be "separated" and transferred
Unless the messages are flowing on a regular heartbeat, they eventually become a congestion point. The regular heartbeat approach suits highly regular processing, like HPC. It doesn't suit more complex workloads. Communication is just overhead, and latency.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #41 on: December 06, 2024, 02:06:47 pm »
is there any good book about these new (to me) tecnologies?
Mine are all 1995-1997-2005
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #42 on: December 06, 2024, 04:55:12 pm »
Even if they can't make them work, maybe they will identify enhancements to hardware that would make them work better in the future.

One form of hardware assist that has a lot of interest but has been plagued with implementation problems is lock elision.  The basic idea is to take advantage of speculative execution to speculatively skip a lock acquisition.  If you get to the release with no memory conflict you basically skip the lock altogether.  If there is a conflict you throw out the speculative state, restart and acquire the lock properly.

One problem with lock free data structures is that writing everything in terms of cmpxchg can be tricky, and it doesn't take more than a handful of atomic operations before it's more efficient to just acquire a lock. Hardware lock elision / transactional memory lets you effectively create your own atomic operations, and leverages the hardware speculative execution engine to roll them back if they fail.

Unfortunately Intel's attempts to implement this had a lot of security problems and were disabled.  Newer Intel microarchitecture don't support it and I think POWER abandoned it as well.  I'm not sure if it's an idea that will eventually come back or if it's effectively obsoleted by better alternatives.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #43 on: December 06, 2024, 05:08:03 pm »
Even if they can't make them work, maybe they will identify enhancements to hardware that would make them work better in the future.

One form of hardware assist that has a lot of interest but has been plagued with implementation problems is lock elision.  The basic idea is to take advantage of speculative execution to speculatively skip a lock acquisition.  If you get to the release with no memory conflict you basically skip the lock altogether.  If there is a conflict you throw out the speculative state, restart and acquire the lock properly.

I bet the people that dreamed that up were previously working on the Itanic!

Quote
One problem with lock free data structures is that writing everything in terms of cmpxchg can be tricky, and it doesn't take more than a handful of atomic operations before it's more efficient to just acquire a lock. Hardware lock elision / transactional memory lets you effectively create your own atomic operations, and leverages the hardware speculative execution engine to roll them back if they fail.

Unfortunately Intel's attempts to implement this had a lot of security problems and were disabled.  Newer Intel microarchitecture don't support it and I think POWER abandoned it as well.  I'm not sure if it's an idea that will eventually come back or if it's effectively obsoleted by better alternatives.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15894
  • Country: fr
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #44 on: December 07, 2024, 06:53:37 am »
Yeah, they're not intended to run the same OS on both. One is meant to run Linux (or similar) and the other to run bare-metal programs.

Yep, but they do share all RAM, so that is kind of odd. They just basically threw a second core, bridged it to the same bus and off you go.

Have you looked into the "mailbox"?

Yep, implemented that as well. It's certainly handy to pass short messages and as a synchronizaton means. But each channel can only hold up to 8 bytes, so sharing large blocks of data this way is unpractical. (To be fair, there are 8 channels available and you can use them freely, so even if less practical, you can always use all 8 channels simultaneously to pass up to 64 bytes per transaction. But that's still much slower and clunkier than directly sharing a data block, if said block is large.)

As I said, handling cache sync manually does work though, for directly sharing RAM. That's useful for larger buffers and not too complicated. Just takes a bit of work. (But yes, that would make it pretty unconvenient for sharing RAM in the context of a general-purpose OS). So I mentioned that as an example of such a system with no cache coherency at all and how it can be a good exercise.

I'm pretty happy with those chips - overall, one can do a lot more than I even thought initially, completely baremetal, so that's a big win.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 21426
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #45 on: December 07, 2024, 10:28:43 am »
Yeah, they're not intended to run the same OS on both. One is meant to run Linux (or similar) and the other to run bare-metal programs.

Yep, but they do share all RAM, so that is kind of odd. They just basically threw a second core, bridged it to the same bus and off you go.

Have you looked into the "mailbox"?

Yep, implemented that as well. It's certainly handy to pass short messages and as a synchronizaton means. But each channel can only hold up to 8 bytes, so sharing large blocks of data this way is unpractical. (To be fair, there are 8 channels available and you can use them freely, so even if less practical, you can always use all 8 channels simultaneously to pass up to 64 bytes per transaction. But that's still much slower and clunkier than directly sharing a data block, if said block is large.)

As I said, handling cache sync manually does work though, for directly sharing RAM. That's useful for larger buffers and not too complicated. Just takes a bit of work. (But yes, that would make it pretty unconvenient for sharing RAM in the context of a general-purpose OS). So I mentioned that as an example of such a system with no cache coherency at all and how it can be a good exercise.

If you have a language designed presuming multicore operation, then the difference between mailboxes and shared memory can be handled transparently.

No need to hope a random person implementing a design (and even more random person maintaining/extending it) will not make a subtle mistake.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #46 on: December 07, 2024, 11:30:14 am »
I'm pretty happy with those chips - overall, one can do a lot more than I even thought initially, completely baremetal, so that's a big win.

you are one of the few who speaks well of it.
There are people who had harh words about the fact that the documentation is poor  :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #47 on: December 07, 2024, 11:48:00 am »
[ ref to mailbox ] is certainly handy to pass short messages and as a synchronizaton means.
But each channel can only hold up to 8 bytes, so sharing large blocks of data this way is unpractical.

Yup, you always have to settle, nothing is ever ideal  :-//

With cache there are several problems, even a problem regarding the size of the cache-block!

That is, the compiler's machine layer must take into account that when you invalidate a cache line you invalidate a "block of memory" of a "certain size", and in there there must be only and exclusively ONE variable of interest and nothing else.

If you have two mutexes, they must never be allocated at contiguous addresses, or they will both end up in the same cache block with disastrous consequences because if you SC that block, it's as if you were SC'ing both mutexes!

myC is not able to handle this issue correctly, so as a temporary solution all the mutexes object of ll/sc instructions are byte-stuffed until the cache block size is filled.

It wastes memory, but at least things work as they should.
« Last Edit: December 07, 2024, 12:34:31 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline dobsonr741

  • Frequent Contributor
  • **
  • Posts: 706
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #48 on: December 07, 2024, 02:57:59 pm »
A little insight into the Apple M silicon architecture: https://developer.apple.com/videos/play/wwdc2020/10686 where not just the various CPU but GPU and AI cores sharing the same memory. Will not reveal the details of cache coherence, but you will hear about how a developer on the platform can navigate.
 
The following users thanked this post: DiTBho

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15894
  • Country: fr
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #49 on: December 09, 2024, 01:09:29 am »
I'm pretty happy with those chips - overall, one can do a lot more than I even thought initially, completely baremetal, so that's a big win.

you are one of the few who speaks well of it.
There are people who had harh words about the fact that the documentation is poor  :o :o :o

A lot is left undocumented, but you can cross-find info on the datasheets for the SG2000/2002 (which has a bit more info, but barely more) and of course, all the source code they made available. They also have complementary doc (although limited) in their github repos.

That took work definitely and a fair bit of "reverse-engineering". But it is cheap and pretty powerful, and still has enough documented (even if it's sparse and requires a lot of work) to be able to use it baremetal (which was not the intent of the guys who make it available to the public), so that's still a pretty positive point. Try doing that with any kind of typical SBC out there.

You get two RISC-V cores with similar performance as a iMXRT1060, at less than 1/3 the price, and with 64MB of embedded DDR2. Actually, with the vector extension on the first core and the second core (without vector, but otherwise the same double precision FPU as in the first core), you get more performance. I have use cases in mind in particular for audio processing applications. The second core can be entirely dedicated to UI stuff.

A few years back, I got interested in the K210 (which was also a dual-core 64-bit RISC MCU), but it had literally NO documentation at all (you just had a SDK and that's it), a datasheet which was barely a product brief, and frankly a pretty unhelpful tech team. The double-precision part of the FPU was permanently disabled in their SDK and nobody ever answered why. They had a FFT coprocessor but limited to Q16 and I get shorter exec time for a FFT of same size and double-precision FP with the CV1800B than I did with the coprocessor of the K210. So, there you have it.
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #50 on: December 09, 2024, 02:33:15 am »
That took work definitely and a fair bit of "reverse-engineering". But it is cheap and pretty powerful, and still has enough documented (even if it's sparse and requires a lot of work) to be able to use it baremetal (which was not the intent of the guys who make it available to the public), so that's still a pretty positive point. Try doing that with any kind of typical SBC out there.

Are you publishing the consolidated information anywhere?
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15894
  • Country: fr
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #51 on: December 09, 2024, 04:05:28 am »
That took work definitely and a fair bit of "reverse-engineering". But it is cheap and pretty powerful, and still has enough documented (even if it's sparse and requires a lot of work) to be able to use it baremetal (which was not the intent of the guys who make it available to the public), so that's still a pretty positive point. Try doing that with any kind of typical SBC out there.

Are you publishing the consolidated information anywhere?

I'm considering that, just have to find enough time.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #52 on: December 09, 2024, 08:18:36 am »
Don't talk to me about SBC documentation, it's a sore point for me  :o :o :o

Saving my ELTK, 2x68060 VME-board with hw mailbox and hw semaphores, from the hydraulic press cost me only two bottles of good red wine to make the guys paid to destroy those boards turn a blind eye. Officially they didn't see me saving that board, officially the company that commissioned the "cleaning" of the lab knows that it was destroyed.

However, documentation was a bloody sore point, because they had already destroyed everything before I was able to save anything, and there is literally nothing on the web.

This type of boards have also been used in industrial sewing machines and also in other fields, very old stuff, the early 90s was modern-era, when people didn't use to upload anything as big as a scan of several books, also because of the cost of uploading.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #53 on: December 09, 2024, 10:02:48 am »
That took work definitely and a fair bit of "reverse-engineering". But it is cheap and pretty powerful, and still has enough documented (even if it's sparse and requires a lot of work) to be able to use it baremetal (which was not the intent of the guys who make it available to the public), so that's still a pretty positive point. Try doing that with any kind of typical SBC out there.

Are you publishing the consolidated information anywhere?

I'm considering that, just have to find enough time.

If you send me the notes, I could try to find some time to turn it into English.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #54 on: December 25, 2024, 10:54:34 am »
  • The Art of Multiprocessor Programming, by Maurice Herlihy, Nir Shavit

Interesting book  :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 10117
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #55 on: December 28, 2024, 08:01:55 pm »
  • The Art of Multiprocessor Programming, by Maurice Herlihy, Nir Shavit

Interesting book  :o :o :o
"Written by the world's most revered experts in multiprocessor programming and performance" - so modest.
 

Offline radiogeek381

  • Regular Contributor
  • *
  • Posts: 133
  • Country: us
    • SoDaRadio
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #56 on: January 04, 2025, 12:13:18 am »

If you have two mutexes, they must never be allocated at contiguous addresses, or they will both end up in the same cache block with disastrous consequences because if you SC that block, it's as if you were SC'ing both mutexes!


It is a good idea to make sure two murexes don't appear in the same block. But a careful reading of the definition of LL/SC will show that there is no architectural commitment as to the size of the address region that is being tracked by the LL address.  That is, an implementation that causes an SC to fail if *any* process does an intervening SC within some range that is independent of the cache block size can be compliant.

In other words, for processes A and B running on different processors:

A:: LL [address X]
B:: LL [address Y]
B:: SC [Y]
A:: SC [X]

and
B:: LL [Y]
A:: LL [X]
B:: SC [Y]
A:: SC [X]

in both cases A may be allowed to fail because of the intervening SC by process B, if X and Y are within some contiguous range. (For instance, the initial Alpha architecture said that the region was at least 8 aligned bytes, and at most one page.)

MIPS had an identical requirement -- see the description of "SC" on page 302 of "MIPS64® Architecture For Programmers
Volume II: The MIPS64® Instruction Set"

There are lots of reasons not to put two semaphores in the same cache block, but LL/SC isn't one of them.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #57 on: January 04, 2025, 03:27:00 pm »

If you have two mutexes, they must never be allocated at contiguous addresses, or they will both end up in the same cache block with disastrous consequences because if you SC that block, it's as if you were SC'ing both mutexes!


It is a good idea to make sure two murexes don't appear in the same block. But a careful reading of the definition of LL/SC will show that there is no architectural commitment as to the size of the address region that is being tracked by the LL address.  That is, an implementation that causes an SC to fail if *any* process does an intervening SC within some range that is independent of the cache block size can be compliant.

In other words, for processes A and B running on different processors:

A:: LL [address X]
B:: LL [address Y]
B:: SC [Y]
A:: SC [X]

and
B:: LL [Y]
A:: LL [X]
B:: SC [Y]
A:: SC [X]

in both cases A may be allowed to fail because of the intervening SC by process B, if X and Y are within some contiguous range. (For instance, the initial Alpha architecture said that the region was at least 8 aligned bytes, and at most one page.)

MIPS had an identical requirement -- see the description of "SC" on page 302 of "MIPS64® Architecture For Programmers
Volume II: The MIPS64® Instruction Set"

There are lots of reasons not to put two semaphores in the same cache block, but LL/SC isn't one of them.

I am reading several books, and as far as I understand, I guess it's all implementation defined.

MIPS R2K and R3K did not implement any atomic read-modify-write instructions.
MIPS R4K was the first.

The load-linked instruction performs the first half of an atomic read-modify-write operation by loading a value from memory and sets a flag in the hardware that indicates that a read-modify-write operation is in progress to that  location, and the read-modify-write operation is completed by using the store-conditional instruction to store any desired value back to the memory location loadedfrom, but it does so only if the hardware flag is still set.

Any stores done to this location by any CPU or IO device since the load-linked instruction was executed will cause this flag to be cleared. Therefore, if the store-conditional instruction finds the flag still set, it will be guaranted that the location hasn't changed since the load-linked instruction was done and the entire sequence of instructions starting with the load-linked and ending with the store-conditional have been executed atomically with respect to the associated memory location.

These two basic instructions can be used to construct more sophisticated atomic operations, anyway It all depends on how the flag is handled

The flag is usually (MIPS4K does it this way) maintained by the cache controller and is invisible to the software.
There are other possibilities:
  • Cache-Based
  • Exclusive Monitor-based
  • TrMem-based (my implementation)
  • ...

If it doesn't depend on the cache block size, which is a serious problem on MIPS 4K, I know because I've been banging my head against it for months, then I think MIPS64 uses an "Exclusive Monitor" to implement exclusive access to memory via load-linked/store-conditional.

ARM uses the Exclusives Reservation Granule technique: when an exclusive monitor tags an address, the minimum region that can be tagged for exclusive access is called the Exclusives Reservation Granule (ERG). The ERG is implementation defined, in the range 8-2048 bytes, in multiples of two bytes.

Once again, "portable code" must not assume anything about ERG size.

Worse still, ARM uses LDREX/STREX for multi-processors but they are not "scalable" to uniprocessors. These instructions do not do what many folks think they do. They are *ONLY* for multiprocessor systems, uniprocessor systems should consider using "swap".

 :-//
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf