Author Topic: shared memory in multi-CPU systems: looking for books, docs, ...  (Read 5161 times)

0 Members and 1 Guest are viewing this topic.

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
shared memory in multi-CPU systems: looking for books, docs, ...
« on: November 27, 2024, 09:06:19 pm »
hi
I'm studying multi-CPU systems that involve shared memory.

In particular MIPS4 superscalar systems { R10k, R12K, R14K, R16K, ... }, for which I have some documentation.
There are only a few sw&hw examples and the most recent is dated 1994.

2024 - 1994  :o :o :o

I wonder: how do modern { POWER, ARM, RISC-V, ... } systems with several cores work internally regarding shared ram?
Is there any study material?

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: WillTurner

Offline WillTurner

  • Regular Contributor
  • *
  • Posts: 66
  • Country: au
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #1 on: November 27, 2024, 10:25:01 pm »
Raspberry Pi Pico datasheet  :)
 
The following users thanked this post: dobsonr741

Offline cosmicray

  • Frequent Contributor
  • **
  • Posts: 343
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #2 on: November 27, 2024, 10:53:46 pm »
Apple sold some Xserve rack mount systems, around 2007-2009 that had dual Intel processors. A quick review sounds like the main memory was shared, but I'm not positive about that.
it's only funny until someone gets hurt, then it's hilarious - R. Rabbit
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #3 on: November 27, 2024, 11:30:13 pm »
Well, my boss' ampere SoC (arm64) internally has 128 cores. They somehow share slices of ram. I don't know how.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #4 on: November 28, 2024, 12:43:56 am »
I see a lot of "I don't really know" answers...

On modern machines with up to 32 or 64 cores at least, the answer is that the RAM is shared equally between all CPUs, but caches are used to drastically decrease the number of times a CPU talks to RAM.

A pretty typical setup is:

- 32k or 64k each of very fast (~3 clock cycles access) private instruction and data cache on each CPU core

- 0.5 to 2 MB of L2 cache shared typically between a group of 4 cores

- if more than 4 cores, then typically about 1-2 MB/core of L3 cache. This will often be NUMA with each group of maybe 4 cores having fast access to 4 MB of L3 cache, but slower access to the other 28 MB or 60 MB etc of L3 if it's not being heavily used. Slower, but still two or more times faster than RAM.

In larger machines the RAM can also be split into chunks that are more quickly accessed by one cluster of cores, but also accessible more slowly (over some kind of bus) by cores in other clusters.

Tech web site reviews of new CPUs often examine such details e.g. here for the 2990wx that I used to build my work PC in early 2019.

https://www.legitreviews.com/amd-ryzen-threadripper-2990wx-processor-review_207135
 
The following users thanked this post: SiliconWizard, WillTurner

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #5 on: November 28, 2024, 01:13:33 pm »
I see a lot of "I don't really know" answers...

E.g. talking about MIPS4, the only documentation you can find publicly online is about R10K, which is documented as a four-way superscalar design that
implements register renaming and executes instructions out-of-order.

However when you look at the implementations (e.g. Workstations, Servers), unless they are development-SBCs where you pay to have all the documentation with both hw and sw examples (e.g. Windriver VxWorks SBCs), you don't really understand how things work.

Worse still, MIPS4 opens a problem with the "out-of-order" nature of the CPU paired with the Branch Prediction and Speculative Execution nature of a purist RISC design: although one or more instructions may begin execution during each cycle, and each instruction takes several or many cycles to complete, when a branch instruction is decoded, its branch condition may not yet be known. However, the R10000 processor can predict whether the branch is taken, and then continue decoding and executing subsequent instructions along the predicted path.

The problem with cache-based shared ram is when a branch prediction is wrong, because the processor must back up to the original branch and take the other path. This technique is called "speculative execution", and whenever the processor discovers a mispredicted branch, it aborts all speculatively-executed instructions and restores the processor's state to the state it held before the branch.

And here specifically the problem - the manual says - the cache state is not restored, and this is clearly a side effect of speculative execution.

So, if the speculative approach involved a Conditional Store (SC): will it be restored? ...

Well ... it depends on whether there is an external coherence agent or not
If there is not, then - the manual also says - if the cache is involved, then it won't be restored, so this is a real mess that needs at least a "sw barrier".

-

This is how MIPS4 CPUs work, and there are different hw implementations of what is around the CPU.

Speaking of simple MIPS4 based systems with 1 CPU and 1 GFX, the CPU to { COP2, GFX, DMA, ... } approach uses the cache as a mechanism for the CPU to know if a cell has been modified

In particular there are:
---
  • "Non-Cache Coherent" systems, coherency is not maintained between CPU Caches and external Bus DMA. Which usually means cache writeback/invalidate needs to be performed by software (barrier + voodoo code) before/after { COP2, GFX, DMA, ... } requests.
  • "Cache Coherent" systems, coherency is maintained by an external agent that uses the multiprocessor primitives provided by the processor (LL/SC instructions in a conditional block) to maintain cache coherency for interventions and invalidations. External duplicate tags can be used by the external agent to filter external coherency requests and automatically update

There are MIPS4 implementations with 128 CPUs, they use NUMA (propietary NUMAflex tecnology), but I can't find any documentation.
They certainly don't use simple hw mechanisms like the simple cache coherency agent, nor do they use trivial dual-port-ram.

On modern machines with up to 32 or 64 cores at least, the answer is that the RAM is shared equally between all CPUs, but caches are used to drastically decrease the number of times a CPU talks to RAM.

The problem is how the "shared ram" is implemented.

In my case, I don't use any cache. Instead I use a simple dual-port ram paired with a coherent agent embedded into the dual-port ram.
I call it "tr-mem", as it's a simple "transactional memory".

It's not difficult to make a dual-port ram, it becomes a little more difficult to make a quad-port ram, which also exists as ASIC chip, so it's still feasible and sensible, but I've never seen octo port-ram.

I think the 128-CPU system uses a ram model that leverages super fast packet networking and wired routers (cross-bar matrix?), coupled with cache systems with cache-coherence agents.

Dunno  :-//
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #6 on: November 28, 2024, 01:24:03 pm »
(
that issue I was talking about is already being merged into upstream GCC.
See`-mr10k-cache-barrier=setting` option.
)
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #7 on: November 28, 2024, 01:46:34 pm »
The problem with cache-based shared ram is when a branch prediction is wrong, because the processor must back up to the original branch and take the other path. This technique is called "speculative execution", and whenever the processor discovers a mispredicted branch, it aborts all speculatively-executed instructions and restores the processor's state to the state it held before the branch.

And here specifically the problem - the manual says - the cache state is not restored, and this is clearly a side effect of speculative execution.

I think you must be confused.

A store that is executed speculatively can NOT be allowed to update the cache if it is aborted. I've never heard of any machine that would actually write to its private L1 cache and then undo it later, though that would be a valid implementation. The usual thing is to have a store queue which allows the queue to be partially flushed on a mis-speculation, and stores are not actually written to L1 cache (or RAM, for a write-through cache) until the store is no longer speculative.

Depending on the memory model, non-speculative stores that are after a speculative store in program order can be allowed to write to cache/RAM first. That's another story :-)

There is NO WAY that s speculative store can be allowed to update cache and then not have that undone on mis-speculation. JUST NO WAY. That computer will just asplode.

If the manual says "the cache state is not restored" then I expect that is in a section about speculative LOADS, which can load the data into the cache (possibly causing something else to be flushed to make room) even while they are still speculative, and on a mis-speculation you can leave the data that you didn't actually need in the cache.

There is no correctness problem with doing that, only a possible inefficiency in that data that you might turn out to need later got flushed to make room. And -- a concern from much more recently than the R10K was designed -- it creates a security timing .
side-channel
 
The following users thanked this post: DiTBho

Offline radiogeek381

  • Regular Contributor
  • *
  • Posts: 133
  • Country: us
    • SoDaRadio
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #8 on: November 28, 2024, 06:34:43 pm »
First, Try either of the Hennessy and Patterson textbooks.  In "Quantitative Approach" there's a section on memory consistency and another on coherence.

Second, I'd suggest you work on understanding the coherence scheme without burdening yourself with the dynamics of LL/SC.  I've worked on both Alpha and MIPS implementations, both have LL/SC, and there are more ways to do it wrong than to do it right. Start with the basic consistency mechanisms, then the implications of speculation.

For what it is worth, both Alpha and MIPS allowed state changes in the cache that were caused by a load on a bad (incorrectly speculated) path.  Once the address enters the memory/cache interface, any fill/displacement/victimization is going to happen.  If it weren't for some security implications, this would be a really good thing, as it allows a load following a mispredicted hammock to prefetch interesting data into the cache.  For that matter, a processor can do whatever prefetching into caches it wants to do, provided that the addresses are "valid." 

Stores, on the other hand, are different.  They must be "held" until the store instruction (and all previous instructions in program order) have retired. Smart designs squirrel the store away in a write buffer that is visible to the processor doing the write (so that a load after a store gets the new value even though the store hasn't retired yet.  Since the load can't retire until the store is ready to retire, there's no hazard in this speculation.). The write buffer entry is pushed to the cache/memory interface at some point in the future.

LL/SC is wayyyy more complicated.  The fundamental rules are:

LL on processor A sets some state that says "I did an LL to an address that matches some pattern".
SC on processor A fails unless its address matches the same pattern. It clears the LL state.
Any SC that arrives at A from some other processor clears the LL state.

No other operation from any other processor (even a write) can affect the LL state.

And here's where it gets really nasty.  What does it mean for an SC operation on processor A to occur before an SC arriving from some other processor? This is where lots of things get tied up in the processor pipeline, the cache coherence scheme, and all kinds of other things.  In addition, there are some nasty issues relating to instructions that may appear between the LL and SC.  I don't remember exactly what the restrictions were, but I know that Alpha's rules were the result of an extended discussion.

 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #9 on: November 28, 2024, 08:21:38 pm »
+= Hennessy and Patterson, Quantitative Approach (kindle version) 
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: eeproks

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #10 on: November 28, 2024, 08:46:32 pm »
More than 20 years ago there was this old discussion, something shocking, for anyone who thinks that cache coherence is handled transparently by the CPU.

At the time it was not well known that installing a MIPS4 R10K on a platform designed to host a MIPS4 R5K could lead to very serious kernel problems, whether NetBSD, OpenBSD or Linux.

There was no platform-side documentation, and even today there is only CPU-side documentation, and not even a very comprehensive one, especially when you find that MIPS R10k needs external hw to manage the cache coherency, otherwise there are very serious problems: no hw examples!

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline radiogeek381

  • Regular Contributor
  • *
  • Posts: 133
  • Country: us
    • SoDaRadio
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #11 on: November 28, 2024, 10:48:09 pm »
Quote
More than 20 years ago there was this old discussion, something shocking, for anyone who thinks that cache coherence is handled transparently by the CPU.

That conversation was pretty peculiar to an R10K behavior and the interaction between speculation and writes to memory mapped IO.  Again, this is way into the weeds, and should not be seen as "normal" in the industry. And being SGI's first OOO design there are many things that they learned along the way.

The *implementation or mechanism* for the cache coherence logic is not part of an instruction set architectural specification. The *behavior* of the coherence logic is the component of the architectural specification.

An ISA spec is a commitment between the processor designer and the programmer. (There may be multiple sets of commitments based on system vs. user mode programming.) Even within a single architecture family the cache coherence scheme is likely different for each product in the line. Some configurations require directory based schemes. Some can snoop. Some have separate coherence networks. Others synchronize at a common point. For a given instruction set, all are required to show the same behavior relating to the apparent order of writes from processor X as sensed by processor Y. Coherence commitments and memory consistency commitments *are* part of the ISA contract. But their implementation is not.

Again, this is founded on my experience with Vax, Alpha, and MIPS so may not be representative of x86, Arm, or RISC V.

So, the stack looks like this:  Memory consistency model (very much an ISA component) -> Cache coherence model (an implementation detail, but must provide the stated memory consistency) -> semaphore signaling function (like LL/SC vs. active messages vs. global lock vs...)



****

as an aside: Writes to memory mapped IO space should never be committed until the associated store instruction is retireable. (Writes to any memory location must be predicated by this.) Machines that can't make that guarantee were probably not designed as intended. There is, however, a problem when a READ of an IO register produces a side-effect. Then mis-speculated loads can cause unexpected things to happen. But professional architects have long known that IO registers with READ side-effects are dangerous, and so their use is rare, and ought to be tied to barrier instructions.

For various other reasons, some architectures require a barrier of some sort between writes to IO space and other operations. This prevents a situation where a write to "LaunchRocket.GO" completes before the write to "Rocket.Destination" register. There are similar hazards associated with races between data/control buffer fills and interrupt notification.

=====

As a general rule, memory systems should deal with loads as they arrive, not as they retire. This may change ownership or validity of cache blocks.

Memory systems should deal with stores when they are certain to be retired. This is the minimum: practical designs will include write buffers that furnish data to "local" loads appearing after the corresponding store in program order.

=====
 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #12 on: November 28, 2024, 11:33:34 pm »
Even within a single architecture family the cache coherence scheme is likely different for each product in the line.
Some configurations require directory based schemes.
Some can snoop.
Some have separate coherence networks.


That's exactly the kind of book and docs I am looking for.
Real documented sw/hw example on real platforms!

I should also take the SMP of the PowerMac G4 and the C8K HPPA2, both 2xCPU.
In this case the documentation (in?) the Linux kernel!
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4832
  • Country: nz
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #13 on: November 29, 2024, 12:09:50 am »
An ISA spec is a commitment between the processor designer and the programmer.

Exactly.

An implementation can do anything it likes, as long as the final result is guaranteed to be AS IF the code was run on the simplest non-pipelined no cache multiple CPI implementation.

Quote
Again, this is founded on my experience with Vax, Alpha, and MIPS so may not be representative of x86, Arm, or RISC V.

R10K and Pentium Pro were very much the pioneers in this stuff -- even eclipsing what mainframes had been doing -- and they made some mistakes which have since been learned from.

For how things *should* work, look at Aarch64 and RISC-V, which have both learned a lot of lessons from the pioneering R10K, Alpha, P6 work. ESPECIALLY, I think, RISC-V which has gone out of its way to bring in world experts from both industry and academia to get various parts of the design right.

A prime example is the memory model, which was done quite early on:

https://googlegroups.com/a/groups.riscv.org/group/isa-dev/attach/3434a200dbcc0/memory-model-spec.pdf?part=0.2

(that's also an appendix in the main ISA manual now)
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #14 on: November 29, 2024, 03:23:19 pm »
memory-model-spec

+=memory-model-spec.pdf?part=0.2 (kindle)

That's exactly what I am looking for, thanks!  :-+
edit:
Very very useful even for the toy architecture I work on from time to time
« Last Edit: November 29, 2024, 03:41:59 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #15 on: November 29, 2024, 03:39:29 pm »
experience with Vax, Alpha, and MIPS

on which MIPS-platforms?

mine:
- SGI IP30/R10K, single core, coherent
- SGI IP32/R10K, single core, non-coherent (sold)
- Atlas MIPS32R2, single core, coherent
- IDT-R2K, MIPS R2000 single core goldcap, N.A. doesn't have any cache controller
- FON2 MIPS32R2, 1xcore/SoC, coherent
- RSP MIPS32R2, 1xcore/SoC, coherent
- RB532 MIPS32R2, 1xcore/SoC, coherent
- RBM33G MIPS32R2, 2xcore/SoC, coherent
- MIPS5++ prototype, softcore/FPGA, N.A.  doesn't have any cache controller

supporting { UMon/fw, YaMon/fw, Linux/Kernel, XINU/Kernel }
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #16 on: November 29, 2024, 04:12:21 pm »
Typically, IO addresses are mapped to an "uncached region" as this way "uncached reads" and "uncached writes" are guaranteed to be non-speculative.

However, there is a further *BIG* problem with R10k: every speculative block read request to an unpredictable physical address can occur if the CPU speculatively fetches data due to a Load or Jump Register instruction specifying a register that has a "garbage" value during the speculative phase.

So, speculative block accesses to load-sensitive I/O areas "can" present an unwanted side-effect.

I had to modify the machine layer of my myC compiler to ensure that a "garbage" register never happens around an IO operation.

It's easy to fix up once you know about it, but it's a real pain if you don't know that even this problem exists.
I don't know how many hours I lost before I figured out why Yamon/R10K crashed every now and then accessing the memory mapped serial port.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #17 on: November 30, 2024, 03:57:46 pm »
...

In short the "R10K on DMA" problem may be summarized as follows:
- R10K may start executing load/store instructions that wouldn't be started by normal program flow
- this can happen due to mispredicted branch
- in "cache-coherent system" it would just cancel this instruction later on, and no one would notice anything.
- in "non cache-coherent", it is impossible to cancel loads or stores <------ it says "cancel", I guess from a pending IO queue
- that may result in cache line being marked as dirty.
- if there is DMA going on to the same location during that time, data will be lost after write-back.
- speculative loads are not as much of a problem
- they can be taken care of by extra cache invalidation op after each DMA.

A workaround, specific for a platform that comes with special HW feature that causes any cached-coherent stores to all but first (x)Mbyte of KSEG0 addresses to fail.
1) this allows to configure KSEG0 as "cached coherent"
2) put the most important firmware/kernel code into first (x) Mbyte of RAM
3) use HIHGMEM for the rest (in Linux kernel terms)
4) any memory that is being target of DMA transfer will be removed from TLB
5) and thus inaccessible while DMA is in progress <-------------- BINGO!?!
6) (well no, you still have to) make sure it isn't brought back into TLB by an accidental reference


That's MIPS R10K  :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #18 on: November 30, 2024, 07:03:35 pm »
I wonder: how do modern { POWER, ARM, RISC-V, ... } systems with several cores work internally regarding shared ram?
Is there any study material?

To answer this specifically: all modern systems including those named are fully cache coherent.  Non cache coherent shared memory is pretty much dead in the realm of general purpose computing.  Instead, a combination of more cache levels to reduce global memory access and smarter software and operating systems are used to reduce contention.  Cache coherence basically means that a cache needs to have exclusive access to a given cache line to perform a write.  This is accomplished by cache invalidation messages exchanged over the processor interconnect.

Up through most high end desktop CPUs, most systems have a single memory controller (possibly dual channel) shared equally between all cores.  This main memory is symmetric.

Systems like the threadripper pro and high core count server CPUs, as well as most multi socket systems have multiple memory controllers attached to different core complexes / sockets.  This means that each CPU has its own locally attached memory and to talk to other memory it has to use the CPU to CPU interconnect.  This is still cache coherent, it just comes with a performance impact.  Page and thread migration is used to try to keep processors accessing local memory as much as possible.

Within the realm of cache cohence there is a fairly wide range of memory ordering guarantees.  Power and arm have weaker memory ordering, sparc and x86 have stronger memory ordering know as total store ordering where all CPUs observe main memory

You can think of this like so: pretty much any high performance processor has a store buffer.  It's the one of the most valuable caches in a system because stores can trivially be deferred and blocking on stores would be a huge performance hit. When a CPU does a load, it checks the store buffer before the caches.  Total store ordering just means that 1) writes to the store buffer go in program order 2)  nothing in the store buffer can be visible by any other core, and 3) commits from the store buffer are globally visible simulaneously via the cache cohence and 4) commits from the store buffer must happen in order.  If the oldest write in the buffer is blocked due to a cache miss or something, no other writes can be committed.

ARM and Power for example don't guarantee all of these rules and don't have total store ordering.  This potentially allows avoiding some unnecessary stalls at the expense of requiring more explicit barriers when synchronization is needed.

Device memory, in contrast to main memory is often uncached.  When cached (for instance when used for shared memory with an accelerator) it's usually cache coherent as well.   The mechanism is generally different, since it's attached on an expansion bus like PCIe or CXL, but doing fundamentally the same thing: cache lines are either shared and read only or exclusive and writeable by a single agent.
 
The following users thanked this post: DiTBho

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4059
  • Country: us
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #19 on: November 30, 2024, 08:12:50 pm »
One thing that has changed between the era of R10k and now is that the memory controllers have moved on chip in package. 

Back in the 90s, processors exposed their front side bus and the system designer was responsible for hooking up the memory controller(s) and possibly the upper cache levels to the CPUs.  This meant that more of the implementation of cache coherence and memory ordering was up to the system designer not the CPU itself.  Different systems could have different guarantees even if using the same processor. 

Modern systems have their memory controllers and cache coherence controllers built into the processor, whether on the same die or a chiplet. 
 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #20 on: November 30, 2024, 08:35:14 pm »
One thing that has changed between the era of R10k and now is that the memory controllers have moved on chip in package. 

Back in the 90s, processors exposed their front side bus and the system designer was responsible for hooking up the memory controller(s) and possibly the upper cache levels to the CPUs.  This meant that more of the implementation of cache coherence and memory ordering was up to the system designer not the CPU itself. 

In my IDT R2K development board the cache controller is an optional module that you can decide to install.
The default configuration of dev kit doesn't include any cache, and any cache controller.
You can also choose the CPU module: { R2000, R3000 }.
The FPU is also an external chip on a further module you can decide to install.

With MIPS32R2, you have "SoCs", all-in-one.

Different systems could have different guarantees even if using the same processor. 

just three examples of what I've worked on with SGI/MIPS
- IP28, MIPS R10K -> non-coherent, no hardware special HW feature that causes any cached-coherent stores to all but first (x)Mbyte of KSEG0 addresses to fail.
- IP32, MIPS R10K -> non-coherent, but with a special HW feature that causes any cached-coherent stores to all but first (x)Mbyte of KSEG0 addresses to fail.
- IP30, MIPS R10K -> coherent, piece of cake
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Postal2

  • Frequent Contributor
  • **
  • !
  • Posts: 826
  • Country: 00
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #21 on: December 01, 2024, 06:38:45 am »
If anyone is reading this by chance in hopes of getting information, please refer to the Nvidia CUDA documentation.
 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #22 on: December 04, 2024, 08:23:52 am »
Maybe also the book I posted in the fpga area here on the forum - pipelined multi core mips implementation and correctness - I took the paper version, which was delivered 4000Km from where I am now.

The core of the discussion is different from this specific discussion area, but it also covers multi-core, cache, and pipeline, and is specific to mips.

I have to get home.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4429
  • Country: gb
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #23 on: December 04, 2024, 08:19:52 pm »
OT, but interesting book

+= the Cray X-MP/Model 24: A Case Study in Pipelined Architecture and Vector Process (Robbins)
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15894
  • Country: fr
Re: shared memory in multi-CPU systems: looking for books, docs, ...
« Reply #24 on: December 05, 2024, 01:59:33 am »
As a maybe "interesting" (or fun) fact, the CV1800B (in the milkV duo boards) and probably similar for the other SoCs of the duo series (SG2000/2002) although I haven't tested the latter yet: there are two RISC-V cores inside, both have access to the main RAM, each has its own cache and there is absolutely no cache coherency mechanism. They are implemented as two completely separate CPUs (they even have the same hart ID!) that happen to share the same peripherals and main RAM.

So to access a shared RAM block, you have to properly invalidate/clean the respective caches, manually. Quite fun.

I think they were essentially meant to communicate via a mailbox and not directly share any RAM, which is why that's the only examples you'll find in the official SDK. But since I went baremetal with this SoC, I played with sharing RAM, and that was as barebones as it gets. A good exercise, although not exactly fun. Not ultra efficient either, since sharing large blocks will involve invalidating/cleaning the cache either line by line or the whole cache, which takes some time.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf