I see a lot of "I don't really know" answers...
E.g. talking about MIPS4, the only documentation you can find publicly online is about R10K, which is documented as a four-way superscalar design that
implements register renaming and executes instructions out-of-order.
However when you look at the implementations (e.g. Workstations, Servers), unless they are development-SBCs where you pay to have all the documentation with both hw and sw examples (e.g. Windriver VxWorks SBCs), you don't really understand how things work.
Worse still, MIPS4 opens a problem with the "out-of-order" nature of the CPU paired with the Branch Prediction and Speculative Execution nature of a purist RISC design: although one or more instructions may begin execution during each cycle, and each instruction takes several or many cycles to complete, when a branch instruction is decoded, its branch condition may not yet be known. However, the R10000 processor can predict whether the branch is taken, and then continue decoding and executing subsequent instructions along the predicted path.
The problem with cache-based shared ram is when a branch prediction is wrong, because the processor must back up to the original branch and take the other path. This technique is called "speculative execution", and whenever the processor discovers a mispredicted branch, it aborts all speculatively-executed instructions and restores the processor's state to the state it held before the branch.
And here specifically the problem - the manual says -
the cache state is not restored, and this is clearly a side effect of speculative execution.
So, if the speculative approach involved a Conditional Store (SC): will it be restored? ...
Well ... it depends on whether there is an external coherence agent or not
If there is not, then - the manual also says - if the cache is involved, then it won't be restored, so this is a real mess that needs at least a "sw barrier".
-
This is how MIPS4 CPUs work, and there are different hw implementations of what is around the CPU.
Speaking of simple MIPS4 based systems with 1 CPU and 1 GFX, the CPU to { COP2, GFX, DMA, ... } approach uses the cache as a mechanism for the CPU to know if a cell has been modified
In particular there are:
---
- "Non-Cache Coherent" systems, coherency is not maintained between CPU Caches and external Bus DMA. Which usually means cache writeback/invalidate needs to be performed by software (barrier + voodoo code) before/after { COP2, GFX, DMA, ... } requests.
- "Cache Coherent" systems, coherency is maintained by an external agent that uses the multiprocessor primitives provided by the processor (LL/SC instructions in a conditional block) to maintain cache coherency for interventions and invalidations. External duplicate tags can be used by the external agent to filter external coherency requests and automatically update
There are MIPS4 implementations with 128 CPUs, they use NUMA (propietary NUMAflex tecnology), but I can't find any documentation.
They certainly don't use simple hw mechanisms like the simple cache coherency agent, nor do they use trivial dual-port-ram.
On modern machines with up to 32 or 64 cores at least, the answer is that the RAM is shared equally between all CPUs, but caches are used to drastically decrease the number of times a CPU talks to RAM.
The problem is how the "shared ram" is implemented.
In my case, I don't use any cache. Instead I use a simple dual-port ram paired with a coherent agent embedded into the dual-port ram.
I call it "tr-mem", as it's a simple "transactional memory".
It's not difficult to make a dual-port ram, it becomes a little more difficult to make a quad-port ram, which also exists as ASIC chip, so it's still feasible and sensible, but I've never seen octo port-ram.
I think the 128-CPU system uses a ram model that leverages super fast packet networking and wired routers (cross-bar matrix?), coupled with cache systems with cache-coherence agents.
Dunno
