My limited understanding is that the HPC mob has standardised on message passing.
It is usable on any architecture (with appropriate tiny hand-crafted primitives) and is scalable to thousands of processors.
Caches, on the other hand...
Cache is evil, but, worse still, "
cache-coherent Non-Uniform Memory Architecture" is what I see in my MIPS5+ prototype:
- "ccNUMA" system architecture
- hw coherence manager
- by default it is disabled, you can enable it by grounding a pin ... the first thing I did was solder a jumper to ground, so as to be 100% sure that it is always enabled!
- local memories are shared in a Single-System Image (SSI)
- single shared address space (64bit)
- single copy of operating system (XINU in my case)
- designed to scale to very large processor counts (128-192 ... only 8 in my case)
- MIPS5+, 4-way super-scalar microprocessor
- not compatible with MIPS-4 Instruction Set Architecture -> there are no AS/C compilers!
- out of Order execution -> 26 pages of the 800 UM pages are about "bad side effects"
- two Floating-point Execution Units
- two Fixedpoint-point Execution Units
- each queue can issue one instruction per cycle: { Add, Mul, MAC, Div }
- each unit is controlled by an 8-entry Flt.Pt. queue
- each unit can trap an exception
- large virtual and physical address spaces
- 52-bit virtual address (data)
- 48-bit physical address (CPU side, 64-bit address register)
- very large TLB page sizes 64M, 256M and 1G page sizes!!!!
- very wild cop0, when it receives an interrupt/exception, it only sets a bit to say whether or not it has concluded the current instruction with the precise idea that then the software will understand what happened, therefore to calculate the return address, PC_next rather than PC_curr. In addition to this, if there is an exception all the LL/SC instructions are canceled (flushed), which is quite serious from the synchronization side because everything must be canceled and resynchronized
- kseg0 is uncached, you can address it to an experimental tr-mem
Being a protype it's full Debugging Challenge. It has extra hw to trace some of activity hidden on chip.
Deep sub-micron features are difficult to observe in testers, impossible in a system, expecially in a large systems with many processors.
Failure point may be difficult to find, and exact failure condition must be recreated.
It's dead, a project they decided not to continue

Unfortunately I don't have all the documentation, worse still it's a prototype not 100% working, and all the analysis tools are missing, which I'm sure they preferred to destroy rather than release.
One of the manual quotes external missing documents:
+ Architecture Verification Programs (AVP)
+ Micro-architecture Verification Programs (MVP)
+ Random diagnostics from programmable random code generators (missing testbench)
+ self-checked and/or compared Diagnostics with a reference machine (which is also missing)
...
for which debugging equipment is mentioned. In particular one of the ICEs for which my company was contacted.
I literally saved a board and two DVDs from the hydraulic press, and I'm trying to understand how it works.
I'd like to recreate some of its features in HDL (the trmem, in particular, or any other good shared mem mechanism), obviously simplifying them.
The goal is to learn something and improve my own RISC-ish softcore toy.