Sorry, but I seem to be in" too long; do not read" mode.
As it happens I have the processor manuals for both the R2K and the 88K.
There are at least 3 editions of P & H since the R2K was mentioned. Those editions came out not because they wanted to sell books, but because the world had changed so drastically that the previous edition was no longer relevant to current products. The discussion in the 1st ed is dominated by the branch delay slot. Then came speculative execution and branch prediction. I have no idea where we are now. I've got the 4th ed on my shelf, but haven't had an HPC project, so not a lot of point in wading though all that detail. It's probably entirely different now. And certainly will change drastically after the discovery of the speculative execution vulnerabilities.
Linear algebra lives or dies on memory bandwidth. IIRC the Intel i860 was touted as being 80 MFLOPS. I did an analysis of the cycle time to do a vector multiply-add and found that absolute peak performance was more like 10 MFLOPS for vectors of 12 KB which was a typical seismic trace length at the time. They are double that now.
You can have large memory or you can have fast memory, but you cannot have large, fast memory. It is physically impossible because as the memory gets larger the capacitance of the address lines increases and the RC constant gets larger and the cycle time for an access gets longer. So I am *not* going overboard. This is the cold hard truth I have dealt with for 30 years in a work environment where running 10,000 cores for 7-10 days and feeding them 10-12 TB of data is routine. And power and cooling limitations prevent having more than about 50,000 cores at one site, so processing companies have multiple installations scattered all over Houston.
The number of cores allocated to a particular job depends on how urgent it is and how fast the person doing it can do the QC on a run and update the velocity model before running the whole thing again. Typically a seismic processor will be working on two of these, so that while one is on the machine they are doing the QC on the other.
The only people doing anything comparable face a long vacation is a federal prison for discussing such matters. So one never does. A simple statement such as there is some mathematical identity that can be used at a certain point in a calculation can land all involved a very long prison visit.
Generally this is not a big deal for small problems of a few million matrix entries. But when each of 3 dimensions measures in the 10e6-8 range, it gets *really* serious.
I have no recommendation other than to do exactly what I would do, a lot of reading.
If you're not familiar with it, look at the implications of particular strides and cache associativities.
I have written *one* computed GOTO in my life. The Alpha was 10% faster if I used a stride of 2 and 2 temporary variables. The killer in the code (Kirchoff pre-stack time migration) was computing square roots. I observed that a linear approximation was good enough. So I only recomputed the square root at intervals and used a linear approximation in between. But it meant that I had to continually recompute where to calculate the square root and call a different subroutine to do that segment of the integration. Which was most cleanly implemented with a computed GOTO.
The point of all this is quite simple:
If you need to do high performance linear algebra you have to count cycles for *every* operation from the prefetch to the write and make use of whatever parallelism the hardware provides.
I had an interview in 2006 with a startup called SiCortex which had a nifty design for a very interesting processor based on the MIPS. After we had scheduled my interview I got a bunch of documentation to read. The interview turned into $2-3K of free consulting. However, I did get to spend the weekend with a friend who lived nearby.
I did an hour talk in which I explained the data movement of all the migration algorithms and why their design could not do this well. Later I had a cycle by cycle discussion with the chief architect and a discussion with the CEO in which I explained why what they had designed, despite being quite marvelous, was not suitable for seismic processing. The friend who had given my name to the head hunter was doing hands on evaluations of a prototype and came to precisely the same conclusions after several months of testing. Needless to say, I did not get hired. The company went under a few years later.