Now, since implementing them costs time and effort, I wonder if they are worth with.
Surprisingly enough, it highly depends on the number of registers available.
When there are only a few registers available (say, 80386 and compatibles), complex indirect accessing modes can keep the instruction count down, and avoid having to store temporary variables on the stack.
If you have lots of registers, and a fast "load-effective-address" instruction (again, similar to 80386), so that most of those registers can be used as pointers, it turns out that those complex indirect addressing modes aren't that useful anymore.
The next problem is finding a compiler that generates efficient code for either case. As an example, GCC is not particularly good at register use and reuse; even in optimized (
-O2) code, you often see register moves and assignments that are completely ... stupid.
This means that the answer to the underlying question (
"which EA modes ..") is something annoying like
"it depends on what addressing modes your compiler needs, to generate efficient code".
I know that GCC generates slightly better assembly from pointer-based C than the equivalent expressions using array indexing. Fortran (say F95 and later) is obviously different, as it has an efficient way of expressing array slicing.
All this waffle means that in my opinion, you should consider writing e.g. a GCC port for the tentative instruction set architecture, and examine the code generated in each case, for some typical constructs. Yeah, I know; I'm not being particularly helpful here, sorry.
(I need to go look at Arise-v2 in detail before commenting again, I guess.)
I am wondering if EA = reg0 + reg1 * scale (immediate 16bit constant) could be really useful for easily accessing a cell in the matrix.
The scale is not always a constant, so do consider
EA = reg0 + reg1 * reg2 instead. (Not only is the instruction shorter, but it is much more versatile this way.) Also, all of those could be negative as well.
That said, I do believe that if you had a fast (whatever your minimum instruction duration is) fused multiply-add instruction (
regN = regA + regB * regC) and/or related load-effective-address instruction (
regN = imm32 + regA<<imm5 + regB * regC) for signed integers, that would do just as well. If simplifying the addressing modes would allow that, I'd definitely go for it.
In general linear algebra (say, the same niche BLAS/LAPACK or GSL target), using a separate signed stride for both row and column advancement allows advanced features not supported by even GSL. I have an outline
here on stackoverflow in C. Unlike in GSL, "views" are first-class matrices, indistinguishable from the matrix they view to; this not only simplifies code, but it makes it easier for "non-professional programmers" (scientists) to write efficient linear algebra code.
(It's just that BLAS/LAPACK/ATLAS/GSL have such a devoted following that even showing practical examples of a better, easier, and more efficient code is not enough to budge the set of scientists I've worked with. Meaning no funding, and therefore no push on my part.)
Another set of users work with very large sparse matrices, but the known implementations do not really need any special addressing modes for those. (These appear in e.g. materials physics, where the matrix contains the locations and velocities of each atom in the cluster or lattice, and the eigenvalues provide the phonon frequencies.)