Ya I know bit of a zombia thread, but the Mill has been of particular interest to me for some time. I'm not certain it will succeed, but it certainly seems like it has a shot. Just a few thoughts about some of the details being missed on this thread.
The problem is that you can't implement belt as a standard RAM.
The belt isn't RAM, it's registers.
It's neither ram nor registers*. The equivalent in a modern cpu is the bypass network. The data is held on the output latches of the functional units and there's a one or two stage NxM mux network. Moving data around. The belt labels just provide the semantics needed to move the route the right data. Larger members are slated to include a rename/tag network, while smaller members may track positions with some associative memory closer to the decode unit and a rotating index. Gate level details can and will vary be member as trade-offs the the options change with scale.
*The 32 position belt version does require fairly small registor file to provide enough pysical operand locations to let the spiller work correctly on funtion calls.
I'm not at all convinced that the inherent find-grained parallelism exists in most computing tasks to make "super-super scalar" CPU architecture worthwhile for general purpose computing. A different architecture isn't going to change that.
...
And for the cases where there is a lot of parallelism we have GPUs and SIMD CPU instructions, or specialist DSP processors which expose more of the pipeline...
Most code is in loops, and loops have unbounded ILP. If you are wide enough to pipeline and clever enough to vectorise most loops, theres a lot of ILP there.
And then it can also capture common data flows in a single instruction to approximate the advantage OoO actually provides (which they claim is mostly from better static schedules) And though it won't beat an OoO of equal width in this regard, it can go much wider without melting the die.
And of course it's not going to beat a DSP or GPU in what they already do well. Many of the core developers were on the Philips TriMedia team. They are trying to adapt DSP internals to general purpose code, which is where they get thier 10x claim.
Those with long experience were pointing out in 1996 that the Itanic's performance required fundamental advances that very smart people had been searching for since the 60s, largely without success. To presume that HP had smarter people was hubris.
Itanium was also released before it was fully ready. It wasn't all that wide and even then most code didn't use the width putting a lot of no-ops in the instruction stream. Cache misses were extremely painful, and dealing with control flow was awkward. The rotating registers were neat, but the compiler didn't really know how to leverage them and wasn't great at pipe-lining real code, and vectorization was also limited. And even when you could pipeline non-trivial loops, the prelude and postlude were so big you'd choke out the cache loading the instruction stream. I also believe clean function calls/context switches with that number of architectural registers was fairly painful as well.
I think the mill has done pretty well with control flow and being easy to pipeline/vectorise. I also think they've tackled code length fairly well with elided no-ops and very compact encodings.
I think it'll live or die on how well the split load actually works, how good the pre-fetch is, and what exactly their still undisclosed stream mechanism is. Also up in the air is weather the multi-core coherence protocol is any good. They hinted at the ability for cycle-accurate determinism between cores if you want it. (like a DSP)
I got a feeling at 8 or 16 positions applies to low end models and actual models that target performance will have much longer belt. In that respect it is not that different from a typical stack-based machine, and you need quite a bit of stack to do any real calculations. Constantly saving things that are about to fall off is a hassle and time sink.
Don't you actually need to save the whole belt contents?
No matter how ingenious your system is, you need to save the whole state, there is no working around that. I find it hard to believe that hand waving without some evidence.
I think there are operations that would require massive simultaneous (from a programmer point of view) changes to the belt. IIRC, it was something related to function calls.
Generally you aren't using registers as long term variable storage*. They're short term intermediate storage usually used less than two times. The compiler will try to schedule producers as close to consumers as possible, and the specializer will automaticly insert spill/fill as needed when this isn't possible. This scratchpad is on die SRAM, but can be be backed to ram on interupt/function call without OS intervention. And for very long-term intermediate results you can spill/fill to system RAM.
*Though a few mathematical libraries will do this, it's not the common case.
On function call the state is saved, but it's done asynchronosly. To do this there are actually twice as many places to hold operands as there are belt positions and like the names, a frame is associated with each operand. Scratch is allocated for the frame stack and operands can trickle out over the next few cycles. Likewise in flight operations retire normally, but are moved into the frame stack with a note about expected timing.
Function calls can be done by explicity naming belt positions to pass and the order to pass them. Functions can also be called by passing the whole belt. You can run into issues on this with split-join program flow. eg. if x a->b->c, else a->c. The compiler will try to if-convert and use speculable ops to eliminate the branch. If a branch is unavoidable, then c can only excpect one sort of belt config and you need to issue a conform op on one of the transitions to c. Best case you have profile info or the compiler can guess which path is more common and only conform for the less common case. Even then it's a fairly cheap operation is it's just reconfigures the mux network control logic and it's not actually moving data around all over the place.