Author Topic: Mill CPU Architecture (Read 35658 times)

rstofer · « **Reply #100 on:** March 01, 2018, 07:54:34 pm »

Quote from: ataradov on March 01, 2018, 07:15:00 pm

I don't think so. I'm not saying that using Java today is somehow unjustified or bad. Especially given that it is realistically cheaper to buy a better CPU and additional 8 GB of memory than redesign software to consume less.

All I'm saying is that changing conditions will push businesses to make different decisions. When it is no longer cheaper to add more memory, they will pay for developing better software.

But that's also a problem for Mill. Right now, it is more expensive to adopt Mill, even if it comes out. For now, whatever benefits it will bring, can be overcome by throwing more money at existing hardware.

I agree! Nobody suggests that the x86-64 based on the x86 based on the 8080, is a nice architecture. But it IS a given and worth the effort to write more or less optimal code, at least in the OS.

In that regard ARM is getting very popular. It's a known architecture (at the core) to which vendors can add peripherals, most of which are also licensed from ARM. It is worth the effort to write decent code for the platform.

If the Mill were released today, who would care? We have processors, we have a stable platform (or two) to aim code at, why change? Now, if the chip just smokes everything on the planet, NASA and NSA will probably care as will Lawrence Livermore and Los Alamos. The rest of the world will keep chugging away on X86-64. There is simply limited utility to a more advanced architecture and, today, PCs are like toasters, they are just an appliance. Nobody cares what's inside as long as the software from decades back still runs.

Look at all the superior architectures that are dying: SPARC, MIPS, <whatever>. There are still machines around but the x86-64 is king.

A lack of software alone will keep the Mill from going very far. For better or worse, the x86-64 has achieved critical mass for computers and the ARM has done the same for appliances like cell phones and so on.

The .gov agencies above have always been on the bleeding edge. They have routinely written their own OS and compilers. For the rest of us, it would be like starting over in '75. They hand you an 8080 and there isn't a shred of software around (at least at the hobby level). Then Bill Gates releases BASIC...

hans · « **Reply #101 on:** March 01, 2018, 09:29:51 pm »

x86 completely thrives to exist on legacy applications. The software industry has many programs written in x86, many of them closed-source and vendors having no interest supporting alternative platforms and instruction sets. 12 years ago we had Apple still making PowerPC desktops, that has stopped and pretty much everything is x86 now. Any performance increase had to be made from speeding up x86.

Architecturally speaking x86 is a PITA. Variable-length instructions (memory was expensive back in the day, now it isn't), CISC instruction set, many data hazards in a typical program limiting instruction-level parallelism, and many layers of legacy and extensions to keep compatibility with current systems. (And the reason why a x86-powered PS4 game console is not a PC)

These issues have been "solved" to the point where we don't see much single-thread performance benefits in the last few years. But basically just by throwing more silicon at the problem; deep pipelines and instruction decoders, speculative execution and register renaming. All this overhead costs a lot of power to make the CPU fast.

The claimed "single-thread 10x power/performance" is perhaps obtainable (atleast a good portion of it); but many RISC (ARM) chips are also far more competitive in this ratio than x86, because is just so much less overhead in just decoding the instruction set. Hence why you see no x86 mobile phones, and probably never will, as there will likely always be a more competitive technical solution available.

As for building a competitive desktop CPU performance; good luck. Intel and AMD (and ARM in their respective domain) have both invested billions to get the products they make today; and as said it's pretty well accepted we've more or less reached a maximum in single-thread performance. Most benefits come from more cores and specific instruction sets (eg AVX2); but even than e.g. throwing a thousand slow & power efficient cores won't compete; any PC still needs the flexible and high peak performance of a single core.

In terms of performance/power; yeah sure that can be improved of x86. But I don't think this is the metric alot of architectures are benchmarked against (unless you're Google and can significantly save on your power bill).

It was attempted to use VLIW like Intel Itanium series, that had explicit parallelism in the programs, but yeah that's now EOL as well.

If you really need high performance/power in a particular application, then going FPGA/DSP for now is the best bet. But that is such a niche industry relatively speaking. The majority of industry looks at productivity [to solve a customers IT problem] as a more direct measure, and to them programmers' time is more expensive than buying more computational power.

tggzzz · « **Reply #102 on:** March 01, 2018, 10:00:24 pm »

I broadly agree with most of Hans' points, but I'll add a few riders...

Those with long experience were pointing out in 1996 that the Itanic's performance required fundamental advances that very smart people had been searching for since the 60s, largely without success. To presume that HP had smarter people was hubris.

Performance per watt has been a significant metric since ~2000, for server farms and high performance computing application domains. It is worth noting that HPC stress any existing technology to its limits - and beyond. Hence problems found by the HPC community will often hit other people later.

Don't neglect the Sun Niagara processors. Their philosophy was to aim at the "embarassingly parallel" server applications, so they used many parallel simple SPARC cores to get good aggregate performance at the expense of single-thread performance. Basically if the SPARC core had performance P, then a 16-core Niagara chip had performance ~16P. They removed obscenely expensive OOO control logic with other cores.

The Mill is inspired by DSP processors, and attempts to bring their philpsophy to general purpose computing coded in conventional languages.

legacy · « **Reply #103 on:** March 01, 2018, 10:30:31 pm »

gnat,ADA is *ONLY* for x86, with a little port (not so stable) for ARM.

IBM is going to sell POWER9 as "x86 alternative!!!". But a POWER9 workstation will cost 6K USD at least.
(and ADA won't run there. You need to allocate on the eight..sixteen POWER9 cores for QEMU/x86)

edit:
oh, PowerPC, POWER4,5,6,7 and POWER8 are big endian
POWER9 is little endian ... just to make things incompatible

rstofer · « **Reply #104 on:** March 01, 2018, 11:22:54 pm »

Quote from: legacy on March 01, 2018, 10:30:31 pm

gnat,ADA is *ONLY* for x86, with a little port (not so stable) for ARM.

And here I thought it was supposed to be easy to port the runtime to any processor. It must be possible because the military is doing something with the language and I suspect it for a processor other than an x86.

hamster_nz · « **Reply #105 on:** March 02, 2018, 04:15:56 am »

I'm not at all convinced that the inherent find-grained parallelism exists in most computing tasks to make "super-super scalar" CPU architecture worthwhile for general purpose computing. A different architecture isn't going to change that.

After all, isn't that why Hyperthreading came about? The dependencies in generating and consuming data was unable to keep all of the CPU core's functional units busy all of the time. So HT gives us two threads running on the same CPU core at the same time, competing for things like Integer ALUs, cache and memory bandwidth. The per-thread performance can drop somewhat (due to contention for shared resources), but the system overall might get 30% more work done.

And for the cases where there is a lot of parallelism we have GPUs and SIMD CPU instructions, or specialist DSP processors which expose more of the pipeline...

tggzzz · « **Reply #106 on:** March 02, 2018, 09:59:01 am »

Quote from: hamster_nz on March 02, 2018, 04:15:56 am

I'm not at all convinced that the inherent find-grained parallelism exists in most computing tasks to make "super-super scalar" CPU architecture worthwhile for general purpose computing. A different architecture isn't going to change that.

It does remain to be proven, but AIUI the Mill team has investigated it and found some useful parallelism.

Quote

After all, isn't that why Hyperthreading came about? The dependencies in generating and consuming data was unable to keep all of the CPU core's functional units busy all of the time. So HT gives us two threads running on the same CPU core at the same time, competing for things like Integer ALUs, cache and memory bandwidth. The per-thread performance can drop somewhat (due to contention for shared resources), but the system overall might get 30% more work done.

Yes and no, but there are more differences than similarities.

The HPC mob, who traditionally stress whatever technology is available, found that HT slowed down their computations, so they disabled it.

HT used conventional OOO-SC cores. If you want HT "done properly", look at Sun's Niagara processors. They really did speed up the aggregate throughput of server-class workloads. The single thread performance was lower, but that was irrelevant in those worloads.

In the embedded arena, the XMOS xCORES are similar (and give cycle-accurate hard realtime guarantees).

Quote

And for the cases where there is a lot of parallelism we have GPUs and SIMD CPU instructions, or specialist DSP processors which expose more of the pipeline...

SIMD is entirely different and in no way comparable, but you know that. DSP is the inspiration for the Mill, and they are trying to bring such "mentality" to general purpose computing.

One point to beware is that the concepts of "instruction" and "operation" are very fluid. AIUI the Mill's definition is closer to that of x86 internal microoperations and to DSP instructions.

legacy · « **Reply #107 on:** March 02, 2018, 11:17:39 am »

Quote from: rstofer on March 01, 2018, 11:22:54 pm

And here I thought it was supposed to be easy to port the runtime to any processor. It must be possible because the military is doing something with the language and I suspect it for a processor other than an x86.

Frankly, we have Ada for SPARC and POWER machine running SUNOS and AIX, but ... they cost 20K euro per license

There is absolutely nothing in the opensource world of gcc && ada-core and llvm doesn't have a support =(

So, yes, in theory .....

WorBlux · « **Reply #108 on:** February 01, 2020, 05:09:39 am »

Ya I know bit of a zombia thread, but the Mill has been of particular interest to me for some time. I'm not certain it will succeed, but it certainly seems like it has a shot. Just a few thoughts about some of the details being missed on this thread.

Quote from: brucehoult on March 01, 2018, 12:04:59 am

Quote from: ataradov on February 10, 2018, 07:50:54 pm
The problem is that you can't implement belt as a standard RAM.
The belt isn't RAM, it's registers.

It's neither ram nor registers*. The equivalent in a modern cpu is the bypass network. The data is held on the output latches of the functional units and there's a one or two stage NxM mux network. Moving data around. The belt labels just provide the semantics needed to move the route the right data. Larger members are slated to include a rename/tag network, while smaller members may track positions with some associative memory closer to the decode unit and a rotating index. Gate level details can and will vary be member as trade-offs the the options change with scale.

*The 32 position belt version does require fairly small registor file to provide enough pysical operand locations to let the spiller work correctly on funtion calls.

Quote from: hamster_nz on March 02, 2018, 04:15:56 am

I'm not at all convinced that the inherent find-grained parallelism exists in most computing tasks to make "super-super scalar" CPU architecture worthwhile for general purpose computing. A different architecture isn't going to change that.
...
And for the cases where there is a lot of parallelism we have GPUs and SIMD CPU instructions, or specialist DSP processors which expose more of the pipeline...

Most code is in loops, and loops have unbounded ILP. If you are wide enough to pipeline and clever enough to vectorise most loops, theres a lot of ILP there.

And then it can also capture common data flows in a single instruction to approximate the advantage OoO actually provides (which they claim is mostly from better static schedules) And though it won't beat an OoO of equal width in this regard, it can go much wider without melting the die.

And of course it's not going to beat a DSP or GPU in what they already do well. Many of the core developers were on the Philips TriMedia team. They are trying to adapt DSP internals to general purpose code, which is where they get thier 10x claim.

Quote from: tggzzz on March 01, 2018, 10:00:24 pm

Those with long experience were pointing out in 1996 that the Itanic's performance required fundamental advances that very smart people had been searching for since the 60s, largely without success. To presume that HP had smarter people was hubris.

Itanium was also released before it was fully ready. It wasn't all that wide and even then most code didn't use the width putting a lot of no-ops in the instruction stream. Cache misses were extremely painful, and dealing with control flow was awkward. The rotating registers were neat, but the compiler didn't really know how to leverage them and wasn't great at pipe-lining real code, and vectorization was also limited. And even when you could pipeline non-trivial loops, the prelude and postlude were so big you'd choke out the cache loading the instruction stream. I also believe clean function calls/context switches with that number of architectural registers was fairly painful as well.

I think the mill has done pretty well with control flow and being easy to pipeline/vectorise. I also think they've tackled code length fairly well with elided no-ops and very compact encodings.

I think it'll live or die on how well the split load actually works, how good the pre-fetch is, and what exactly their still undisclosed stream mechanism is. Also up in the air is weather the multi-core coherence protocol is any good. They hinted at the ability for cycle-accurate determinism between cores if you want it. (like a DSP)

Quote from: ataradov on March 01, 2018, 12:10:40 am

I got a feeling at 8 or 16 positions applies to low end models and actual models that target performance will have much longer belt. In that respect it is not that different from a typical stack-based machine, and you need quite a bit of stack to do any real calculations. Constantly saving things that are about to fall off is a hassle and time sink.

Don't you actually need to save the whole belt contents?
No matter how ingenious your system is, you need to save the whole state, there is no working around that. I find it hard to believe that hand waving without some evidence.

I think there are operations that would require massive simultaneous (from a programmer point of view) changes to the belt. IIRC, it was something related to function calls.

Generally you aren't using registers as long term variable storage*. They're short term intermediate storage usually used less than two times. The compiler will try to schedule producers as close to consumers as possible, and the specializer will automaticly insert spill/fill as needed when this isn't possible. This scratchpad is on die SRAM, but can be be backed to ram on interupt/function call without OS intervention. And for very long-term intermediate results you can spill/fill to system RAM.

*Though a few mathematical libraries will do this, it's not the common case.

On function call the state is saved, but it's done asynchronosly. To do this there are actually twice as many places to hold operands as there are belt positions and like the names, a frame is associated with each operand. Scratch is allocated for the frame stack and operands can trickle out over the next few cycles. Likewise in flight operations retire normally, but are moved into the frame stack with a note about expected timing.

Function calls can be done by explicity naming belt positions to pass and the order to pass them. Functions can also be called by passing the whole belt. You can run into issues on this with split-join program flow. eg. if x a->b->c, else a->c. The compiler will try to if-convert and use speculable ops to eliminate the branch. If a branch is unavoidable, then c can only excpect one sort of belt config and you need to issue a conform op on one of the transitions to c. Best case you have profile info or the compiler can guess which path is more common and only conform for the less common case. Even then it's a fairly cheap operation is it's just reconfigures the mux network control logic and it's not actually moving data around all over the place.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Mill CPU Architecture (Read 35658 times)

rstofer

Re: Mill CPU Architecture

hans

Re: Mill CPU Architecture

tggzzz

Re: Mill CPU Architecture

legacy

Re: Mill CPU Architecture

rstofer

Re: Mill CPU Architecture

hamster_nz

Re: Mill CPU Architecture

tggzzz

Re: Mill CPU Architecture

legacy

Re: Mill CPU Architecture

WorBlux

Re: Mill CPU Architecture

Share me