Author Topic: Why not use interleaving technique on multi-core processor ? (Read 6123 times)

tonyget · « **on:** January 05, 2015, 03:09:28 am »

From my understanding, the current multi-core processor relies on software parallelisation, individual cores may execute separate task/thread concurrently. The disadvantage is that softwares which are not optimized for multi-threading can't fully utilize this feature, so does numerical integration algorithm.

Inspired from ADC interleaved sampling technique employed by oscilloscopes, I'm wondering why not use the same technique on CPU ? For instance, the time interval of 1GHz processor is 1ns, if you combine two cores with individual clock set 0.5ns apart, they take turns to execute lines of code, you would effectively get a 2GHz processor, isn't it ?

Marco · « **Reply #1 on:** January 05, 2015, 04:04:43 am »

Instructions are not generally independent AND their "execution" takes more time than 1 cycle (through pipelining). Superscalar processors already try to execute multiple instructions at the same time and they fail more often than not ... throwing another core into the mix with large communication latencies won't help.

In fact the reverse of what you suggest is more frequently done, use the resources from a single core to execute two threads (Hyperthreading for instance).

Psi · « **Reply #2 on:** January 05, 2015, 05:27:31 am »

It's not uncommon for a CPU to "take a few guesses" at what the next microinstruction could be.
Or what operation might be done with the result of the current operation.

It then execute the most likely guesses and uses the correct one once that's known.

I forget what its called and what CPU do it.
"Branch prediction" comes to mind but i think that's more to do with if/else jump prediction.

helius · « **Reply #3 on:** January 05, 2015, 06:17:35 am »

The technique called Thread-Level Speculation can be used to apply multiple processors (or cores) to a program written as a single thread. The idea is that at (some subset of) conditional branches, the program forks a thread for each path. The threads are later joined and the result from the branch that was actually taken is used. There were many papers written on this subject in the '00s, and several research CPU designs to support it were tested.

tszaboo · « **Reply #4 on:** January 05, 2015, 07:38:29 am »

Communication between cores is rather slow. They only have the L3 cache shared, which is slower than L2, it is possible that several clock cycles required to send data between them.

tggzzz · « **Reply #5 on:** January 05, 2015, 12:37:08 pm »

In desktop CPUs, nowadays access to cache and main memory is the bottleneck. Doubly so when synchronisation is required.

Mechatrommer · « **Reply #6 on:** January 05, 2015, 01:33:59 pm »

Quote from: tonyget on January 05, 2015, 03:09:28 am

From my understanding, the current multi-core processor relies on software parallelisation

unless you can design cores that as intelligent as human to decide which one goes first which one later, that will be always the case. you can make presumption,preemptive and whatever you want to call it in your thesis, it will be not so close enough to "human made software parallelisation"

Quote from: tonyget on January 05, 2015, 03:09:28 am

The disadvantage is that softwares which are not optimized for multi-threading can't fully utilize this feature, so does numerical integration algorithm.

whose fault?

Quote from: tonyget on January 05, 2015, 03:09:28 am

Inspired from ADC interleaved sampling technique employed by oscilloscopes, I'm wondering why not use the same technique on CPU ? For instance, the time interval of 1GHz processor is 1ns, if you combine two cores with individual clock set 0.5ns apart, they take turns to execute lines of code, you would effectively get a 2GHz processor, isn't it ?

dont equate super "sampling" that can do the job in half the clock, each have there own ram and entirely separate chip/hardware. still what you see on the screen is overly undersampled from those collected (or realtime data) or just spits of psychedelics into 2d plane.

instructions execution is causal process, data collection is not, you can still collect the data of today even if you missed yesterday. but in computing/data processing, you try to process the data of yesterday to get of today, if yesterday is garbage, you get garbage today. what you are asking is a processor that can do the instruction and data processing to feed the next instruction in half the clock, in the end what you get? a 2GHz single processor, in each core. if its feasible, some phd in intel already came up with something like that long time ago. people struggled for things like "preemption computing" and its already ages old and has not been on par with "good enough" compared to "human made software parallelisation". and its only applicable in multitasking environment, two or several tasks/programs that do entirely different things and separate data. dont be too excited by that mumbo jumbo, even it, is not close to real deal of hard cores paralisation.

but well, we all know that we can easily say than done and its not wrong just by saying (dreaming) it. when we got the grant we'll be there

Howardlong · « **Reply #7 on:** January 05, 2015, 01:36:06 pm »

The DSP world, for example TI's TMS320C6xxx series use VLIW (very large instruction word) to allow concurrent execution.

It's been nearly ten years since I worked on them, but my recollection is that they have 256 bit instruction words and run from on chip RAM, so as native 32 bit processors they can run up to 8 concurrent instructions. It's up to the compiler (or a super nerd) to present mutually exclusive concurrent instructions. DSP lends itself quite well to closely coupled parallism like this due to the mutually exclusive array computations.

rob77 · « **Reply #8 on:** January 05, 2015, 02:18:42 pm »

Quote from: Howardlong on January 05, 2015, 01:36:06 pm

The DSP world, for example TI's TMS320C6xxx series use VLIW (very large instruction word) to allow concurrent execution.

It's been nearly ten years since I worked on them, but my recollection is that they have 256 bit instruction words and run from on chip RAM, so as native 32 bit processors they can run up to 8 concurrent instructions. It's up to the compiler (or a super nerd) to present mutually exclusive concurrent instructions. DSP lends itself quite well to closely coupled parallism like this due to the mutually exclusive array computations.

same is done on Itanium processors , but it's up to the compiler to make use of this advantage. if the compiler fails to optimize for VLIW, then the extra horsepower is left unused.

helius · « **Reply #9 on:** January 05, 2015, 02:37:10 pm »

Some notes:
The Itanium is not VLIW; its instruction words are wide, but they are not issued in lock-step in VLIW fashion. Instead, there are bits in each word that specify the data dependency of the instructions in the word, and these instructions are issued dynamically by the processor based on the available IUs according to those bits. So the instructions packed into each word can be issued either sequentially or in parallel. This approach was called "Explicit parallelism" by the HP team that designed the architecture.

Marco gave a good explanation of why the TS's idea cannot work.

rob77 · « **Reply #10 on:** January 05, 2015, 02:49:58 pm »

Quote from: helius on January 05, 2015, 02:37:10 pm

Some notes:
The Itanium is not VLIW; its instruction words are wide, but they are not issued in lock-step in VLIW fashion. Instead, there are bits in each word that specify the data dependency of the instructions in the word, and these instructions are issued dynamically by the processor based on the available IUs according to those bits. So the instructions packed into each word can be issued either sequentially or in parallel. This approach was called "Explicit parallelism" by the HP team that designed the architecture.

Marco gave a good explanation of why the TS's idea cannot work.

from wikipedia: http://en.wikipedia.org/wiki/Very_long_instruction_word

Quote

Outside embedded processing markets, Intel's Itanium IA-64 EPIC and Elbrus 2000 appear as the only examples of a widely used VLIW CPU architectures. However, EPIC architecture is sometimes distinguished from a pure VLIW architecture, since EPIC advocates full instruction predication, rotating register files, and a very long instruction word that can encode non-parallel instruction groups.

and many other sources are marking Itanium as VLIW, despite the fact is not pure VLIW... because they had to add many more features (prediction, multi-core, etc....) to compensate for inability to fully utilize the advantages by the available software.... (what is the extra horse power good for on your high-end server, if the DB software is not able to use it ?)

tggzzz · « **Reply #11 on:** January 05, 2015, 07:15:51 pm »

Quote from: rob77 on January 05, 2015, 02:18:42 pm

Quote from: Howardlong on January 05, 2015, 01:36:06 pm
The DSP world, for example TI's TMS320C6xxx series use VLIW (very large instruction word) to allow concurrent execution.

It's been nearly ten years since I worked on them, but my recollection is that they have 256 bit instruction words and run from on chip RAM, so as native 32 bit processors they can run up to 8 concurrent instructions. It's up to the compiler (or a super nerd) to present mutually exclusive concurrent instructions. DSP lends itself quite well to closely coupled parallism like this due to the mutually exclusive array computations.

same is done on Itanium processors , but it's up to the compiler to make use of this advantage. if the compiler fails to optimize for VLIW, then the extra horsepower is left unused.

When, not if, all but one of the predicated results are thrown away, that horsepower and the associated joules have been thrown away. Note that large datacentres tend to be power limited nowadays, and are often sited next to large bodies of water.

People have been trying to automatically extract such parallelism as the Itanium requires since the 70s, without success.

Itanic is dead.

rob77 · « **Reply #12 on:** January 06, 2015, 05:30:55 am »

Quote from: tggzzz on January 05, 2015, 07:15:51 pm

Itanic is dead.

Yes it is dead

but not completely yet, for really huge servers it's still one of the best available solutions (e.g. HP Superdome). furthermore, smaller itanium hardware is still used for running IO intensive SAP installations and databases, but it's getting replaced by x86 servers in this area (since x86 architecture made a giant leap forward and overcome the north-bridge IO and memory bottleneck).

tggzzz · « **Reply #13 on:** January 06, 2015, 08:38:52 am »

Quote from: rob77 on January 06, 2015, 05:30:55 am

Quote from: tggzzz on January 05, 2015, 07:15:51 pm
Itanic is dead.

Yes it is dead but not completely yet, for really huge servers it's still one of the best available solutions (e.g. HP Superdome). furthermore, smaller itanium hardware is still used for running IO intensive SAP installations and databases, but it's getting replaced by x86 servers in this area (since x86 architecture made a giant leap forward and overcome the north-bridge IO and memory bottleneck).

Back in 2001-3 Ihad difficulty persuading a PARISC house that they shouldn't believe the hype about Itanic, and should consider x86 machines for future products. I became much less difficult when AMD's x86-64 plus HyperTransport arrived in their Sledghammer processors. Sun's T1 processor was also impressive on the right workloads.

The era of presuming that coherent shared memory can be scaled has fortunately passed.

I expect the long-term future to be a processor based around 1-8 cores closely coupled to shared memory. Scalability will be based on explicit application-level message passing between such processors.

It will be interesting to see which languages/libraries are best able to aid/hinder programmers.

Howardlong · « **Reply #14 on:** January 06, 2015, 09:52:06 pm »

What I would say is that trying to figure out what is going on at the instruction level when an optimising compiler has got its hands on a VLIW like the TI C6000 series is like nothing else I've ever seen before or since.

While supposedly we should just trust the compiler, occasionally, just occasionally, there will be a functional compiler bug. Not often, but it does happen. The last functional compiler bug I found was about five years ago, but that was on a PIC, so hardly difficult to find and characterise.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Why not use interleaving technique on multi-core processor ? (Read 6123 times)

tonyget

Why not use interleaving technique on multi-core processor ?

Marco

Re: Why not use interleaving technique on multi-core processor ?

Psi

Re: Why not use interleaving technique on multi-core processor ?

helius

Re: Why not use interleaving technique on multi-core processor ?

tszaboo

Re: Why not use interleaving technique on multi-core processor ?

tggzzz

Re: Why not use interleaving technique on multi-core processor ?

Mechatrommer

Re: Why not use interleaving technique on multi-core processor ?

Howardlong

Re: Why not use interleaving technique on multi-core processor ?

rob77

Re: Why not use interleaving technique on multi-core processor ?

helius

Re: Why not use interleaving technique on multi-core processor ?

rob77

Re: Why not use interleaving technique on multi-core processor ?

tggzzz

Re: Why not use interleaving technique on multi-core processor ?

rob77

Re: Why not use interleaving technique on multi-core processor ?

tggzzz

Re: Why not use interleaving technique on multi-core processor ?

Howardlong

Re: Why not use interleaving technique on multi-core processor ?

Share me