Author Topic: 8-bit uC - is there even a point? (Read 66287 times)

brucehoult · « **Reply #350 on:** October 09, 2018, 08:57:25 am »

Quote from: Fungus on October 09, 2018, 08:47:56 am

Quote from: brucehoult on October 09, 2018, 12:33:27 am
The relevant thing is the address space a program can conveniently use.

Most of the world would disagree with you, including the people who manufacture the chips.

I work for a small company that designs and manufactures CPUs, mostly at the moment ones in the 32 or 64 bit "microcontroller" class. Mostly I write compilers for them, but I also get a little involved in design of new instructions and hardware implementing them.

Before that I worked for a gigantic company that also designs and manufactures CPUs, also mostly writing compilers, but also being sometimes involved in the CPU design.

Fungus · « **Reply #351 on:** October 09, 2018, 09:25:21 am »

Quote from: brucehoult on October 09, 2018, 08:57:25 am

Quote from: Fungus on October 09, 2018, 08:47:56 am
Quote from: brucehoult on October 09, 2018, 12:33:27 am
The relevant thing is the address space a program can conveniently use.
Most of the world would disagree with you, including the people who manufacture the chips.

I work for a small company that designs and manufactures CPUs, mostly at the moment ones in the 32 or 64 bit "microcontroller" class. Mostly I write compilers for them, but I also get a little involved in design of new instructions and hardware implementing them.

Before that I worked for a gigantic company that also designs and manufactures CPUs, also mostly writing compilers, but also being sometimes involved in the CPU design.

So?

At some point you have to put a label on things and put them in a category. You can't call every single chip a "hybrid 8/16/20/32/64-bit CPU" just because it has a single edge case for each of those somewhere in the instruction set.

You can 'conveniently use' CPUs with paged memory via bios calls, etc. All the swapping can be completely transparent to users. Many computers do this, eg. swapping RAM for system ROM. Does that make the 6502-based BBC Micro or Z80-based MSX a 17 or 18-bit computer? Nope.

coppice · « **Reply #352 on:** October 09, 2018, 09:41:53 am »

Quote from: Fungus on October 09, 2018, 09:25:21 am

You can 'conveniently use' CPUs with paged memory via bios calls, etc.

I like the use of quote marks there. I don't think I've ever seen as much loathing of a computer design as you get when you push a paged solution, or as much relief as you get when you tell people you're going to stretch the address registers to solve their memory constraints.

brucehoult · « **Reply #353 on:** October 09, 2018, 09:47:33 am »

Quote from: coppice on October 09, 2018, 09:41:53 am

Quote from: Fungus on October 09, 2018, 09:25:21 am
You can 'conveniently use' CPUs with paged memory via bios calls, etc.
I like the use of quote marks there. I don't think I've ever seen as much loathing of a computer design as you get when you push a paged solution, or as much relief as you get when you tell people you're going to stretch the address registers to solve their memory constraints.

Exactly.

Using more memory than the natural hardware size of your memory addresses (16 bits in the case of the 6502, z80 etc) requires absolute contortions on the part of the programmer.

Fungus · « **Reply #354 on:** October 09, 2018, 09:57:16 am »

Quote from: brucehoult on October 09, 2018, 09:47:33 am

Using more memory than the natural hardware size of your memory addresses (16 bits in the case of the 6502, z80 etc) requires absolute contortions on the part of the programmer.

Maybe true but it doesn't mean you get to change the commonly understood definition of "8-bit" or "16-bit" because of it.

Siwastaja · « **Reply #355 on:** October 09, 2018, 10:39:12 am »

Quote from: Fungus on October 09, 2018, 08:47:56 am

Quote from: brucehoult on October 09, 2018, 12:33:27 am
The relevant thing is the address space a program can conveniently use.

Most of the world would disagree with you, including the people who manufacture the chips.

I wouldn't; while the classification to the bit number classes isn't never going to work perfectly, the "address space a program can conveniently use" sounds quite fair to me.

Of course, like most classification problems, this is only interesting on the "academic coffee table discussion" level.

In reality, a 8-bit CPU is what the manufacturer calls a "8-bit CPU", and a 32-bit CPU is what the manufacturer decides to call a "32-bit CPU". This can be more arbitrary than we want to think. It's up to the marketing to decide.

And the coffee table discussion on how "pure" those numbers are goes on, and people try to come up with criteria to maximize the "pureness" of existing products. And others say that all CPU's are "hybrids". Nothing wrong with that either. Classification, after all, is always by arbitrary rules, and there is no need to come up with the "Official Truth" on coffee table discussions. It doesn't exist.

Fungus · « **Reply #356 on:** October 09, 2018, 11:18:13 am »

Quote from: Siwastaja on October 09, 2018, 10:39:12 am

Quote from: Fungus on October 09, 2018, 08:47:56 am
Quote from: brucehoult on October 09, 2018, 12:33:27 am
The relevant thing is the address space a program can conveniently use.
Most of the world would disagree with you
I wouldn't

So you'd say the chips on this page are more than 8-bits?

https://en.wikipedia.org/wiki/8-bit

Like I said, most of the world disagrees with that.

SiliconWizard · « **Reply #357 on:** October 09, 2018, 02:53:15 pm »

Quite frankly isn't this bitness thing running in circles? As most of us have debated, there is no formal definition that I know of, and it tends to vary according to time, type of processor, market, etc. And who cares. As I said above, the most commonly seen use amongst a lot of different kinds and generations of processors seems to be the internal data bus width, at least IMO. There are so many counter-examples concerning the registers width, address bus width or ALU that it doesn't seem quite as relevant, all the more that modern ALUs can be structurally pretty different from what they were 30 or 40 years ago. It may seem common that most registers and the ALU width will match the internal data bus width, but there can be variants/extensions of that.

Only a closer look to a specific processor will allow you to determine if it fits your requirements or not. Not just a simple figure that turns into a marketing gimmick.
To get back to the topic, all 8-bit processors are not equal by any means. Some will even get you more effective performance and lower power draw than some 16-bit processors, whereas other 8-bit processors are usable only for the simplest tasks and actually draw more power than even a lot of more recent 32-bit ones. Then the memory models can vary substantially as well, leading to different performance levels and ease of programming.

Only if you target a very specific architecture (such as the ARM Cortex Mx) - then it becomes possible to compare processors. It's easier in the 32-bit world for MCUs, since there are very few different architectures on the market. For 8-bitters and 16-bitters, that's a different story. There are many different architectures that are difficult to really compare.

SiliconWizard · « **Reply #358 on:** October 09, 2018, 03:08:50 pm »

Quote from: NorthGuy on October 09, 2018, 01:22:05 am

The SIMD registers are not really registers, but a combination of smaller largely independent registers. Even though they have 512-bit SIMD registers now, the longest addition or multiplication you can perform is still only 64-bit. And that's what determines the "bitness" of the CPU.

They are registers and defined as such in Intel's datasheets and manuals.
Since they can be fully accessed in one cycle as any other register, I don't see a reason not to consider them so. Of course there is no 256-bit or 512-bit ALU, so ALU operations on them are limited to operations on chunks of the registers, but that doesn't make them any less of registers. And you can still move arbitrary data around with them at their full size. Just because they are virtually organized as predefined chunks for some operations doesn't change much. There are operations on "regular" registers such as byte/word swapping that also use predefined chunks of the registers.

All of this to again say that it sounds futile to try and see a definitive definition.

NorthGuy · « **Reply #359 on:** October 09, 2018, 03:19:46 pm »

Quote from: SiliconWizard on October 09, 2018, 02:53:15 pm

To get back to the topic, all 8-bit processors are not equal by any means. Some will even get you more effective performance and lower power draw than some 16-bit processors, whereas other 8-bit processors are usable only for the simplest tasks and actually draw more power than even a lot of more recent 32-bit ones.

This is a question of technology. 32-bit processors typically use smaller transistors which are faster and consume less power. 8-bit processors are usually made of bigger transistors which consume more power an will be slower. If you build both on the same technology, there's no doubts that you can achieve lower power consumption and faster speeds with 8-bit processors compared to 16-bit processors, or with 16-bit processors compared to 32-bit processors.

SiliconWizard · « **Reply #360 on:** October 09, 2018, 07:31:46 pm »

Quote from: NorthGuy on October 09, 2018, 03:19:46 pm

Quote from: SiliconWizard on October 09, 2018, 02:53:15 pm
To get back to the topic, all 8-bit processors are not equal by any means. Some will even get you more effective performance and lower power draw than some 16-bit processors, whereas other 8-bit processors are usable only for the simplest tasks and actually draw more power than even a lot of more recent 32-bit ones.

This is a question of technology. 32-bit processors typically use smaller transistors which are faster and consume less power. 8-bit processors are usually made of bigger transistors which consume more power an will be slower. If you build both on the same technology, there's no doubts that you can achieve lower power consumption and faster speeds with 8-bit processors compared to 16-bit processors, or with 16-bit processors compared to 32-bit processors.

Sure. The point was not what is technologically possible, but rather what's actually available on the market.

As for speed, yes you can obviously achieve higher clock speeds on a given process node with a simpler architecture and smaller registers. As for the resulting performance, it's a trade-off. Higher clock speeds on a given process node for an equivalent "processing power" may actually draw more power than a more complex/wider architecture running at lower clock speeds. The sweet spot is not necessarily trivial to find IMO.

NorthGuy · « **Reply #361 on:** October 09, 2018, 08:03:45 pm »

Quote from: SiliconWizard on October 09, 2018, 03:08:50 pm

Of course there is no 256-bit or 512-bit ALU, so ALU operations on them are limited to operations on chunks of the registers, but that doesn't make them any less of registers.

Another interesting example is 8087. When it was a separate chip, it was 80-bit CPU attached to 16-bit 8086 processor. It could operate on 80-bit floats and 64-bit integers.

Once the functionality of 8087 was taking into the processor, the 80-bit registers (with 80-bit operations) became part of the processor. Did it make 80386 into a 64-bit (or 80-bit) processor?

BTW: The FPU is still there in the most modern Intel processors, although newer software doesn't use it. Does this make new Intel processors into 80-bit processors?

The point is moot. It's the same as pornography. You cannot define it, but you recognize it when you see it.

technix · « **Reply #362 on:** October 09, 2018, 08:03:54 pm »

Quote from: SiliconWizard on October 09, 2018, 07:31:46 pm

Quote from: NorthGuy on October 09, 2018, 03:19:46 pm
Quote from: SiliconWizard on October 09, 2018, 02:53:15 pm
To get back to the topic, all 8-bit processors are not equal by any means. Some will even get you more effective performance and lower power draw than some 16-bit processors, whereas other 8-bit processors are usable only for the simplest tasks and actually draw more power than even a lot of more recent 32-bit ones.

This is a question of technology. 32-bit processors typically use smaller transistors which are faster and consume less power. 8-bit processors are usually made of bigger transistors which consume more power an will be slower. If you build both on the same technology, there's no doubts that you can achieve lower power consumption and faster speeds with 8-bit processors compared to 16-bit processors, or with 16-bit processors compared to 32-bit processors.

Sure. The point was not what is technologically possible, but rather what's actually available on the market.

As for speed, yes you can obviously achieve higher clock speeds on a given process node with a simpler architecture and smaller registers. As for the resulting performance, it's a trade-off. Higher clock speeds on a given process node for an equivalent "processing power" may actually draw more power than a more complex/wider architecture running at lower clock speeds. The sweet spot is not necessarily trivial to find IMO.

Maybe this can be tested on a FPGA. Throw various different processor cores on the same FPGA platform and run the same test program. Same FPGA platform means there would be no variance of process node.

An example test would be RV64GC vs Cortex-M0 vs Zet the open source 80486 implementation vs that open source AVR implementation. RV64, Cortex-M0, 80486 and AVR all run code produced by mainline GCC, so we can even remove most of the compiler and optimizer differences if we use the same source tree to generate the four compilers used in the tests. And since we are using the same version of GCC across the platforms we can use the same test code (with platform-specific boilerplate though) at the same optimization level.

NorthGuy · « **Reply #363 on:** October 09, 2018, 09:35:11 pm »

Quote from: SiliconWizard on October 09, 2018, 07:31:46 pm

As for speed, yes you can obviously achieve higher clock speeds on a given process node with a simpler architecture and smaller registers. As for the resulting performance, it's a trade-off. Higher clock speeds on a given process node for an equivalent "processing power" may actually draw more power than a more complex/wider architecture running at lower clock speeds. The sweet spot is not necessarily trivial to find IMO.

Performance is another story. The performance depends on what you're doing. The current most popular benchmarks use different sorts of 32-bit number crunching, so 32-bit processors certainly come as winners. But look at your embedded programs. Most of the time, you're doing something completely different. As a result, the 32-bit ALU crunches mostly zero bits, which could have been avoided if you used 8-bit ALU.

32x32 multiplier takes 16 times more silicon that 8x8 multiplier. What would you rather have: one 1-GHz 32-bit CPU, or 16 1-GHz 8-bit CPUs running independently and controlling the peripherals? IMHO, for the majority of tasks the second would be more preferable.

brucehoult · « **Reply #364 on:** October 09, 2018, 11:17:08 pm »

Quote from: NorthGuy on October 09, 2018, 09:35:11 pm

32x32 multiplier takes 16 times more silicon that 8x8 multiplier. What would you rather have: one 1-GHz 32-bit CPU, or 16 1-GHz 8-bit CPUs running independently and controlling the peripherals? IMHO, for the majority of tasks the second would be more preferable.

The latter, obviously.

Unfortunately you don't get to make that trade-off as instruction decode and control takes a large proportion of the area of a CPU, and doesn't vary much depending on whether the registers and ALU are 8, 16, 32 or 64 bits. For example, the decode and control on a 64 bit RISC-V Rocket core is just about identical to that on a 32 bit RISC-V Rocket core.

So, it's more likely that you'd actually get to choose between one 1-GHz 32-bit CPU, or two or maybe three 1-GHz 8-bit CPUs.

Three points about the multiplier argument:

1) many useful CPUs don't have a multiplier or divider at all

2) the ratio of the rest of the ALU and registers and datapath is proportional to the bitwidth, lowering the average.

3) a 32x32 multiply can be done with three 16x16 multiplies. Each 16x16 multiply can be done with three 8x8 multiplies. So actually you need nine 8x8 multipliers, not sixteen.

Also, few CPUs do single-cycle multiply. It's more likely to use a 16x16 multiplier three times and take three clock cycles. So that's only three times the area of an 8x8 multiplier.

NorthGuy · « **Reply #365 on:** October 10, 2018, 03:35:05 am »

Quote from: brucehoult on October 09, 2018, 11:17:08 pm

Unfortunately you don't get to make that trade-off as instruction decode and control takes a large proportion of the area of a CPU, and doesn't vary much depending on whether the registers and ALU are 8, 16, 32 or 64 bits. For example, the decode and control on a 64 bit RISC-V Rocket core is just about identical to that on a 32 bit RISC-V Rocket core.

So, it's more likely that you'd actually get to choose between one 1-GHz 32-bit CPU, or two or maybe three 1-GHz 8-bit CPUs.

Most of the control that you're referring to is not required in light-weight cores:

- There's no pipeline, so no need for pipeline control.
- There's no cache, so there's no need for cache and big memory controllers.
- We won't worry about code density (because the performance doesn't depend
on decoding speed), so we can have really long instructions, such as 32-bit.
This way the instruction will come out already decoded to a great extent
- We'll also get rid of the bus with all the bus arbitrage and collisions, and
let our small cores communicate through dedicated FIFOs.

In the end, it'll be much less of silicon for each core. May be not 16, but close. More importantly, you
may be able to run the whole thing faster and will end up with 2 GHz 8-bit cores, or even 3 GHz 8-bit
cores. This would certainly beat the current behemoths.

Quote from: brucehoult on October 09, 2018, 11:17:08 pm

Three points about the multiplier argument:

3) a 32x32 multiply can be done with three 16x16 multiplies. Each 16x16 multiply can be done with three 8x8 multiplies. So actually you need nine 8x8 multipliers, not sixteen.

May be Karatsuba gives some advantage, but it also requires lots of dancing around. When I implemented SSL a long time ago, I found out that it only produces benefits for really big numbers. Of course it's different in silicon, but I don't think it's advantageous.

Quote from: brucehoult on October 09, 2018, 11:17:08 pm

Also, few CPUs do single-cycle multiply. It's more likely to use a 16x16 multiplier three times and take three clock cycles. So that's only three times the area of an 8x8 multiplier.

You either pipeline it, then each stage requires its own multiplier, or you don't, in which case next multiply commands will be sitting there waiting for its turn.

brucehoult · « **Reply #366 on:** October 10, 2018, 05:48:15 am »

Quote from: NorthGuy on October 10, 2018, 03:35:05 am

Quote from: brucehoult on October 09, 2018, 11:17:08 pm
Unfortunately you don't get to make that trade-off as instruction decode and control takes a large proportion of the area of a CPU, and doesn't vary much depending on whether the registers and ALU are 8, 16, 32 or 64 bits. For example, the decode and control on a 64 bit RISC-V Rocket core is just about identical to that on a 32 bit RISC-V Rocket core.

So, it's more likely that you'd actually get to choose between one 1-GHz 32-bit CPU, or two or maybe three 1-GHz 8-bit CPUs.

Most of the control that you're referring to is not required in light-weight cores:

- There's no pipeline, so no need for pipeline control.
- There's no cache, so there's no need for cache and big memory controllers.
- We won't worry about code density (because the performance doesn't depend
on decoding speed), so we can have really long instructions, such as 32-bit.
This way the instruction will come out already decoded to a great extent
- We'll also get rid of the bus with all the bus arbitrage and collisions, and
let our small cores communicate through dedicated FIFOs.

That turns out not to be the case. On simple processors the control is the majority of the chip.

For example here's a labelled photo of a Z80

You'll observe that the control logic is something like 60% of the chip, and the registers and ALU 40% (to be generous).

If you made a chip with the same instruction set with the same encoding, but just made the registers and ALU two or four times wider, the control part would stay the same size while the registers and ALU would get bigger, roughly in proportion to the bit width. So the control part would get relatively smaller. So a 32 bit Z80 might be 25% control and 75% registers and ALU.

Quote

In the end, it'll be much less of silicon for each core. May be not 16, but close. More importantly, you
may be able to run the whole thing faster and will end up with 2 GHz 8-bit cores, or even 3 GHz 8-bit
cores. This would certainly beat the current behemoths.

There is no reason for a 2x or 4x wider data path to run 50% more slowly. The only place it will make any difference at all is carry propagation in adders, but that's logarithmic if done correctly. Basically, a 32 bit adder has two gates more delay than an 8 bit adder. And the adder isn't usually the clock-limiting factor anyway.

Quote from: brucehoult on October 09, 2018, 11:17:08 pm

You either pipeline it, then each stage requires its own multiplier, or you don't, in which case next multiply commands will be sitting there waiting for its turn.

Multiplies are rare. Two multiplies in a row are *very* rare.

The Coremark and Dhrystone benchmarks, for example are quite heavily influenced by multiply performance.

I just checked on my HiFive1 "Arduino" board (32 bit RISC-V "Rocket" microcontroller running at 256 MHz). It has a 4 cycle multiplier which allows following instructions to continue as long as they are not multiplies (or divides) and as long as they do not need the result of the multiply. I used gcc 7.2.0.

Coremark/MHz with hardware multiply: 2.68
Coremark/MHz with software multiply: 0.94

Dhrystone/MHz with hardware multiply: 1.56
Dhrystone/MHz with software multiply: 0.85

For comparison, for the Cortex M3/M4 ARM claims http://www.eeherald.com/section/design-guide/arm_cortex_m3_m4_mcu.html 2.17 Coremark/MHz and 1.25 Dhrystone/MHz.

ARM manuals say the M3/M4 has a single-cycle multiplier. But it's coming out considerably slower than the chip with the 4 cycle multiplier on benchmarks where multiplication is important (as seen by software emulation being much slower)

As in most real-case uses of multiply, Coremark (for example) is generally loading two values from memory, then multiplying them, then storing the result, and then looping. There are plenty of other instructions around the multiply that a four cycle latency is no big deal.

legacy · « **Reply #367 on:** October 10, 2018, 10:54:26 am »

Quote from: brucehoult on October 10, 2018, 05:48:15 am

ARM manuals say the M3/M4 has a single-cycle multiplier

how is it implemented?

technix · « **Reply #368 on:** October 10, 2018, 11:36:25 am »

Quote from: legacy on October 10, 2018, 10:54:26 am

Quote from: brucehoult on October 10, 2018, 05:48:15 am
ARM manuals say the M3/M4 has a single-cycle multiplier

how is it implemented?

There is this matrix adder design for integer multiplier, a lot of adders are involved but it is a fully combinatorial unit.

GeorgeOfTheJungle · « **Reply #369 on:** October 10, 2018, 12:00:43 pm »

ARMs are the best, trash all your 8-bitters now!

Howardlong · « **Reply #370 on:** October 10, 2018, 12:04:23 pm »

Quote from: brucehoult on October 10, 2018, 05:48:15 am

ARM manuals say the M3/M4 has a single-cycle multiplier. But it's coming out considerably slower than the chip with the 4 cycle multiplier on benchmarks where multiplication is important (as seen by software emulation being much slower)

As in most real-case uses of multiply, Coremark (for example) is generally loading two values from memory, then multiplying them, then storing the result, and then looping. There are plenty of other instructions around the multiply that a four cycle latency is no big deal.

A quick note: how often a multiply instruction appears in object code isn't the same as how often it's executed.

In some number crunching applications, particularly DSP, multiplies, or more often MACs, are a significant proportion of the run time effort.

I totally agree that, particularly in ARM, and depending on the benchmark, loads and stores are an expensive overhead. ARM is optimised for in-register processing.

In many applications, this necessitates amalgamating several logically separate, chained algorithms into fewer more complex algorithms to be able to compute in-register, minimising loads and stores. This is on top of being acutely aware of processor pipeline stalls, which in itself demands a mental maze to be minimised.

While compiler optimisers are capable of helping out with things like loop unrolling, re-implementing multiple chained algorithms to reduce loads and stores is not something a compiler optimiser is capable of.

Once you've optimised out as many loads and stores as possible, you're left with a run time profile with a very high percentage of multiplies and/or MACs, even though the object code itself only has <1% multiply/MAC instructions.

Siwastaja · « **Reply #371 on:** October 10, 2018, 01:44:13 pm »

Quote from: Howardlong on October 10, 2018, 12:04:23 pm

While compiler optimisers are capable of helping out with things like loop unrolling, re-implementing multiple chained algorithms to reduce loads and stores is not something a compiler optimiser is capable of.

Exactly.

When the processing power is at premium, mapping the memory correctly is a huge deal; not only to minimize the load/stores, but also to utilize the available SIMD instruction set. DMAs that can increment the addresses by a value larger than the write access size help here by interleaving the datasets, if you have that available to you.

Some example code from an older project: https://pastebin.com/NJf5WF93 . Two ADCs are configured in dual-mode to utilize 32-bit memory operations on DMA, halving the memory access time spent by DMA. Now, the data for two independent but synchronous DC/DC converters are interleaved; their control ISRs happen synchronously on 180 degree phase shift as well. The original code simply summed the eight current measurement to average them, but this only works with 16-bit loads, taking the same time as 32-bit loads would take. Both ISRs needed to do 8 such loads. By manually optimizing it, 16-bit loads are replaced by 32-bit loads, and utilizing SIMD instructions, each ISR calculates half of what it needs, and as a free side product, half of what the another ISR needs. One register is dedicated for storing the intermediate result, so we need to tell the compiler not to use that register elsewhere. (I needed to analyze how that limitation affects the performance elsewhere. No performance drop was caused, as the full number of registers were never used by the compiler at the same time anywhere expect non-critical initialization.)

The reason for this was that I was actually having random ADC data register overruns when the CPU was executing the loooong eight loads, and simultaneously, the DMA was doing a bunch of 16-bit writes all the time from the ADCs working in separate mode. This fix halved both usage patterns, reducing the memory operation load by 4x. This was, by the way, two ADC converting at 5MSPS each (so 10MSPS total), the CPU and the AHB bus running at 72MHz, on STM32F334.

legacy · « **Reply #372 on:** October 10, 2018, 01:50:41 pm »

Quote from: technix on October 10, 2018, 11:36:25 am

Quote from: legacy on October 10, 2018, 10:54:26 am
Quote from: brucehoult on October 10, 2018, 05:48:15 am
ARM manuals say the M3/M4 has a single-cycle multiplier

how is it implemented?
There is this matrix adder design for integer multiplier, a lot of adders are involved but it is a fully combinatorial unit.

but usually there is a limit for the number of numbers you can sum, how can they perform 40 add-operations in parallel and respecting the timing for data being stable?

NorthGuy · « **Reply #373 on:** October 10, 2018, 02:17:53 pm »

Quote from: brucehoult on October 10, 2018, 05:48:15 am

That turns out not to be the case. On simple processors the control is the majority of the chip.

This can be reduced if you set it as a design goal.

Quote from: brucehoult on October 10, 2018, 05:48:15 am

For example here's a labelled photo of a Z80

If you want to use existing processors, look at this:

https://en.wikipedia.org/wiki/Transistor_count

Z80 had 8500 transistors. ARM Cortex A9 has 26 million transistors. You could have over 3000 Z80 cores instead of one ARM core, all running at the same clock speed or faster. This is a little bit more than 16, is it?

I couldn't find data for ARM Cortex M, but my guess would be 100000, which would give you roughly 12 Z80 cores.

But this can be improved if you aim at pure performance and forget about code density.

Quote from: brucehoult on October 10, 2018, 05:48:15 am

There is no reason for a 2x or 4x wider data path to run 50% more slowly. The only place it will make any difference at all is carry propagation in adders, but that's logarithmic if done correctly. Basically, a 32 bit adder has two gates more delay than an 8 bit adder. And the adder isn't usually the clock-limiting factor anyway.

Of course, dealing with wider datapath requires more logic and more pipelining at faster clocks. You need to arrange the gates and flops in space. This produces delays. I'm 100% confident that if you put the higher clock frequency as you design goal, you'll be able to build much faster 8-bit cores than 32-bit cores with the same design goal in mind.

Quote from: brucehoult on October 09, 2018, 11:17:08 pm

Multiplies are rare. Two multiplies in a row are *very* rare.

Depends on what you're doing. If you want to do a FIR filter, you'd be better off with multiplications done every cycle. And DSP processors (such as dsPIC) can do it, and will give you much higher performance doing FIR. Although they will fall way behind the ARM M3/M4 cores in your Coremark/Dhrystone tests. These tests have nothing to do with performance.

Of course, people look at benchmarks, so if you design a processor which is high on benchmarks, people will think of it as a high performance processor. Better yet, if you design a benchmark test to make your processor look good ...

langwadt · « **Reply #374 on:** October 10, 2018, 02:22:57 pm »

Quote from: legacy on October 10, 2018, 10:54:26 am

Quote from: brucehoult on October 10, 2018, 05:48:15 am
ARM manuals say the M3/M4 has a single-cycle multiplier

how is it implemented?

afair ARM use some kind booth multiplier, basically a bunch of adders and some trickery

might not be exactly the algo they use but close enough, http://www.ellab.physics.upatras.gr/~bakalis/Eudoxus/MBM.html


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: 8-bit uC - is there even a point? (Read 66287 times)

Share me