Author Topic: 8-bit uC - is there even a point? (Read 58043 times)

Fungus · « **Reply #375 on:** October 10, 2018, 02:25:13 pm »

Quote from: legacy on October 10, 2018, 01:50:41 pm

but usually there is a limit for the number of numbers you can sum, how can they perform 40 add-operations in parallel and respecting the timing for data being stable?

The timing problem is O(logN). 64 bits isn't much more difficult than 16 bits.

How do they do it? Very carefully.

Siwastaja · « **Reply #376 on:** October 10, 2018, 02:28:50 pm »

Quote from: Fungus on October 10, 2018, 02:25:13 pm

How do they do it? Very carefully.

Yeah, I guess there is quite a lot of complexity hidden in the place&route (synthesis) and timing analysis tools, which is vital for the end result.

coppice · « **Reply #377 on:** October 10, 2018, 02:44:53 pm »

Quote from: langwadt on October 10, 2018, 02:22:57 pm

Quote from: legacy on October 10, 2018, 10:54:26 am
Quote from: brucehoult on October 10, 2018, 05:48:15 am
ARM manuals say the M3/M4 has a single-cycle multiplier

how is it implemented?

afair ARM use some kind booth multiplier, basically a bunch of adders and some trickery

might not be exactly the algo they use but close enough, http://www.ellab.physics.upatras.gr/~bakalis/Eudoxus/MBM.html

Its more likely they use a Wallace tree than a Booth tree. Booth is naturally 2's complement, and you need extra fudging to do unsigned multiplies. Wallace is naturally unsigned, and you need extra fudging to do 2's complement multiplies. When you need to do both types of multiply (which an ARM does) Wallace + fudging logic tends to work out better. For 2's complement only work (e.g. most DSP) Booth is the winner.

nctnico · « **Reply #378 on:** October 10, 2018, 02:47:26 pm »

Actually it is easier to implement 1's complement calculations than 2's complement. Someone I knew in a distant past wrote a PhD thesis on that.

legacy · « **Reply #379 on:** October 10, 2018, 02:50:29 pm »

Quote from: Fungus on October 10, 2018, 02:25:13 pm

The timing problem is O(logN). 64 bits isn't much more difficult than 16 bits.
How do they do it? Very carefully.

I have implemented a Booth 32x32 Multiplier and Adder (MAC) with saturating arithmetic, but ... it takes 4 clocks to compute, otherwise, the synthesizer complains of the timing (of the full adder units) not respected.

It means that, on ISE-v11/Spartan6, I can't sum more than 32/4=8 numbers in parallel with the same adder

anyway, I have a wait-flag in the finite state machine of the CPU (Arise-v2) that stalls the whole core until the MAC unit comes ready with the results.

External interrupts are handled by a Cop0 unit which works in parallel, hence they accepted during the CPU stall and saved for being handled on the next fetch.

legacy · « **Reply #380 on:** October 10, 2018, 02:53:54 pm »

Quote from: nctnico on October 10, 2018, 02:47:26 pm

Actually it is easier to implement 1's complement calculations than 2's complement. Someone I knew in a distant past wrote a PhD thesis on that.

with the software made able to handle "negative zero"

technix · « **Reply #381 on:** October 10, 2018, 03:03:24 pm »

Quote from: legacy on October 10, 2018, 01:50:41 pm

Quote from: technix on October 10, 2018, 11:36:25 am
Quote from: legacy on October 10, 2018, 10:54:26 am
Quote from: brucehoult on October 10, 2018, 05:48:15 am
ARM manuals say the M3/M4 has a single-cycle multiplier

how is it implemented?
There is this matrix adder design for integer multiplier, a lot of adders are involved but it is a fully combinatorial unit.

but usually there is a limit for the number of numbers you can sum, how can they perform 40 add-operations in parallel and respecting the timing for data being stable?

Instead of traditional adders they expand one of the operands into sums of powers of two. Now multiply by power of two is just shifting which can be done without propagation delay using cross wiring, and the matrix multiplier uses a one-counter (a special type of encoder) to reduce the amount of adders actually needed.

brucehoult · « **Reply #382 on:** October 11, 2018, 01:25:00 am »

Quote from: NorthGuy on October 10, 2018, 02:17:53 pm

Quote from: brucehoult on October 10, 2018, 05:48:15 am
That turns out not to be the case. On simple processors the control is the majority of the chip.

This can be reduced if you set it as a design goal.

I'm pretty sure the people designing the Z80 had simplicity of implementation as a primary goal. The ones designing the 6502 certainly did, and it's got a similar control-to-datapath ratio.

Quote

Quote from: brucehoult on October 10, 2018, 05:48:15 am
For example here's a labelled photo of a Z80

If you want to use existing processors, look at this:

https://en.wikipedia.org/wiki/Transistor_count

Z80 had 8500 transistors. ARM Cortex A9 has 26 million transistors. You could have over 3000 Z80 cores instead of one ARM core, all running at the same clock speed or faster. This is a little bit more than 16, is it?

The A9 has a lot of transistors because it's a complex out-of-order CPU, has a lot of cache, has FPU, has MMU.

Virtually nothing to do with being 32 bit vs 8 bit.

NorthGuy · « **Reply #383 on:** October 11, 2018, 02:03:10 am »

Quote from: brucehoult on October 11, 2018, 01:25:00 am

The A9 has a lot of transistors because it's a complex out-of-order CPU, has a lot of cache, has FPU, has MMU.

Virtually nothing to do with being 32 bit vs 8 bit.

You don't seem to like my examples. Would you pick a 32-bit processor which has something to do with being 32-bit?

SilverSolder · « **Reply #384 on:** October 11, 2018, 02:05:59 am »

Are there any 3 cent 32 bit microcontrollers?

SiliconWizard · « **Reply #385 on:** October 11, 2018, 02:26:06 am »

Quote from: legacy on October 10, 2018, 02:53:54 pm

with the software made able to handle "negative zero"

Is zero negative or positive?
You have 4 hours.

brucehoult · « **Reply #386 on:** October 11, 2018, 02:41:27 am »

Quote from: NorthGuy on October 11, 2018, 02:03:10 am

Quote from: brucehoult on October 11, 2018, 01:25:00 am
The A9 has a lot of transistors because it's a complex out-of-order CPU, has a lot of cache, has FPU, has MMU.

Virtually nothing to do with being 32 bit vs 8 bit.

You don't seem to like my examples. Would you pick a 32-bit processor which has something to do with being 32-bit?

Sure. An actually small 32 bit ARM such as the Cortex M0 or a simple 32 bit RISC-V would be appropriate to compare to the likes of Z80.

It's hard to know much about ARMs but in fully open RISC-V land you have for example https://github.com/SpinalHDL/VexRiscv which can be configured as RV32I at 346 MHz and 0.52 Dhrystone MIPS/MHz on an Artix 7 using 481 LUTs and 539 FFs.

LUTs don't convert conveniently to equivalent gates, but somewhere between 6 and 24 is about right, and probably 12 is a good average. So that's somewhere between 3000 and 12000 gates for the LUTs with 6000 probably being a good guess. D flip-flops are worth 4 gates each I guess, so that's 2000. Total maybe 8000.

When you convert gates to transistors that's bigger than a Z80. But it's *far* from 100000. Let alone millions.

Note that the 32 bit ARM1 is listed on the Wikipedia page you referenced as having 25000 transistors, about 3x the z80.

legacy · « **Reply #387 on:** October 11, 2018, 12:43:25 pm »

Quote from: SiliconWizard on October 11, 2018, 02:26:06 am

Quote from: legacy on October 10, 2018, 02:53:54 pm
with the software made able to handle "negative zero"

Is zero negative or positive?

mathematically
zero+ = 0 + eps
zero- = 0 - eps

hence,
zero+ is positive
zero- is negative

numerically, in 1's complement
zero+=0x0000.0000
zero-=0xffff.ffff

a circuit evaluates if a number is positive or negative by checking the most significant bit: (boolean_t) is_positive(A) = (MSbit(A) isEqualTo '0');

hence
(MSbit(0x0000.0000) = '0') => zero+ is positive
(MSbit(0xffff.ffff) = '1') => zero- is negative

but the point is that a 1's-complement-ALU requires an additional circuit to handle results with end-around carry propagation and subtract if it's set to '1'

Code: [Select]

  0001 0110     22
+ 1111 1111     −0
===========   ====
1 0001 0101     21    —An end-around carry is produced. it's set to '1'
+ 0000 0001      1    — subtract it
===========   ====
  0001 0110     22    —The correct result (22 + (−0) = 22)

and you also have to modify each compare-and-branch instruction in your hypothetical ISA to double compare an operand, which can zero 0+ or zero-, and they both must be handled as zero.

These look the couple of software patches you have to apply if you are willing to use a 2's-complement-ALU for the 1's-complement math

legacy · « **Reply #388 on:** October 11, 2018, 12:57:21 pm »

for sure the 1's complement math gives the less numerical error when used to implement a fixed-point DSP engine, simply because the weights of positive numbers and negatives numbers are perfectly balanced

2's complement
bx00: positive, zero
bx01: postive, non zero
bx10: negative, non zero
bx11: negative, non zero

1 positive non zero vs 2 negative non zero: weights are not balanced, it may cause an additional numerical error if the algorithm needs to bounce between positive numbers and negative numbers (e.g. a trigonometric algorithm)

1's complement
bx00: positive, zero+
bx01: postive, non zero
bx10: negative, non zero
bx11: negative, zero-

1 positive non zero vs 1 negative non zero: weights are balanced. It offers the same truncation error for both positive and negative numbers. Which is good!

Fungus · « **Reply #389 on:** October 11, 2018, 01:55:07 pm »

Quote from: SiliconWizard on October 11, 2018, 02:26:06 am

Quote from: legacy on October 10, 2018, 02:53:54 pm
with the software made able to handle "negative zero"

Is zero negative or positive?
You have 4 hours.

That's a trick question. The answer is "Both", obviously.

Fungus · « **Reply #390 on:** October 11, 2018, 01:56:33 pm »

Quote from: brucehoult on October 11, 2018, 02:41:27 am

Sure. An actually small 32 bit ARM

I had an original Acorn Archimedes back in the day.

technix · « **Reply #391 on:** October 11, 2018, 02:11:13 pm »

Quote from: SilverSolder on October 11, 2018, 02:05:59 am

Are there any 3 cent 32 bit microcontrollers?

Maybe not quite three US cents...

SWM050I2P7, Cortex-M0 in TSOP-8, 8kB Flash, 1kB SRAM, US$0.2 in volume of 10
GD32F130F8P6, Cortex-M3 in TSOP-20, 32kB fast Flash + 32kB slow Flash, 8kB SRAM, US$0.5 in volume of 10.

NorthGuy · « **Reply #392 on:** October 11, 2018, 02:28:56 pm »

Quote from: brucehoult on October 11, 2018, 02:41:27 am

Sure. An actually small 32 bit ARM such as the Cortex M0 or a simple 32 bit RISC-V would be appropriate to compare to the likes of Z80.

It's hard to know much about ARMs but in fully open RISC-V land you have for example https://github.com/SpinalHDL/VexRiscv which can be configured as RV32I at 346 MHz and 0.52 Dhrystone MIPS/MHz on an Artix 7 using 481 LUTs and 539 FFs.

LUTs don't convert conveniently to equivalent gates, but somewhere between 6 and 24 is about right, and probably 12 is a good average. So that's somewhere between 3000 and 12000 gates for the LUTs with 6000 probably being a good guess. D flip-flops are worth 4 gates each I guess, so that's 2000. Total maybe 8000.

Physically a LUT consists of 64 config bits which are selected by 6 address lines. Thus it's 63 muxes, which is lot more than 12 gates. You may be able to get the same effect with discrete gates, or you may be not. It's like data compression - some data compresses well, some data doesn't compress at all. You only can tell, if all 6 inputs are used, you need at least 6 gates. Therefore comparing LUTs to gates is not a good idea.

If you compare to FPGA based cores, such as Picoblaze, your basic RV32I is equivalent of 5 Picoblazes (however I don't think Picoblaze can run at 350 MHz).

However, this is a very basic, very feeble processor. If you start adding features (look at the table you posted: https://github.com/SpinalHDL/VexRiscv ), by the time you add enough features to make it into a typical 32-bit processor, you get to 2000 LUTs and the speed deceases to 183 MHz - now it's equivalent to 20 Picoblazes and your RISC is now running slower than Picoblaze.

This is a very interesting table, by the way - adding features consumes lots of logic, but performance growth is not great - your fastest RISC is not even 50% faster than the feeble 346 MHz model.

And that's RISC - the best 32-bit CPU the humanity could come up with. If you look at others (such as ARM or Microblaze), the pattern will be the same, but the performance will be even lower.

Quote from: brucehoult on October 11, 2018, 02:41:27 am

Note that the 32 bit ARM1 is listed on the Wikipedia page you referenced as having 25000 transistors, about 3x the z80.

If we use the smallest 32-bit processor (ARM1), shouldn't we pick the smallest 8-bit processor for comparison? The table shows 3500 transistors for 6502. Your ARM1 is roughly 7 times bigger.

technix · « **Reply #393 on:** October 11, 2018, 05:32:12 pm »

Quote from: NorthGuy on October 11, 2018, 02:28:56 pm

Quote from: brucehoult on October 11, 2018, 02:41:27 am
Sure. An actually small 32 bit ARM such as the Cortex M0 or a simple 32 bit RISC-V would be appropriate to compare to the likes of Z80.

It's hard to know much about ARMs but in fully open RISC-V land you have for example https://github.com/SpinalHDL/VexRiscv which can be configured as RV32I at 346 MHz and 0.52 Dhrystone MIPS/MHz on an Artix 7 using 481 LUTs and 539 FFs.

LUTs don't convert conveniently to equivalent gates, but somewhere between 6 and 24 is about right, and probably 12 is a good average. So that's somewhere between 3000 and 12000 gates for the LUTs with 6000 probably being a good guess. D flip-flops are worth 4 gates each I guess, so that's 2000. Total maybe 8000.

Physically a LUT consists of 64 config bits which are selected by 6 address lines. Thus it's 63 muxes, which is lot more than 12 gates. You may be able to get the same effect with discrete gates, or you may be not. It's like data compression - some data compresses well, some data doesn't compress at all. You only can tell, if all 6 inputs are used, you need at least 6 gates. Therefore comparing LUTs to gates is not a good idea.

If you compare to FPGA based cores, such as Picoblaze, your basic RV32I is equivalent of 5 Picoblazes (however I don't think Picoblaze can run at 350 MHz).

However, this is a very basic, very feeble processor. If you start adding features (look at the table you posted: https://github.com/SpinalHDL/VexRiscv ), by the time you add enough features to make it into a typical 32-bit processor, you get to 2000 LUTs and the speed deceases to 183 MHz - now it's equivalent to 20 Picoblazes and your RISC is now running slower than Picoblaze.

This is a very interesting table, by the way - adding features consumes lots of logic, but performance growth is not great - your fastest RISC is not even 50% faster than the feeble 346 MHz model.

And that's RISC - the best 32-bit CPU the humanity could come up with. If you look at others (such as ARM or Microblaze), the pattern will be the same, but the performance will be even lower.

There is a certain thing 8-bit processors suffer: memory space. Certain programs eat RAM like candy (Google Chrome with more than a handful of tabs open) and 8-bit cores will quickly start to suffer even if it has 64-bit memory pointers.

Quote from: NorthGuy on October 11, 2018, 02:28:56 pm

Quote from: brucehoult on October 11, 2018, 02:41:27 am
Note that the 32 bit ARM1 is listed on the Wikipedia page you referenced as having 25000 transistors, about 3x the z80.

If we use the smallest 32-bit processor (ARM1), shouldn't we pick the smallest 8-bit processor for comparison? The table shows 3500 transistors for 6502. Your ARM1 is roughly 7 times bigger.

Do keep in mind that one NMOS transistor can map to multiple CMOS transistors. The 25000 CMOS transistors in ARM1 and the 3500 NMOS transistors in 6502 maps to ARM1 having about three times the gate count.

Fungus · « **Reply #394 on:** October 12, 2018, 07:23:52 am »

Quote from: SilverSolder on October 11, 2018, 02:05:59 am

Are there any 3 cent 32 bit microcontrollers?

There won't be any ARM chips, the ARM royalties will be more than 3 cents.

mikeselectricstuff · « **Reply #395 on:** October 12, 2018, 08:10:44 am »

Quote from: Fungus on October 12, 2018, 07:23:52 am

Quote from: SilverSolder on October 11, 2018, 02:05:59 am
Are there any 3 cent 32 bit microcontrollers?

There won't be any ARM chips, the ARM royalties will be more than 3 cents.

RISCV?

rjp · « **Reply #396 on:** October 12, 2018, 08:24:53 am »

Quote from: mikeselectricstuff on October 12, 2018, 08:10:44 am

Quote from: Fungus on October 12, 2018, 07:23:52 am
Quote from: SilverSolder on October 11, 2018, 02:05:59 am
Are there any 3 cent 32 bit microcontrollers?

There won't be any ARM chips, the ARM royalties will be more than 3 cents.
RISCV?

Not until someone is producing them in the numbers that get economies of scale.

Id expect one of the Chinese mobs to be on to that sooner rather than later but its not today.

GeorgeOfTheJungle · « **Reply #397 on:** October 12, 2018, 10:07:29 am »

Quote from: SilverSolder on October 11, 2018, 02:05:59 am

Are there any 3 cent 32 bit microcontrollers?

How about 1 cent, and flexible?

$0.01 Flexible Plastic ARM Processor by PragmatIC

wraper · « **Reply #398 on:** October 12, 2018, 10:44:11 am »

Quote from: GeorgeOfTheJungle on October 12, 2018, 10:07:29 am

Quote from: SilverSolder on October 11, 2018, 02:05:59 am
Are there any 3 cent 32 bit microcontrollers?

How about 1 cent, and flexible?

$0.01 Flexible Plastic ARM Processor by PragmatIC

Except it's not 1 cent, not a real product, only prototype and even ARM licence is more expensive than 1 cent. A few cents sometime in the future, maybe, just as a guy on the right said.

coppice · « **Reply #399 on:** October 12, 2018, 10:47:41 am »

Quote from: wraper on October 12, 2018, 10:44:11 am

Quote from: GeorgeOfTheJungle on October 12, 2018, 10:07:29 am
Quote from: SilverSolder on October 11, 2018, 02:05:59 am
Are there any 3 cent 32 bit microcontrollers?

How about 1 cent, and flexible?

$0.01 Flexible Plastic ARM Processor by PragmatIC
Except it's not 1 cent, not a real product, only prototype and even ARM licence is more expensive than 1 cent. Sometime in the future, maybe, just a a guy on the right said.

They made an ARM1. That is licence free. To make a commercial product they have licence free options. like the RISC/V. People usually can't escape all royalties, because they will have to used some silicon IP, like a flash block. Since these people are not using any silicon technology, they will be creating 100% of their process related IP.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: 8-bit uC - is there even a point? (Read 58043 times)

Share me