Microprocessor (MPU 8/16) that can be programmed using C programming language

#50 Reply
Posted by MK14 on 29 Jul, 2020 10:52
Quote from: 0db on 29 Jul, 2020 09:46
Just thought my opinion was interesting. Wasn't it? Ok. I'm out too.

You seem to have rather strong opinions, and ideas of the way, computers are going to be, in the future. Which would appear (at a quick glance), to be in contradiction. To what I believe, many other people would think.

You might be right, but you seem to be able to very rapidly jump to conclusions, about the future of computing, on rather odd, ideas about computer things.

E.g. Someone might see an interesting new development, in AI projects, which flashes some leds, in an interesting and cool manor. All by itself, without any human help, to give it some interesting patterns.

They shouldn't really, after watching it flash for 45 seconds, pronounce that all of mathematics is going to very rapidly, and dramatically change. All computers, by 2030, will have to use this new processing algorithm. Hence the C language will almost totally disappear, and not even run on almost any processor or computer, by 2030.
Etc etc.

This someone can say that, but it doesn't mean it is correct, or likely to happen, or even possible, if you properly analyse the evidence/situation.

I.e. The AI flashing the leds, is nice and fun. But those 45 seconds they spent watching it, usually doesn't change the future of mathematics, computer science, processor design, etc. Within the next 10 years.

There are exceptions, such as the day (various dates), when Transistors, Integrated Circuits, Microprocessors, etc etc. Were invented. Those did change things, to an extent, over a fairly long period of time.

#51 Reply
Posted by aiq25 on 04 Aug, 2020 15:14
Quote from: MK14 on 27 Jul, 2020 13:48
If not already mentioned, consider the AVR Mega2560.
You can configure it to have, 32K of external address/data bus space, because it has so many pins, available.
It is 8 bit, and has an instruction set, somewhat similar, to early 8 bit processors.
It can go fairly fast, if you want, I'd have to look it up, but something like 16 MHz. (Compared to early 8 bit cpus, which only ran at 1 MHz, originally, 6502).
It is a modern and very available chip.
It has lots of built in peripheral I/O devices, a lot more than the usual arduino offerings.
Is a fully fledged arduino member, with the information sources and ready built boards/hardware availability.
But you can buy just the chip.
Quite a bit of onboard flash.

I did consider this but the ATXMEGA384C3 seems to be cheaper that's why I was considering it. But the additional I/O pins of the 2560 is attractive. 2560 Arduino is easy to find and might be a good starting point, so I think I will have to reconsider this.

#52 Reply
Posted by brucehoult on 04 Aug, 2020 21:59
Quote from: MK14 on 27 Jul, 2020 13:48
If not already mentioned, consider the AVR Mega2560.
You can configure it to have, 32K of external address/data bus space, because it has so many pins, available.

Ohhh .. I hadn't realized that! And I own a few of them. Really should read the datasheet :-)

How it works:

- you can actually add a 64K external SRAM, of which 56832 bytes (55.5K) can actually be used. The external SRAM starts at address 8704 (0x2200), immediately after the 8K of internal SRAM.

- of course it's not necessarily SRAM, it could be ROM, device registers, or whatever you want to add.

- accessing external SRAM takes 1 clock cycle per byte more than internal (plus any needed wait states). If you put the stack there then subroutine and interrupt calls and returns take 3 cycles longer (for 3 PC bytes).

- pins AD0..AD7 (Arduino pins 22..29) are multiplexed as both the data bus and the low 8 bits of the external address. Need an external latch such as 74HC573 (74AHC573 above 8 MHz) to demultiplex the bus

- pins A8..A15 (Arduino pins 30..37 but in reverse order) are the high bits of the address. If not all are needed for the amount of SRAM you can configure the unneeded bits as normal I/O by setting the low three bits of XMCRB (0x75) to a nonzero value e.g. setting it to 1 will release PC7 as a normal I/O pin, setting it to 6 will release PC2, PC3, PC4, PC5, PC6, PC7 as normal I/O pins etc.

- the address presented on the external bus is the actual address used by the program, so addresses in the first 8.5K of address space are never presented externally. This means the first 8.5K of a 64K SRAM is inaccessable (unless you use one weird trick involving manipulating the port bits manually -- taking more instructions and time and programming complexity -- see below). If a 32K SRAM is used then the first 8.5K can be accessed at addresses from 32K to 40.5K because the CPU knows it's an external address but the 15 bit address will wrap back to the start of the SRAM. And so on for smaller SRAMs: as long as the SRAM size is a power of 2 the whole SRAM can be used and you'll get a contiguous RAM address space (starting at 0x200) but the usage inside the SRAM will wrap (for SRAMs larger than 0.5K).

- the data bus and lower 8 bits of the address bus use port A. The upper 8 bits of the address use port C

- if you use XMCRB to disable some high bits of A8..A15 (at least 2) then you can set the remaining bits of port C manually (as normal I/O bits) to the high bits of the desired address in the actual SRAM, and only the low bits of the address in the program's pointer will be output. Thus you can access the entire SRAM, but only in a selected 256 byte to 16K window at a time. Ugh.

But, you can get up to 55.5 KB of contiguous RAM that uses completely normal compiler-generated code and C pointers!

#53 Reply
Posted by MK14 on 04 Aug, 2020 23:31
Quote from: brucehoult on 04 Aug, 2020 21:59
Quote from: MK14 on 27 Jul, 2020 13:48
If not already mentioned, consider the AVR Mega2560.
You can configure it to have, 32K of external address/data bus space, because it has so many pins, available.

Ohhh .. I hadn't realized that! And I own a few of them. Really should read the datasheet :-)

A very nice post. I hope you don't mind me adding to it, a bit.

The Arduino's 2K of ram, 8K on the Mega2560 (I know there are later Arduino's with considerably more, e.g. Arm, but those are not 'original' Arduino's as such).
So, the original Arduino's, have a rather tiny amount of ram. But, your project might have started out on it, with good intentions, and you thought it would easily fit, and at first it diid.
Then feature creep or expansion of what you want to do, can easily exceed such small amounts of ram (especially the 2K).

So, there use to be a 32K ram expansion board, that you could buy. But I heard they stopped selling it.
Not sure if the following link is the same or a different version:
https://hackaday.io/project/21561-arduino-mega-2560-32kb-ram-shield

But anyway, you can still buy, relatively blank, I/O shields for the Arduino Mega2560's, for not much money (a few dollars, if I remember correctly), from China. Which should allow you to build up your own 32K (or partial 64K, see nice post about it above this one) board on it.
Which can then have whatever you want to do with the available address space. E.g. EPROMS, more sram, etc etc.

Don't get too hanged up about the extra memory cycle (or more if wait states are used). Because if you were concerned about speed, you wouldn't be using an 8 bit AVR 16 MHz.
It would probably be a hundreds of megahertz 32 bit arm, or > GHz Raspberry PI 4, etc.

The cool feature of the 8 bit AVR 16 MHz, is that it is genuinely slow enough to just about count as a vintage (like), old school cpu, combined with an 8 bit instruction set, which wouldn't look too out of place, if it came from the 1970s or 1980s.
All the built in I/O blocks and functions, allows for a relatively compact, not too many ICs, neat retro, vintage like machine. Which is ready, to add low cost (Chinese) I/O boards, for experiments and fun.

E.g. Fairly affordably, you can buy Chinese complete electronics kits, with the Mega2560, and an eye-wateringly big array, of fun, pre-built gadgets, to play with it. N.B. They sometimes get big sale discounts, and other buying mediums, can sell stuff like that cheaper.

E.g. https://www.amazon.co.uk/ELEGOO-Mega2560-Complete-Ultimate-controller/dp/B01IUZK3JO

https://www.amazon.co.uk/ELEGOO-Upgraded-Tutorial-Compatible-MEGA2560/dp/B01M3TOXZN

#54 Reply
Posted by westfw on 04 Aug, 2020 23:44
Quote
You can configure [atmega2560] to have, 32K of external address/data bus space
Not for storing instructions, though/

https://www.rugged-circuits.com/new-products/quadram

XMEGA chips support up to 16MB of external RAM (as well as it can be supported by an 8bit architecture with "natively 16bit" addresses. Bank registers. Eww!) The ATXMEGAA1U Xplained Pro Eval board has an Xmega with 512k bytes of external RAM...

There are some harsh facts, though:
- These chips, by themselves, are more expensive that newer, faster, ARM or PIC32 chips with much more internal RAM (and the ability to execute from RAM.)
- Using them with external RAM is outside of "normal." You will need to reconfigure the linker scripts and possibly the compiler, and the amount of help and example code you can expect to find "on the web" is small.
I always figured that when I need to start messing with banking registers, it's time for a different (probably 32bit) architecture.

#55 Reply
Posted by brucehoult on 05 Aug, 2020 00:10
Quote from: MK14 on 04 Aug, 2020 23:31
So, there use to be a 32K ram expansion board, that you could buy. But I heard they stopped selling it.
Not sure if the following link is the same or a different version:
https://hackaday.io/project/21561-arduino-mega-2560-32kb-ram-shield

Yup. And there are exactly two active components on that board: the actual SRAM, and a SN74AHC373.

Weirdly, their examples address the SRAM as 0x8000 to 0xFFFF. Sure, that works, but it means you have to modify your programs to use it explicitly instead of using it as contiguous with the built in RAM, as 0x2200 to 0xA1FF. It's all the same to the board, as 0x2200 to 0x7FFF is the same bytes in the SRAM as 0xA200 to 0xFFFF.

Quote
But anyway, you can still buy, relatively blank, I/O shields for the Arduino Mega2560's, for not much money (a few dollars, if I remember correctly), from China. Which should allow you to build up your own 32K (or partial 64K, see nice post about it above this one) board on it.

Sure. I use nice Australian-made (or at least designed) ones: https://www.freetronics.com.au/products/protoshield-mega

Last time I got some they were available off the shelf in Jaycar here in NZ (and I assume in Aussie).

Quote
The cool feature of the 8 bit AVR 16 MHz, is that it is genuinely slow enough to just about count as a vintage (like), old school cpu, combined with an 8 bit instruction set, which wouldn't look too out of place, if it came from the 1970s or 1980s.

I think it's probably fast enough to *emulate* a 1 MHz 6502. It must be very close. I might try it.

With this, you could add enough memory to actually emulate a 32K Apple ][.

#56 Reply
Posted by brucehoult on 05 Aug, 2020 00:19
Quote from: westfw on 04 Aug, 2020 23:44
There are some harsh facts, though:
- These chips, by themselves, are more expensive that newer, faster, ARM or PIC32 chips with much more internal RAM (and the ability to execute from RAM.)
- Using them with external RAM is outside of "normal." You will need to reconfigure the linker scripts and possibly the compiler, and the amount of help and example code you can expect to find "on the web" is small.
I always figured that when I need to start messing with banking registers, it's time for a different (probably 32bit) architecture.
Well, yes.

A Teensy 4.0 has a 600 MHz dual-issue 32 bit CPU (around 900 MIPS), 2048K of flash, and 1024K of RAM, 40 digital pins (31 PWM), 14 analog inputs, all for $20.

Even Chinese Arduino Mega2560s cost $16 (I have ELEGOO). Genuine ones are $38. For 256K of flash and 8K of RAM.

#57 Reply
Posted by MK14 on 05 Aug, 2020 14:00
Quote from: westfw on 04 Aug, 2020 23:44
Quote
You can configure [atmega2560] to have, 32K of external address/data bus space
Not for storing instructions, though/

https://www.rugged-circuits.com/new-products/quadram

XMEGA chips support up to 16MB of external RAM (as well as it can be supported by an 8bit architecture with "natively 16bit" addresses. Bank registers. Eww!) The ATXMEGAA1U Xplained Pro Eval board has an Xmega with 512k bytes of external RAM...

There are some harsh facts, though:
- These chips, by themselves, are more expensive that newer, faster, ARM or PIC32 chips with much more internal RAM (and the ability to execute from RAM.)
- Using them with external RAM is outside of "normal." You will need to reconfigure the linker scripts and possibly the compiler, and the amount of help and example code you can expect to find "on the web" is small.
I always figured that when I need to start messing with banking registers, it's time for a different (probably 32bit) architecture.
In a sense, or technically speaker, you are RIGHT!
But my viewpoint, is as follows.

You use a modern, powerful Arm processor, with its potentially, highly complicated peripheral set. Amazingly powerful peripherals, yes. But they can have 2,000 page manuals, which make for very heavy reading.
If you are doing something commercially, and being paid well to do it, and are part of a team, it can work out, reasonably well, or better.
There are also, various levels of library options, with their own pros and cons, which I won't go into here.

But for a quick hobby project, on your own, with limited time to do it, that is not such a good option.

You could go full vintage-retro build, and try to get the (now) somewhat rare peripheral chips, which went with that processor (Z80, 6502, etc), but e.g. ebay, is reportedly full of cheap, Chinese potential fakes, (poor) clones, used/poorly-unsoldered unknown if working/wrecked chips, etc.
If it is available from a western supplier, it still may have originated from suspect Chinese sources, also reputable suppliers, can charge quite a lot for the chips.
So, 10 such chips from a reliable supplier (genuine, new old stock), may cost £50 .. £150+, which is a lot.

So, it gives one, a way of having a genuine (non-emulated), fairly modern, Mega2560 peripheral set (fairly old school, 8 bit, quick/easy implementation, using your own code), with bigger external (Data only, not instructions, I think you are right, I'd forgotten, as I needed/wanted bigger dataspace, over the usual 2K/8K available).

You are right, such an 8 bit solution, is potentially more expensive than some of the 32 bit, considerably faster solutions. E.g. ST32 'clone' mini development boards, are only around £1.60+ each (delivered, if you shop around and buy a pack of 10 in one go).
Not that much more, if you only buy one at a a time, perhaps £1.75 + £1.10 = £3.

Not trying to offend you, but the speed thing is not especially relevant (if aiming for old retro vintage, slow type of computer). The idea is to make something, broadly similar to a 1MHz 6502 or 4 MHz Z80 system, so the (mostly) single cycle instruction execution time at 16 MHz, is actually something like 30x faster, than those old/original processors, and has many more registers.

If you want speed, then use a modern Raspberry PI 4, or fast PC, etc.

Analogy:
For a fun build, you splash out and buy an unbuilt, (probably way too expensive, rare and may not of been part of their range), complete Heathkit, classic Valve/Tube amplifier.

You then build it, and have great fun making it, testing it and using it.

Someone then comes and says, you just spent $1,000 (there seem to be ones you can buy these days, with 'modern' valves, which are fairly cheap, e.g. $10 .. $100) for a headphone (or speaker, if valve based) amplifier, when a $0.05 OP-AMP (headphone only, without extra stuff or power amp versions), would have done it. Much lower noise, massive bandwidth boost and better overall, and only needs 1 milliamp, at 3.3V, to operate. As opposed to 240V AC, 50 Watts.
I'd have to politely say to them, they are missing the point. (Fun with building old tube amplifiers etc).

There are of course, millions of other possible solutions. E.g. PICs. I have just covered one of them.

#58 Reply
Posted by mikerj on 05 Aug, 2020 14:45
Quote from: 0db on 28 Jul, 2020 15:08
Quote from: mikerj on 28 Jul, 2020 14:46
Really? In what respect?

Since Quantum Mechanics, the math is having more interests for Differential Algebraic, Tensor Algebra and etc.

Things used by the modern physic to describe new ideas, and the computer science reflects a part of his "new hype" with the deep learning AI.

For example, the Google Tensor processor uses rather modern math, both for training operators and so called "autonomous" machines; universities have new courses, and laboratories propose new challenges.

Just 20 years ago, if you had made a thesis on AI, you would have discusses a "lisp-based machine".

Ok, how does this preclude the use of C on the majority of applications that don't require such heavy maths?

#59 Reply
Posted by SiliconWizard on 05 Aug, 2020 16:43
Quote from: mikerj on 05 Aug, 2020 14:45
Quote from: 0db on 28 Jul, 2020 15:08
Quote from: mikerj on 28 Jul, 2020 14:46
Really? In what respect?

Since Quantum Mechanics, the math is having more interests for Differential Algebraic, Tensor Algebra and etc.

Things used by the modern physic to describe new ideas, and the computer science reflects a part of his "new hype" with the deep learning AI.

For example, the Google Tensor processor uses rather modern math, both for training operators and so called "autonomous" machines; universities have new courses, and laboratories propose new challenges.

Just 20 years ago, if you had made a thesis on AI, you would have discusses a "lisp-based machine".

Ok, how does this preclude the use of C on the majority of applications that don't require such heavy maths?

I dunno what that was all supposed to really mean.

Tensors are nothing new. And I don't see what would prevent you from using C to implement tensor algebra.
Actually there's a large number of recent C++ tensor libraries out there. Whereas C++ here is usually used to take advantage of templates and operator overloading, nothing prevents you from implementing the same in pure C. Maybe it'll look a bit less elegant. But the basics of tensor algebra, as far as numerical computation is involved, still uses pretty basic maths operators. This may all look pretty cool and fancy in a PhD thesis, because that may be kind of advanced from the theoretical POV, but from a computational POV there's nothing much to write home about IMHO.

If you're targetting a general-purpose CPU, there won't be much difference here, except style.
Now of course if you're targetting some kind of specialized CPU/coprocessor, then you ideally either need a language that natively supports that efficiently, or some kind of library (possibly less elegant and maybe a bit less efficient.)

But claiming you can't use C or another existing programming language to implement all those supposedly fancy very modern maths, that's kind of bullshit IMO. Now if the question is finding a language that will express that better and yield to better performance, why not (that's already different from saying it's impossible with C or whatever else), but then I'd be glad to see real-world examples with real figures to see if that really holds.

Now there's still the very general matter of parallelism, which is still not well supported by most programming languages out there. And yes, whereas some "modern maths" can benefit from parallelism, that's a very general issue in itself.

#60 Reply
Posted by GeorgeOfTheJungle on 05 Aug, 2020 17:04
Isn't CUDA a sort of dialect of C?

#61 Reply
Posted by brucehoult on 05 Aug, 2020 20:47
Quote from: MK14 on 05 Aug, 2020 14:00
Not trying to offend you, but the speed thing is not especially relevant (if aiming for old retro vintage, slow type of computer). The idea is to make something, broadly similar to a 1MHz 6502 or 4 MHz Z80 system, so the (mostly) single cycle instruction execution time at 16 MHz, is actually something like 30x faster, than those old/original processors, and has many more registers.

32x faster running NOP. Or INX, LDA # etc. But the average 6502 instruction execution time is probably at least 3 cycles -- the time to load / store / add / compare etc something from Zero Page. With roughly equal numbers of 4+ cycle instructions and 2 cycle instruction mixed in. So that makes 50x a better estimate. But also you need more instructions on 6502, even when dealing with 8 bit variables. If you've got more than 3 variables in a function and do A = B + C then you're looking at 2 instructions and 2 clock cycles on AVR (whether 8 bit or 16 bit) but 4 instructions and 11 clock cycles on 6502 for 8 bit and 7 instructions and 20 clock cycles for 16 bit. So you can easily start to see 5 to 10 times more clock cycles on 6502 than AVR before even taking the 16 MHz vs 1 MHz into account.

I reckon 100x would be a good estimate of the speed ratio for skillful hand-written assembly language, 200x for compiled C.

#62 Reply
Posted by brucehoult on 05 Aug, 2020 20:59
Quote from: GeorgeOfTheJungle on 05 Aug, 2020 17:04
Isn't CUDA a sort of dialect of C?

No.

At least no more than JavaScript is, and I'd say less. They all use {} around blocks and = for assignment and similar operators and operator precedence but that's just the syntax level. The *meaning* is very different.

CUDA embeds functions that run on the GPU inside C programs, but those functions (marked by __global__) have very different rules and semantics to C, most obviously access to "variables" such as blockIdx, blockDim, and threadIdx that have not been declared anywhere in the program or header files. There are also many things that you can't do that you can do in normal C in the rest of the program.

#63 Reply
Posted by GeorgeOfTheJungle on 05 Aug, 2020 21:12
Oh, sorry, I meant OpenCL not CUDA:

https://developer.apple.com/library/archive/documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html

#64 Reply
Posted by MK14 on 05 Aug, 2020 21:36
Quote from: brucehoult on 05 Aug, 2020 20:47
Quote from: MK14 on 05 Aug, 2020 14:00
Not trying to offend you, but the speed thing is not especially relevant (if aiming for old retro vintage, slow type of computer). The idea is to make something, broadly similar to a 1MHz 6502 or 4 MHz Z80 system, so the (mostly) single cycle instruction execution time at 16 MHz, is actually something like 30x faster, than those old/original processors, and has many more registers.

32x faster running NOP. Or INX, LDA # etc. But the average 6502 instruction execution time is probably at least 3 cycles -- the time to load / store / add / compare etc something from Zero Page. With roughly equal numbers of 4+ cycle instructions and 2 cycle instruction mixed in. So that makes 50x a better estimate. But also you need more instructions on 6502, even when dealing with 8 bit variables. If you've got more than 3 variables in a function and do A = B + C then you're looking at 2 instructions and 2 clock cycles on AVR (whether 8 bit or 16 bit) but 4 instructions and 11 clock cycles on 6502 for 8 bit and 7 instructions and 20 clock cycles for 16 bit. So you can easily start to see 5 to 10 times more clock cycles on 6502 than AVR before even taking the 16 MHz vs 1 MHz into account.

I reckon 100x would be a good estimate of the speed ratio for skillful hand-written assembly language, 200x for compiled C.

Attempting to use 'standard' units and independent from us, measurements. That would be the 'MIP' then. Because they are both 8 bit, the data size ambiguity, doesn't matter so much.

6502 = 0.430 MIPS at 1.000 MHz, source: https://en.wikipedia.org/wiki/Instructions_per_second
Mega2560 = up to 1 MIPS per MHz, i.e. an 8 MHz processor can achieve up to 8 MIPS = 16 MIPS (@ 16 MHz)
Source: https://en.wikipedia.org/wiki/AVR_microcontrollers

So that gives 16 / 0.430 = x37.21 times faster.

We could argue all day long, about how relevant/accurate, MIPS are today, we could also argue/adjust those figures I just supplied.

My x30, was only a very, very rough, ball park figure. x37.21, is not that different, from what I estimated.

There isn't really an answer as such. Because it varies so much, with what exactly you are trying to do (the program), how well (efficiently) it is implemented, and how good (if not assembler), the compilers are. There could be other factors as well, such as exactly how you measure it (e.g. what the test data is, some data patterns may favour one cpu, over the other, etc).

#65 Reply
Posted by GeorgeOfTheJungle on 05 Aug, 2020 22:14
It isn't only the MIPS, it's that you've got only three (true) registers in the 6502 (ignore zero-page), and they're only 8 bits each, that's quite a handicap that makes simple things unnecessarily complicated, and too often turns one liners into a multi line register load-store-swap ugly mess.

#66 Reply
Posted by brucehoult on 06 Aug, 2020 00:44
Quote from: GeorgeOfTheJungle on 05 Aug, 2020 21:12
Oh, sorry, I meant OpenCL not CUDA:

https://developer.apple.com/library/archive/documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html

Uh, yeah, I've been on a team at Samsung that wrote an OpenCL (and Vulkan) compiler for GPUs. I've heard of it :-) I didn't mention it because it's very little different in principle to CUDA. It's also much more restrictive than C despite, again, sharing the superficial syntax. As just one example, there are only string constants, not string variables. Recursion is not allowed. In fact it's normal for OpenCL compilers to inline all function calls into the parent kernel function so no actual function calls exist in the compiled code. Only strictly structured control flow is allowed (you can use "goto" to express it, but it must be "reducable" to fully structured flow).

#67 Reply
Posted by brucehoult on 06 Aug, 2020 01:37
Quote from: MK14 on 05 Aug, 2020 21:36
Quote from: brucehoult on 05 Aug, 2020 20:47
Quote from: MK14 on 05 Aug, 2020 14:00
Not trying to offend you, but the speed thing is not especially relevant (if aiming for old retro vintage, slow type of computer). The idea is to make something, broadly similar to a 1MHz 6502 or 4 MHz Z80 system, so the (mostly) single cycle instruction execution time at 16 MHz, is actually something like 30x faster, than those old/original processors, and has many more registers.

32x faster running NOP. Or INX, LDA # etc. But the average 6502 instruction execution time is probably at least 3 cycles -- the time to load / store / add / compare etc something from Zero Page. With roughly equal numbers of 4+ cycle instructions and 2 cycle instruction mixed in. So that makes 50x a better estimate. But also you need more instructions on 6502, even when dealing with 8 bit variables. If you've got more than 3 variables in a function and do A = B + C then you're looking at 2 instructions and 2 clock cycles on AVR (whether 8 bit or 16 bit) but 4 instructions and 11 clock cycles on 6502 for 8 bit and 7 instructions and 20 clock cycles for 16 bit. So you can easily start to see 5 to 10 times more clock cycles on 6502 than AVR before even taking the 16 MHz vs 1 MHz into account.

I reckon 100x would be a good estimate of the speed ratio for skillful hand-written assembly language, 200x for compiled C.

Attempting to use 'standard' units and independent from us, measurements. That would be the 'MIP' then. Because they are both 8 bit, the data size ambiguity, doesn't matter so much.

6502 = 0.430 MIPS at 1.000 MHz, source: https://en.wikipedia.org/wiki/Instructions_per_second
Mega2560 = up to 1 MIPS per MHz, i.e. an 8 MHz processor can achieve up to 8 MIPS = 16 MIPS (@ 16 MHz)
Source: https://en.wikipedia.org/wiki/AVR_microcontrollers

First off, MIPS is completely meaningless across CPU ISAs because instructions do vastly different amounts of actual useful work. As I already noted, what takes 1 or 2 instructions on AVR takes 4 instructions on 6502 even in 8 bit land.

The only thing that makes sense is the running time of benchmarks, preferably real programs that do work you actually want to do.

Note that the wikipedia page you points to is (mostly) not actually Millions of Instructions per Second, but relative speeds running the rather bogus but at least it's trying Dhrystone benchmark, in which the DEC VAX 11/780 is defined as 1.000. Some of the entries are actual MIPS not Dhrystone MIPS, which is completely ridiculous as they are not comparable.

I found this page: http://www.ecrostech.com/Other/Resources/Dhrystone.htm

An ATmega64 (64k flash, 4k SRAM) did 0.328 Dhrystone 2.1 MIPS/MHz, which would be 5.248 at 16 MHz.

According this this page: http://www.homebrewcpu.com/new_stuff.htm

A 1.023 MHz Apple ][ does 37 Dhrystones/second which is 0.021 DMIPS.

I've found other pages with similar figures e.g. https://news.ycombinator.com/item?id=11705631. I think it's probably about right for compiled C code.

0.328/0.021 = 16 MHz AVR 250 times faster than 1 MHz 6502.

That's on the Dhrystone benchmark, which is a pretty rubbish benchmark, but better than nothing. Coremark would be a lot better Dhrystone does a few different things, but a disproportionate amount of the score comes from a single (repeated) call to strcmp() where the first difference occurs after (IIRC) 18 bytes.

Quote
So that gives 16 / 0.430 = x37.21 times faster.

That's reasonably close to my estimate of 48x for the number of instructions per second, or indeed to your 30x. However it's not allowing for the different amount of useful work done by each instruction on very different instruction sets.

Quote
There isn't really an answer as such. Because it varies so much, with what exactly you are trying to do (the program), how well (efficiently) it is implemented, and how good (if not assembler), the compilers are. There could be other factors as well, such as exactly how you measure it (e.g. what the test data is, some data patterns may favour one cpu, over the other, etc).

I completely agree, but you should at least *try* to measure it using some more or less realistic or useful task, using typical programming languages, compilers, libraries etc.

As I said before, and I stand by it, 100x for carefully hand-written code, and 200x for compiled C code. (Dhrystone shows 250x)

#68 Reply
Posted by brucehoult on 06 Aug, 2020 01:43
Quote from: GeorgeOfTheJungle on 05 Aug, 2020 22:14
It isn't only the MIPS, it's that you've got only three (true) registers in the 6502 (ignore zero-page), and they're only 8 bits each, that's quite a handicap that makes simple things unnecessarily complicated, and too often turns one liners into a multi line register load-store-swap ugly mess.

Exactly.

I think my saying the 6502 needs twice as many instructions as AVR, as well as three times more clock cycles per instruction, as well as 16x lower clock speed (i.e. 96x slower) is if anything being generous to my beloved old 6502.

#69 Reply
Posted by MK14 on 06 Aug, 2020 07:29
Quote from: brucehoult on 06 Aug, 2020 01:37
As I said before, and I stand by it, 100x for carefully hand-written code, and 200x for compiled C code. (Dhrystone shows 250x)

In real life though, the 6502 can be, let's say x25 faster than the Mega2560, as regards hobby projects.
Assumptions:
The 6502 has hardware acceleration (video/sound), and is hand crafted machine code.
The Mega2560, no hardware enhancements, and all code runs via a poorly written Basic Interpreter, someone found, on the internet, for the Mega2560. Which is especially slow. (Not to be confused with cheating, to make a POINT on a forum, ).
Here are a few examples:
You use a (old-era 6502, Based Home computer) Commodore 64 or Atari 800 (6502), or possibly other similar computer, available at the same time.
The hobby project, uses a simple, self-designed, memory-mapped video card, interfaced to the Mega2560.
But the old-era home computer, has hardware sprite chips, and sound chips, potentially, greatly speeding up games, from that era.
But the Mega2560, doesn't.

Also, you compare a 6502, era (time) correct Chess program, written by expert(s), in hand crafted machine code. With the hobbyists, Mega2560, Basic Interpreter's, version of a Chess program.
Again, the 6502, may have a x25 speed advantage.

Why 6502 x25 faster than Mega2560 ?

Consider an empty loop.

6502 1MHz, could do something like 200,000 (DEX, BNE = 2 + 3 cycles = 1E6/5=200,00) empty loops per second (hand crafted machine/assembly code).
https://www.masswerk.at/6502/6502_instruction_set.html#BNE
The Mega2560, in a not too well written, Basic Interpreter, might do 8,000 empty (For/Next) loops, per second.
Both, are MY estimates (e.g. the 8,000), and are extremely rough.
So, 200,000 / 8,000 = x25 advantage of 6502 over the Mega2560

tl;dr
Although the examples I just gave, are theoretically 'correct', maybe. I guess the point I'm trying to make, is that these things can end up being "How long is a piece of string", types of questions.

Arguably, the 6502 is not really suited for C compilers, so your x200, is really just a way of saying its architecture, is not well suited to C compilers. Whereas the Mega2560 (e.g. rich number of registers, and at least, a somewhat orthogonal instruction set), make it well suited to C compilers.

Which is partly why Sweet16, a semi-bytecode like 16 bit ALU, was created in/for the Apple, by Steve Wozniak. Which somewhat overcomes, the 6502's limitations.
https://en.wikipedia.org/wiki/SWEET16

#70 Reply
Posted by GeorgeOfTheJungle on 06 Aug, 2020 08:16
Quote from: MK14 on 06 Aug, 2020 07:29
Which is partly why Sweet16, a semi-bytecode like 16 bit ALU, was created in/for the Apple, by Steve Wozniak. Which somewhat overcomes, the 6502's limitations.
https://en.wikipedia.org/wiki/SWEET16

"runs at about one-tenth the speed of the equivalent native 6502 code", see?

It was removed very soon, with the AUTOSTART ROM, in 1978 (79?) IIRC. Not that anybody was using it anyways. The big loss if you ask me was the mini-assembler (F666G), also gone with that ROM "upgrade".

#71 Reply
Posted by MK14 on 06 Aug, 2020 08:25
Quote from: GeorgeOfTheJungle on 06 Aug, 2020 08:16
Quote from: MK14 on 06 Aug, 2020 07:29
Which is partly why Sweet16, a semi-bytecode like 16 bit ALU, was created in/for the Apple, by Steve Wozniak. Which somewhat overcomes, the 6502's limitations.
https://en.wikipedia.org/wiki/SWEET16

"runs at about one-tenth the speed of the equivalent native 6502 code", see?

It was removed very soon, with the AUTOSTART ROM, in 1978 (79?) IIRC. Not that anybody was using it anyways. The big loss if you ask me was the mini-assembler (F666G), also gone with that ROM "upgrade".

These days, processors are usually so fast, that a speed loss of /10 (or even /100), is not necessarily a big show stopper. Hence the popularity of relatively inefficient, scripting/interpreted languages, these days, such as Python.

But the 6502, was somewhat relatively slow and a bit weak, that such a speed loss (/10), would be quite devastating. Especially if it was on top of a Basic Interpreter, which already slows things down by a factor of x100 or even hundreds, compared to hand crafted, well optimised machine code (assembly language),

The thing that "speeded up", the 6502, was the fact that, because of the lack of hardware floating point (home computers, in general, at that time). Floating point took so long (on a typical 6502 1 MHz, or equivalent cpu, e.g. Z80), that the relative slowness of a Basic Interpreter, didn't really matter that much.

I.e If you were using floating point extensively/exclusively, anyway. Even a fast compiler will struggle, because the floating point bits, will still be relatively slow, on those old generation cpus.

#72 Reply
Posted by GeorgeOfTheJungle on 06 Aug, 2020 09:06
In the beginning there was no floating point in the Apple ][, it came with the monitor, a 6502 mini-assembler, and Woz's Integer Basic. FP came later ("Applesoft") with the II Plus.

#73 Reply
Posted by MK14 on 06 Aug, 2020 09:25
Quote from: GeorgeOfTheJungle on 06 Aug, 2020 09:06
There was no floating point at all in the Apple ][, it came with the monitor, a 6502 mini-assembler, and Woz's Integer Basic. FP came years later ("Applesoft") with the II Plus.

I think, Apple was popular, and there were many of them, in the US, in that era. But, the UK, did not have many Apples, and they were not particularly popular, here (UK).

Checking up, here is a source:
Quote
The Apple II became one of several recognizable and successful computers during the 1980s and early 1990s, although this was mainly limited to the USA

https://en.wikipedia.org/wiki/Apple_II_series

In the UK, most of the popular computers (and hence their basics), had floating point (software), as standard.
The original ZX81 (ZX80), didn't. So, then the (cheap at the time), £99.99 Sinclair ZX81, had floating point.
If I remember correctly, it was unbelievably fast. If you kept the screen on (display OFF, is considerably faster, but switches the vdu off, while you are doing calculations, as the Z80, was the VDU), it could do, something like 4 (maybe slightly more), empty For/Next loops (but which print out the For loop counter on the screen, that probably takes a while, as well), per second.
https://en.wikipedia.org/wiki/ZX81

*****I DON'T BELIEVE THAT....
>>Try it for yourself, here:
http://www.zx81stuff.org.uk/zx81/jtyone.html

#74 Reply
Posted by brucehoult on 06 Aug, 2020 10:59
Quote from: MK14 on 06 Aug, 2020 07:29
Quote from: brucehoult on 06 Aug, 2020 01:37
As I said before, and I stand by it, 100x for carefully hand-written code, and 200x for compiled C code. (Dhrystone shows 250x)

In real life though, the 6502 can be, let's say x25 faster than the Mega2560, as regards hobby projects.
Assumptions:
The 6502 has hardware acceleration (video/sound), and is hand crafted machine code.
The Mega2560, no hardware enhancements, and all code runs via a poorly written Basic Interpreter, someone found, on the internet, for the Mega2560. Which is especially slow. (Not to be confused with cheating, to make a POINT on a forum, ).
Here are a few examples:
You use a (old-era 6502, Based Home computer) Commodore 64 or Atari 800 (6502), or possibly other similar computer, available at the same time.
The hobby project, uses a simple, self-designed, memory-mapped video card, interfaced to the Mega2560.
But the old-era home computer, has hardware sprite chips, and sound chips, potentially, greatly speeding up games, from that era.
But the Mega2560, doesn't.

You're getting very far from "the 6502 is faster here". You're at "a particular well equipped 6502 computer is faster than a particular poorly-equipped AVR computer".

In the real world of course all the AVR n00bs are using the Arduino IDE with C compiled optimized by the very good gcc compiler.

Quote
Also, you compare a 6502, era (time) correct Chess program, written by expert(s), in hand crafted machine code. With the hobbyists, Mega2560, Basic Interpreter's, version of a Chess program.
Again, the 6502, may have a x25 speed advantage.

I'm pretty sure I can write a 6502 emulator for the AVR which will run faster than a real 6502. Using that external SRAM interface.

Writing emulators, JITs and compilers is my job and specialty.

Quote
Arguably, the 6502 is not really suited for C compilers, so your x200, is really just a way of saying its architecture, is not well suited to C compilers. Whereas the Mega2560 (e.g. rich number of registers, and at least, a somewhat orthogonal instruction set), make it well suited to C compilers.

The 6502 is indeed not suited to any language that requires 16 or 32 bit variables. Or functions that are required to work if called recursively.

From memory, cc65 produces code that is 3 or 4 times slower than native code, and is huge as well. I have developed a compilation scheme for 6502 that results in near native speed for 8 bit code and 2x slower than native for 16 bit variables, and 1.5x slower than native for 32 bit variables, with 16 bit and 32 bit 2-operand operations (e.g. x = y; x += y, x == y etc) needing a maximum of 7 bytes of code and often 5.

Quote
Which is partly why Sweet16, a semi-bytecode like 16 bit ALU, was created in/for the Apple, by Steve Wozniak. Which somewhat overcomes, the 6502's limitations.
https://en.wikipedia.org/wiki/SWEET16

I know SWEET16 very very well. I could just about rewrite it from memory, and certainly know its internals well. It's very close to being the absolute best possible you can do on the 6502 for very compact and yet quite fast code for 16 bit operations. It provides for 14 16 bit "registers" (after r0 the accumulator and r15 the PC are removed) in Zero Page memory. If A, and B are 16 bit variables in those registers then A += B needs 3 bytes of code and runs around 10x slower than native code.

The native code for the same 16 bit variables in Zero Page is 13 bytes of code and runs in 20 clock cycles.

SWEET16 is 3 bytes and around 200 clock cycles (I haven't checked exactly just now but it's around that)

My code generation scheme is 7 bytes and runs in 44 clock cycles. (ldx #REGA; ldy #REGB; jsr ADD16). X and Y are not modified by ADD16 so if the previous or next operations use A or B then those registers don't need to be reloaded.

The AVR code (ADD A_LO,B_LO; ADC A_HI, B_HI) takes 4 bytes and 2 clock cycles.

It's all trade-offs.