Author Topic: Padauk: why use 16 bit pointers when 8 bit can address the whole RAM? (Read 5187 times)

pastaclub · « **on:** December 28, 2020, 11:16:38 am »

Is there any good reason why Padauk microcontrollers use 16 bit pointers (word)?

Apparently Padauk made 8-core chips in the past... not sure if they had more RAM. However, all current chips have between 80 and 256 bytes of RAM, so you don't need more than 8 bits to address any memory location. The datasheets even point out that in order to increment a pointer, you can simply increment the less significant byte and save one instruction. Then why do you need the more significant byte in the first place?

Circlotron · « **Reply #1 on:** December 28, 2020, 11:52:50 am »

Not familiar with these micros, but is IO, flash, and various config registers also part of the addressable memory? That could easily total way more than 256 bytes

mikerj · « **Reply #2 on:** December 28, 2020, 12:24:05 pm »

As said above, it's because RAM isn't the only thing you might want to access via a pointer. The old HTSoft PIC C compiler for the 12/14 bit cores had both 8 and 16 bit pointers, 8 bit for accessing only RAM or SFRs whilst saving memory and cycles, and 16 bit that could access anything.

pastaclub · « **Reply #3 on:** December 28, 2020, 02:18:13 pm »

Quote from: mikerj on December 28, 2020, 12:24:05 pm

RAM isn't the only thing you might want to access via a pointer

That's incorrect. In the Padauk architecture, those resources are not memory-mapped. Most of them are labeled "I/O", and I/O includes the pins, all special registers, the multiplier, the ADC etc.. The addresses of I/O resources are in a different address space (which is also 8 bits wide) and they are accessed by different opcodes. So to speak, the 9th bit of the address is part of the opcode, not part of the pointer. In any case, a pointer can only point to RAM locations, not to I/O. And there is barely a handful of instructions that accept pointers (respectively indirect addressing) and they don't even support offsets.

With regards to minimizing the die and the cost, it all makes a lot of sense. The only thing that doesn't make sense to me is having 16-bit pointers of which the upper 8 bits are unused.

retiredfeline · « **Reply #4 on:** December 28, 2020, 03:17:46 pm »

The instruction sets of the pdk13-16 variants can be seen at https://free-pdk.github.io/ Maybe you might mean the 16 bit memory operations like IDXM where M has to be word aligned. In pdk16 the memory address part of the instruction is 8 or 9 bits depending on whether you count the least significant 0 bit. The rest of the bits in the instruction is the opcode. I don't see a 16-bit pointer, just a 16-bit instruction, as the architecture name suggests.

If you mean that because of the alignment requirement, a whole word has to be used, that's true. It could be future-proofing the instruction set, or to handle pointers to code memory (for constants and literals) uniformly.

mikerj · « **Reply #5 on:** December 28, 2020, 06:46:29 pm »

Quote from: pastaclub on December 28, 2020, 02:18:13 pm

That's incorrect. In the Padauk architecture, those resources are not memory-mapped. Most of them are labeled "I/O", and I/O includes the pins, all special registers, the multiplier, the ADC etc.. The addresses of I/O resources are in a different address space (which is also 8 bits wide) and they are accessed by different opcodes. So to speak, the 9th bit of the address is part of the opcode, not part of the pointer. In any case, a pointer can only point to RAM locations, not to I/O

So the Padauk micro is unable to read program memory through code? That seems very limiting.

westfw · « **Reply #6 on:** December 29, 2020, 03:16:54 am »

Quote

the Padauk micro is unable to read program memory through code? That seems very limiting.

Same as the early PICs. In fact, same as lots of the PIC16F series - you don't get an FSR capable of addressing Flash until the "enhanced midrange" architecture. Instead, you get to have lists of "retw" instructions (return immediate value in W) (with program memory less than 16bits wide, it doesn't matter very much...)

From what I've read here, you won't be far from wrong if you assume that the Padauk chips are quite similar to the older PICs...

Mechatrommer · « **Reply #7 on:** December 29, 2020, 03:42:07 am »

Quote from: pastaclub on December 28, 2020, 02:18:13 pm

With regards to minimizing the die and the cost, it all makes a lot of sense. The only thing that doesn't make sense to me is having 16-bit pointers of which the upper 8 bits are unused.

you've answered your own question. forget about any other points about addressing extra memory, instrucrtion length etc. your answer is fundamental in scalability and backward and forward compatibility. the later point is the main reason why most people mourn about when broken. if the mcu is 8 bit addressable, you can leave the hi byte zero and forget about it.

retiredfeline · « **Reply #8 on:** December 29, 2020, 04:06:38 am »

This short paper from Oct 2020 by the Philipp Klaus Krause, the developer who added support to SDCC for Padauk MCUs is an interesting read: https://arxiv.org/pdf/2010.04633 In pdk15 and pdk16 architectures there are LDTAB[HL] instructions for accessing constants in code memory, RET k is used in earlier architectures. You can see that SDCC takes advantage of the unused bits to implement universal pointers. This is also done for other targets like the MCS-51. The paper also has suggestions (directed to Padauk I assume) for improving the instruction set, chiefly by better supporting stack variables, which C uses a lot.

mikerj · « **Reply #9 on:** December 29, 2020, 09:01:23 am »

Quote from: westfw on December 29, 2020, 03:16:54 am

Quote
the Padauk micro is unable to read program memory through code? That seems very limiting.
Same as the early PICs. In fact, same as lots of the PIC16F series - you don't get an FSR capable of addressing Flash until the "enhanced midrange" architecture. Instead, you get to have lists of "retw" instructions (return immediate value in W) (with program memory less than 16bits wide, it doesn't matter very much...)

From what I've read here, you won't be far from wrong if you assume that the Padauk chips are quite similar to the older PICs...

Yes I realise how early PICs store constants in program memory, and this is why I asked. The HTSoft compiler would let you address RETLW constants in flash memory via a (16 bit) pointer, and the same pointer could also address RAM.

pastaclub · « **Reply #10 on:** December 29, 2020, 02:44:52 pm »

Thank you all who commented!

Quote from: retiredfeline on December 28, 2020, 03:17:46 pm

I don't see a 16-bit pointer, just a 16-bit instruction

Padauk's documentation is a bit scattered and that part is actually in the help of the IDE (which is where their C dialect is documented). The recommended way to implement pointers is (as weird as it may seem) by declaring a WORD and using it like a pointer:

Code: [Select]

WORD ptr;
BYTE b, buf[16];
ptr = &buf[0];
ptr = ptr + 5;
b = *ptr;         // buf[5] is assigned to b

I had a look at the compiled code and indeed, a pointer assignment like ptr = &buf[0] compiles into four instructions, namely

moving the address of buf into the accumulator (the address at this point is an immediate operand, i.e. a constant - this is probably the reason why the compiler insists that arrays must be global variables, so it can allocate static memory addresses to it which are known at compile time
moving the accumulator value into the low byte of the pointer
moving 0 into the accumulator
moving the accumulator value into the high byte of the pointer

Now the interesting part is: an indirect read using the pointer (i.e. b = *ptr) is in fact compiled into an IDXM instruction (as you correctly assumed), and just a single one. So clearly, the high byte of the pointer is just wasted. Any assignment will write 0 to it (which is also a waste of program memory and processing time) and the value will never even be read.

What if we try to use a BYTE as a pointer? -> Syntax error. Using it in this way *ptr is only permitted with words, not with bytes.

So those 16-bit pointers are related to Padauk's Mini-C compiler, not to the architecture itself. By using assembly, you can get by with just a byte as a pointer. And that makes no sense whatsoever. I almost believe they inherited the compiler from someone who wrote it for 16-bit architectures and either didn't bother adapting this part or ran into problems and decided not to bother. I consider this a bug. Sure it's fine to encourage code to be easily portable to chips with more than 256 bytes of RAM (which Padauk doesn't have), but on an architecture with 8-bit memory addresses, the compiler should really not insist on 16-bit pointers, especially since it compiles read operations with such pointers into single 8-bit operations.

So my conclusion is: as long as I have enough RAM, I can use a WORD as a pointer. When I have to optimize for every byte, I need to code these parts in assembly and use a byte.

Quote from: mikerj on December 28, 2020, 06:46:29 pm

So the Padauk micro is unable to read program memory through code? That seems very limiting.

Exactly. As others have already said, the Padauk has a RET instruction with an immediate operand. It loads the accumulator with the immediate operand and returns from a call. You can use it together with the PCADD instruction (which adds its immediate operand to the program counter) to implement a constant table in program memory. It's neither efficient nor convenient... but it has its use cases. I needed to implement a simple lookup table with 8 entries. Using an array in RAM was not an option due to limited RAM, using a SWITCH statement worked but resulted in an endless sequence of loads, compares and jumps, so did it with PCADD and RET Imm. Although you have to use a CALL first (to then return from) and it looks a bit ugly, it was still the best way to do it. And you are right it's very limiting - but then again, that's the 3 cent microcontroller

The dual-core Padauk devices have two instructions called LDTABL and LDTABH which allow reading a byte from program memory in the accumulator. I haven't looked more into it since the chips I am working with don't support it, but according to Free PDK, the address is stored in 9 bits of the opcode. Since there are two instructions, that would mean you can address up to 1KB. The program memory is up to 4KB, so 1KB is not enough and I assume that such tables must be in the last quarter of the ROM or something like that. Again, quirky compromises, but cheap silicon. In any case, this is not related to the question of 16-bit pointers, as LDTAB works only with constant addresses, not with pointer variables.

retiredfeline · « **Reply #11 on:** December 29, 2020, 03:01:31 pm »

Quote from: pastaclub on December 29, 2020, 02:44:52 pm

Now the interesting part is: an indirect read using the pointer (i.e. b = *ptr) is in fact compiled into an IDXM instruction (as you correctly assumed), and just a single one. So clearly, the high byte of the pointer is just wasted. Any assignment will write 0 to it (which is also a waste of program memory and processing time) and the value will never even be read.

What if we try to use a BYTE as a pointer? -> Syntax error. Using it in this way *ptr is only permitted with words, not with bytes.

So those 16-bit pointers are related to Padauk's Mini-C compiler, not to the architecture itself. By using assembly, you can get by with just a byte as a pointer. And that makes no sense whatsoever. I almost believe they inherited the compiler from someone who wrote it for 16-bit architectures and either didn't bother adapting this part or ran into problems and decided not to bother. I consider this a bug. Sure it's fine to encourage code to be easily portable to chips with more than 256 bytes of RAM (which Padauk doesn't have), but on an architecture with 8-bit memory addresses, the compiler should really not insist on 16-bit pointers, especially since it compiles read operations with such pointers into single 8-bit operations.

So my conclusion is: as long as I have enough RAM, I can use a WORD as a pointer. When I have to optimize for every byte, I need to code these parts in assembly and use a byte.

IIRC their C compiler, unless they improved it, is deficient in many other ways, such as not having real for loops, but something like loop unrolling.

Assuming the CPU doesn't actually reference the high byte of the pointer, a smarter compiler could on the lower members of the family stick a byte variable in the unused high byte while still allocating a pointer to be at an even address which would reduce the storage waste. I don't know if SDCC has done this.

There are lots of cut corners in this family. The main thing going for it is the cheapness.

However due to the hardware programming requirements, even though an open source programmer has been developed, I'm taking an interest in the Fremont Micro products which cost only a bit more but can be programmed with nothing more than an Arduino. There is another thread on this family on this forum.

bson · « **Reply #12 on:** December 30, 2020, 07:57:18 pm »

The Padauks have 1-4k OTP flash. To access data tables in flash you need more than 8-bit address calculations.

mikerj · « **Reply #13 on:** December 31, 2020, 12:25:16 pm »

Quote from: pastaclub on December 29, 2020, 02:44:52 pm

Quote from: mikerj on December 28, 2020, 06:46:29 pm
So the Padauk micro is unable to read program memory through code? That seems very limiting.

Exactly. As others have already said, the Padauk has a RET instruction with an immediate operand. It loads the accumulator with the immediate operand and returns from a call. You can use it together with the PCADD instruction (which adds its immediate operand to the program counter) to implement a constant table in program memory. It's neither efficient nor convenient... but it has its use cases.

Exactly like the older 12/14 bit PICs, which as I've said could be utilised via a 16 bit pointer in the HTSoft compiler. If the Padauk compiler isn't sophisticated enough to do this then I would agree a 16 bit pointer is wasteful.

bson · « **Reply #14 on:** December 31, 2020, 09:11:16 pm »

The problem I have understanding architectures like this is that while the CPU core is significantly simplified, the code footprint bloats by the additional overhead, easily by a factor of 2-3x over a traditional GPR ISA. But flash is the dominant user of die real estate, so this seems totally counterproductive: gain 20% on the core footprint, and lose 100% on the flash... Unless we're talking e.g. 64 words of flash or something really small, but it seems even the smallest Padauks have 1k. Which would be really big with a GPR ISA.

brucehoult · « **Reply #15 on:** December 31, 2020, 11:14:45 pm »

Quote from: bson on December 31, 2020, 09:11:16 pm

The problem I have understanding architectures like this is that while the CPU core is significantly simplified, the code footprint bloats by the additional overhead, easily by a factor of 2-3x over a traditional GPR ISA. But flash is the dominant user of die real estate, so this seems totally counterproductive: gain 20% on the core footprint, and lose 100% on the flash... Unless we're talking e.g. 64 words of flash or something really small, but it seems even the smallest Padauks have 1k. Which would be really big with a GPR ISA.

I agree. You wouldn't make that trade-off with a new design. But designs that start very small can paint themselves into a compatibility corner.

The original PIC 1650 had 32 bytes of "registers" and 512 instruction words of 12 bits each, which is half as much as that smallest Padauk. I'm actually surprised it's not smaller on the program storage side as many applications require a lot less than that.

All accumulator ISAs really suffer on code size, and 8 bit ones even more so because of the frequency of needing 16 bit integers and pointers.

A 16 bit datapath and ALU takes very little additional space (and 32 bit not much more) and 16 bit instructions easily allow nice 2-address register-to-register instructions with 8 registers (PDP11, MSP430) or even 16 with some trickery (m68k, SuperH, Thumb) or 32 (AVR, RISC-V C extension).

The benefits in reduced code size kick in at programs that are not all that big.

SiliconWizard · « **Reply #16 on:** January 01, 2021, 12:02:48 am »

bson, you make a good point, but it's hard to fully understand the design rationale without knowing all the context. Regarding code memory, I don't know how Padauk MCU are made exactly, but I'm thinking a relatively common approach with chinese designers of those low-cost MCUs is to embed Flash on a separate die, usually from another vendor. Here I don't know if that's what they did, but possibly if they only embed their CPU core on an extremely small die (with simple peripherals and a very low amount of data RAM), connected to a small Flash die (that they possibly could get for ultra ridiculously low prices by the million), then in the end it may prove cheaper than having a bigger CPU core even with half the Flash memory. Just a thought though, I don't know here if that applies for Padauk.

brucehoult · « **Reply #17 on:** January 01, 2021, 12:21:54 am »

The GD32F103 and GD32VF103 copy the off-die serial flash into an equal sized SRAM at startup, so that's still using a serious amount of die area on the MCU itself -- just more compatible with the processes used to make CPU logic.

I have no idea whether Padauk do that. At low clock speeds maybe execute directly from a flash chip would be ok. (They surely don't use an instruction cache)

retiredfeline · « **Reply #18 on:** January 01, 2021, 12:45:05 am »

There's a photo of a decapped MCU in the paper I mentioned a few posts back.

SiliconWizard · « **Reply #19 on:** January 01, 2021, 12:50:00 am »

Quote from: brucehoult on January 01, 2021, 12:21:54 am

The GD32F103 and GD32VF103 copy the off-die serial flash into an equal sized SRAM at startup, so that's still using a serious amount of die area on the MCU itself -- just more compatible with the processes used to make CPU logic.

I have no idea whether Padauk do that. At low clock speeds maybe execute directly from a flash chip would be ok. (They surely don't use an instruction cache)

Well Padauk MCUs so run at low clock frequencies don't they? I don't remember their max operating freq, but if in the order of a few MHz, then directly running off Flash shouldn't be a problem. That's what many MCUs already do. Now the matter of making the Flash external and from another vendor could potentially (IMO) cut costs, but with similar performance as if they embedded the Flash on the same die. If said Flash chips have a parallel bus interface, that's easy enough. Now whether that would be cost effective all depends on the context, but I'm guessing that it *could* be cheaper.

westfw · « **Reply #20 on:** January 01, 2021, 01:25:26 am »

The architecture wasn't really designed to work with HLLs.Code is less bloated if you program them in assembly language, AFTER thoroughly understanding the architecture and its standard hacks.

(It'd be interesting to compare dies sizes of one of those modern CPUs (CM0, RISC-V) with a Padauk or PIC, IMPLEMENTED AT SIMILAR CHIP GEOMETRIES. I get the impression that a lot of these low-end chips are cheap because they're using old fabs that have been paid for several times over. Not that CM0 is a great example, BTW. It has enough of the ARM architecture "omitted" that it doesn't do great with a lot of C code, either...)

(Yes, I know that current wisdom is that a good compiler should do as well as an assembly language programmer. But that assumes that the architecture more or less matches the HLL's model of the world, so that one CAN write a "good compiler." Ye old PIC microcontrollers (and actually a LOT of the older processors, include 8051 and Z80) not only didn't try to support HLLs, but included instructions and features specifically designed to appeal to Assembly Language coders, that were really difficult to get a compiler to generate.
A lot of the RISC movement was not so much about "Reduced" as "implement things that make it easier for compiler writers, rather than easier for assembly language programmers.)

That there are "sort-of OK" compilers for PIC/8051/etc is amazing, even if they do produce relatively bloated code.
(Let's not forget that C was pretty much designed to match the architecture of the PDP11.)

MIS42N · « **Reply #21 on:** January 01, 2021, 02:00:59 am »

Some of Padauk's products have 512 bytes RAM (the BLDC series). It is not clear from the doco if they use the same processor architecture as the others, but we can speculate that is the case. In which case more than an 8-bit pointer is needed. Maybe the internals access memory via a word aligned 16 bit pathway. The fact that loading and reading timer1 (16-bit) is done in 1 cycle would suggest this is so.

MIS42N · « **Reply #22 on:** January 01, 2021, 02:51:15 am »

Quote from: westfw on January 01, 2021, 01:25:26 am

The architecture wasn't really designed to work with HLLs.Code is less bloated if you program them in assembly language, AFTER thoroughly understanding the architecture and its standard hacks.

In the past, I used the tactic of looking at the output of a HLL and hand optimising it with a bit of in-line assembler. One time I intended to do this with a huge FORTRAN program that took a weekend to run. When I looked at the code generated by the compiler, it was about 20% slower than my estimate of what could be achieved with hand coding, not worth touching. However, it didn't gel with the time taken to run and I found the I/O routines were ridiculously inefficient. Changing a few lines of the original program to read and write arrays in chunks of 1000 elements at a time instead of 1 element a time reduced the time to run from around 40 hours to less than 8, meaning it could run overnight. So knowledge of architecture and assembler was essential even though in this case no assembler was needed.

I've also reversed the process, taken an assembler language written for a VAX and translated it to C. Some parts of the program were so convoluted I had no idea what they did. But the C code (also logically impenetrable) worked the same but slower. This allowed the other programmers to maintain it, or at least the bits we understood, so everyone was happy.

A satisfying task was writing a sliding window error detecting and correcting communications driver for a Commodore 64. The protocol was called X.PC https://en.wikipedia.org/wiki/X.PC and the source code given to me was in C. It was written for an IBM PC and was around 50Kbyte. Clearly wasn't going to work in the C64. A complication was the C64 had no UART and the O/S had a bit basher that worked sort of at 300 baud but flaky at anything faster. I wrote the driver in 6502 assembler and 'hid' most of it behind the BASIC ROM. The C64 had 64K of RAM, but 12K was 'wasted' by overlaying part of it with the O/S and BASIC ROMS. Fortunately these could be switched in and out giving access to the RAM. I wrote a bit basher that worked at 1200baud full duplex (which was the speed of the modems we used) and the comms protocol in 6502 assembler. The main application was written in C64 BASIC which thought it was dealing with the O/S serial driver. It all worked beautifully and users preferred the C64 application to the equivalent IBM PC one (at about 3 times the cost). However the client didn't have anyone who was capable of maintaining the code so after a year or so the users with C64s scored a free machine for home and were 'upgraded' to IBM PCs. I don't think there was any way this could be done in a HLL.

retiredfeline · « **Reply #23 on:** January 01, 2021, 03:40:50 am »

Quote from: westfw on January 01, 2021, 01:25:26 am

The architecture wasn't really designed to work with HLLs.

Yet Padauk acknowledged the advantage of a HLL enough to provide a cut down, very non-compliant C for the IDE. If it works fast enough and fits into the ROM who cares if the code is bloated? It's still 3¢.

brucehoult · « **Reply #24 on:** January 01, 2021, 04:02:48 am »

Quote from: westfw on January 01, 2021, 01:25:26 am

The architecture wasn't really designed to work with HLLs.Code is less bloated if you program them in assembly language, AFTER thoroughly understanding the architecture and its standard hacks.

Less bloated but still bloated. The output of C compilers for PIC, 6502 etc is simply obscene.

Simply put: for any fixed number of conveniently addressed short term variable locations (GP registers, zero page in 6502), code for any given construct is longer for an accumulator machine than for a 2-address or 3-address machine because the same number (or more) of addresses are needed in the code but more opcodes are needed.

e.g. let's assume 4 bit opcodes (pretty much the least you can ever get away with) and 16 registers or short-address memory locations, and writing code for "a = b + c"

Accumulator machine:

load b
add c
store a

Each operation code needs 4 bits, plus once each for a, b, c giving a total of 24 bits of code.

2-address:

mov a,b
add a,c

Again 24 bits, but this is the worst case and maybe 80% of the time a is the same as b or c and it reduces to one 12 bit instruction:

add a,b

3-address:

add a,b,c

16 bits. The smallest for the general case, but bigger than the 2-address special case.

Even worse for tight code are the machines such as PDP-11, VAX, 68000 that devote equal amounts of the instruction bits to the register number and the addressing mode -- that's pretty useful when you have a lot of operands in memory, but if you have enough registers that most operands can be in registers then it's a huge waste to specify "register mode" all the time.

68000 and PDP11 for example both have 16 bits in each instruction split as 4 bits operation, 3 bits dst addressing mode, 3 bits dst register, 3 bits src addressing mode, 3 bits src register. VAX has 8 bits operation then one byte for each operand split as 4 bit addressing mode, 4 bits register.

Of course you could examine many other code snippets, and accumulator machines do frequently manage to do multiple operations in succession without storing the accumulator and loading a new value into it (just as 2-address machines often avoid explicit mov), but averaged over any significant program 2-address machine that don't waste bits on addressing modes in every instruction come out as the smaller code than other choices.

Quote

(It'd be interesting to compare dies sizes of one of those modern CPUs (CM0, RISC-V) with a Padauk or PIC, IMPLEMENTED AT SIMILAR CHIP GEOMETRIES. I get the impression that a lot of these low-end chips are cheap because they're using old fabs that have been paid for several times over. Not that CM0 is a great example, BTW. It has enough of the ARM architecture "omitted" that it doesn't do great with a lot of C code, either...)

Yes it would.

Number of LUTs and FFs in an FPGA implementation would be a reasonable proxy.

You should count core size plus code size.

Quote

(Yes, I know that current wisdom is that a good compiler should do as well as an assembly language programmer. But that assumes that the architecture more or less matches the HLL's model of the world, so that one CAN write a "good compiler." Ye old PIC microcontrollers (and actually a LOT of the older processors, include 8051 and Z80) not only didn't try to support HLLs, but included instructions and features specifically designed to appeal to Assembly Language coders, that were really difficult to get a compiler to generate.

I think the major problem is such CPUs were optimized for fairly simple tasks such as block copy/compare or multi-precision arithmetic and de-optimized for functions with a lot of arguments or local variables.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Padauk: why use 16 bit pointers when 8 bit can address the whole RAM? (Read 5187 times)

Share me