RISC-V assembly language programming tutorial on YouTube

#150 Reply
Posted by westfw on 16 Dec, 2018 10:45
Quote
Quote
I wish the ISRs in C code [on ARM] that the HW interrupt entry was quicker...
Doesn't the 'naked' attribute of the function definition remove the prolog and epilog?
Not for ARM Cortex. The NVIC hardware saves exactly the same registers that the C ABI says must be saved, so effectively there is NO extra prolog for ISRs. But the NVIC hardware stacks 8 words of context, so it's slower than it could be if the choice was left to the programer.

Quote
Pretty much everyone using ARC or Xtensa is likely to switch to RISC-V
Espressif too? Is there any indication that the "mostly China" manufacturers would switch?

Quote
[complaints about CM0 code]I guess there are two options: 1) let the C compiler figure
That's where I got the 4-register version. Offsets larger than 32 get converted into a MOV of an offset into the 4th register, and "LDR r1,[r2,r3]" addressing mode. In assembly language, I could presumably add/sub manually from the base register or something, at the expense of ... unpleasantness and cryptic code.

Computations with values that are already in registers are where 16 bit opcodes shine.
I think the big thing I was missing is that in simple assembly programs, arrays might be addressed as "[Rindex, #constantSymbolAddress]", while in an only slightly more complex program, they'll be passed around as pointers, and the double-index-register addressing modes will work just fine.

#151 Reply
Posted by legacy on 16 Dec, 2018 12:40
Quote from: brucehoult on 16 Dec, 2018 04:55
You could try the LoFive: https://store.groupgets.com/products/lofive-risc-v

Yup, of this size

A little MPU can handle the keyboard (the key-matrix is 9x10), interfacing serially to the CPU, and a small LCD is usually SPI. It sounds something that can be done.

#152 Reply
Posted by brucehoult on 16 Dec, 2018 13:15
Quote from: westfw on 16 Dec, 2018 00:01
My surprises show up when initializing periperals. I expected code like:
Code: [Select]
PORT->Group[0].PINCFG[12].reg |= PORT_PINCFG_DRVSTR; PORT->Group[0].DIRSET.reg |= 1<<12;

Just for fun, I made a couple of definitions so your code would be compilable and tried it on a few things.

Code: [Select]
#include <stdint.h> #define PORT_PINCFG_DRVSTR (1<<7) struct { struct { struct { uint32_t foo; uint32_t reg; uint32_t bar; } PINCFG[16]; struct { uint64_t baz; uint32_t reg; } DIRSET; } Group[10]; } *PORT = (void*)0xdecaf000; void main(){ PORT->Group[0].PINCFG[12].reg |= PORT_PINCFG_DRVSTR; PORT->Group[0].DIRSET.reg |= 1<<12; }
And I checked it with for example:

Code: [Select]
arm-linux-gnueabihf-gcc -O initPorts.c -o initPorts -nostartfiles && \ arm-linux-gnueabihf-objdump -D initPorts | expand | less -p'<main>'
So ... ARMv7 (Thumb2):

Code: [Select]
000001c0 <main>: 1c0: 4b07 ldr r3, [pc, #28] ; (1e0 <main+0x20>) 1c2: 447b add r3, pc 1c4: 681b ldr r3, [r3, #0] 1c6: f8d3 2094 ldr.w r2, [r3, #148] ; 0x94 1ca: f042 0280 orr.w r2, r2, #128 ; 0x80 1ce: f8c3 2094 str.w r2, [r3, #148] ; 0x94 1d2: f8d3 20c8 ldr.w r2, [r3, #200] ; 0xc8 1d6: f442 5280 orr.w r2, r2, #4096 ; 0x1000 1da: f8c3 20c8 str.w r2, [r3, #200] ; 0xc8 1de: 4770 bx lr 1e0: 00010e3a andeq r0, r1, sl, lsr lr 00011000 <PORT>: 11000: decaf000 cdple 0, 12, cr15, cr10, cr0, {0}
Arm32:

Code: [Select]
000001c0 <main>: 1c0: e59f3020 ldr r3, [pc, #32] ; 1e8 <main+0x28> 1c4: e08f3003 add r3, pc, r3 1c8: e5933000 ldr r3, [r3] 1cc: e5932094 ldr r2, [r3, #148] ; 0x94 1d0: e3822080 orr r2, r2, #128 ; 0x80 1d4: e5832094 str r2, [r3, #148] ; 0x94 1d8: e59320c8 ldr r2, [r3, #200] ; 0xc8 1dc: e3822a01 orr r2, r2, #4096 ; 0x1000 1e0: e58320c8 str r2, [r3, #200] ; 0xc8 1e4: e12fff1e bx lr 1e8: 00010e34 andeq r0, r1, r4, lsr lr 00011000 <PORT>: 11000: decaf000 cdple 0, 12, cr15, cr10, cr0, {0} /code] Thumb1: [code] 000001c0 <main>: 1c0: 4b07 ldr r3, [pc, #28] ; (1e0 <main+0x20>) 1c2: 447b add r3, pc 1c4: 681b ldr r3, [r3, #0] 1c6: 2194 movs r1, #148 ; 0x94 1c8: 2280 movs r2, #128 ; 0x80 1ca: 5858 ldr r0, [r3, r1] 1cc: 4302 orrs r2, r0 1ce: 505a str r2, [r3, r1] 1d0: 3134 adds r1, #52 ; 0x34 1d2: 2280 movs r2, #128 ; 0x80 1d4: 0152 lsls r2, r2, #5 1d6: 5858 ldr r0, [r3, r1] 1d8: 4302 orrs r2, r0 1da: 505a str r2, [r3, r1] 1dc: 4770 bx lr 1de: 46c0 nop ; (mov r8, r8) 1e0: 00010e3a andeq r0, r1, sl, lsr lr 00011000 <PORT>: 11000: decaf000 cdple 0, 12, cr15, cr10, cr0, {0}
Arm64

Code: [Select]
00000000000002ac <main>: 2ac: b0000080 adrp x0, 11000 <PORT> 2b0: f9400000 ldr x0, [x0] 2b4: b9409401 ldr w1, [x0, #148] 2b8: 32190021 orr w1, w1, #0x80 2bc: b9009401 str w1, [x0, #148] 2c0: b940c801 ldr w1, [x0, #200] 2c4: 32140021 orr w1, w1, #0x1000 2c8: b900c801 str w1, [x0, #200] 2cc: d65f03c0 ret 0000000000011000 <PORT>: 11000: decaf000 .word 0xdecaf000 11004: 00000000 .word 0x00000000
Thumb1:

Code: [Select]
000001c0 <main>: 1c0: 4b07 ldr r3, [pc, #28] ; (1e0 <main+0x20>) 1c2: 447b add r3, pc 1c4: 681b ldr r3, [r3, #0] 1c6: 2194 movs r1, #148 ; 0x94 1c8: 2280 movs r2, #128 ; 0x80 1ca: 5858 ldr r0, [r3, r1] 1cc: 4302 orrs r2, r0 1ce: 505a str r2, [r3, r1] 1d0: 3134 adds r1, #52 ; 0x34 1d2: 2280 movs r2, #128 ; 0x80 1d4: 0152 lsls r2, r2, #5 1d6: 5858 ldr r0, [r3, r1] 1d8: 4302 orrs r2, r0 1da: 505a str r2, [r3, r1] 1dc: 4770 bx lr 1de: 46c0 nop ; (mov r8, r8) 1e0: 00010e3a andeq r0, r1, sl, lsr lr 00011000 <PORT>: 11000: decaf000 cdple 0, 12, cr15, cr10, cr0, {0}
RISC-V rv32ic: (without c is identical except all instructions take 4 bytes. 64 bit is identical except for a "ld" to get <PORT> and the pointer is 8 bytes instead of 4)

Code: [Select]
00010074 <main>: 10074: 67c5 lui a5,0x11 10076: 0947a783 lw a5,148(a5) # 11094 <PORT> 1007a: 6685 lui a3,0x1 1007c: 0947a703 lw a4,148(a5) 10080: 08076713 ori a4,a4,128 10084: 08e7aa23 sw a4,148(a5) 10088: 0c87a703 lw a4,200(a5) 1008c: 8f55 or a4,a4,a3 1008e: 0ce7a423 sw a4,200(a5) 10092: 8082 ret 00011094 <PORT>: 11094: f000 fsw fs0,32(s0) 11096: deca sw s2,124(sp)
M68k:

Code: [Select]
800001ac <main>: 800001ac: 2079 8000 400c moveal 8000400c <PORT>,%a0 800001b2: 0068 0080 0096 oriw #128,%a0@(150) 800001b8: 0068 1000 00ca oriw #4096,%a0@(202) 800001be: 4e75 rts 8000400c <PORT>: 8000400c: deca addaw %a2,%sp 8000400e: f000
i686:

Code: [Select]
000001b5 <main>: 1b5: e8 20 00 00 00 call 1da <__x86.get_pc_thunk.ax> 1ba: 05 3a 1e 00 00 add $0x1e3a,%eax 1bf: 8b 80 0c 00 00 00 mov 0xc(%eax),%eax 1c5: 81 88 94 00 00 00 80 orl $0x80,0x94(%eax) 1cc: 00 00 00 1cf: 81 88 c8 00 00 00 00 orl $0x1000,0xc8(%eax) 1d6: 10 00 00 1d9: c3 ret 000001da <__x86.get_pc_thunk.ax>: 1da: 8b 04 24 mov (%esp),%eax 1dd: c3 ret 00002000 <PORT>: 2000: 00 f0 add %dh,%al 2002: ca .byte 0xca 2003: de .byte 0xde
SH4:

Code: [Select]
004001b0 <main>: 4001b0: 07 d1 mov.l 4001d0 <main+0x20>,r1 ! 411000 <PORT> 4001b2: 12 61 mov.l @r1,r1 4001b4: 13 62 mov r1,r2 4001b6: 7c 72 add #124,r2 4001b8: 26 50 mov.l @(24,r2),r0 4001ba: 80 cb or #-128,r0 4001bc: 06 12 mov.l r0,@(24,r2) 4001be: 05 92 mov.w 4001cc <main+0x1c>,r2 ! bc 4001c0: 2c 31 add r2,r1 4001c2: 13 52 mov.l @(12,r1),r2 4001c4: 03 93 mov.w 4001ce <main+0x1e>,r3 ! 1000 4001c6: 3b 22 or r3,r2 4001c8: 0b 00 rts 4001ca: 23 11 mov.l r2,@(12,r1) 4001cc: bc 00 mov.b @(r0,r11),r0 4001ce: 00 10 mov.l r0,@(0,r0) 4001d0: 00 10 mov.l r0,@(0,r0) 4001d2: 41 00 .word 0x0041 00411000 <PORT>: 411000: 00 f0 .word 0xf000 411002: ca de mov.l 41132c <__bss_start+0x31c>,r14
#Instr Code Data Total ISA
10 32 8 40 Thumb2
10 40 8 48 Arm32
15 30 10 40 Thumb1
9 36 8 44 Arm64
10 32 8 40 RISC-V rv64ic
10 32 4 36 RISC-V rv32ic
10 40 4 44 RISC-V rv32i
4 20 4 24 M68k
8 41 4 45 i686
13 26 14 40 SH4

Good old Motorola 68000 wins by miles on both number of instructions and total number of bytes!

Thumb1 and SH4 use a lot of instructions, but are the next smallest in code size after m68k. They're just middle of the pack once you include .data

rv31i is slightly smaller than Arm32 and rv32ic is slightly smaller than Thumb2 in total size. The number of instructions is identical for all of them and rv32i/Arm32 and rv32ic/Thumb2 have the same code size as each other.

rv64ic has one instruction more than Arm64, but the code is 4 bytes smaller. Both have to load a 64 bit pointer from the .data section, costing 4 bytes, but they don't need an intermediate pointer at the end of the function code, saving 4 bytes.

#153 Reply
Posted by NorthGuy on 16 Dec, 2018 16:06
Quote from: brucehoult on 16 Dec, 2018 13:15
Quote from: westfw on 16 Dec, 2018 00:01
My surprises show up when initializing periperals. I expected code like:
Code: [Select]
PORT->Group[0].PINCFG[12].reg |= PORT_PINCFG_DRVSTR; PORT->Group[0].DIRSET.reg |= 1<<12;

Just for fun, I made a couple of definitions so your code would be compilable and tried it on a few things.

Code: [Select]
#include <stdint.h> #define PORT_PINCFG_DRVSTR (1<<7) struct { struct { struct { uint32_t foo; uint32_t reg; uint32_t bar; } PINCFG[16]; struct { uint64_t baz; uint32_t reg; } DIRSET; } Group[10]; } *PORT = (void*)0xdecaf000; void main(){ PORT->Group[0].PINCFG[12].reg |= PORT_PINCFG_DRVSTR; PORT->Group[0].DIRSET.reg |= 1<<12; }

In SAM, "Group" represents a group of registers 128 bytes long and everything below is just unions. "PORT" would be a fixed location in memory space. So, what the code actually does is setting 2 bits at the fixed memory location.

There's no pointer loading (which takes whopping 50% in Motorola, and 49% in Intel which you decided to compile as position-independent code). Moreover, when someone builds an MCU with RISC-V, they will probably provide some way of setting bits without reading registers, as Atmel did here:

Code: [Select]
PORT->Group[0].DIRSET.reg = 1<<12; // no need for "|="
The register is called DIRSET because writing to it only sets the bits (and the bits which are written "0" remain unchanged), and there's an opposite register called DIRCLR which clears the bits, and also DIRTGL which xors.

The compiler may be clever enough to keep one of the registers permanently pointing to the IO registers area, so the whole thing boils down to this:

Code: [Select]
6685 lui a3,0x1 0ce7a423 sw a3,200(a5) ; replace "200" with correct offset from a5
<edit>Can't help it. In dsPIC33 you get:

Code: [Select]
bset LATA,#12
one instruction and 3 bytes (50% compared to RISC-V).

#154 Reply
Posted by lucazader on 16 Dec, 2018 18:25
Quote from: westfw on 16 Dec, 2018 10:45
Quote
Pretty much everyone using ARC or Xtensa is likely to switch to RISC-V
Espressif too? Is there any indication that the "mostly China" manufacturers would switch?

Yea they are a member of the RISC-V foundation, a "Founding Gold" member, whatever that means.
https://riscv.org/members-at-a-glance/

Judging from timing on when they would have started development on an ESP32 successor, I'd put it at about 50% chance of the switching over to risc-v in the next chip, but a lot higher in the chip after that.

#155 Reply
Posted by rhodges on 16 Dec, 2018 18:45
Quote from: westfw on 16 Dec, 2018 10:45

Not for ARM Cortex. The NVIC hardware saves exactly the same registers that the C ABI says must be saved, so effectively there is NO extra prolog for ISRs. But the NVIC hardware stacks 8 words of context, so it's slower than it could be if the choice was left to the programer.
I have really been enjoying this discussion

A decade and a half ago, I had the pleasure of working with a VLIW processor, the Trimedia/Philips PNX1302. It dispatched up to 5 operations per instruction word at 200mhz. It had 128 32-bit registers, and the convention was that the botttom 64 belonged to user code and the top 64 could be used by the ISR. No saving required. Further, an interrupt only happens when the user code makes a jump. So user code could (with care) use the top 64 between jumps. An interesting and useful side-effect is that user code could assume no interrupts while doing code that needs to be atomic.

I just thought some might find this interesting.

#156 Reply
Posted by NorthGuy on 16 Dec, 2018 20:09
Quote from: rhodges on 16 Dec, 2018 18:45
It had 128 32-bit registers, and the convention was that the botttom 64 belonged to user code and the top 64 could be used by the ISR. No saving required.

Some modern MCUs have multiple register sets. When an interrupt happens, the new set gets loaded. When it quits, the old one gets restored. It doesn't take any additional time and thus decreases the interrupt latency by a lot. If you have a separate register set for every interrupt level, you never need to save anything.

However, I think in the future, as everything moves to multi-cores, things may get even better. If you assign a designated core to an interrupt, then the core can simply sit there waiting for the interrupt to happen. Then there's no latency except for the short period necessary to synchronize the interrupt signal to the CPU clock.

#157 Reply
Posted by langwadt on 16 Dec, 2018 20:40
Quote from: westfw on 16 Dec, 2018 10:45
Quote
Quote
I wish the ISRs in C code [on ARM] that the HW interrupt entry was quicker...
Doesn't the 'naked' attribute of the function definition remove the prolog and epilog?
Not for ARM Cortex. The NVIC hardware saves exactly the same registers that the C ABI says must be saved, so effectively there is NO extra prolog for ISRs. But the NVIC hardware stacks 8 words of context, so it's slower than it could be if the choice was left to the programer.

slower in the rare case you need to do something in a few cycles with no registers, likely faster in the majority of cases

#158 Reply
Posted by ataradov on 16 Dec, 2018 20:51
Quote from: NorthGuy on 16 Dec, 2018 20:09
However, I think in the future, as everything moves to multi-cores, things may get even better. If you assign a designated core to an interrupt, then the core can simply sit there waiting for the interrupt to happen. Then there's no latency except for the short period necessary to synchronize the interrupt signal to the CPU clock.
The limiting factor here will be memory. You either need to have a dedicated memory per core, which will make the maximum size of the handler inflexible, or deal with concurrent access by multiple cores, which will slow down everything.

#159 Reply
Posted by andersm on 16 Dec, 2018 21:00
Quote from: NorthGuy on 16 Dec, 2018 20:09
Some modern MCUs have multiple register sets. When an interrupt happens, the new set gets loaded. When it quits, the old one gets restored. It doesn't take any additional time and thus decreases the interrupt latency by a lot. If you have a separate register set for every interrupt level, you never need to save anything.
Register banks do make code that need to access registers across priority levels a whole lot messier (eg. task switching using a low-priority interrupt, like is usually done on Cortex-M MCUs, or exception handlers). I guess with modern manufacturing processes the extra state required by the additional register banks isn't a big deal anymore (eg. 31 32-bit registers by 8 banks is a bit less than 1000 bytes).

#160 Reply
Posted by David Hess on 16 Dec, 2018 21:20
Quote from: andersm on 16 Dec, 2018 21:00
Register banks do make code that need to access registers across priority levels a whole lot messier (eg. task switching using a low-priority interrupt, like is usually done on Cortex-M MCUs, or exception handlers). I guess with modern manufacturing processes the extra state required by the additional register banks isn't a big deal anymore (eg. 31 32-bit registers by 8 banks is a bit less than 1000 bytes).

It does not cost as much due to area now but the register bank is within the critical timing path for the pipeline so it limits performance in an aggressive design.

#161 Reply
Posted by NorthGuy on 16 Dec, 2018 21:32
Quote from: ataradov on 16 Dec, 2018 20:51
Quote from: NorthGuy on 16 Dec, 2018 20:09
However, I think in the future, as everything moves to multi-cores, things may get even better. If you assign a designated core to an interrupt, then the core can simply sit there waiting for the interrupt to happen. Then there's no latency except for the short period necessary to synchronize the interrupt signal to the CPU clock.
The limiting factor here will be memory. You either need to have a dedicated memory per core, which will make the maximum size of the handler inflexible, or deal with concurrent access by multiple cores, which will slow down everything.

I have ideas for this too. Most of the cores should have very limited amount of dedicated regular memory, but they will have one or more deep hardware FIFOs. The other end of the FIFOs may be muxed to other cores, which provides wide address-less communication channels between cores. This removes bus congestion altogether. The central core (or cores), in contrast, will have bigger memory so they can process data.

#162 Reply
Posted by ataradov on 16 Dec, 2018 21:36
Quote from: NorthGuy on 16 Dec, 2018 21:32
I have ideas for this too. Most of the cores should have very limited amount of dedicated regular memory, but they will have one or more deep hardware FIFOs. The other end of the FIFOs may be muxed to other cores, which provides wide address-less communication channels between cores. This removes bus congestion altogether. The central core (or cores), in contrast, will have bigger memory so they can process data.
That does not address code memory.

#163 Reply
Posted by NorthGuy on 16 Dec, 2018 22:01
Quote from: ataradov on 16 Dec, 2018 21:36
That does not address code memory.

Doesn't have to. Code memory can be made completely separate from data memory. Each peripheral core has its own limited amount of code memory which can be programmed by the central core as needed. Small memories can be made very fast. This ensures very fast deterministic execution for the peripheral cores. In contrast, the central core doesn't have to be deterministic - may have caches, pipelines - if it ever needs access to data, it all gets smoothed out by FIFOs.

#164 Reply
Posted by ataradov on 16 Dec, 2018 22:03
Quote from: NorthGuy on 16 Dec, 2018 22:01
Code memory can be made completely separate from data memory.
That's exactly what I'm talking about. You will essentially limit what your "interrupt" handler can do by defining the amount of code memory it has. I think this will be enough of a limitation to make this system impractical. At least for common microcontroller uses. It may be useful in an MPU environment. Kind of like ARM's big.LITTLE stuff.

#165 Reply
Posted by hamster_nz on 16 Dec, 2018 22:05
Quote from: rhodges on 16 Dec, 2018 18:45
... Further, an interrupt only happens when the user code makes a jump... An interesting and useful side-effect is that user code could assume no interrupts while doing code that needs to be atomic.

I just thought some might find this interesting.
I found that very interesting!

#166 Reply
Posted by westfw on 16 Dec, 2018 22:17
Quote
[ARM Cortex NVIC register stacking] likely faster in the majority of cases

I'm not convinced. We're talking register stacking, probably limited by memory speed, and taking all of 1 instruction (push multiple) in the ISR to save exactly which ones you need...

Quote
The register is called DIRSET because writing to it only sets the bits

Yeah, ....DIRSET |= bitmask; was not the best example.

Quote
The compiler may be clever enough to keep one of the registers permanently pointing to the IO registers area

Maybe. 32bit processors tend to really spread those IO registers out, perhaps occupying more than even a reasonable offset constant for indexed addressing.And constant-folding upper bits of an address might be too much to ask of a compiler. I remember looking at PIC32 code (MIPS), which loads 32bit constants half-at-a-time (LUI/ORI), and being disappointed that it it kept re-loading the same upper value. OTOH, I think Microchip was defining those symbols at link time rather than in C source, so there wasn't much choice... (This was quite a while ago. Maybe now, with LTO and similar, it does better.)

#167 Reply
Posted by NorthGuy on 16 Dec, 2018 22:36
Quote from: ataradov on 16 Dec, 2018 22:03
Quote from: NorthGuy on 16 Dec, 2018 22:01
Code memory can be made completely separate from data memory.
That's exactly what I'm talking about. You will essentially limit what your "interrupt" handler can do by defining the amount of code memory it has. I think this will be enough of a limitation to make this system impractical. At least for common microcontroller uses.

You do not need a lot of memory for peripheral cores - you need speed and determinism. And that is what MCUs are lacking now. You always can have a central core with enormous amount of memory to do any kind of processing.

The approach where you have a single memory bus for both data and code which is accessed simultaneously by CPU and 15 DMA channels through the bus arbiter, is not very suitable for real-time applications.

#168 Reply
Posted by langwadt on 16 Dec, 2018 22:47
Quote from: westfw on 16 Dec, 2018 22:17
Quote
[ARM Cortex NVIC register stacking] likely faster in the majority of cases

I'm not convinced. We're talking register stacking, probably limited by memory speed, and taking all of 1 instruction (push multiple) in the ISR to save exactly which ones you need...

but before you get to your push multiple, first the core has read the vector table and fetch the first instruction of the ISR (prolog)
done automatically it can often be done in parallel

#169 Reply
Posted by NorthGuy on 16 Dec, 2018 23:16
Quote from: westfw on 16 Dec, 2018 22:17
Quote
The compiler may be clever enough to keep one of the registers permanently pointing to the IO registers area
Maybe. 32bit processors tend to really spread those IO registers out, perhaps occupying more than even a reasonable offset constant for indexed addressing.And constant-folding upper bits of an address might be too much to ask of a compiler. I remember looking at PIC32 code (MIPS), which loads 32bit constants half-at-a-time (LUI/ORI), and being disappointed that it it kept re-loading the same upper value.

Microchip went overboard with spreading the registers all over the place in PIC32. There's no reason for that. In PIC24, everything fits into 2048 bytes quite nicely, even with space to spare. RISC-V has only 4096 reach, but I think this is Ok for hardware registers.

If you locate all your peripheral registers at the beginning of the memory space, you already have the zero register which creates free zero base for you. So, you have 2048 bytes which are easily accessible. Good place for hardware registers.

It would be full 4096 bytes, but RISC-V went the traditional sign-extended (instead of more reasonable zero-extended) road for offsets. Although addresses 0xfffff000 to 0xffffffff may be used for peripheral registers too.

Quote from: westfw on 16 Dec, 2018 22:17
OTOH, I think Microchip was defining those symbols at link time rather than in C source, so there wasn't much choice...

That's true. Although it's not a very good idea. I remember I had to copy definitions from the linker scripts to the inc files when I was working with PIC24.

#170 Reply
Posted by brucehoult on 17 Dec, 2018 00:25
Quote from: NorthGuy on 16 Dec, 2018 16:06
In SAM, "Group" represents a group of registers 128 bytes long and everything below is just unions.

I don't suppose the exact sizes matter much, as long as you stay within what can be done with a simple offset.

Quote
"PORT" would be a fixed location in memory space. So, what the code actually does is setting 2 bits at the fixed memory location.

Setting two bits at fixed locations .. yep .. that's what I compiled.

Quote
There's no pointer loading (which takes whopping 50% in Motorola, and 49% in Intel which you decided to compile as position-independent code).

I compiled them the way they came. None of the other ISAs have problems using PC-relative addressing.

You need to get the address of the hardware registers *somehow*. Now, it's true that you'd probably get slightly smaller code using the address of "PORT" as a #define instead of as a global variable, but that's the same for all ISAs and doesn't favour one over another.

Quote
Moreover, when someone builds an MCU with RISC-V, they will probably provide some way of setting bits without reading registers, as Atmel did here:

Code: [Select]
PORT->Group[0].DIRSET.reg = 1<<12; // no need for "|="
The register is called DIRSET because writing to it only sets the bits (and the bits which are written "0" remain unchanged), and there's an opposite register called DIRCLR which clears the bits, and also DIRTGL which xors.

I took the C code exactly as given by westfw, which also matches the ARM assembly language he gave in loading, ORing, and storing.

Incidentally, RISC-V *does* have a way to change bits without bringing the data to the CPU and back, but it seemed unfair to use it. I'm concentrating here on compiled C code.

AMOOR.W res,addr,val

This sends a message with the address, value, and operation out over the TileLink bus. If all the channels of the bus go as far as the peripheral, then the peripheral itself will do the OR operation locally and report back the new value. If at some point on the way to the peripheral the bus narrows to just a simple read/write bus then the controller at that point will do the read/modify/write and report the result back to the CPU.

Quote
The compiler may be clever enough to keep one of the registers permanently pointing to the IO registers area, so the whole thing boils down to this:

Code: [Select]
6685 lui a3,0x1 0ce7a423 sw a3,200(a5) ; replace "200" with correct offset from a5

Sure, of course. But that value has to *get* into a5 somehow, and I showed that.

If I'd chosen to put the code into a function that took PORT as an argument then *all* of the ISAs would show shorter code.

#171 Reply
Posted by brucehoult on 17 Dec, 2018 00:39
Quote from: rhodges on 16 Dec, 2018 18:45
A decade and a half ago, I had the pleasure of working with a VLIW processor, the Trimedia/Philips PNX1302. It dispatched up to 5 operations per instruction word at 200mhz. It had 128 32-bit registers, and the convention was that the botttom 64 belonged to user code and the top 64 could be used by the ISR. No saving required.

You can do this on any CPU with a reasonably large number of registers. It's just a matter of documenting it and making sure the compiler (and/or assembly language programmers) know about it.

Even three or four registers is enough for many interrupt routines, so you could reasonably do this on machines with 16 registers -- but 32 would be better.

Quote
Further, an interrupt only happens when the user code makes a jump. So user code could (with care) use the top 64 between jumps. An interesting and useful side-effect is that user code could assume no interrupts while doing code that needs to be atomic.

This is a nice property. I've worked on a machine that (potentially) switched threads after every "block" of code -- not quite a basic block as there was a way to do if/then/else and small loops within a block, but there was a limit on the number of instructions executed in the block. Once you were in a block you were guaranteed NOT to be interrupted. And there was a bank of 8 fast registers (1 cycle latency) that could be used within a block but went *poof* at the end of the block. The 256 global registers had several cycles more latency than that.

#172 Reply
Posted by brucehoult on 17 Dec, 2018 00:47
Quote from: NorthGuy on 16 Dec, 2018 20:09
Quote from: rhodges on 16 Dec, 2018 18:45
It had 128 32-bit registers, and the convention was that the botttom 64 belonged to user code and the top 64 could be used by the ISR. No saving required.

Some modern MCUs have multiple register sets. When an interrupt happens, the new set gets loaded. When it quits, the old one gets restored. It doesn't take any additional time and thus decreases the interrupt latency by a lot. If you have a separate register set for every interrupt level, you never need to save anything.

I don't know about "modern". The Z80 did this. Old ARM chips had a set of registers for every privilege level (not necessarily a whole set). And SPARC and Itanium had register windows that were used nto only by interrupts, but by function calls.

There are two problems with this that explain why no one does it any more:

1) at some point you run out and want three sets instead of two, or seventeen sets instead of sixteen. And then you have a whole lot of delay while you swap stuff. And you have to swap the entire set of registers even if the function/task using them is only using a small proportion of them.

2) it's just a huge waste of hardware resources that, in the end, is not actually used all that effectively. You're better off spending those transistors on something else -- such as a cache or write buffer that can absorb manually saved registers quickly on interrupts, but also makes your normal code run faster the rest of the time as well.

#173 Reply
Posted by brucehoult on 17 Dec, 2018 00:58
Quote from: langwadt on 16 Dec, 2018 20:40
Quote from: westfw on 16 Dec, 2018 10:45
Quote
Quote
I wish the ISRs in C code [on ARM] that the HW interrupt entry was quicker...
Doesn't the 'naked' attribute of the function definition remove the prolog and epilog?
Not for ARM Cortex. The NVIC hardware saves exactly the same registers that the C ABI says must be saved, so effectively there is NO extra prolog for ISRs. But the NVIC hardware stacks 8 words of context, so it's slower than it could be if the choice was left to the programer.

slower in the rare case you need to do something in a few cycles with no registers, likely faster in the majority of cases

Not faster. If the hardware managed to write those 8 words to memory (or at least to a write buffer or something) in one or two clock cycles then it would be faster. But it doesn't. Cortex M3, M4, M7 all have 12 cycle interrupt latency (M0 has 16). It's sitting there writing those eight registers out at one per clock cycle, exactly the same as you could do yourself in software.

#174 Reply
Posted by brucehoult on 17 Dec, 2018 01:00
Quote from: David Hess on 16 Dec, 2018 21:20
Quote from: andersm on 16 Dec, 2018 21:00
Register banks do make code that need to access registers across priority levels a whole lot messier (eg. task switching using a low-priority interrupt, like is usually done on Cortex-M MCUs, or exception handlers). I guess with modern manufacturing processes the extra state required by the additional register banks isn't a big deal anymore (eg. 31 32-bit registers by 8 banks is a bit less than 1000 bytes).

It does not cost as much due to area now but the register bank is within the critical timing path for the pipeline so it limits performance in an aggressive design.

Yes.

Also, there are other ways to use that 1 KB worth of transistors that give more bang for the buck, more of the time.

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

Are you sure?

Confirm

Cancel

There was an error while thanking

Thanking...

Go to page:

« 1 2 3 4 5 6 7 8 9 10 11 12 13 » All

Full site Menu

RISC-V assembly language programming tutorial on YouTube

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Navigation

Common actions

#Instr	Code	Data	Total	ISA
10	32	8	40	Thumb2
10	40	8	48	Arm32
15	30	10	40	Thumb1
9	36	8	44	Arm64
10	32	8	40	RISC-V rv64ic
10	32	4	36	RISC-V rv32ic
10	40	4	44	RISC-V rv32i
4	20	4	24	M68k
8	41	4	45	i686
13	26	14	40	SH4