you have access to the single-cycle bit-manipulation instructions RBIT, BFI, BFX etc.
you need to use shifts and adds [using more cycles.]
I think the second poster missed that the bit manipulation instructions (on CM3 and CM4) theoretically allow you to "gather" scattered bits significantly more efficiently than a series of shift/mask/add instructions.
I regret that I was using loose language here. When I said "shifts and adds", I was not referring to the literal shifts or adds operations of the CPU, but the macroscopic task of combining results. I was calling all non-LUT masking, extracting, merging as "shift-and-add". I should've said "bit twiddling".
After correcting this terminology issue, I do not think I missed anything. I think the third poster missed that having a single-bank GPIO pinout (on the STM32H7) theoretically allow you to "gather" scattered bits in 2 clock cycles, significantly more efficiently han the bit manipulation instructions such as RBIT, BFI, BFX on the CM3 and CM4.
If the combinations are too many for SRAM LUT, you need to use shifts and adds, this rearranging wastes perhaps 10 cycles more, you lose another 20 ns even at 480 MHz. Bad news.
On the STM32H7, a LDR from ITCM/DTCM takes two cycles for a 32-bit word, with zero wait states, which means if you GPIO combination fits in TCM and it's in a critical path, it only takes 2 CPU cycles per lookup. No computation, No wait. For example, this is perfect to un-swizzling 16 address bits if they're on the same GPIO bank and costs perhaps 4 ns, not including the setup cost, such as loading the address.
If they are not on the same GPIO bank, you have a 32-bit problem. So you need at least
some shifts to reduce the problem space, and that's going to take more CPU cycles for two reasons:
1. The input needs shifting, or masking, or bitfield extraction, or sign extension, or whatever.
2. Multiple partial results must be computed or loaded.
3. Partial results must be merged.
If you're not convinced, I challenge you to accelerate the following GPIO decoder from my real project. This is in a critical path, and I did it to eliminate all vias in the PCB layout, the PCB used no signal vias. I thought I had enough time, but I was wrong. I was short on time by 20-30 ns, and now I have to respin the board with sequential GPIOs to meet the timing. I think the following code is reasonably efficient, you can perhaps eliminates a few more cycles, but it's reaching the point of diminishing return without a layout change.
My LUT is:
/*
* PB7 -> BIT07 -> DC
* PB6 -> BIT06 -> A0
* PB5 -> BIT05 -> D0
* PB4 -> BIT04 -> A1
* PB3 -> BIT03 -> D1
* PB2 -> BIT02 -> DC
* PB1 -> BIT01 -> DC
* PB0 -> BIT00 -> DC
*
* PD15 -> BIT07 -> D5
* PD14 -> BIT06 -> A6
* PD13 -> BIT05 -> D6
* PD12 -> BIT04 -> A7
* PD11 -> BIT03 -> D7
* PD10 -> BIT02 -> A8
* PD09 -> BIT01 -> A9
* PD08 -> BIT00 -> A10
*
* PD07 -> BIT07 -> A2
* PD06 -> BIT06 -> D2
* PD05 -> BIT05 -> A3
* PD04 -> BIT04 -> D3
* PD03 -> BIT03 -> A4
* PD02 -> BIT02 -> D4
* PD01 -> BIT01 -> A5
* PD00 -> BIT00 -> DC
*
* PB15 -> BIT7 -> A11
* PB14 -> BIT6 -> A12
* PB13 -> BIT5 -> A13
* PB12 -> BIT4 -> A14
* PB11 -> BIT3 -> A15
* PB10 -> BIT2 -> DC
* PB09 -> BIT1 -> DC
* PB08 -> BIT0 -> DC
*
* DC = Don't Care.
*/
To decode these bits, I'm using:
/*
* LOW16 OUTPUT:
*
* PD07 -> BIT07 -> A2
* PD05 -> BIT05 -> A3
* PD03 -> BIT03 -> A4
* PD01 -> BIT01 -> A5
*
* MID16 OUTPUT:
*
* PD14 -> BIT06 -> A6
* PD12 -> BIT04 -> A7
* PD10 -> BIT02 -> A8
* PD09 -> BIT01 -> A9
* PD08 -> BIT00 -> A10
*
* HIGH16 OUTPUT:
*
* PB15 -> BIT7 -> A11
* PB14 -> BIT6 -> A12
* PB13 -> BIT5 -> A13
* PB12 -> BIT4 -> A14
* PB11 -> BIT3 -> A15
* PB10 -> BIT2 -> DC
* PB09 -> BIT1 -> DC
* PB08 -> BIT0 -> DC
*
*/
__attribute__((section (".itcm_text"), noinline))
static inline uint16_t
cpu_extract_bus_address(uint16_t pb, uint16_t pd)
{
uint16_t low = addr_lut_low[pd & 0x00FF];
uint16_t mid = addr_lut_mid[pd >> 8];
uint16_t high = addr_lut_high[pb >> 8];
return low | mid | high;
}
/*
* LOW8 OUTPUT:
*
* PD06 -> BIT06 -> D2
* PD04 -> BIT04 -> D3
* PD02 -> BIT02 -> D4
*
* HIGH8 OUTPUT:
*
* PD15 -> BIT07 -> D5
* PD13 -> BIT05 -> D6
* PD11 -> BIT03 -> D7
*/
__attribute__((section (".itcm_text"), noinline))
static inline uint8_t
cpu_extract_bus_data(uint16_t pb, uint16_t pd)
{
uint16_t retval = input_data_lut_low[pd];
retval |= input_data_lut_high[pb];
retval |= read_bit(pb, 3) << 1 | read_bit(pb, 5);
return retval;
}
/*
* LOW16 OUTPUT:
*
* D2 -> PD06 -> BIT06
* D3 -> PD04 -> BIT04
* D4 -> PD02 -> BIT02
* D5 -> PD15 -> BIT15
* D6 -> PD13 -> BIT13
* D7 -> PD11 -> BIT11
*
* HIGH16 OUTPUT:
*
* D0 -> PB5 -> BIT05
* D1 -> PB3 -> BIT03
*/
__attribute__((section (".itcm_text"), always_inline))
static inline void
cpu_write_bus_data(uint8_t data)
{
uint32_t packed = output_data_lut[data];
uint16_t pb = packed >> 16;
uint16_t pd = packed & 0x0000FFFF;
LL_GPIO_WriteOutputPort(GPIOB, pb);
LL_GPIO_WriteOutputPort(GPIOD, pd);
}
There are a lot of unwanted operations here.
1. Extracting the lower bits of PD via 0x00FF.
2. Extracting the higher bits of PD via >> 8.
3. Extracting the higher bits of pb via >> 8.
4. Find 3 partial results for addresses, 2 partial results for data.
5. Adding (or ORing, it doesn't matter, because there are no overlapping bits) three partial results.
This compiles to (inlining disabled for clarify, otherwise partial results can be reused between these functions by the optimizer). If anyone accepts this challenge, note that you're not allowed to merge these two functions, because the data is not valid when the address just arrived, so both functions must be callable independently.
00000298 <cpu_extract_bus_address>:
298: b2ca uxtb r2, r1
29a: f8df c020 ldr.w ip, [pc, #32] @ 2bc <cpu_extract_bus_address+0x24>
29e: 4b08 ldr r3, [pc, #32] @ (2c0 <cpu_extract_bus_address+0x28>)
2a0: 0a09 lsrs r1, r1, #8
2a2: 0a00 lsrs r0, r0, #8
2a4: f833 3012 ldrh.w r3, [r3, r2, lsl #1]
2a8: f83c 1011 ldrh.w r1, [ip, r1, lsl #1]
2ac: 4a05 ldr r2, [pc, #20] @ (2c4 <cpu_extract_bus_address+0x2c>)
2ae: f832 2010 ldrh.w r2, [r2, r0, lsl #1]
2b2: ea43 0001 orr.w r0, r3, r1
2b6: 4310 orrs r0, r2
2b8: b280 uxth r0, r0
2ba: 4770 bx lr
2bc: 20000968 .word 0x20000968
2c0: 20000768 .word 0x20000768
2c4: 20000568 .word 0x20000568
000002c8 <cpu_extract_bus_data>:
2c8: 4b06 ldr r3, [pc, #24] @ (2e4 <cpu_extract_bus_data+0x1c>)
2ca: f3c0 0cc0 ubfx ip, r0, #3, #1
2ce: 4a06 ldr r2, [pc, #24] @ (2e8 <cpu_extract_bus_data+0x20>)
2d0: 5c5b ldrb r3, [r3, r1]
2d2: 5c12 ldrb r2, [r2, r0]
2d4: f3c0 1040 ubfx r0, r0, #5, #1
2d8: 4313 orrs r3, r2
2da: ea40 004c orr.w r0, r0, ip, lsl #1
2de: 4318 orrs r0, r3
2e0: 4770 bx lr
2e2: bf00 nop
2e4: 20000c68 .word 0x20000c68
2e8: 20000b68 .word 0x20000b68
000002ec <cpu_write_bus_data>:
2ec: 4b04 ldr r3, [pc, #16] @ (300 <cpu_write_bus_data+0x14>)
2ee: 4905 ldr r1, [pc, #20] @ (304 <cpu_write_bus_data+0x18>)
2f0: f853 3020 ldr.w r3, [r3, r0, lsl #2]
2f4: 4a04 ldr r2, [pc, #16] @ (308 <cpu_write_bus_data+0x1c>)
2f6: 0c18 lsrs r0, r3, #16
2f8: b29b uxth r3, r3
2fa: 6148 str r0, [r1, #20]
2fc: 6153 str r3, [r2, #20]
2fe: 4770 bx lr
300: 20002d68 .word 0x20002d68
304: 58020400 .word 0x58020400
308: 58020c00 .word 0x58020c00
I call everything including "uxtb", "lsrs", "orrs", the "shifts-and-adds". "uxtb" is a single-cycle 8-bit mask, but it's just a fast shift and add to me. Eyeballing the overhead of this entire operation, address in, data in, data out, excluding the LUT loads themselves, is approximately 10 cycles (the Cortex M7's "dual-issuing" timing detail is not entirely clear).
But if you have no crossbank buses, and you only need to read one GPIO bank, the whole operations (if it's in a critical path) above, the "uxtb", "lsrs", "orrs", the "shifts-and-adds" can all be deleted from the source, since the entire address read would collapse into a single 64 K x 16 bit LUT with one instructions: LDR in 2 cycles if you want to take extreme measures to solve this problem for absolutely the the best performance. A waste of TCM, but at least it's fast. For 8-bit LUTs it's ideal, on the STM32H7, DTCM1 and DTCM2 can even dual-issue, meaning you can read two buses in just 2 cycles, impossible to beat.
So I believe my original point stands:
1. one bank + in-order = great, use results directly.
2. one bank + out-of-order = good, theoretically allows a 2-cycle LUT, no ALU needed.
3. two banks = fair, multiple reads, LUTs, multiple bitwise operations needed, or all at the same time.
4. multiple banks = bad, more instructions.
But doing custom parallel bus protocol is a niche applications, It means little if you only use SPI and I2C.
I spent the last week to figure out a new way to save 30 nanoseconds more from /CS to data valid. But this means using the DMA, which cannot tolerate additional table lookups without ruining the bus cycles.
Also, in my original words,
GPIO itself is very expensive. reading or writing the register of each GPIO bank stalls the CPU for 28 ns. [...] you get a 2x slowdown (56 ns) if it spans over 2 ports, and a 3x slow down (84 ns) [...] shifts and adds, this rearranging wastes perhaps 10 cycles more, you lose another 20 ns even at 480 MHz.
So I clearly emphasized that GPIO was the bottleneck on the extreme example which was the H7, not the ALU. I only mentioned bitwise operations in the sense of "add insult to injury", or "the last straw that breaks the the bus timing", so I was not paying serious attention to the bit operations, so I don't understand why did you think that a minor remark is a good use of your time to refute.