Author Topic: Does GPIO order matter on a microcontroller development board? (Read 2422 times)

light655 · « **on:** May 09, 2026, 02:28:09 pm »

Hello everyone.

I am currently designing an STM32G474VC development board for my projects. The microcontroller has 100 pins and 86 GPIOs.

I've previously used Arduinos and Raspberry Pi Picos in my projects and I am used to the GPIOs on the development board being sequentially numbered.
For example, the Raspberry Pi Pico has GPIOs 0~15 on the left hand side, in order. The GPIOs are also in order on the RP chips.
The Arduinos have D0~D13 on the right hand side, in order. (I know that it is "reordered" in software.)

The GPIO ports are not in order on the STM32 chips. So, I tried to reorder them on the PCB so that the GPIOs on the pin headers are in order. I quickly found that it is quite a challenge for a 100 pin microcontroller.

Does GPIO order matter on a microcontroller development board?

Therefore, I have a question.
Do you guys care about the order of the GPIOs on a microcontroller development board? Is having them in order a nice feature of the development board, or it doesn't matter to you?

SpacedCowboy · « **Reply #1 on:** May 09, 2026, 02:47:35 pm »

Every time I have cursed ST for their QWERTY-like approach to laying out their ports, I have always assumed it was so some peripheral or other actually is laid out logically. I have yet to find this mythical peripheral, but I'm sure there must be some reason...

They're almost as bad as XMOS.

But to your point, if you're going to lay out ports in a logical manner (which is A Good Thing^TM), make sure you don't make any fast-peripheral paths have wildly different trace lengths. There's a lot of pearl-clutching about just how many picoseconds a trace can be "out" and still work - in my experience they work far outside the supposed boundaries - but I still endeavour to keep things within the specified tolerances.

niconiconi · « **Reply #2 on:** May 09, 2026, 07:30:34 pm »

Quote from: light655 on May 09, 2026, 02:28:09 pm

Therefore, I have a question.
Do you guys care about the order of the GPIOs on a microcontroller development board? Is having them in order a nice feature of the development board, or it doesn't matter to you?

It depends on what the firmware needs to do. If the firmware needs to output 8-bit or 16-bit parallel data via GPIO, it's highly desirable keep the GPIO ports according to the logical order. It makes connection to the system bus or logic analyzer easier, with a single ribbon cable. Otherwise, the board either becomes difficult to connect, or the firmware becomes x times slower (where x is the number of GPIO groups a bus needs to cross).

If it only outputs serial data or a few control signals, I don't care.

Also, when using high-speed peripherals, that hardware pinout can be in conflict with the GPIO, so there's a tradeoff.

light655 · « **Reply #3 on:** May 10, 2026, 01:50:06 am »

Great to from you guys.
I think I will still try to make the GPIOs in order, it will be a challenge for me.

Quote from: SpacedCowboy on May 09, 2026, 02:47:35 pm

But to your point, if you're going to lay out ports in a logical manner (which is A Good Thing^TM), make sure you don't make any fast-peripheral paths have wildly different trace lengths. There's a lot of pearl-clutching about just how many picoseconds a trace can be "out" and still work - in my experience they work far outside the supposed boundaries - but I still endeavour to keep things within the specified tolerances.

The signals probably will "jump" off the development board with wires. Maybe a few millimetre of difference on the PCB will only be a small portion of the total delay mismatch?
At least the fast peripheral pins that I am interested in (HRTIM) are logically arranged on the STM32, so I can make the trace length very similar.

Quote from: niconiconi on May 09, 2026, 07:30:34 pm

If the firmware needs to output 8-bit or 16-bit parallel data via GPIO, it's highly desirable keep the GPIO ports according to the logical order.

Is this kind of application common? I've thought about this when I was designing the pinout of the development board, but I couldn't recall when I ever used this technique.

chilternview · « **Reply #4 on:** May 10, 2026, 07:28:21 am »

Quote from: light655 on May 10, 2026, 01:50:06 am

Quote from: niconiconi on May 09, 2026, 07:30:34 pm
If the firmware needs to output 8-bit or 16-bit parallel data via GPIO, it's highly desirable keep the GPIO ports according to the logical order.

Is this kind of application common? I've thought about this when I was designing the pinout of the development board, but I couldn't recall when I ever used this technique.

Well, if you're driving e.g. a 1602 LCD, it makes it easier to connect the 8 bit data lines.

cfbsoftware · « **Reply #5 on:** May 10, 2026, 10:22:25 pm »

Quote from: light655 on May 10, 2026, 01:50:06 am

Quote from: niconiconi on May 09, 2026, 07:30:34 pm
If the firmware needs to output 8-bit or 16-bit parallel data via GPIO, it's highly desirable keep the GPIO ports according to the logical order.

Is this kind of application common? I've thought about this when I was designing the pinout of the development board, but I couldn't recall when I ever used this technique.

We used this technique to map the Signetics 2650 CPU address pins A0..A12 to 13 GPIO pins on the RP2350 Olimex PICO2-XL board:

https://github.com/cfbsoftware/S2650-IcePi/blob/main/S2650IcePiRevA-sch.pdf

A0 to A12 are contiguous on the 2650 but we ended up having to split them into four groups. It would have been so much easier if all pins had been contiguous on the PICO2-XL rather than having to map them in software.

We are using the Olimex GateMateA1-EVB FPGA boards in our latest project. One of the advantages of working with FPGA's is that they give you so much more freedom than MCU's when determining the position and functionality of each pin.

westfw · « **Reply #6 on:** May 10, 2026, 11:06:15 pm »

I like to have a full "port" (at least 8 or 16 bits) of contiguous bits, somewhere on a development board.
Board vendors and even chip manufacturers aren't particularly cooperative (I once posted a flame about some of the low-pin-count Atmel SAMD chips - no more than 2 contiguous bits in the 12 GPIO pins...) But I don't need ALL of the pins to be in order.

I blame Arduino.

OTOH, I also tend to believe that Arduino's API emphasizing single pins rather than ports was one of the most important steps toward acceptance by their target audience (to whom "bit manipulaion" might as well have been tensor calculus.)

Kleinstein · « **Reply #7 on:** May 11, 2026, 06:34:09 am »

For a developement board it really helps to have the ports sorted. For a board for a specific purpose one can often get away with a little more software effort and a simpler PCB, but it depends.

cfbsoftware · « **Reply #8 on:** May 11, 2026, 09:05:10 pm »

All is not lost if you can at least make some groups of the pins contiguous. The STM32G474VC is a Cortex-M4 MCU so you have access to the single-cycle bit-manipulation instructions RBIT, BFI, BFX etc. which allow you to remap the GPIO pins in software very efficiently.

cv007 · « **Reply #9 on:** May 13, 2026, 07:48:56 pm »

You will have all gpio pins available in either case. I would go with easy routing which still leaves you with something like 6 ports with 8 consecutive physical pins (and probably rarely used as 8bits). Dealing with the physical pin order when using the dev board may make you choose pins during development that are also easier to route when it comes time to layout the mcu on its own specific board. Pins can be changed later in the software, but its nicer if your dev board matches up.

As a simple example, lets say you chose to use 8 led's as hex code error output, and since your logical layout had pb0-7 in order you use them (no real need to use consecutive pins for this, but at first glance it looks like it will be easier so you do it). Now you put the mcu on its own board and now realize during layout some of those pb pins are on the other side of the mcu and makes routing harder than it needs to be. That can be dealt with by using different pins, but it also now means your dev board setup you most likely initially used does not match the new board and you now need more software to make the same code work on both boards. Now you start to juggle pins and now also need to avoid alternate function pins in use, and pretty soon you wish you would have chosen pins better in the first place.

Since most of the time pins will be used as individual pins, or as alternate function pins which will probably be scattered around anyway, about the only useful thing the logical order gets you is a little easier time finding a pin on the dev board.

westfw · « **Reply #10 on:** May 14, 2026, 03:14:53 am »

Quote

you have access to the single-cycle bit-manipulation instructions RBIT, BFI, BFX etc. which allow you to remap the GPIO pins in software very efficiently.

"Very efficiently" may be an overstatement :-(

(Although, lots better than

Code: [Select]

   for (int i = 0; i<8; i++) {
      val += val + digitalRead(bitGather[i]);
   }

)

cv007 · « **Reply #11 on:** May 14, 2026, 06:25:32 am »

Quote

Well, if you're driving e.g. a 1602 LCD, it makes it easier to connect the 8 bit data lines.

For something like a character lcd even when using 8 bit mode it may make little difference for the relatively slow update rate, so you can use any combination of pins that are easy to route and/or lets you make better choices for other pins such as alternate function pins. With 85 pins available it would not be hard to come up with 8 consecutive pins, but when the pin count is lower you may have other pin priorities that do not allow for these consecutive pins (and you would then most likely use the lcd 4 bit mode). Not a big deal one way or the other, but by not tying yourself to consecutive pins when not needed you allow for easy pin reassignment and the code is already in place.

Simple example below, obviously not complete but gives the idea that any group of pins can be used for the lcd data pins and changing them is only a matter of changing the definition of the pin array. The code will be slower of course than writing out a single 8bit value, but its also not as bad as one would think and for something where the speed is not required the ability to use any set of pins is nice.

There may be only a limited number of devices like the character lcd where speed is not really needed, but I also do this for multiplexing 7/14 segment led digits where speed is also not really needed (and it makes routing quite a bit easier). There are also ways to speed up the loop a little (no loop) but usually better off keeping the code simple/readable and just let the compiler do its job.

Code: [Select]

                GpioPin lcdDataPins[]{ 
                    {MCU::PA5, GpioPin::OUTPUT}, //bit 0
                    {MCU::PB7, GpioPin::OUTPUT}, 
                    {MCU::PC1, GpioPin::OUTPUT},
                    {MCU::PA1, GpioPin::OUTPUT},
                    {MCU::PB2, GpioPin::OUTPUT},
                    {MCU::PD4, GpioPin::OUTPUT},
                    {MCU::PD1, GpioPin::OUTPUT},
                    {MCU::PB5, GpioPin::OUTPUT}, //bit 7
                    }; 

class 
Lcd             {

                GpioPin (&pins_)[8];

public:

Lcd             (GpioPin (&pins)[8])
                : pins_{ pins }
                {                
                }

                auto
data            (u8 data)
                {
                for( auto& p : pins_ ){
                    p.on( data bitand 1 );
                    data >>= 1;
                    }
                }

                };

                Lcd lcd{ lcdDataPins };

                int
main            ()
                {

                u8 data{ 0 };
                while(1){
                    lcd.data( data++ );
                    }

                }

sparkydonkey · « **Reply #12 on:** May 14, 2026, 09:49:17 am »

I too don't see any real merit in arranged GPIOs either

light655 · « **Reply #13 on:** May 15, 2026, 07:38:32 am »

Looks like it has become a 50-50 split among you guys

.

I think the benefits of having sequentially ordered GPIOs is not about efficiency of operation but the efficiency of developing.
For me, being able to find the pins/ports instantly on the development board is important. Especially if it's going to be reused in multiple projects.

Quote from: westfw on May 10, 2026, 11:06:15 pm

I like to have a full "port" (at least 8 or 16 bits) of contiguous bits, somewhere on a development board.

I think this is a great idea. Breaking the 16-bit port into two 8-bit ports strikes a balance the PCB layout complexity and the easiness to find GPIO pins. The only downside is that it doesn't look as satisfying.

niconiconi · « **Reply #14 on:** May 15, 2026, 01:27:59 pm »

Quote from: westfw on May 14, 2026, 03:14:53 am

Quote
you have access to the single-cycle bit-manipulation instructions RBIT, BFI, BFX etc. which allow you to remap the GPIO pins in software very efficiently.
"Very efficiently" may be an overstatement :-(

It's indeed an overstatement.

For an extreme example, on the STM32H7, GPIO itself is very expensive. reading or writing the register of each GPIO bank stalls the CPU for 28 ns. Even if you can use SRAM LUTs to rearrange bits, you get a 2x slowdown (56 ns) if it spans over 2 ports, and a 3x slow down (84 ns) if it spans three ports. Your entire bus timing margin can be killed just by reading non-continuous GPIOs. If the combinations are too many for SRAM LUT, you need to use shifts and adds, this rearranging wastes perhaps 10 cycles more, you lose another 20 ns even at 480 MHz. Bad news.

STM32H7 may be an extreme example due to bad I/O on a high-latency bus - a freerunning a DMA state machine can be a more representative example. If you have sequential GPIOs, it would be possible to stream address and data directly to the GPIO or SRAM. If it's not in sequentially order, you suddenly need to do all kinds of gymnastics to emulate bit shifts, AND/OR, by chaining multiple DMA together, now the firmware looks like an arbitrary-code execution exploit running on the DMA engine that nobody understands, and with a performance hit.

So again, I repeat my previous post: it depends on what the firmware needs to do. If the firmware needs to emulate a 8-bit or 16-bit bus with custom protocol, sequential GPIOs can be critical for performance. But on the other hand, this is an important application, but it's relatively uncommon. For most boards, they're just running I2C or SPI. This is why I said I either care or I don't, depends on what the firmware needs to do.

Alien Brother · « **Reply #15 on:** May 17, 2026, 01:24:15 am »

Alternatively, maybe order of pins on the board could somehow correspond to the order on the chip?

Quote from: SpacedCowboy on May 09, 2026, 02:47:35 pm

I have yet to find this mythical peripheral, but I'm sure there must be some reason...

What if it's the Arduino headers?

unikeyname · « **Reply #16 on:** May 17, 2026, 03:27:07 am »

Quote from: niconiconi on May 15, 2026, 01:27:59 pm

...it depends on what the firmware needs to do. If the firmware needs to emulate a 8-bit or 16-bit bus with custom protocol, sequential GPIOs can be critical for performance...

yeah it is, there are some bitbang task which needs to be DMA straight to ODR/BSRR. I once have to do 3 16-bit bus which requires a very sequential layout (it's PB0-PB15 -> C, D), so if you do highly parallel work, a sequential layout is preferable. Plus, it is quite comprehensible for layman to look at.

Given 4-layer PCB is relatively cheap, you should try sorting pins in certain order if that is your niche. Otherwise, do it based on (SPI, PCI, UART, display) interface.

SpacedCowboy · « **Reply #17 on:** May 17, 2026, 05:16:57 am »

I'm usually using BGA devices, the "order" depends on coffee intake...

But my usual use of these chips is together with 16/8-bit busses, not with SPI, or individual pins, maybe an 80/20 split, so for me at least, having the balls at least *somewhat* spatially coherent would be nice.

Example: Use 24-bit LTDC, and your R,G,B selection is all over the place. That's to a 70MHz bus! Displays don't really care if a bit flips now and then, but it's amazing how the eye can catch it as "something's not quite right".

Or: the FMC with a split A/D bus. Those are also high-frequency, also split semi-randomly, and this time I really do care about data-integrity.

Even HyperRAM, they seem to have actually tried with this one, and it's only 8 bits wide, but if you're using LTDC or FMC, you'll probably find there's some pins that are shared in the section that *are* clustered, and you'll have (say) D2 a relative mile away on the other side of the chip...

What would be nice to fix this is just have a bus-matrix where you can set pins to ones you want. They do it for Rx/TX on uarts in most of their modern chips, and yes, I realise that's a lot easier, but continuing to squeeze more features into a fixed-size (in terms of ball-count) device is just making things constantly worse. The alternative is to supply *far* more fixed pin-muxing than they currently do.

westfw · « **Reply #18 on:** May 18, 2026, 08:14:50 pm »

Quote

you have access to the single-cycle bit-manipulation instructions RBIT, BFI, BFX etc.

Quote

you need to use shifts and adds [using more cycles.]

I think the second poster missed that the bit manipulation instructions (on CM3 and CM4) theoretically allow you to "gather" scattered bits significantly more efficiently than a series of shift/mask/add instructions. At worst, I think you can get rid of the "mask" instructions. If you're lucky and have "some" contiguous bits, they can be handled as a group instead of one-at-a-time.

I've played around with a couple of designs for a "read multiple bits" or "digitalReadAll()" function for the Arduino-sphere that would be faster and "more atomic" than the usual series of single-bit reads, but I keep getting torn between assembling the bits "in order" in the function, and requiring the "sketch" to use names for each bit. And then there is the C vs C++ question - C++ would solve the variable number of pins on a board issue more cleanly...

bson · « **Reply #19 on:** May 18, 2026, 10:34:05 pm »

Can you use the FSMC controller to access the shield? If so, consider routing the FSMC bus pins to the headers. With it you get DMA, code execution in its address space, wait state and other timing management, CS# generation, easy library use (even just basic memcpy) etc.

niconiconi · « **Reply #20 on:** May 19, 2026, 09:17:40 am »

Quote from: westfw on May 18, 2026, 08:14:50 pm

Quote
you have access to the single-cycle bit-manipulation instructions RBIT, BFI, BFX etc.
Quote
you need to use shifts and adds [using more cycles.]

I think the second poster missed that the bit manipulation instructions (on CM3 and CM4) theoretically allow you to "gather" scattered bits significantly more efficiently than a series of shift/mask/add instructions.

I regret that I was using loose language here. When I said "shifts and adds", I was not referring to the literal shifts or adds operations of the CPU, but the macroscopic task of combining results. I was calling all non-LUT masking, extracting, merging as "shift-and-add". I should've said "bit twiddling".

After correcting this terminology issue, I do not think I missed anything. I think the third poster missed that having a single-bank GPIO pinout (on the STM32H7) theoretically allow you to "gather" scattered bits in 2 clock cycles, significantly more efficiently han the bit manipulation instructions such as RBIT, BFI, BFX on the CM3 and CM4.

Quote

If the combinations are too many for SRAM LUT, you need to use shifts and adds, this rearranging wastes perhaps 10 cycles more, you lose another 20 ns even at 480 MHz. Bad news.

On the STM32H7, a LDR from ITCM/DTCM takes two cycles for a 32-bit word, with zero wait states, which means if you GPIO combination fits in TCM and it's in a critical path, it only takes 2 CPU cycles per lookup. No computation, No wait. For example, this is perfect to un-swizzling 16 address bits if they're on the same GPIO bank and costs perhaps 4 ns, not including the setup cost, such as loading the address.

If they are not on the same GPIO bank, you have a 32-bit problem. So you need at least some shifts to reduce the problem space, and that's going to take more CPU cycles for two reasons:

1. The input needs shifting, or masking, or bitfield extraction, or sign extension, or whatever.
2. Multiple partial results must be computed or loaded.
3. Partial results must be merged.

If you're not convinced, I challenge you to accelerate the following GPIO decoder from my real project. This is in a critical path, and I did it to eliminate all vias in the PCB layout, the PCB used no signal vias. I thought I had enough time, but I was wrong. I was short on time by 20-30 ns, and now I have to respin the board with sequential GPIOs to meet the timing. I think the following code is reasonably efficient, you can perhaps eliminates a few more cycles, but it's reaching the point of diminishing return without a layout change.

My LUT is:

Code: [Select]

/*
 * PB7 -> BIT07 -> DC
 * PB6 -> BIT06 -> A0
 * PB5 -> BIT05 -> D0
 * PB4 -> BIT04 -> A1
 * PB3 -> BIT03 -> D1
 * PB2 -> BIT02 -> DC
 * PB1 -> BIT01 -> DC
 * PB0 -> BIT00 -> DC
 *
 * PD15 -> BIT07 -> D5
 * PD14 -> BIT06 -> A6
 * PD13 -> BIT05 -> D6
 * PD12 -> BIT04 -> A7
 * PD11 -> BIT03 -> D7
 * PD10 -> BIT02 -> A8
 * PD09 -> BIT01 -> A9
 * PD08 -> BIT00 -> A10
 *
 * PD07 -> BIT07 -> A2
 * PD06 -> BIT06 -> D2
 * PD05 -> BIT05 -> A3
 * PD04 -> BIT04 -> D3
 * PD03 -> BIT03 -> A4
 * PD02 -> BIT02 -> D4
 * PD01 -> BIT01 -> A5
 * PD00 -> BIT00 -> DC
 *
 * PB15 -> BIT7 -> A11
 * PB14 -> BIT6 -> A12
 * PB13 -> BIT5 -> A13
 * PB12 -> BIT4 -> A14
 * PB11 -> BIT3 -> A15
 * PB10 -> BIT2 -> DC
 * PB09 -> BIT1 -> DC
 * PB08 -> BIT0 -> DC
 *
 * DC = Don't Care.
 */

To decode these bits, I'm using:

Code: [Select]

/*
 * LOW16 OUTPUT:
 *
 * PD07 -> BIT07 -> A2
 * PD05 -> BIT05 -> A3
 * PD03 -> BIT03 -> A4
 * PD01 -> BIT01 -> A5
 *
 * MID16 OUTPUT:
 *
 * PD14 -> BIT06 -> A6
 * PD12 -> BIT04 -> A7
 * PD10 -> BIT02 -> A8
 * PD09 -> BIT01 -> A9
 * PD08 -> BIT00 -> A10
 *
 * HIGH16 OUTPUT:
 *
 * PB15 -> BIT7 -> A11
 * PB14 -> BIT6 -> A12
 * PB13 -> BIT5 -> A13
 * PB12 -> BIT4 -> A14
 * PB11 -> BIT3 -> A15
 * PB10 -> BIT2 -> DC
 * PB09 -> BIT1 -> DC
 * PB08 -> BIT0 -> DC
 *
 */
__attribute__((section (".itcm_text"), noinline))
static inline uint16_t                                                          
cpu_extract_bus_address(uint16_t pb, uint16_t pd)                               
{                                                                               
    uint16_t low = addr_lut_low[pd & 0x00FF];                                   
    uint16_t mid = addr_lut_mid[pd >> 8];                                       
    uint16_t high = addr_lut_high[pb >> 8];                                     
    return low | mid | high;                                                    
}

/*
 * LOW8 OUTPUT:
 *
 * PD06 -> BIT06 -> D2
 * PD04 -> BIT04 -> D3
 * PD02 -> BIT02 -> D4
 *
 * HIGH8 OUTPUT:
 *
 * PD15 -> BIT07 -> D5
 * PD13 -> BIT05 -> D6
 * PD11 -> BIT03 -> D7
 */
__attribute__((section (".itcm_text"), noinline))                               
static inline uint8_t                                                           
cpu_extract_bus_data(uint16_t pb, uint16_t pd)                                  
{                                                                               
    uint16_t retval = input_data_lut_low[pd];                                   
    retval |= input_data_lut_high[pb];                                          
    retval |= read_bit(pb,  3) << 1 | read_bit(pb, 5);                          
    return retval;                                                              
}

/*
 * LOW16 OUTPUT:
 *
 * D2 -> PD06 -> BIT06
 * D3 -> PD04 -> BIT04 
 * D4 -> PD02 -> BIT02 
 * D5 -> PD15 -> BIT15
 * D6 -> PD13 -> BIT13
 * D7 -> PD11 -> BIT11
 *
 * HIGH16 OUTPUT:
 *
 * D0 -> PB5 -> BIT05
 * D1 -> PB3 -> BIT03
 */
__attribute__((section (".itcm_text"), always_inline))
static inline void
cpu_write_bus_data(uint8_t data)
{
	uint32_t packed = output_data_lut[data];
	uint16_t pb = packed >> 16;
	uint16_t pd = packed & 0x0000FFFF;

	LL_GPIO_WriteOutputPort(GPIOB, pb);
	LL_GPIO_WriteOutputPort(GPIOD, pd);
}

There are a lot of unwanted operations here.

1. Extracting the lower bits of PD via 0x00FF.
2. Extracting the higher bits of PD via >> 8.
3. Extracting the higher bits of pb via >> 8.
4. Find 3 partial results for addresses, 2 partial results for data.
5. Adding (or ORing, it doesn't matter, because there are no overlapping bits) three partial results.

This compiles to (inlining disabled for clarify, otherwise partial results can be reused between these functions by the optimizer). If anyone accepts this challenge, note that you're not allowed to merge these two functions, because the data is not valid when the address just arrived, so both functions must be callable independently.

Code: [Select]

00000298 <cpu_extract_bus_address>:
 298:   b2ca            uxtb    r2, r1
 29a:   f8df c020       ldr.w   ip, [pc, #32]   @ 2bc <cpu_extract_bus_address+0x24>
 29e:   4b08            ldr     r3, [pc, #32]   @ (2c0 <cpu_extract_bus_address+0x28>)
 2a0:   0a09            lsrs    r1, r1, #8
 2a2:   0a00            lsrs    r0, r0, #8
 2a4:   f833 3012       ldrh.w  r3, [r3, r2, lsl #1]
 2a8:   f83c 1011       ldrh.w  r1, [ip, r1, lsl #1]
 2ac:   4a05            ldr     r2, [pc, #20]   @ (2c4 <cpu_extract_bus_address+0x2c>)
 2ae:   f832 2010       ldrh.w  r2, [r2, r0, lsl #1]
 2b2:   ea43 0001       orr.w   r0, r3, r1
 2b6:   4310            orrs    r0, r2
 2b8:   b280            uxth    r0, r0
 2ba:   4770            bx      lr
 2bc:   20000968        .word   0x20000968
 2c0:   20000768        .word   0x20000768
 2c4:   20000568        .word   0x20000568

000002c8 <cpu_extract_bus_data>:
 2c8:   4b06            ldr     r3, [pc, #24]   @ (2e4 <cpu_extract_bus_data+0x1c>)
 2ca:   f3c0 0cc0       ubfx    ip, r0, #3, #1
 2ce:   4a06            ldr     r2, [pc, #24]   @ (2e8 <cpu_extract_bus_data+0x20>)
 2d0:   5c5b            ldrb    r3, [r3, r1]
 2d2:   5c12            ldrb    r2, [r2, r0]
 2d4:   f3c0 1040       ubfx    r0, r0, #5, #1
 2d8:   4313            orrs    r3, r2
 2da:   ea40 004c       orr.w   r0, r0, ip, lsl #1
 2de:   4318            orrs    r0, r3
 2e0:   4770            bx      lr
 2e2:   bf00            nop
 2e4:   20000c68        .word   0x20000c68
 2e8:   20000b68        .word   0x20000b68

000002ec <cpu_write_bus_data>:
 2ec:   4b04            ldr     r3, [pc, #16]   @ (300 <cpu_write_bus_data+0x14>)
 2ee:   4905            ldr     r1, [pc, #20]   @ (304 <cpu_write_bus_data+0x18>)
 2f0:   f853 3020       ldr.w   r3, [r3, r0, lsl #2]
 2f4:   4a04            ldr     r2, [pc, #16]   @ (308 <cpu_write_bus_data+0x1c>)
 2f6:   0c18            lsrs    r0, r3, #16
 2f8:   b29b            uxth    r3, r3
 2fa:   6148            str     r0, [r1, #20]
 2fc:   6153            str     r3, [r2, #20]
 2fe:   4770            bx      lr
 300:   20002d68        .word   0x20002d68
 304:   58020400        .word   0x58020400
 308:   58020c00        .word   0x58020c00

I call everything including "uxtb", "lsrs", "orrs", the "shifts-and-adds". "uxtb" is a single-cycle 8-bit mask, but it's just a fast shift and add to me. Eyeballing the overhead of this entire operation, address in, data in, data out, excluding the LUT loads themselves, is approximately 10 cycles (the Cortex M7's "dual-issuing" timing detail is not entirely clear).

But if you have no crossbank buses, and you only need to read one GPIO bank, the whole operations (if it's in a critical path) above, the "uxtb", "lsrs", "orrs", the "shifts-and-adds" can all be deleted from the source, since the entire address read would collapse into a single 64 K x 16 bit LUT with one instructions: LDR in 2 cycles if you want to take extreme measures to solve this problem for absolutely the the best performance. A waste of TCM, but at least it's fast. For 8-bit LUTs it's ideal, on the STM32H7, DTCM1 and DTCM2 can even dual-issue, meaning you can read two buses in just 2 cycles, impossible to beat.

So I believe my original point stands:
1. one bank + in-order = great, use results directly.
2. one bank + out-of-order = good, theoretically allows a 2-cycle LUT, no ALU needed.
3. two banks = fair, multiple reads, LUTs, multiple bitwise operations needed, or all at the same time.
4. multiple banks = bad, more instructions.

But doing custom parallel bus protocol is a niche applications, It means little if you only use SPI and I2C.

I spent the last week to figure out a new way to save 30 nanoseconds more from /CS to data valid. But this means using the DMA, which cannot tolerate additional table lookups without ruining the bus cycles.

Also, in my original words,

Quote

GPIO itself is very expensive. reading or writing the register of each GPIO bank stalls the CPU for 28 ns. [...] you get a 2x slowdown (56 ns) if it spans over 2 ports, and a 3x slow down (84 ns) [...] shifts and adds, this rearranging wastes perhaps 10 cycles more, you lose another 20 ns even at 480 MHz.

So I clearly emphasized that GPIO was the bottleneck on the extreme example which was the H7, not the ALU. I only mentioned bitwise operations in the sense of "add insult to injury", or "the last straw that breaks the the bus timing", so I was not paying serious attention to the bit operations, so I don't understand why did you think that a minor remark is a good use of your time to refute.

josip · « **Reply #21 on:** May 19, 2026, 11:01:29 am »

I prefer to have ordered GPIO on development boards. For example on my blue pill port A is on one side, port B on another and 3 C pins (13 - 15) between.

DavidAlfa · « **Reply #22 on:** June 13, 2026, 08:12:27 am »

You HAD to use that nasty micro-usb!

(I have so much hate to them)


EEVblog® Main Site	EEVblog® on Youtube	EEVblog® on Twitter	EEVblog® on Facebook	EEVblog® on Odysee

Author Topic: Does GPIO order matter on a microcontroller development board? (Read 2422 times)

Share me