On many 32-bit ARM microcontrollers one can write code for in the Arduino environment – my favourites being the Teensies by PJRC.com – each GPIO bank tends to be either 16 or 32 bits/pins wide, and usually have output, set, clear, and toggle registers.
Writing a value to the output registers sets all pins in that bank to the specified states.
Writing a value to the other three registers only affects the pins corresponding to bits set in the written value. Writing to the set register, those pins are set high. Writing to the clear register, those pins are set low. Writing to the toggle register, those pins will have their state toggled between high and low.
(This assumes the ones whose state will change are outputs. Some do use the output state for input pullup enable or similar, when the affected pin is an input.)
The toggle is most interesting to me, because it allows one to treat any subset of a bank as a parallel bus without affecting any of the other pins in the same bank. The key to setting the "bus" to a specific state is remembering the state it has been previously set to, since exclusive-OR between the current state and the desired state yields the toggle set. Since many of these ARM MCUs have DMA capabilities, and the aforementioned operation is trivial to do when filling in a buffer of (future) bus states, it makes for an excellent parallel bus DMA mechanism.
Of course, the annoying thing is that rarely do the pins you can use for the parallel bus correspond to bits 0..N-1 in the bank. Every GPIO pin tends to have a set of alternative functions (like say SCK pin of a SPI bus, SDA or SCK of an I2C bus, PWM from a specific timer, and so on). One solution to this is to use a lookup array with N 32-bit words, with index corresponding to bus state, and output corresponding to GPIO bank bits. For N>8, the arrays become too large, but instead of a single lookup, one can use multiple arrays, each dealing with a subset of the bus width. For example, if one has a 16-bit parallel bus, you can pick any 16 pins in a single bank, in any order, and do one lookup each into two separate 256-entry lookup arrays, that only need 2048 bytes of memory total. (A single lookup array would need 262144 bytes of memory.) With four lookups and a 16-bit parallel bus, you only need 128 bytes for the four tables, 16 entries in each.
Right now, I'm playing with a BuyDisplay ER-TFT028A2-4 display module, which provides both parallel and serial interfaces to its ILI9341 controller (controlling the 2.8" 240×320-pixel 18-bit color IPS panel on it), with a Teensy 4.0 microcontroller. Since I will be doing framebuffery pixel-manipulation stuff, splitting the 18-bit parallel bus to three separate 6-bit buses (RGB, each with 2⁶=64 intensity levels), I also only need 3 lookup tables with 64 entries of 32-bit words each –– except that none of the banks really have enough GPIO pins for me to use, so I'll have to split them into different banks.
(The framebuffery pixel-manipulation stuff means things where I will be manipulating the pixel color in a way that I will end up with some kind of value where I can easily pick off the 6 most significant bits for each color component. Mapping them through a lookup table is a minimal extra cost, since I'll be doing something like four multiplications per pixel (for color blending) anyway. All I want to be able to do, is do all this faster than the refresh on the display module, so I won't see annoying tearing on it.)
However, one must also send 8-bit commands and data to the display, and those cover one 6-bit bus, plus two least significant bits from another bus. So, if I could put 12 pins in the same bank, I could have both the three RGB-lookup tables (768 bytes total) and the 256-entry (1024-byte) command/data lookup table making things easy.
So, what I have been doing, is experimenting with the pin selection! (And also the backlight for the display module. I do code well, but on the electronics side, I'm rather a beginner hobbyist.) Of course, the pins on the microcontroller evaluation/development boards is not according to the internal bank system, so one ends up with spreadsheets with pin mappings, their alternate functions, and playing with the numerous options on which pins to use for which purpose.
Often, the microcontroller processors –– especially those using BGA footprints –– have many more possible outputs than are exposed on the cheap evaluation/development boards one uses with Arduino, making that kind of selection annoying and frustration, which is why people doing anything like this will very soon move on to designing their own boards, so they have more options in picking which pin to use for which purpose.
The above means that if you want your Arduino code to be portable, you must stick to the Arduino (library) features, and not use the hardware directly, as the direct hardware access (through ports and registers) will vary from microcontroller to microcontroller. On the other hand, some of the more interesting stuff (like the parallel bus to a display module above) really needs hardware access, and is not as easily portable at all.
As a result, I would say it is very important to play with all kinds of peripherals and sensors and breakout boards, and understand how they are controlled and communicated with. Not just "I used this library; that's all I know", because that only teaches you how to use libraries written by others, you do need to be interested in finding out exactly how stuff happens, too.
While I do use Arduino for experiments and stuff I want those that use Arduino to use too, I do kinda-sorta like bare-metal development more. I don't see them as one excluding the other; they're just different environments, and I use both. In the mean time, there are others like PlatformIO that can use Arduino libraries, but do not rely on it or its IDE, that one should probably also experiment with too, if they have a supported microcontroller.
As in all programming, documentation should never be overlooked, either. Always write comments that describe your intent, not what the code does. (Anyone can read the code and see what it does, but the code does not tell what it is trying to accomplish, how, nor why.) And documentation showing wiring and real-life test cases, rationale for the design choices, and so on, is always useful. So practice that, too, from the get go. It is not something you can just pick up later on, and "for now concentrate on the code": it becomes much, much harder to learn to do afterwards! I know from hard-fought experience. Make good documentation a habit from the get go, and you'll end up being very happy you did so, guaranteed.