Simplest is if you're DMAing to a whole GPIO port at once; if you're picking and choosing the pins it can get more complicated - though the "bit banding" on many M3's is one way to simplify that.
Most, if not all, manufacturers have ways to set individual GPIO pins. Eg. ST have bit set/reset registers, and NXP have a very unusual method where the LSBs of the address used to access the port become a mask.
sure. but like i said, it gets more complicated... not insanely so, but you need to dig through and match the DMA approach with the register(s) available (and for example can require 2 DMA channels & kill atomicity if you need to access separate set/reset registers).
The low-end ST parts for example have a register that will let you do bit sets & resets as an atomic operation across arbitrary pins on a port, but to use it properly with DMA you'd need to precompute 32 bits per transfer rather than the 16 that you would otherwise write directly to the ODR. Not a hard thing to do at all - just takes an extra XOR per value & larger data type - but it does make for a more complicated setup - and even small amounts of data munging in some cases can even break throughput requirements.
That NXP approach IIRC is basically bit banding, except it's in the address space of the peripheral. I seem to recall NXP also doing some things in that GPIO implementation - at least if it's the same one they used on the LPC11U24 - that made writing decent, low-latency GPIO abstractions a real pain... (ugliness from throwing together assorted registers from multiple GPIO ports into a single block)