The problem with a roll-it-yourself is all the detail problems you bump into when something changes (a software version here, an install directory there....)
Unfortunately, you don't get rid of this problem just by using a vendor's pre-packaged toolset. Even with Atmel Studio, you run into gcc, CMSIS, and ASF versioning issues.
In general cortex m-series can all be programmed using SWD.
This seems to be the general mechanism at the physical layer, but the "flash memory controller" is usually a vendor-proprietary thing, so I think that actually programming the flash usually requires some sort of chip-specific manipulation.
As someone about to try and jump into the world of 32 bit arms, any pointers towards why the transistion is hard?
The peripherals tend to be especially complex, compared to an 8bit microcontroller. As someone said, that IS the differentiating feature for ARM-core chips. Given a 32bit address space, vendors have caught "feature-itis" is bad way; the GPIO interface on an 8bit chip that ranges maybe 1 (8051?) to 4 (AVR) registers now might have 512bytes (128 32-bit registers) allocated to it (probably only uses 40 or so registers, like the SAM3X.) Those "complex peripherals" include the clock system and memory controller, so it feels like you can't even get started till you've mastered several hundred pages of datasheet.
ARMs are programmed mostly in C, so the vendors provide libraries to do a lot of stuff. The libraries, of course, expose nearly ALL of the complexities of those fancy peripherals. But they're documented separately, and perhaps not so well, and they're an excuse for the datasheet to be less well done. So now, instead of a 500 page datasheet to understand, you have a 1200 page datasheet and 1000 pages of library documentation (which is probably some sort of weird cross-index web document), each of which has garnered only partial attention.
My advice is ... probably don't start from "bare metal" the way you would on an 8bit. Go ahead and use one of the "toy" environments (MBED, Arduino, etc), or a boilerplate vendor-library skeleton. Pay no attention to the depressing code size (it doesn't matter - you have a lot more flash than on an 8bit), questions of runtime efficiency (doesn't matter either, for startup code), or general incomprehensibility of what happens before your part of the code starts to execute. Go ahead and explore from there - either upward into peripherals and features that your "environment" doesn't support, or downward into how it provides the services that it does provide...