[TCWilliamson], I have not worked with Arduino. But as I understand it, it works as a Hardware Abstraction Layer (HAL) that hides the complexity of the hardware, and differences between platforms.
One way you might approach this is by starting to write your own HAL. You might even choose to start this on the Arduino. That offers the advantage of being a familiar platform to you, and allows you to gradually replace Arduino calls with calls to your own HAL.
Start with plain old digital I/O. As you write this, you'll obviously learn how to directly access the AVR's digital I/O registers. But perhaps more importantly, you'll learn more about some general coding practices, that the Arduino experience might have deprived you of.
For example, even though I haven't used an Arduino, I know that:
With Wiring (which Arduino is based on), digitalWrite(13,HIGH) compiles to a single sbi instruction (which is 2 clock cycles).
This type of optimization is only possible when the pin number is a constant (at very least). In this example, the address of the hardware register, bit position within the register, and value are all known at compile time; therefore the compiler has the opportunity to condense it to a single instruction.
Otherwise, the digitalWrite function is called, which even I know is notoriously slow. So is the one in my HAL. Every time it's called, it must:
1) Check that the pin number is within the valid range of physical pins, and that the pin supports digital I/O.
2) Convert the pin number to both a register address and bit number within the address.
3) Modify the bit.
It's #1 and #2 that consume most of the time! In the process of writing my HAL, I realized that there was a better way. (Which oddly, I have never seen reference to being used on the Arduino.)
Assume one wants to write a digital pin, which is not a constant, and so cannot be optimized at compile time. But that pin will typically be written repeatedly, and that's where the slowness of a function like digitalWrite really hurts; because it has to check and convert that pin repeatedly, when that isn't really necessary. So instead, when we know we're going to be writing a pin repeatedly, we could do something like this:
1) Define a structure in the HAL (FASTPIN) that will store the register address and bit number.
2) Create a function in the HAL (FastPinInit) that when passed an instance of the structure and a pin number, checks the pin for validity, converts the pin to register address and bit number, and stores this in the structure.
3) Create macros that when passed the initialized structure, will set (FastPinSet), clear (FastPinClear), or toggle (FastPinToggle) the pin based on the info stored in the structure.
And in your code:
1) Create an instance of FASTPIN.
2) Initialize it by calling FastPinInit.
3) Repeatedly write it by calling FastPinSet, FastPinClear, or FastPinToggle.
Which accelerates writing the pin greatly! It's not as fast as a single instruction, obviously; but still an order of magnitude of faster than calling digitalWrite.
Assume you've implemented something like this on the Arduino, that handles both writes and reads. As well as some bit-banged (software) I2C and SPI routines, that use your FastPin implementation.
Later you want to convert it to ARM. Digital I/O is a little different there, but not much. You might need to change what information is stored in FASTPIN, but that's no big deal, you only need to change the structure definition in one place; all your code that uses FASTPIN will use the new definition. Then rewrite FastPinInit, and the FastPin* macros, and...
Voila. In a few hours, you have reasonably fast digital I/O on ARM. As well as I2C and SPI, on any pin. Eventually you may implement hardware I2C and SPI, perhaps even driven by DMA. But you might not need that level of performance right away, so you can do that later - when needed, or at your leisure. This helps make moving to a new platform faster and less intimidating.
(I took that concept even further in my own HAL. Once I get a single timer, its associated interrupt, and a few other things working on the new platform, I use it to drive a C state engine that functions as a scheduler. This lets me simulate additional timers, PWMs, and other "virtual" peripherals. Not high performance by any means, but it gets me up and running on new platforms really fast. After just a week on PIC32 I have all this running, and can port useful code to it; I'm just taking some time to learn and tweak before moving on to gradually implementing the real peripherals. Of course, writing the original scheduler was difficult, but at least for me, it was fun. You need not take things this far!)
Sorry for the length of this post, but I think it demonstrates how writing your own HAL can be both educational and useful.