Ok, but what was the code? PORT is reasonably fast, but it still takes a number of instructions to write the value and do the loop.
Standard header files include those definitions. It is the same mapping, so worst case scenario - it is one #define that you need to add.
I'm, not sure what you mean by stages. The device uses a shared bus, so the final behavior may depend on other things going on in the system.
Compilers optimize things quite well, you just need to write code in a way that is possible to optimize. Compilers are not magic, they can only optimize things that are allowed by the standard..