Yes I agree definitely get the algorithms working on the PC first, then port them as-is first, and then optimise.
There is plenty of tweaking that can be done, and depending on processor that will be done in different ways, and almost all solutions will implementation specific and a good knowledge of the target device. This often doesn't mean digging into assembly but it does occasionally. Typically understanding the on chip registers, cache, wait states, pipeline stalling, VLIW, bus arbitration, loop unrolling, the use of compiler idioms, and intrinsic functions, it's not uncommon to gain an order of magnitude in speed over an unoptimised vanilla algorithm.
One recommendation if you're a newbie is to go for a processor that offers single cycle floating point operations, although there is a power consumption tax on that. Some of the TI VLIW processors will do several floating point operations simultaneously if you can structure and code your algorithm around it: typically this involves pipelining mutually exclusive operations, which in itself often makes for pretty unreadable code: I usually always keep my readable working but unoptimised code in there, commented or #ifdef'd out, for documentation and sanity/cross checking purposes.
Using pre baked optimised vendor provided libraries is usually a good start, but keep in mind a few examples aren't very good, for example some of the ARM CMSIS DSP FIR filter code isn't well optimised at all at the algorithm level.