Fun fact. I bought myself a new Nucleo board, H743ZI. It's 400 MHz Cortex M7. The same test that took 145 us on 180 MHz Nucleo F4 and 50 us on 1GHz Cortex A8 PocketBeagle, now takes... 45 us.
So without bothering with an operating system, boot up time etc, I have calculation speed the same as on a 2.5 times faster processor. And this happens with standard libraries and generic code, without pondering if I maybe could do some highly optimized custom code for a specific processor. So to me it looks like Cortex A8 is not that good, but Cortex M7 is absolutely great.
As for using lookup tables or other faster method instead of standard trig functions, I don't think it's really an option. The calculations are for kinematics of a 6 DOF robotic arm, and I need precision, especially around singularities. Rather, I'm considering if I should maybe use double precision calculations. That's one of the advantages of M7 over M4. Still, I'm not rushing into it since double precision FPU is slower than the single precision one. Anyways, with the current configuration I think I could easily get 10 000 kinematics calculations (i.e. 10k points in space processed) per second in the final application, which I take as a very good result.