ARMs (at least microcontroller cores) suck at DSP bigtime, especially if there are relatively simple operations to be done on relatively large set of data. That because they don't have and hardware support for implementing loops and code overhead for looping if pretty bad. What is 1 cycle MAC good for if going to the next loop iteration takes 9 cycles or something like that. Plus most ARMs (if nor all) have a single ALU only and single data bus. Also, the ALU has no direct connection to memory as it is on some dedicated architectures. 32-bit SIMD on microcontrollers if nice, but has quite limited usage if you're aiming at any sensible system resolution.
Actually, not being a fan (to say the least) of Microchip micrcontrollers, their dsPIC33 series is pretty neat. Runs at up to 70MIPS, 16bit. Can execute up to 8 instructions per cycle IIRC, which kind of alleviates the problem with architecture beign 2clk/instruction. Those chips have DSP engine with hardware multiplier, barrel shifter, they support saturating and rounding logic in hardware, have two specialized memory regions for DSP instructions, support fie hardware looping etc. On the downside the built-in memory it rather modest in comparison to mid- and high-end CM4 microcontrollers.
dsPICs are not on par with BF, TMS320 and such, but they are relatively cheap, so you might find out, that if you have - say - 3 channels of data to process, it's cheaper to implement individual dsPIC on every channel rather than a single full-blown DSP.