Seems like ARM describes a standardized NPU peripheral:
https://developer.arm.com/documentation/102023/0000/Introduction-to-the-NPU/Description-of-the-neural-processing-unit-I imagine it to be a binary bitstream of somekind you'll have to load into the peripheral, with the weights and a computational program (unfolding of the neural layers into matrix multiply, add, etc.). The introduction says there is an open source tool that can produce a command stream, which I suspect takes a neural network (or trains it) and then outputs this bitstream for the NPU.
From the diagrams it looks like the NPU will pull input buffers from the main bus via a (it's own?) DMA engine into a scratchpad, then do the matrix computations internally on them, and publish the final result out via DMA or maybe some register set (e.g. if the neural network only publishes a few values).
ST's marketing slides talks about a 75x speed up from presumably a standard C implementation. Even if you can do only 1 FP16 MAC per 6 cycles (2 loads, 1 mac, 1 store, 1 decrement, 1 branch-if-not-zero), then the NPU engine will have be 12x faster than that. So that makes me guess the NPU has some 256-bit vector ALU inside, maybe even more.
TinyYoloV2 uses about 6.97Billion FLOPS for 416x416 images resulting in 207fps. STM32N6 can do about 18fps => 600M FLOPS.
I suppose there is some overhead involved with swapping out matrix weights for large networks.
I wonder how long it takes till someone will abuse the NPU for DSP too. We have had SIMD on Cortex-m4, but since the registers are still only 32-bit wide, it's use is a bit limited. I suppose you could also load a matrix that treats the "matrix" operations as mostly vectored (weights only on diagonal) and implement FIR, IIR, matched filters, mixers, etc. like that.