[ Disclosure: I worked for Tensilica 2000-2005 before it was aquired by Cadence. By far, the best company I ever worked for. ]
The way Tensilica Xtensa Processor Generator works is that licensees use a GUI to select among a bunch of "canned" processor features: multipliers, MMU, cache size/type/none, special instructions for this or that. Additionally, using a special language called "TIE", fully custom instructions can be created that use custom IO, custom register files, etc. When the user clicks "go" the processor RTL is created on the fly, along with a verification suite, scripts for synthesis and layout, etc, along with a full suite of GCC-based software tools that understand all the customizations including the new instructions.
Regarding the compiler, at a minimum, instrinsics are created so that you can instantiate variables matching the register file entries and call the instructions directly on them without dropping to assembler. But when the register and instruction types created result in vector operations on "normal" types (float, int32, uint32, etc), the compiler is generally smart enough to use them on its own, unrolling loops and such.
It's really an incredible platform, and was far ahead of its time. It was basically everything RISC-V claims to be regarding extensibility, except better and more automatic, but, of course, not free or open.
Anyway, the Xtensa HiFi engine 4 is essentially a fully canned audio DSP where all these choices have been made in advance by Tensilica, but conceptually works the same as any Xtensa processor. Tensilica definitely provided NXP with a compiler and all the regular tools.
Is there an optimized FFT? I am absolutely sure there is and it is part of the basic support for the DSP.
Will NXP be providing these tools to users for free? Don't know. Will they provide the FFT library to users? Don't know.
Someone asked whether the ESP32 tools could be used to build binaries for this machine. The answer is a definite maybe, if the ESP32 is a strict subset of instructions to whatever is in this processor. Probably things like the linker map would have to be adjusted, and of course, the compiler would not use the DSP instructions, but you could create intrinsics for them. But a better path by far would be to use the compiler that was generated for the DSP.
-- dave j