TBH, this doesn't sound too hard as stated. It should be obvious that the microcontroller core can't do this in real-time, so you will need a GPIO DMA with enough speed and bits to handle your ADC. You also need the clock to work, so either you need a GPIO peripheral that can be externally clocked or you need the sample clock to by synchronous with your master clock. These requirements plus sufficient SRAM should filter you down to a handful of candidate parts and you can easily compare the cost and power budget vs. an external SRAM based FIFO or a small FPGA.
The thing to emphasize in your comparison is flexibility and extensability. 40,000 samples is maybe fine, but then what happens if your requirements change to take 80,000 samples? With an SRAM based FIFO you just pick a bigger SRAM chip and add one address line. With an FPGA you may not need to do anything if it already has enough SRAM built in, otherwise maybe you migrate to a larger pin compatible part in the same family. With an MCU you probably have to scrap the project and start over with an FPGA, or at the very least do a complete redesign with multiple interleaved MCUs. If there is any chance you will need to process the data in real time (for instance to detect some error/out of range condition and trigger a halt in deterministic time), you want an FPGA.