[
With some loop unrolling and instruction reordering to hide the latency of the load you'd have code that looks like:
[code]LDR samples, [buf], #4
CBNZ res, found
USUB8 res, samples, triggers
LDR samples, [buf], #4
CBNZ res, found
USUB8 res, samples, triggers
LDR samples, [buf], #4
CBNZ res, found
USUB8 res, samples, triggers
which would check for a trigger occurring in 4 samples every 3 processor clock cycles - or 224MSPS with the MCU's 168MHz clock..
Obviously the whole thing wouldn't go that fast, but I think that 100MSPS might be realistically achievable.
The Cortex M4 load instructions are actually 2 cycles to execute but sequential loads are pipelined so only the first takes 2 cycles. So best to load a block of registers first to minimize the extra cycle overhead of the first load.
Also the memory reads have to compete with the DMA writes of the ADC data to RAM so some loads may stall by an extra cycle. They might not of course - it depends on the relative priorities of the DMA and the M4 core and the implementation of the arbitration. Since DMA implementation is vendor specific I guess the behaviour will differ between the STM32F407 and the GD32 version.
The good news though is that, from the other thread, the GD32F407 must be overclocked to 250MHz, otherwise it wouldn't be able to read the ADC data at 125MSPs. Note that the STM32F407 requires 4 clock cycles for each GPIO to memory DMA transfer so it only works on the GD32.
Conclusion: hopefully it will be just possible to handle the triggering without the DMA overrunning, depending on the internal bus contention impact. There will also be some additional overhead in swapping the DMA buffers over, but this will be amortized over a whole buffer (64k?) so should be minimal.
It may be possible to avoid having to check, and reset, the 'DMA buffer half full flags' which you would normally do to know when to supply the DMA with the address of the next buffer when the current one is full. This might be avoidable by padding the trigger detect code with enough NOPs such that it's execution time exactly matches the incoming data rate.
[EDIT] Using a 'load multiple' instruction to load a block of ADC samples into registers might not work - I don't know if it would stall a DMA write until the ldm instruction completes which would cause a DMA overrun.