You could try to compute the S-Box result, instead of using of a lookup table, that could be faster on gpu.
On a GPU, it will most likely be faster indeed.
Sounds definitely doable with CUDA. Learning how to use CUDA properly is not completely trivial though.
Yeah, not trivial at all
The Sbox does unfortunate not have a formula behind the layout
It's more like it has been constructed by a game of Dart
I have been running different software to convert "lookup table" to "logic" and I also tried to do a logic approach in FPGA vs the Bram (Rom) and let the FPGA compiler squeeze the logic and it always use LUT's equal to full decoding.
The Bram does have fmax limit way lower the the Logic fmax and to even run Bram on the max fmax you need to pipeline all the surrounding logic and it does eat a lot of registers (reducing the numbers of cores).
Also FPGA are expensive when you go for big sizes and not something easy accessible compared to the Cuda GPU.
That's why I try to use a hammer to get some better results on the GPU.
The Visual Profiler is a good tool to see a bit of what is happening when the code does run.
I will do some more experiments on the GPU and the local "shared"memory to see if I can get it to run more smooth, at least I do learn some new tricks
Thank you for all the ideas