I do work on some Algo project where each round does have a S-box 8 bit in/out (256 bytes) if has to read the S-Box on every round and do some XOR on every Round as well
So in software you typical do have an array with the S-box and you will read the result for the input needed.
I want to run as many parallel cores of this algo in parallel to get as high throughput as possible.
So I have been looking on CUDA on GPU which is a logic way to get many cores, the thing is the memory access is a bottleneck on a CUDA core since it does share the RAM (in different ways) guess the "texture" RAM are the best for this kind of job.
But the RAM access makes it slow since 50% of code are S-Box lookup and I gues the are a lot of wait time due to collisions on where the core want to read data from the shared RAM
So now for the question:
In the Microchip MIPS PIC chips it's quite normal to make a lookup this way
movf count, w ;put value in W
call Table1
Table1 ADDWF PCL, f ;data table for bit pattern
retlw b'10000000'
retlw b'01000000'
retlw b'00100000'
retlw b'00010000'
retlw b'00001000'
retlw b'00000100'
retlw b'00000010'
retlw b'00000001'
retlw b'00000010'
retlw b'00000100'
retlw b'00001000'
retlw b'00010000'
retlw b'00100000'
retlw b'01000000'
..
..
This method will not use any RAM but just add a little to the program ROM
Is it possible to something alike on GPU Cuda?
Use FPGA. Nothing else will give you as many independent parallel executions as FPGA.
You could try to compute the S-Box result, instead of using of a lookup table, that could be faster on gpu.
You could try to compute the S-Box result, instead of using of a lookup table, that could be faster on gpu.
On a GPU, it will most likely be faster indeed.
Sounds definitely doable with CUDA. Learning how to use CUDA properly is not completely trivial though.
You could try to compute the S-Box result, instead of using of a lookup table, that could be faster on gpu.
On a GPU, it will most likely be faster indeed.
Sounds definitely doable with CUDA. Learning how to use CUDA properly is not completely trivial though.
Yeah, not trivial at all
The Sbox does unfortunate not have a formula behind the layout
It's more like it has been constructed by a game of Dart
I have been running different software to convert "lookup table" to "logic" and I also tried to do a logic approach in FPGA vs the Bram (Rom) and let the FPGA compiler squeeze the logic and it always use LUT's equal to full decoding.
The Bram does have fmax limit way lower the the Logic fmax and to even run Bram on the max fmax you need to pipeline all the surrounding logic and it does eat a lot of registers (reducing the numbers of cores).
Also FPGA are expensive when you go for big sizes and not something easy accessible compared to the Cuda GPU.
That's why I try to use a hammer to get some better results on the GPU.
The Visual Profiler is a good tool to see a bit of what is happening when the code does run.
I will do some more experiments on the GPU and the local "shared"memory to see if I can get it to run more smooth, at least I do learn some new tricks
Thank you for all the ideas
I have been running different software to convert "lookup table" to "logic" and I also tried to do a logic approach in FPGA vs the Bram (Rom) and let the FPGA compiler squeeze the logic and it always use LUT's equal to full decoding.
The Bram does have fmax limit way lower the the Logic fmax and to even run Bram on the max fmax you need to pipeline all the surrounding logic and it does eat a lot of registers (reducing the numbers of cores).
Also FPGA are expensive when you go for big sizes and not something easy accessible compared to the Cuda GPU.
$15 Xilinx's Spartan-7 XC7S6 has 900 slices. It's enough for 100+ S-boxes. If you manage to run them at 400 MHz, this will give you 40G S-boxes per second. 500 MHz is doable on -2 grades, even 600 MHz may be feasible, depending on your other logic. It's hard to beat with GPU.
BRAM, of course, is slower, but a 36k block can be made into two 18k blocks, each of which has dual-access, so you can host 4 S-boxes per BRAM. If you pipeline it and run it at 300 MHz (faster may be possible), it's 1.2G S-boxes per second for each 36k BRAM block. The smallest S-7 has only 5 BRAM blocks though. But if you go bigger - XC7S25 (about $30) has 45 BRAM blocks - 54G S-boxes per second only in BRAM - with 3500 slices left for other needs, but if you use all of the logic you may get cooling problems.
How fast do you need it?
buy a CUDA board like the JK1 and forget FPGAs, it's not appropriate in this case.
buy a CUDA board like the JK1 and forget FPGAs, it's not appropriate in this case.
Perhaps you're right. The modern GPU's may have become dramatically more capable since the last time I looked.
GPU is made such a way that individual processors does not have to compete for data access.
They are made for graphics work, which has high locality of reference in vertex and texture caches for the bulk of the data. Despite all the MIPS of modern GPUs, I'd still chose a FPGA to implement an encryption accelerator, it's so inelegant to use a chip designed for floating point operations to do nothing but LUTs.
This sounds very fun but it really depends on exactly what you want to achieve.... To get good CUDA performance you need to ensure there is always sufficient data flowing which in turn means optimising cache access and moving data from the host to the card. For applications like graphics transformation where the same algorithm is operating across the image this is more easy to achieve compared to when access is more adhoc. As you have pointed out the cost of moving data to/from the card is high (although decreasing as architectures evolve).
There is tons of literature out there for say AES on GPU. But also some more recent FPGA stuff...
https://www.nallatech.com/40gbit-aes-encryption-using-opencl-and-fpgas/If it is for fun the GPU is the way to go as it is super cheap to get started.... if you are trying to break NSA encryption then perhaps worth looking at FPGAs
PS
It's a while since I wrote any CUDA but at the time I did the memory hierarchy looked like
https://www.microway.com/hpc-tech-tips/gpu-memory-types-performance-comparison/ it is vital to get memory access right otherwise you won't achieve the occupancy goals you seek.
BRAM, of course, is slower, but a 36k block can be made into two 18k blocks, each of which has dual-access, so you can host 4 S-boxes per BRAM. If you pipeline it and run it at 300 MHz (faster may be possible), it's 1.2G S-boxes per second for each 36k BRAM block. The smallest S-7 has only 5 BRAM blocks though. But if you go bigger - XC7S25 (about $30) has 45 BRAM blocks - 54G S-boxes per second only in BRAM - with 3500 slices left for other needs, but if you use all of the logic you may get cooling problems.
How fast do you need it?
I have been using the Altera Cyclone V E A9 with 301K LE and 1220 M10K used as dual ROM = 2440 Sboxes where the fmax would be 240 for the M10K (if registered on both in and out) I have managed to make 32 cores of 56 sbox = 1.792 Sboxes + a lot of LUT stuff ... chip full like 93%
all this can run @200Mhz (yes I have coolling on the FPGA) = 358G Sboxes which are very fine for the FPGA
My goal was to test 2^48 keys in less than 10 sec right now it takes 12 hours so it will need a lot of FPGA's.
This is a project for fun and learning so that's why I try to see if there are other solutions like GPU
I will do some more reading / testing
I have been using the Altera Cyclone V E A9
For AES encryption on Xilinx's 7-series, if you fully pipeline everything, you get 2 pipeline stages per round. If you use 14 rounds for AES256, you get 28 pipeline stages, each requiring 128 slices, so it is 3584 slices plus 16 slices for initial key xoring, which is 3600 slices. I'm pretty confident you can run this at 400 MHz, most likely faster, which gives you 400M blocks/s per line. If you use one block per try, to get it under 10 seconds, you would need roughly 70,000 lines working in parallel
If you do decryption, it is more complex and will take even more fabric. And if you also do key expansion ...
Good luck with GPU. May be now they have really fast ones. They need to deal with 8K displays after all. However, I doubt it can get you anywhere close to under 10 seconds.
You can't use a jump table because each warp lane will jump to a different address (which is inefficient). NVIDIA GPUs use a SIMD size of 32 lanes.
Store the lookup table in shared memory with each entry padded to 4 bytes and then duplicated sequentially 32 times (total of 32KB). Declare the shared memory as a 4 byte array and multiply the index by 32 and add the lane ID when doing a read.
The shared memory has 32 banks of 4 bytes each and this will avoid bank conflicts (each bank can perform one 4-byte read or write per cycle per SM). The bank number for an address is the address divided by 4 modulo 32. All memory operations are SIMD with a size of 32 lanes.
The kernel launch parameters should have two blocks per SM and as many threads per block as will fit without register spilling (threads per block should be a multiple of 32).
On a 1080 ti at 1500mhz (which has 28 SMs), you should be able to execute 1500e6*28*32 = 1.34 trillion lookups per second. The 2080 ti should have double that performance (assuming shared memory is the bottleneck).
You can view the actual code the GPU is executing with the nvdisasm utility. The "CUDA Binary Utilities" section of the CUDA documentation has a description of the instruction set.
I'd still chose a FPGA to implement an encryption accelerator, it's so inelegant to use a chip designed for floating point operations to do nothing but LUTs.
Bitcoin miners do not agree to you
Thank you for all your good feedback
I do make some test on CUDA and I do get some better results on the the Sbox part on small tests but still have to investigate futher, but I also need have to do permutate on 64 bit in x out on each Sbox round. In FPGA world perm is just crossing connections = not existing
Are there any good trick to do that in CUDA code?
Right now I test each bit and set the output = 64 steps
I know it's possible to do some bit-slicing and do 64 test on each step but still not so easy.
It could be great if there were some kind of instruction allowing to have a mini customer crossing matrix programmed (like FPGA do the routing)
I'd still chose a FPGA to implement an encryption accelerator, it's so inelegant to use a chip designed for floating point operations to do nothing but LUTs.
Bitcoin miners do not agree to you
bitcoin mining evolution was cpu mining => gpu mining => fpga mining => asic mining.
There's a byte permutation instruction but nothing for permutating bits:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-prmt
There is also a bit reverse instruction (brev), and some instructions for selecting bit fields (bfi and bfe).
There might be some way to optimize the specific permutation that you're using.
Nice instructions, thank you
Would still be great to see one where you have a user defined crossover like and output port to a input port where you have the crossover with wires, but guess that what the FPGA does way better than anything else.