S-Box CUDA

S-Box CUDA
Posted by Wiljan on 12 Oct, 2018 12:58
I do work on some Algo project where each round does have a S-box 8 bit in/out (256 bytes) if has to read the S-Box on every round and do some XOR on every Round as well

So in software you typical do have an array with the S-box and you will read the result for the input needed.

I want to run as many parallel cores of this algo in parallel to get as high throughput as possible.

So I have been looking on CUDA on GPU which is a logic way to get many cores, the thing is the memory access is a bottleneck on a CUDA core since it does share the RAM (in different ways) guess the "texture" RAM are the best for this kind of job.

But the RAM access makes it slow since 50% of code are S-Box lookup and I gues the are a lot of wait time due to collisions on where the core want to read data from the shared RAM

So now for the question:

In the Microchip MIPS PIC chips it's quite normal to make a lookup this way

Code: [Select]
movf count, w ;put value in W call Table1 Table1 ADDWF PCL, f ;data table for bit pattern retlw b'10000000' retlw b'01000000' retlw b'00100000' retlw b'00010000' retlw b'00001000' retlw b'00000100' retlw b'00000010' retlw b'00000001' retlw b'00000010' retlw b'00000100' retlw b'00001000' retlw b'00010000' retlw b'00100000' retlw b'01000000' .. ..
This method will not use any RAM but just add a little to the program ROM

Is it possible to something alike on GPU Cuda?

#1 Reply
Posted by NorthGuy on 12 Oct, 2018 13:46
Use FPGA. Nothing else will give you as many independent parallel executions as FPGA.

#2 Reply
Posted by tyrel on 14 Oct, 2018 20:16
You could try to compute the S-Box result, instead of using of a lookup table, that could be faster on gpu.

#3 Reply
Posted by ogden on 14 Oct, 2018 21:26
Indeed it is possible with CUDA. GPU is made such a way that individual processors does not have to compete for data access. What's the point to have such a number of CPU's that are crowding in a queue to access data? Multiprocessors have local registers and local (on-chip) block memory which ironically is named "shared". Further reading:

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#device-memory-spaces

#4 Reply
Posted by SiliconWizard on 14 Oct, 2018 21:30
Quote from: tyrel on 14 Oct, 2018 20:16
You could try to compute the S-Box result, instead of using of a lookup table, that could be faster on gpu.

On a GPU, it will most likely be faster indeed.
Sounds definitely doable with CUDA. Learning how to use CUDA properly is not completely trivial though.

#5 Reply
Posted by Wiljan on 15 Oct, 2018 19:58
Quote from: SiliconWizard on 14 Oct, 2018 21:30
Quote from: tyrel on 14 Oct, 2018 20:16
You could try to compute the S-Box result, instead of using of a lookup table, that could be faster on gpu.

On a GPU, it will most likely be faster indeed.
Sounds definitely doable with CUDA. Learning how to use CUDA properly is not completely trivial though.

Yeah, not trivial at all

The Sbox does unfortunate not have a formula behind the layout
It's more like it has been constructed by a game of Dart

I have been running different software to convert "lookup table" to "logic" and I also tried to do a logic approach in FPGA vs the Bram (Rom) and let the FPGA compiler squeeze the logic and it always use LUT's equal to full decoding.

The Bram does have fmax limit way lower the the Logic fmax and to even run Bram on the max fmax you need to pipeline all the surrounding logic and it does eat a lot of registers (reducing the numbers of cores).

Also FPGA are expensive when you go for big sizes and not something easy accessible compared to the Cuda GPU.

That's why I try to use a hammer to get some better results on the GPU.
The Visual Profiler is a good tool to see a bit of what is happening when the code does run.

I will do some more experiments on the GPU and the local "shared"memory to see if I can get it to run more smooth, at least I do learn some new tricks

Thank you for all the ideas

#6 Reply
Posted by NorthGuy on 15 Oct, 2018 21:32
Quote from: Wiljan on 15 Oct, 2018 19:58
I have been running different software to convert "lookup table" to "logic" and I also tried to do a logic approach in FPGA vs the Bram (Rom) and let the FPGA compiler squeeze the logic and it always use LUT's equal to full decoding.

The Bram does have fmax limit way lower the the Logic fmax and to even run Bram on the max fmax you need to pipeline all the surrounding logic and it does eat a lot of registers (reducing the numbers of cores).

Also FPGA are expensive when you go for big sizes and not something easy accessible compared to the Cuda GPU.

$15 Xilinx's Spartan-7 XC7S6 has 900 slices. It's enough for 100+ S-boxes. If you manage to run them at 400 MHz, this will give you 40G S-boxes per second. 500 MHz is doable on -2 grades, even 600 MHz may be feasible, depending on your other logic. It's hard to beat with GPU.

BRAM, of course, is slower, but a 36k block can be made into two 18k blocks, each of which has dual-access, so you can host 4 S-boxes per BRAM. If you pipeline it and run it at 300 MHz (faster may be possible), it's 1.2G S-boxes per second for each 36k BRAM block. The smallest S-7 has only 5 BRAM blocks though. But if you go bigger - XC7S25 (about $30) has 45 BRAM blocks - 54G S-boxes per second only in BRAM - with 3500 slices left for other needs, but if you use all of the logic you may get cooling problems.

How fast do you need it?

#7 Reply
Posted by legacy on 15 Oct, 2018 22:39
buy a CUDA board like the JK1 and forget FPGAs, it's not appropriate in this case.

#8 Reply
Posted by NorthGuy on 16 Oct, 2018 03:35
Quote from: legacy on 15 Oct, 2018 22:39
buy a CUDA board like the JK1 and forget FPGAs, it's not appropriate in this case.

Perhaps you're right. The modern GPU's may have become dramatically more capable since the last time I looked.

#9 Reply
Posted by Marco on 16 Oct, 2018 08:48
Quote from: ogden on 14 Oct, 2018 21:26
GPU is made such a way that individual processors does not have to compete for data access.

They are made for graphics work, which has high locality of reference in vertex and texture caches for the bulk of the data. Despite all the MIPS of modern GPUs, I'd still chose a FPGA to implement an encryption accelerator, it's so inelegant to use a chip designed for floating point operations to do nothing but LUTs.

#10 Reply
Posted by NivagSwerdna on 16 Oct, 2018 09:58
This sounds very fun but it really depends on exactly what you want to achieve.... To get good CUDA performance you need to ensure there is always sufficient data flowing which in turn means optimising cache access and moving data from the host to the card. For applications like graphics transformation where the same algorithm is operating across the image this is more easy to achieve compared to when access is more adhoc. As you have pointed out the cost of moving data to/from the card is high (although decreasing as architectures evolve).

There is tons of literature out there for say AES on GPU. But also some more recent FPGA stuff... https://www.nallatech.com/40gbit-aes-encryption-using-opencl-and-fpgas/

If it is for fun the GPU is the way to go as it is super cheap to get started.... if you are trying to break NSA encryption then perhaps worth looking at FPGAs

PS
It's a while since I wrote any CUDA but at the time I did the memory hierarchy looked like https://www.microway.com/hpc-tech-tips/gpu-memory-types-performance-comparison/ it is vital to get memory access right otherwise you won't achieve the occupancy goals you seek.

#11 Reply
Posted by Wiljan on 16 Oct, 2018 10:30
Quote from: NorthGuy on 15 Oct, 2018 21:32
BRAM, of course, is slower, but a 36k block can be made into two 18k blocks, each of which has dual-access, so you can host 4 S-boxes per BRAM. If you pipeline it and run it at 300 MHz (faster may be possible), it's 1.2G S-boxes per second for each 36k BRAM block. The smallest S-7 has only 5 BRAM blocks though. But if you go bigger - XC7S25 (about $30) has 45 BRAM blocks - 54G S-boxes per second only in BRAM - with 3500 slices left for other needs, but if you use all of the logic you may get cooling problems.

How fast do you need it?

I have been using the Altera Cyclone V E A9 with 301K LE and 1220 M10K used as dual ROM = 2440 Sboxes where the fmax would be 240 for the M10K (if registered on both in and out) I have managed to make 32 cores of 56 sbox = 1.792 Sboxes + a lot of LUT stuff ... chip full like 93%
all this can run @200Mhz (yes I have coolling on the FPGA) = 358G Sboxes which are very fine for the FPGA

My goal was to test 2^48 keys in less than 10 sec right now it takes 12 hours so it will need a lot of FPGA's.

This is a project for fun and learning so that's why I try to see if there are other solutions like GPU

I will do some more reading / testing

#12 Reply
Posted by NorthGuy on 16 Oct, 2018 14:36
Quote from: Wiljan on 16 Oct, 2018 10:30
I have been using the Altera Cyclone V E A9

For AES encryption on Xilinx's 7-series, if you fully pipeline everything, you get 2 pipeline stages per round. If you use 14 rounds for AES256, you get 28 pipeline stages, each requiring 128 slices, so it is 3584 slices plus 16 slices for initial key xoring, which is 3600 slices. I'm pretty confident you can run this at 400 MHz, most likely faster, which gives you 400M blocks/s per line. If you use one block per try, to get it under 10 seconds, you would need roughly 70,000 lines working in parallel

If you do decryption, it is more complex and will take even more fabric. And if you also do key expansion ...

Good luck with GPU. May be now they have really fast ones. They need to deal with 8K displays after all. However, I doubt it can get you anywhere close to under 10 seconds.

#13 Reply
Posted by sundersoft on 17 Oct, 2018 00:20
You can't use a jump table because each warp lane will jump to a different address (which is inefficient). NVIDIA GPUs use a SIMD size of 32 lanes.

Store the lookup table in shared memory with each entry padded to 4 bytes and then duplicated sequentially 32 times (total of 32KB). Declare the shared memory as a 4 byte array and multiply the index by 32 and add the lane ID when doing a read.
The shared memory has 32 banks of 4 bytes each and this will avoid bank conflicts (each bank can perform one 4-byte read or write per cycle per SM). The bank number for an address is the address divided by 4 modulo 32. All memory operations are SIMD with a size of 32 lanes.

The kernel launch parameters should have two blocks per SM and as many threads per block as will fit without register spilling (threads per block should be a multiple of 32).

On a 1080 ti at 1500mhz (which has 28 SMs), you should be able to execute 1500e6*28*32 = 1.34 trillion lookups per second. The 2080 ti should have double that performance (assuming shared memory is the bottleneck).

You can view the actual code the GPU is executing with the nvdisasm utility. The "CUDA Binary Utilities" section of the CUDA documentation has a description of the instruction set.

#14 Reply
Posted by ogden on 17 Oct, 2018 09:19
Quote from: Marco on 16 Oct, 2018 08:48
I'd still chose a FPGA to implement an encryption accelerator, it's so inelegant to use a chip designed for floating point operations to do nothing but LUTs.

Bitcoin miners do not agree to you

#15 Reply
Posted by Wiljan on 19 Oct, 2018 09:55
Thank you for all your good feedback

I do make some test on CUDA and I do get some better results on the the Sbox part on small tests but still have to investigate futher, but I also need have to do permutate on 64 bit in x out on each Sbox round. In FPGA world perm is just crossing connections = not existing

Are there any good trick to do that in CUDA code?

Right now I test each bit and set the output = 64 steps
I know it's possible to do some bit-slicing and do 64 test on each step but still not so easy.

It could be great if there were some kind of instruction allowing to have a mini customer crossing matrix programmed (like FPGA do the routing)

#16 Reply
Posted by sundersoft on 19 Oct, 2018 18:06
There's a byte permutation instruction but nothing for permutating bits:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-prmt

There is also a bit reverse instruction (brev), and some instructions for selecting bit fields (bfi and bfe).

There might be some way to optimize the specific permutation that you're using.

#17 Reply
Posted by xaxaxa on 19 Oct, 2018 18:59
Quote from: ogden on 17 Oct, 2018 09:19
Quote from: Marco on 16 Oct, 2018 08:48
I'd still chose a FPGA to implement an encryption accelerator, it's so inelegant to use a chip designed for floating point operations to do nothing but LUTs.

Bitcoin miners do not agree to you

bitcoin mining evolution was cpu mining => gpu mining => fpga mining => asic mining.

#18 Reply
Posted by Wiljan on 21 Oct, 2018 11:25
Quote from: sundersoft on 19 Oct, 2018 18:06
There's a byte permutation instruction but nothing for permutating bits:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-prmt

There is also a bit reverse instruction (brev), and some instructions for selecting bit fields (bfi and bfe).

There might be some way to optimize the specific permutation that you're using.

Nice instructions, thank you

Would still be great to see one where you have a user defined crossover like and output port to a input port where you have the crossover with wires, but guess that what the FPGA does way better than anything else.

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

There was an error while thanking

Thanking...

Go to page:

1

Full site Menu

Navigation

Powered by SMFPacks Advanced Attachments Uploader Mod