To pipeline speed this one up, here is how I would do it:
(a)
Compare all inputs with each other and generate 4 sets of 2 bit selection flags/words.
--- and --- Store all 4 inputs in D-flipflop registers.
(b) The output of all 4 D-flipflop registers would feed 4 x 4:1 mux selection units, each unit receiving the 2 bit selection flags generating 4 sorted outputs.
About pipelining.
The speed of combinatorial logic depends on the number of layers. Simple design has only one layer:
IN->LUT->OUT
Of course, there may be many parallel paths like that, but all LUTs are fed directly from the input. This makes it the fastest.
Then you introduce LUTs which depends on the values produced by other LUTs, like this:
IN->LUT->LUT->OUT
Here you have two layers of LUTs. You need to wait until the LUTs of the first layer settle and provide stable outputs to the LUTs of the second layer. Then you must wait for the LUTs of the second layer. Therefore, it takes longer. Each layer adds roughly 0.7ns on Xilinx.
The design we're discussing has 4 layers:
IN->LUT->LUT->LUT->LUT->OUT
Only the longest path affects the overall speed. For example, in this design there's a shorter path which goes from input to the final MUX. It only has one LUT. It could be done faster, but the presence of longer paths don't let the design run faster. The speed is roughly determined by the number of layers on the
longest path.
Any combinatorial design can be
pipelined.
You don't do it as AndyC suggested by splitting things which already can run in parallel. You do it by inserting flip-flops between combinatorial layers:
IN->LUT->LUT->
FF->LUT->LUT->OUT
Now the clock doesn't need to wait for all four layers to complete. Once two layers are done, the flip-flop can clock and remember the intermediary result. On the
next clock, the next two layers of LUTs will finish the job. You turned 4-layer design into 2-layer design, but now there's one clock delay.
One flip-flop must be inserted in every path, be it a simple wire or a LUT.
To maximize the clock speed, you need to minimize the number of layers. This can be done by inserting flip-flops
exactly in the middle of LUT chain. In the example above, two layers go before the flip-flop and two layers go after it.
You don't do it as BrianHG did:
IN->LUT->LUT->LUT->FF->LUT->OUT
In his design, he put 3 layers (2 layers of comparison and one layer to generate MUX inputs) before the flip-flops, and only one layer (MUX) after the flip-flop. If you do this, the first stage will be 3-layer design and the second stage will be 1-layer design. Since they're clocked by the same clock, the overall design is still 3-layer. It is faster than 4-layer design, but it is slower than 2-layer design.
To get 2 layer design you need this:
IN->LUT->LUT->FF->LUT->LUT->OUT
Which means the 2 layers of comparisons go before the flip-flop, and everything else goes after, as this:
Stage 1. 6 bits of comparison results are saved using 6 flip-flops. Since flip-flops must go into every path, we also need 32 flip-flop to save the original inputs.
Stage 2. MUX input is generated from comparison results (one layer) and MUX selects the appropriate input (second layer).
This produces fast 2-layer design.
We can pipeline even further:
IN->LUT->FF->LUT->FF->LUT->FF->LUT->OUT
Now we've got one-layer design, which is as fast as it gets, but you need to wait 3 extra clocks to get the result. Also, this will be tedious to program - you'll have to pipeline comparison operations.