Hey, fun exercise!
...
This is using ISE 14.7 and targeting a spartan-6: xc6slx45-2csg324. The design is pipelined, 3 stages.
Fun indeed, I've access to fully licensed tools so might have a slight edge here (possibly some extra options/strategies unlocked) but I'm not running smart explorer to get the last few % out of the design and yet there appears to be a lot of slack available from the attempts so far.
ISE 14.7 xc6slx45-2csg324
Minimum area, combinatorial only. 58 LUTs
Logical pipeline of 3 stages. 106 LUTs >440 MHz
Fully pipelined with 4 stages. 118 LUTs >540 MHz
(requires using both edges of clock)
ISE 14.7 xc7a100t-2csg324
Minimum area, combinatorial only. 58 LUTs
Logical pipeline of 3 stages. 132 LUTs >580 MHz
Fully pipelined with 4 stages. 141 LUTs >580 MHz
(both switching limited)
Vivado X.X xc7a35t-2csg324
Minimum area, combinatorial only. 76 LUTs
Logical pipeline of 3 stages. 114 LUTs >380 MHz
Fully pipelined with 4 stages. 147 LUTs >400 MHz
Vivado X.X xc7a100t-2csg324
Minimum area, combinatorial only. 76 LUTs
Logical pipeline of 3 stages. 114 LUTs >380 MHz
Fully pipelined with 4 stages. 147 LUTs >410 MHz
It's known that ISE can do a better synthesis job on many designs but its orphaned for device support now and harder to use going forward. But 7 series parts are easily 50-100% faster than Spartan 6 so many designs need to be reassessed for area/speed tradeoff and can be adapted to the new Vivado synthesis at the same time. These results above are using a sort algorithm better suited for FPGA implementation but still written with a high level functional description in VHDL, so its not necessary to get down to gate level descriptions but rather knowing how to map to resources allows you to design for minimum area while still using high level constructs.