This is probably a bit 'after the fact' in looking at the dates on replies, but here's my $0.02.
I actually had need of this very thing (counting ones in a reg) some time back, only for an actual ASIC. You hate to use a LUT as that takes real estate. Shift reg is the most obvious answer (small logic, easy to implement).....until you're counting 32 or 64 bit regs. Then it gets very expensive. Can't afford the latency. So that set me on a hunt. I think I ran across the very same link someone else pointed out earlier about how many algorithms there are to count 1's. I played around with a couple variants and synthesized each to see what they really looked like. (yes, I'm working in an HDL...Verilog to be specific).
The one that gave me the best solution, trading off area and doing it in a single clock cycle and still meeting timing was a simple for loop. (yes, for loops are a synthesize construct.....if used properly)
// sorry, I'll probably wind up mixing in some SystemVerilog syntax, but you'll get the gist
always_comb
begin
count = 6'b0; // counter is this big because I'm counting a 32 bit reg
for (integer i=0; i<32; i=i+1)
count = count + my_reg[i];
end
This is actually, basically the same code as someone else wrote out from a Xilinx example it's just this one stays neater and tighter from a syntax perspective when the number of bits you're counting grows.
Soooo, what does that give me. Well, it actually winds up looking very similar to what you built by hand.

I can't be certain of the exact logic constructed, but it was similar enough I'd say you're spot on. (lots of XOR blocks staged serially). The difference here is coding it in HDL lessens the strain on the grey matter. And when you're block is a few hundred thousand gates, you can't afford to be drawing stuff out by hand.
Few other tidbits to take note of: if you haven't experienced it already, the actual clock speed you can get the FPGA to run at is dependent on how much utilization you have and the congestion / routing issues (as others have mentioned).
Yes, you can very likely get the IO / serdes / whatever to run at speed, but the internal logic can sometimes be a challenge to get to speed. I worked on a project that we stuffed into a very large, very expensive FPGA that had the serdes running at 1.5Gbps, but it was a stretch to get the internal logic to run at anything faster than 35MHz.
EDIT: DOH! hit post and realized the html conflicts with the Verilog syntax I was trying to type.