I would have thought that the optimal way would be to look at the underlying FPGA architecture, find the biggest width LUT that it supports in a single slice (8 bits on Xilinx 7-series logic), then just compress 8 bits down to two (00 = no bits, 01 = one bit, 11 = more than one bit, apologies for the VHDL):

case testvalue_8_bits is

when "00000000" => result <= "00";

when "00000001" => result <= "01";

when "00000010" => result <= "01";

when "00000100" => result <= "01";

when "00001000" => result <= "01";

when "00010000" => result <= "01";

when "00100000" => result <= "01";

when "01000000" => result <= "01";

when "10000000" => result <= "01";

when others => result <= "11";

end case;

Have as many of these as you need for the width of your input, concatenate the "result" outputs together and repeat, until you get down to two bits. A bit like numerology to find your birth number

That would be one slice and level of logic for values up to 8 bits, two levels of logic and 18 slices for 32-bit values, and three levels of logic for values up to 256 bits.

I would have guessed it to be about the same resource usage but much faster as the original (x &(x-1)==0), and without the corner-cases for 0 and 1.