A lot depends on your application - what are you wanting to counting and why?
The quickest way to count 'something' is to combine various techniques. A very fast 'something' to prescale the input, followed by a longer (slower) counter to give you lots of digits. For example, you might use a high speed transceiver to capture the input 40 bits at a time, and then have logic behind that to count the number of 0-to-1 transitions.
The CLB logic must be quicker than the clocking network that supports it, so it is usually the clocking network's performance is the limiting factor - if it is something else like carry chain length then you are doing it wrong. This also hints that if you are only CLBs for your counter you should try to avoid using the logic's clocking networks at the very fast part of the counter - i.e. maybe route the input signal directly onto a flip-flop's CLK input, rather than through clock buffers.