You can use your own clock. Given that internal clocks in FPGa's can be pumped up to 400 MHz ... even if you need 10 clockticks to calculate...
ANyway , what you need can be resolved in a different way. you are not interested in the absolute bitcount... you only need to know if there are more 1's than zeros... so you can use a modified exor tree.
look at two adjacent bits. if they are 01 or 10 -> eliminate them they balanced each other out
you need to build a block that has 2 outputs. 1 output tells you 'BALANCED', the other tell's you if its 0 heavy or 1 heavy...
you need 4 of these blocks. this reduces your outputs to 4.
now build a block that looks at two adjacent outputs. if both are balanced set an output 'balanced'. if they are not balanced look of one is 0 heavy and the other '1 heave and strike them out to 'balance ( exor gate ). if the result is again imbalanced it will be '0' heavy or '1 heavy'
this output essentially tells you the output for 4 input bits. ( no need to look at all 8 )
the outputs tells you : the block is either internally balanced , or it is leaning to 1 , or leaning to 0
at the top level you only need to compare 2 blocks. if both yield 'balanced', or one is leaning 1 and the other leaning 0 you decide based on D0. in other instances you take the 'leaning' result.
i'd have to work it out but i think all you need is about 6 xor gates and some AND and or gates.
You can reduce the logic becasue you do not need an absolute bit count as a numerical value. you are only interested if the 1 and 0's are balance , leaning towards 1 or leaning towards 0.
Now, here is another thing. This is for digital video. the bits are transmitted serially... so you can do away with the shifter altogether. all you need is a counter. drive the enable with the incoming bitstream , syncronize with the bitclock and by the time the last bit is received you have your answer.
Sync a state machine to D7 , and at the moment D0 is in you know the answer. ( plus some propagation time, but well before the next clocktick ).
Now, that being said. How are you going to capture this stream ? You will need a bloody fast FPGA to handle the bitstream ... TMDS uses a 165 MHz i/o clock.. guess what... it won't be a plcc44 on a double sided board ... FPGA's that can handle that are BGA , need 6 to 12 layer boards, matched impedance and tuned line lengths. And they need differential transceiver cells.... And you are trying to design this as a schematic... ( you don't know an HDL apparently .. ) Pardon my french but : FAIL !
Second problem. Silicon Image holds the rights on TMDS ... if you build a decoder and publish it you will have to fork over some serious dough to them ...