The output of whether there is a zero of not is available in 5 LUT delays. However the location of it has 10 LUT delays.
The hierarchical process would yield 682 LUTs (assuming 6x1 or 5x2 Xilinx LUTs), 224 4:1 MUXes and 6 layers of combinatory logic.
Layer 1You group all bits into 256 groups, each consists of consecutive 4-bit elements. For each group you calculate:
1-bit layer1_found - '1' if if the group has a good bit. This takes 1 6x1 LUT - 4 original bits as inputs
2-bit layer1_index - index of the first good bit (if any). This takes 1 5x2 LUT - same 4 inputs - 2 bits of outputs
Combined all together, we've got 256 of each. 256 x 2 = 512 LUTs for the first layer
Layer 2We now form groups, each of which consists of 4 consecutive layer-1 groups (64 groups total - each represents 4 groups from layer 1 or 16 original bits). For each group we calculate
1-bit layer2_found - '1' if any of the 4 layer1_found is set. This take 1 6x1 LUT - 4 group1_found as inputs
2-bit layer2_index - index of the first layer1_found which is set (if any). This takes 1 5x2 LUT - same 4 inputs
LUTs on layer 2: 64 x 2 = 128 LUTs
Note that layer2_index must be 4-bit, but it is only 2-bit. The existing 2 bits of the layer2_index should be used to
select the appropriate layer1_index (out of 4) and use the selected layer1_index as 2 LS bits of layer2_index.
We cannot do it right now, but we can do it on the next layer.
Layer 3We now form 16 4-element groups from the layer-2 group. Acting the same way as before we produce
1-bit layer3_found - 1 LUT
2-bit layer3_index - 1 LUT
This gives us 16 x 2 = 32 LUTs
Now we continue with selecting indices. 2 bits of layer2_index are used as a MUX selector to select the
appropriate layer1_index for each group of the 2-nd layer. Since layer1_index is 2-bit, we need 2 MUXes.
These can be implemented as LUTs. Xilinx also has built-in 4:1 MUXes which can be used instead. Assume
we use MUXes.
64 x 2 = 128 MUXes
Layer 4Same thing. Create layer4_found and layer4_index
4 x 2 = 8 LUTs
Also, as in the previous layer, we use layer3_index to select appropriate layer2_index values. layer2_index is
now 4-bit, so we need 4 MUXes for each group.
16 x 4 = 64 MUXes
Layer 5We now get only one group and produce layer5_found and layer5_index = 2 LUTs
Also, use layer4_index to select the appropriate layer3_index (which is now 6-bit long)
4 x 6 = 24 MUXes
Layer 6We still need to use layer5_index to select the correct layer4_index = 8 MUXes (one for each bit of layer4_index)
Total count is:
512 + 128 + 32 + 8 + 2 = 682 LUTs
128 + 64 + 24 + 8 = 224 MUXes
However, producing the signal which uses 10-bit layer5_index to select new value for each of 1024 bits will take 2 LUTs for each = 2048 LUTs, but I guess there might be a better way to do the modification.