Author Topic: HDL problem: find the first bit equal to zero, and set it to one (Read 11160 times)

legacy · « **Reply #25 on:** July 06, 2017, 08:42:28 pm »

Quote from: Gribo on July 06, 2017, 06:01:36 pm

So, let me understand it, one process is filling the vector with bits at an unknown rate, the other is emptying it, again, at an unknown rate?

Exactly.

brucehoult · « **Reply #26 on:** July 06, 2017, 10:34:16 pm »

Quote from: hamster_nz on July 04, 2017, 10:35:58 pm

There is no way getting around it, that the new state for bit one depends on the state of all other bits in the vector. Somehow the state from the 1023 other bits must fan in to allow it to be updated.

If you need to do that in one cycle the best you can do is celling(log(1024)/log(LUT width)) levels of LUTs...

That's just purely to find out whether there is at least one 0 or not. Basically implementing AND gates with all the LUTs. But you won't know where any zero is, let alone the first one.

I've never owned or used an FPGA (though I plan to soon) but I had a crack at solving this, designing LUT contents directly as I don't know VHDL or Verilog.

Code: [Select]

git clone https://github.com/brucehoult/fpga_ffz.git
cd fpga_ffz
make

It will generate a bunch of LUTs as C code, then simulate them through a million random test cases.

Examine e.g. luts.h

I get 1247 LUTs, of which 224 are 6 input LUTs used as 4:1 muxes. The other 1023 only require four inputs, with groups of three LUTs sharing the same four inputs. If you can use your 6-input LUTs as two 5-input LUTs (as I know Xilinx can) or even into four 4-input LUTs (anyone?) then the number of LUTs can be reduced to 906 or 565.

In theory you could cover 1024 inputs with 4 layers of 6-input LUTs, but presumably you want the output of the location of the first zero in binary, not some wacky form. It's only 5 layers with 4-input LUTs.

The output of whether there is a zero of not is available in 5 LUT delays and the location of the first zero in 6 LUT delays.

Here's the start and end of luts.h. I have no idea how people really do this, so I just made something up on the spot

Of course since the simulation is serial not parallel, there is a constraint that all the input wires to a LUT must have smaller indexes than the output of the LUT. The original input bits are in wires 0 .. 1023.

Code: [Select]

#define AllSet4 0x8000800080008000
#define FFZ4_0 0x22a222a2a2a222a2
#define FFZ4_1 0x8888088808880888
#define Mux4to1 0xfedcba9876543210
// make_ffz_layers 4 0
// make_ffz_layers 3 0
// make_ffz_layers 2 0
// make_ffz_layers 1 0
// make_ffz_layers 0 0
  {1024, AllSet4, 0, 1, 2, 3, 0, 0},
  {1025, FFZ4_1, 0, 1, 2, 3, 0, 0},
  {1026, FFZ4_0, 0, 1, 2, 3, 0, 0},
// make_ffz_layers 0 4
  {1027, AllSet4, 4, 5, 6, 7, 0, 0},
  {1028, FFZ4_1, 4, 5, 6, 7, 0, 0},
  {1029, FFZ4_0, 4, 5, 6, 7, 0, 0},
// make_ffz_layers 0 8
  {1030, AllSet4, 8, 9, 10, 11, 0, 0},
  {1031, FFZ4_1, 8, 9, 10, 11, 0, 0},
  {1032, FFZ4_0, 8, 9, 10, 11, 0, 0},
// make_ffz_layers 0 12
  {1033, AllSet4, 12, 13, 14, 15, 0, 0},
  {1034, FFZ4_1, 12, 13, 14, 15, 0, 0},
  {1035, FFZ4_0, 12, 13, 14, 15, 0, 0},
  {1036, AllSet4, 1024, 1027, 1030, 1033, 0, 0},
  {1037, FFZ4_1, 1024, 1027, 1030, 1033, 0, 0},
  {1038, FFZ4_0, 1024, 1027, 1030, 1033, 0, 0},
  {1039, Mux4to1, 1038, 1037, 1025, 1028, 1031, 1034},
  {1040, Mux4to1, 1038, 1037, 1026, 1029, 1032, 1035},
// make_ffz_layers 1 16
// make_ffz_layers 0 16
  {1041, AllSet4, 16, 17, 18, 19, 0, 0},
  {1042, FFZ4_1, 16, 17, 18, 19, 0, 0},
  {1043, FFZ4_0, 16, 17, 18, 19, 0, 0},
// make_ffz_layers 0 20
  {1044, AllSet4, 20, 21, 22, 23, 0, 0},
  {1045, FFZ4_1, 20, 21, 22, 23, 0, 0},
  {1046, FFZ4_0, 20, 21, 22, 23, 0, 0},
// make_ffz_layers 0 24
  {1047, AllSet4, 24, 25, 26, 27, 0, 0},
  {1048, FFZ4_1, 24, 25, 26, 27, 0, 0},
  {1049, FFZ4_0, 24, 25, 26, 27, 0, 0},
// make_ffz_layers 0 28
  {1050, AllSet4, 28, 29, 30, 31, 0, 0},
  {1051, FFZ4_1, 28, 29, 30, 31, 0, 0},
  {1052, FFZ4_0, 28, 29, 30, 31, 0, 0},
  {1053, AllSet4, 1041, 1044, 1047, 1050, 0, 0},
  {1054, FFZ4_1, 1041, 1044, 1047, 1050, 0, 0},
  {1055, FFZ4_0, 1041, 1044, 1047, 1050, 0, 0},
  {1056, Mux4to1, 1055, 1054, 1042, 1045, 1048, 1051},
  {1057, Mux4to1, 1055, 1054, 1043, 1046, 1049, 1052},
:
:
// make_ffz_layers 1 1008
// make_ffz_layers 0 1008
  {2227, AllSet4, 1008, 1009, 1010, 1011, 0, 0},
  {2228, FFZ4_1, 1008, 1009, 1010, 1011, 0, 0},
  {2229, FFZ4_0, 1008, 1009, 1010, 1011, 0, 0},
// make_ffz_layers 0 1012
  {2230, AllSet4, 1012, 1013, 1014, 1015, 0, 0},
  {2231, FFZ4_1, 1012, 1013, 1014, 1015, 0, 0},
  {2232, FFZ4_0, 1012, 1013, 1014, 1015, 0, 0},
// make_ffz_layers 0 1016
  {2233, AllSet4, 1016, 1017, 1018, 1019, 0, 0},
  {2234, FFZ4_1, 1016, 1017, 1018, 1019, 0, 0},
  {2235, FFZ4_0, 1016, 1017, 1018, 1019, 0, 0},
// make_ffz_layers 0 1020
  {2236, AllSet4, 1020, 1021, 1022, 1023, 0, 0},
  {2237, FFZ4_1, 1020, 1021, 1022, 1023, 0, 0},
  {2238, FFZ4_0, 1020, 1021, 1022, 1023, 0, 0},
  {2239, AllSet4, 2227, 2230, 2233, 2236, 0, 0},
  {2240, FFZ4_1, 2227, 2230, 2233, 2236, 0, 0},
  {2241, FFZ4_0, 2227, 2230, 2233, 2236, 0, 0},
  {2242, Mux4to1, 2241, 2240, 2228, 2231, 2234, 2237},
  {2243, Mux4to1, 2241, 2240, 2229, 2232, 2235, 2238},
  {2244, AllSet4, 2188, 2205, 2222, 2239, 0, 0},
  {2245, FFZ4_1, 2188, 2205, 2222, 2239, 0, 0},
  {2246, FFZ4_0, 2188, 2205, 2222, 2239, 0, 0},
  {2247, Mux4to1, 2246, 2245, 2189, 2206, 2223, 2240},
  {2248, Mux4to1, 2246, 2245, 2190, 2207, 2224, 2241},
  {2249, Mux4to1, 2246, 2245, 2191, 2208, 2225, 2242},
  {2250, Mux4to1, 2246, 2245, 2192, 2209, 2226, 2243},
  {2251, AllSet4, 2019, 2094, 2169, 2244, 0, 0},
  {2252, FFZ4_1, 2019, 2094, 2169, 2244, 0, 0},
  {2253, FFZ4_0, 2019, 2094, 2169, 2244, 0, 0},
  {2254, Mux4to1, 2253, 2252, 2020, 2095, 2170, 2245},
  {2255, Mux4to1, 2253, 2252, 2021, 2096, 2171, 2246},
  {2256, Mux4to1, 2253, 2252, 2022, 2097, 2172, 2247},
  {2257, Mux4to1, 2253, 2252, 2023, 2098, 2173, 2248},
  {2258, Mux4to1, 2253, 2252, 2024, 2099, 2174, 2249},
  {2259, Mux4to1, 2253, 2252, 2025, 2100, 2175, 2250},
  {2260, AllSet4, 1324, 1633, 1942, 2251, 0, 0},
  {2261, FFZ4_1, 1324, 1633, 1942, 2251, 0, 0},
  {2262, FFZ4_0, 1324, 1633, 1942, 2251, 0, 0},
  {2263, Mux4to1, 2262, 2261, 1325, 1634, 1943, 2252},
  {2264, Mux4to1, 2262, 2261, 1326, 1635, 1944, 2253},
  {2265, Mux4to1, 2262, 2261, 1327, 1636, 1945, 2254},
  {2266, Mux4to1, 2262, 2261, 1328, 1637, 1946, 2255},
  {2267, Mux4to1, 2262, 2261, 1329, 1638, 1947, 2256},
  {2268, Mux4to1, 2262, 2261, 1330, 1639, 1948, 2257},
  {2269, Mux4to1, 2262, 2261, 1331, 1640, 1949, 2258},
  {2270, Mux4to1, 2262, 2261, 1332, 1641, 1950, 2259},
#define NUM_WIRES 2271
#define NUM_INPUTS 1024
#define NUM_LUTS 1247
#define RES_NO_ZEROS 2260
#define RES_POS_9 2261
#define RES_POS_8 2262
#define RES_POS_7 2263
#define RES_POS_6 2264
#define RES_POS_5 2265
#define RES_POS_4 2266
#define RES_POS_3 2267
#define RES_POS_2 2268
#define RES_POS_1 2269
#define RES_POS_0 2270

NorthGuy · « **Reply #27 on:** July 07, 2017, 12:22:06 am »

Quote from: brucehoult on July 06, 2017, 10:34:16 pm

The output of whether there is a zero of not is available in 5 LUT delays. However the location of it has 10 LUT delays.

The hierarchical process would yield 682 LUTs (assuming 6x1 or 5x2 Xilinx LUTs), 224 4:1 MUXes and 6 layers of combinatory logic.

Layer 1

You group all bits into 256 groups, each consists of consecutive 4-bit elements. For each group you calculate:

1-bit layer1_found - '1' if if the group has a good bit. This takes 1 6x1 LUT - 4 original bits as inputs
2-bit layer1_index - index of the first good bit (if any). This takes 1 5x2 LUT - same 4 inputs - 2 bits of outputs

Combined all together, we've got 256 of each. 256 x 2 = 512 LUTs for the first layer

Layer 2

We now form groups, each of which consists of 4 consecutive layer-1 groups (64 groups total - each represents 4 groups from layer 1 or 16 original bits). For each group we calculate

1-bit layer2_found - '1' if any of the 4 layer1_found is set. This take 1 6x1 LUT - 4 group1_found as inputs
2-bit layer2_index - index of the first layer1_found which is set (if any). This takes 1 5x2 LUT - same 4 inputs

LUTs on layer 2: 64 x 2 = 128 LUTs

Note that layer2_index must be 4-bit, but it is only 2-bit. The existing 2 bits of the layer2_index should be used to
select the appropriate layer1_index (out of 4) and use the selected layer1_index as 2 LS bits of layer2_index.
We cannot do it right now, but we can do it on the next layer.

Layer 3

We now form 16 4-element groups from the layer-2 group. Acting the same way as before we produce

1-bit layer3_found - 1 LUT
2-bit layer3_index - 1 LUT

This gives us 16 x 2 = 32 LUTs

Now we continue with selecting indices. 2 bits of layer2_index are used as a MUX selector to select the
appropriate layer1_index for each group of the 2-nd layer. Since layer1_index is 2-bit, we need 2 MUXes.
These can be implemented as LUTs. Xilinx also has built-in 4:1 MUXes which can be used instead. Assume
we use MUXes.

64 x 2 = 128 MUXes

Layer 4

Same thing. Create layer4_found and layer4_index

4 x 2 = 8 LUTs

Also, as in the previous layer, we use layer3_index to select appropriate layer2_index values. layer2_index is
now 4-bit, so we need 4 MUXes for each group.

16 x 4 = 64 MUXes

Layer 5

We now get only one group and produce layer5_found and layer5_index = 2 LUTs

Also, use layer4_index to select the appropriate layer3_index (which is now 6-bit long)

4 x 6 = 24 MUXes

Layer 6

We still need to use layer5_index to select the correct layer4_index = 8 MUXes (one for each bit of layer4_index)

Total count is:

512 + 128 + 32 + 8 + 2 = 682 LUTs
128 + 64 + 24 + 8 = 224 MUXes

However, producing the signal which uses 10-bit layer5_index to select new value for each of 1024 bits will take 2 LUTs for each = 2048 LUTs, but I guess there might be a better way to do the modification.

hamster_nz · « **Reply #28 on:** July 07, 2017, 04:34:53 am »

Based @Flip-Flop's idea from earlier in this thread.... it actually works very well!

Code: [Select]

initial   0000000000000000000001000000000000000000000000000000000000000000
SetNext   0000000000000000000011000000000000000000000000000000000000000000
SetNext   0000000000000000000111000000000000000000000000000000000000000000
Extra     0000000000001000000111000000000000000000000000000000000000000000
SetNext   0000000000011000000111000000000000000000000000000000000000000000
SetNext   0000000000111000000111000000000000000000000000000000000000000000

It uses the carry chain to get information between bits, so should be relatively fast - much faster than LUTs

Here is the C code used to test the idea - NOTE - The bits in vector are in the wrong order (e.g the leftmost bit above is bit 0, the rightmost is bit 63).

Code: [Select]

unsigned long long vector = 0;

void setNext(void) {
  unsigned long long t;
  unsigned long long mask;

  t = ~vector;              // Flip the 0s and 1s,
  t++;                         // Cause a carry to ripple up through the carry chain
  mask = t & vector;     // Detect where the carry stopped
  vector |= mask >> 1; // pull bit back and set the next bit
}

There are a few boundary conditions that need to be checked carefully (when all zeros, or when the last bit is set). You could maybe use this as a basis for a shorter block, and chain them (much like a carry-lookahead adder). But using the carry chain in between bits is the way to go!

Here is the entire test code:

Code: [Select]

#include <stdio.h>

unsigned long long vector = 0;

void print_vector(char *desc) {
  printf("%-10.10s",desc);
  unsigned long long mask = 1;
  while(mask) {
    putchar( (mask&vector) ? '1' : '0');
    mask <<= 1;
  }
  printf("\n");
}

void setNext(void) {
  unsigned long long t;
  unsigned long long mask;

  t = ~vector;
  t++;
  mask = t & vector;
  vector |= mask >> 1;
}

int main(int c, char *v[])
{
   /* Set the initial value */
   vector = 0x200000;
   print_vector("initial");

   setNext();
   print_vector("SetNext");

   setNext();
   print_vector("SetNext");

   /* Note this is bit 48 in the vector */
   vector |= 0x1000;
   print_vector("Extra  ");

   setNext();
   print_vector("SetNext");

   setNext();
   print_vector("SetNext");
}

Bruce Abbott · « **Reply #29 on:** July 07, 2017, 07:29:43 am »

Quote from: legacy on July 06, 2017, 04:56:59 pm

yes it's a status register, but only two processes are involved with it.

A 4096 bit status register? What is its purpose?

brucehoult · « **Reply #30 on:** July 07, 2017, 12:05:47 pm »

Quote from: NorthGuy on July 07, 2017, 12:22:06 am

Quote from: brucehoult on July 06, 2017, 10:34:16 pm
The output of whether there is a zero of not is available in 5 LUT delays. However the location of it has 10 LUT delays.

The hierarchical process would yield 682 LUTs (assuming 6x1 or 5x2 Xilinx LUTs), 224 4:1 MUXes and 6 layers of combinatory logic.

... and you go on to excellently describe precisely the same solution as I posted working code for in the message you replied to :-) With the same results I quoted there: 906 LUTs total (224 as 4:1 MUXes), if both 6x1 and 5x2 are available.

I guess that means I wasn't too far off for a first crack at FPGA design :-) :-) :-) :-)

legacy · « **Reply #31 on:** July 07, 2017, 12:10:36 pm »

Quote from: brucehoult on July 07, 2017, 12:05:47 pm

I guess that means I wasn't too far off for a first crack at FPGA design :-) :-) :-) :-)

LOL

NorthGuy · « **Reply #32 on:** July 07, 2017, 12:55:48 pm »

Quote from: hamster_nz on July 07, 2017, 04:34:53 am

It uses the carry chain to get information between bits, so should be relatively fast - much faster than LUTs

I don't think so. Carry chain is consecutive and hence very slow. For example, propagation through a CARRY4 element in Xilinx-7 series takes roughly 0.5 ns. So, to go through 1024 elements will take about 128 ns. This is about 30 times slower than the brucehoult's solution.

NorthGuy · « **Reply #33 on:** July 07, 2017, 01:25:48 pm »

Quote from: brucehoult on July 07, 2017, 12:05:47 pm

... and you go on to excellently describe precisely the same solution as I posted working code for in the message you replied to :-) With the same results I quoted there: 906 LUTs total (224 as 4:1 MUXes), if both 6x1 and 5x2 are available.

Amazing. After posting this, I looked at your post again and the numbers looked suspiciously similar. This was very surprising, but I'm not familiar with fpga_ffz, so I didn't go through your code.

nctnico · « **Reply #34 on:** July 08, 2017, 09:19:09 am »

Quote from: legacy on July 06, 2017, 08:42:28 pm

Quote from: Gribo on July 06, 2017, 06:01:36 pm
So, let me understand it, one process is filling the vector with bits at an unknown rate, the other is emptying it, again, at an unknown rate?
Exactly.

Isn't there some way to avoid having to check all the bits every time by using a different way to get the same functionality? I'm not talking about optimisation but creating an entirely different approach to the problem you are trying to solve which avoids needing to check & set bits at all.

legacy · « **Reply #35 on:** July 08, 2017, 01:58:02 pm »

Quote from: nctnico on July 08, 2017, 09:19:09 am

Isn't there some way to avoid having to check all the bits every time by using a different way to get the same functionality?

Since I need to write a few notes to make to remember what I am doing, I am going to write a laTek paragraph under the title "getting stuck in a rut" since alternatives to 'get & set' actually exist, but ... they all block other functionalities.

Get&Set works, but it's hard to be implemented (and might consume area). I have to thank people for the help

Oh, in the theory, relaxing other circuits and removing other functionalities, I could simply end in replacing the 'get&set' with a simple 32 bit counter which increments and decrement on demand

Gribo · « **Reply #36 on:** July 08, 2017, 02:52:33 pm »

You practically have a 4096 bit deep, 1 bit wide, dual port FIFO.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: HDL problem: find the first bit equal to zero, and set it to one (Read 11160 times)

legacy

Re: HDL problem: find the first bit equal to zero, and set it to one

brucehoult

Re: HDL problem: find the first bit equal to zero, and set it to one

NorthGuy

Re: HDL problem: find the first bit equal to zero, and set it to one

hamster_nz

Re: HDL problem: find the first bit equal to zero, and set it to one

Bruce Abbott

Re: HDL problem: find the first bit equal to zero, and set it to one

brucehoult

Re: HDL problem: find the first bit equal to zero, and set it to one

legacy

Re: HDL problem: find the first bit equal to zero, and set it to one

NorthGuy

Re: HDL problem: find the first bit equal to zero, and set it to one

NorthGuy

Re: HDL problem: find the first bit equal to zero, and set it to one

nctnico

Re: HDL problem: find the first bit equal to zero, and set it to one

legacy

Re: HDL problem: find the first bit equal to zero, and set it to one

Gribo

Re: HDL problem: find the first bit equal to zero, and set it to one

Share me