Author Topic: From bit position to array index  (Read 5461 times)

0 Members and 1 Guest are viewing this topic.

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11713
  • Country: my
  • reassessing directives...
Re: From bit position to array index
« Reply #25 on: October 13, 2020, 05:14:45 am »
well things happened. i just happened to have lot of them since many years. its an experience, now its time to upgrade. i guess at this age, its why some youngsters never know about what binary search or bit manipulation is, or even working with unsigned data type. those tiny13 will never go to waste for small projects cheers ;)
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Offline RenThraysk

  • Regular Contributor
  • *
  • Posts: 107
  • Country: gb
Re: From bit position to array index
« Reply #26 on: October 13, 2020, 11:03:54 am »



If you need to conserve flash, use
Code: [Select]
#include <avr/io.h>
#include <avr/pgmspace.h>

const unsigned char highest_bit_table[256] PROGMEM = {
    8,0,1,8,2,8,8,8,3,8,8,8,8,8,8,8,4,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,5,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
    6,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
};

unsigned char highest_bit_set(unsigned char c)
{
    if (c & 128)
        return 7;
    else
        return pgm_read_byte_near(highest_bit_table + c);
}
which takes 148 bytes of flash, and 5 (if highest bit is set) or 9 cycles (otherwise).

Note that if the inputs are uniformly random, the average is the same (7 cycles).



To trade speed for less flash use, use
Code: [Select]
#include <avr/io.h>
#include <avr/pgmspace.h>

const unsigned char highest_bit_table_3[16] PROGMEM = {
    0x88, 0x40, 0x51, 0x88, 0x62, 0x88, 0x88, 0x88, 0x73, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88,
};

unsigned char highest_bit_set_3(unsigned char c)
{
    if (c & 0xF0) {
        if (c & 0x0F)
            return 8;
        else
            return pgm_read_byte_near(highest_bit_table_3 + (c >> 4)) >> 4;
    } else
        return pgm_read_byte_near(highest_bit_table_3 + c) & 0x0F;
}
which takes 65 bytes of flash, and between 8 and 18 cycles.

If you don't have even that much flash available, use the loop approach, but with a nibble swap since AVR's support that in hardware:
Code: [Select]
unsigned char  highest_bit_set(unsigned char c)
{
    unsigned char  result;

    if (c & 0xF0) {
        c >>= 4;
        result = 4;
    } else {
        result = 0;
    }

    while (c) {
        c >>= 1;
        result++;
    }

    return result;
}

Pretty sure the first conditions on these 3 solutions can be changed to comparisons (c > 0x0F), which is non destructive (doesn't require a temporary register to compute) so would save a move instruction.

And a 16 byte lookup table on the last solution would be neat.
 
« Last Edit: October 13, 2020, 11:31:21 am by RenThraysk »
 

Offline RenThraysk

  • Regular Contributor
  • *
  • Posts: 107
  • Country: gb
Re: From bit position to array index
« Reply #27 on: October 13, 2020, 12:50:12 pm »
Something like this.

Code: [Select]
#include <avr/io.h>
#include <avr/pgmspace.h>

const unsigned char t[16] PROGMEM = {
0,
1,
2, 2,
3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4,
};

/* returns 0 when no bits set, 1 for least significant bit set, 8 when most */
unsigned char highest_bit_set(unsigned char x) {
    unsigned char y = x / 16;
    if (y) {
        x = y;
        y = 4;
    }
    return y + pgm_read_byte_near(t + x);
};
 

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11713
  • Country: my
  • reassessing directives...
Re: From bit position to array index
« Reply #28 on: October 13, 2020, 03:10:26 pm »
Something like this.

Code: [Select]
#include <avr/io.h>
#include <avr/pgmspace.h>

const unsigned char t[16] PROGMEM = {
0,
1,
2, 2,
3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4,
};

/* returns 0 when no bits set, 1 for least significant bit set, 8 when most */
unsigned char highest_bit_set(unsigned char x) {
    unsigned char y = x / 16;
    if (y) {
        x = y;
        y = 4;
    }
    return y + pgm_read_byte_near(t + x);
};
this is what i was talking about "a bit of both" saving us time from some exercise.. but just looking at your table i can see some problem and cant handle unexpected errors as brucehoult highlighted, for example if x = 0 (no bit set) or x = 3 (more than one bits are set) your code still return valid index value. since you are using lookup table and includes invalid bits value, why dont you use as brucehoult's table some using -1, or just put value greater than 7 (unsigned operations) to flag error? maybe something like this? (unverified, just eyeballing) please also note the OP is using 0-based indexing. and i prefer x >> 4 with comment x / 16 because i dont trust the compiler is clever enough to switch division into bit shift :P, or through its infinite wisdom, do a more clever rounding off (iirc i've seen bigger machine did that) but your table is good for fast index rounding off when the program expects the behaviour. ymmv

Code: [Select]
#include <avr/io.h>
#include <avr/pgmspace.h>

const unsigned char t[16] PROGMEM = {
8,
0,
1, 8,
2, 8, 8, 8,
3, 8, 8, 8, 8, 8, 8, 8
};

/* returns 8 for invalid bits value, 0 for least significant bit set, 7 when most */
unsigned char highest_bit_set(unsigned char x) {
    unsigned char y = x >> 4; // (x / 16)
    if (y) {
        x = y;
        y = 4;
    }
    return y + pgm_read_byte_near(t + x);
};
« Last Edit: October 13, 2020, 03:23:17 pm by Mechatrommer »
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Offline RenThraysk

  • Regular Contributor
  • *
  • Posts: 107
  • Country: gb
Re: From bit position to array index
« Reply #29 on: October 13, 2020, 03:30:26 pm »

/* returns 8 for invalid bits value, 0 for least significant bit set, 7 when most */
unsigned char highest_bit_set(unsigned char x) {
    unsigned char y = x >> 4; // (x / 16)
    if (y) {
        x = y;
        y = 4;
    }
    return y + pgm_read_byte_near(t + x);
};
[/code]

AVR GCC 5.4.0 does something strange with x >> 4 hence why used x / 16.

unsigned char y = x >> 4; // (x / 16)
Code: [Select]
     mov r30,r24
        ldi r31,0
        movw r18,r30
        asr r19
        ror r18
        asr r19
        ror r18
        asr r19
        ror r18
        asr r19
        ror r18
        mov r24,r18
        tst r18
        breq .L2

unsigned char y = x / 16;
Code: [Select]
        mov r25,r24
        swap r25
        andi r25,lo8(15)
        breq .L2

Code here: https://godbolt.org/z/Ghffaj
 

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11713
  • Country: my
  • reassessing directives...
Re: From bit position to array index
« Reply #30 on: October 13, 2020, 03:43:28 pm »
AVR GCC 5.4.0 does something strange with x >> 4 hence why used x / 16.

unsigned char y = x >> 4; // (x / 16)
Code: [Select]
     mov r30,r24
        ldi r31,0
        movw r18,r30
        asr r19
        ror r18
        asr r19
        ror r18
        asr r19
        ror r18
        asr r19
        ror r18
        mov r24,r18
        tst r18
        breq .L2

unsigned char y = x / 16;
Code: [Select]
        mov r25,r24
        swap r25
        andi r25,lo8(15)
        breq .L2

Code here: https://godbolt.org/z/Ghffaj
yup apparantly avr dont support single op multiple bit shift, and the compiler is not clever enough to translate it into more efficient divide by 16 integer logic ;D but this is machine specific i usually only becomes aware when tuning space and time constraint. there is not entirely wrong or entirely right answer, cheers.
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Online Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6850
  • Country: fi
    • My home page and email address
Re: From bit position to array index
« Reply #31 on: October 13, 2020, 03:56:24 pm »
Like Mechatrommer pointed out, there are many ways to define the desired behaviour.  For example:

Using avr-gcc 5.4.0 (it's what I have installed right now),
Code: [Select]
#include <avr/pgmspace.h>

#define  UNLIKELY(expr)  __builtin_expect(expr, 0)
#define  LIKELY(expr)    __builtin_expect(expr, 1)

/* ((BAD | 4) == BAD) must be true. */
#define  BAD  15

static const unsigned char one_bit_table[16] PROGMEM = {
    BAD, 0, 1, BAD, 2, BAD, BAD, BAD, 3, BAD, BAD, BAD, BAD, BAD, BAD, BAD,
};

unsigned char one_bit_set(unsigned char value)
{
    unsigned char  result = 0;

    if (value > 15) {
        value = __builtin_avr_swap(value);
        if (UNLIKELY(value > 15))
            return BAD;
        else
            result = 4;
    }

    return result | pgm_read_byte_near(one_bit_table + value);
}
which needs 36+16 = 52 bytes of flash, and takes 7 cycles if bad (argument is zero or has more than one bit set), and 13 or 17 cycles otherwise.

(You can choose any value for BAD, as long as ((BAD|4)==BAD).  On AVRs, this includes 15 and 255.  Note that as usual, on AVR, casting 255 to signed char, yields -1.)

The following implementation finds the highest bit set in a byte:
Code: [Select]
#include <avr/io.h>
#include <avr/pgmspace.h>

#define  LIKELY(expr)  __builtin_expect(expr, 1)

#define  NONE  8

const unsigned char high_bit_table[16] PROGMEM = {
    NONE, 0, 1,1, 2,2,2,2, 3,3,3,3,3,3,3,3,
};

unsigned char highest_bit_set(unsigned char c)
{
    if (LIKELY(c > 15))
        return pgm_read_byte_near(high_bit_table + (c >> 4)) + 4;

    if (LIKELY(c > 0))
        return pgm_read_byte_near(high_bit_table + c);

    return NONE;
}
using 44+16 = 60 bytes, and runs in 7 (zero) or 12/13 cycles.  You can freely choose the value of NONE ; and the first byte in the array (set to NONE) is never accessed, so you can reuse it for something else.

The difference here is that if you have multiple bits set, this one returns the index of the highest bit set instead of an error.

The two functions assembly code is
Code: [Select]
        .text
one_bit_set:
        cpi  r24, lo8(16)
        brlo .lownibble
        swap r24
        cpi  r24, lo8(16)
        brsh .notonebit
        ldi  r25, lo8(4)
        rjmp .lookup
.lownibble:
        ldi  r25, 0
.lookup:
        mov  r30, r24
        ldi  r31, 0
        subi r30, lo8(-(one_bit_table))
        sbci r31, hi8(-(one_bit_table))
        lpm  r30, Z
        mov  r24, r25
        or   r24, r30
        ret
.notonebit:
        ldi  r24, lo8(15)
        ret

highest_bit_set:
        cpi  r24, lo8(16)
        brlo .lowhalf
        mov  r30, r24
        swap r30
        andi r30, lo8(15)
        ldi  r31, 0
        subi r30, lo8(-(high_bit_table))
        sbci r31, hi8(-(high_bit_table))
        lpm  r30, Z
        ldi  r24, lo8(4)
        add  r24, r30
        ret
.lowhalf:
        tst  r24
        breq .zero
        mov  r30, r24
        ldi  r31, 0
        subi r30, lo8(-(high_bit_table))
        sbci r31, hi8(-(high_bit_table))
        lpm  r24, Z
        ret
.zero:
        ldi  r24, lo8(8)
        ret

high_bit_table:
        .byte   8
        .byte   0
        .byte   1
        .byte   1
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3

one_bit_table:
        .byte   15
        .byte   0
        .byte   1
        .byte   15
        .byte   2
        .byte   15
        .byte   15
        .byte   15
        .byte   3
        .byte   15
        .byte   15
        .byte   15
        .byte   15
        .byte   15
        .byte   15
        .byte   15
 

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11713
  • Country: my
  • reassessing directives...
Re: From bit position to array index
« Reply #32 on: October 13, 2020, 03:56:55 pm »
maybe you can try what it generates if x >> 3 vs x / 8 and x >> 6 vs x / 64.. when there is no high and low nibble swap, sometime compiler does wonderfull things...
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Online Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6850
  • Country: fi
    • My home page and email address
Re: From bit position to array index
« Reply #33 on: October 13, 2020, 05:08:43 pm »
When optimizations are enabled (-O1, -O2, -O3, -Os), avr-gcc 5.4.0 generates logical shifts of unsigned chars (Rn) at most four instructions long:
Rn << 1:   LSL  Rn
Rn << 2:   LSL  Rn ; LSL  Rn
Rn << 3:   LSL  Rn ; LSL  Rn ; LSL Rn
Rn << 4:   SWAP Rn ; ANDI Rn, -16
Rn << 5:   SWAP Rn ; LSL  Rn ; ANDI Rn, -32
Rn << 6:   SWAP Rn ; LSL  Rn ; LSL Rn ; ANDI Rn, -64
Rn << 7:   ROR  Rn ; CLR  Rn ; ROR  Rn

Rn >> 1:   LSR  Rn
Rn >> 2:   LSR  Rn ; LSR  Rn
Rn >> 3:   LSR  Rn ; LSR  Rn ; LSR  Rn
Rn >> 4:   SWAP Rn ; ANDI Rn, 15
Rn >> 5:   SWAP Rn ; LSR  Rn ; ANDI Rn, 7
Rn >> 6:   SWAP Rn ; LSR  Rn ; LSR  Rn ; ANDI Rn, 3
Rn >> 7:   ROL  Rn ; CLR  Rn ; ROL  Rn

Similarly, both -1-Rn and -(Rn+1) are compiled to a single instruction COM Rn.

(Each of the instructions used above takes only one cycle.)

While AVR does have a two-cycle signed and unsigned multiplication, the 16-bit result is always in register pair R1:R0.  For AVR-GCC, R1 will need to be cleared to zero.  So, technically, one could implement e.g.
Rn << 6:   LDI  R1, 64 ; MUL  Rn, R1 ; MOV  Rn, R1 ; CLR  R1
but it would be just as long, and take an extra cycle to execute.
 

Offline RenThraysk

  • Regular Contributor
  • *
  • Posts: 107
  • Country: gb
Re: From bit position to array index
« Reply #34 on: October 13, 2020, 05:21:49 pm »
Seems the AVR GCC 5.4.0 compiler version on godbolt is generating the strange code for >> 4.
Local compiler is correctly recognising the optimisation of the shift to swap & and.
Other versions 4.6.4 & 4.5.4 on godbolt also recognise the optimisation.
Bizzare.

Edit: Further investigation

avr-g++ 5.4.0 generates the strange code for y = x >> 4, and is what godbolt uses when says it's using AVR gcc 5.4.0.
avr-gcc 5.4.0 generates expected optimised code.
« Last Edit: October 13, 2020, 11:02:15 pm by RenThraysk »
 

Offline RenThraysk

  • Regular Contributor
  • *
  • Posts: 107
  • Country: gb
Re: From bit position to array index
« Reply #35 on: October 13, 2020, 05:50:29 pm »
this is what i was talking about "a bit of both" saving us time from some exercise.. but just looking at your table i can see some problem and cant handle unexpected errors as brucehoult highlighted, for example if x = 0 (no bit set) or x = 3 (more than one bits are set) your code still return valid index value. since you are using lookup table and includes invalid bits value, why dont you use as brucehoult's table some using -1, or just put value greater than 7 (unsigned operations) to flag error?

Using 0 is just what used to, main language is Go these days and it's intrinsic function bits.Len(x) returns minimum number of bits required to represent x. If x = 0 then it returns 0.

« Last Edit: October 13, 2020, 05:54:18 pm by RenThraysk »
 

Online radiolistener

  • Super Contributor
  • ***
  • Posts: 3999
  • Country: ua
Re: From bit position to array index
« Reply #36 on: October 14, 2020, 10:11:09 am »
I have similar question, I need solution for verilog, but I will be glad for solutions in any language.

The problem is the following, there is about 1024 bit register which consists waveform fingerprint, and I need to convert it into binary representation for period of 1. Actually this is output of TDC (time to digital converter) ADC implemented in FPGA. And I want to convert it into usable format.

For example:
input: 0000000010000000000001000000000...
output: 1

input: 0000000011000000000001100000000...
output: 2

input: 0000000011110001111000000000000...
output: 4

etc...

The problem here is that register is very long and I need to transform it at high speed rate (about 100-200 MHz), so there is no way to serialize it with counting, because it will take at least 1024 clock cycles. There is also no way to predict where 1 starts.

Any idea?  :)
« Last Edit: October 14, 2020, 10:15:09 am by radiolistener »
 

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11713
  • Country: my
  • reassessing directives...
Re: From bit position to array index
« Reply #37 on: October 14, 2020, 10:30:35 am »
with 128 bytes binary search for nonzero, thats log2(128)=7 steps, and then some housekeeping job to see how much adjacent 1's.. no experience in verilog programming though, not sure how to access array's elements.
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4468
  • Country: nz
Re: From bit position to array index
« Reply #38 on: October 14, 2020, 12:06:32 pm »
with 128 bytes binary search for nonzero, thats log2(128)=7 steps, and then some housekeeping job to see how much adjacent 1's.. no experience in verilog programming though, not sure how to access array's elements.

You can't binary search for nonzero because if you don't find 1s at the midpoint you have no way to know whether you shoudl search the first half or the second half.
 

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11713
  • Country: my
  • reassessing directives...
Re: From bit position to array index
« Reply #39 on: October 14, 2020, 12:33:16 pm »
right with no particular ordering or clue, you could end up worst case 128 steps :palm:
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4468
  • Country: nz
Re: From bit position to array index
« Reply #40 on: October 14, 2020, 12:54:02 pm »
I have similar question, I need solution for verilog, but I will be glad for solutions in any language.

The problem is the following, there is about 1024 bit register which consists waveform fingerprint, and I need to convert it into binary representation for period of 1. Actually this is output of TDC (time to digital converter) ADC implemented in FPGA. And I want to convert it into usable format.

For example:
input: 0000000010000000000001000000000...
output: 1

input: 0000000011000000000001100000000...
output: 2

input: 0000000011110001111000000000000...
output: 4

etc...

The problem here is that register is very long and I need to transform it at high speed rate (about 100-200 MHz), so there is no way to serialize it with counting, because it will take at least 1024 clock cycles. There is also no way to predict where 1 starts.

Any idea?  :)

What do you want to do if there are multiple runs of 1s in the input but they have different lengths? Average them? Also, what if the register starts with 1s and/or ends with 1s, in which case you may have only a partial run there. For maximum accuracy you should ignore them, but then you have to hope there are other runs in the middle.

Anyway the problem reduces to counting how many 1s there are in total (ignoring any at the very start or very end), and how many transitions from 0 to 1 (or from 1 to 0, which will be the same if you ignore runs at the ends)

You can do this very simply with a state machine linked to three counters:

Initialize runs=0, bits=0, partial=0, state=0

state 0: 0 -> state=1; 1 -> state=0; // skip initial 1s
state 1: 0 -> state=1; 1 -> ++runs, ++bits, ++partial, state=2; // skip a run of 0s
state 2: 0 -> partial=0, state=1; 1 -> ++bits, ++partial, state=2 // count run length

End of input encountered (in any state):
if (partial != 0){bits -= partial; --runs;}
output = bits/runs;


Translating that directly to verilog implemented in an FPGA shouldn't have any problems keeping up.

If you want to run it in software then you can process (for example) a byte at a time by having several 256 element lookup tables for each state (at the start of that input byte), telling which state is next (0..2), how much to add to runs (0..4), how much to add to bits (0..8), and how much to add to or copy to partial (1..8 or 0..7).

If you're on a 32 bit CPU you could encode all that into a single 32 bit value and update everything in parallel with branchless code at I think just under one instruction per input bit.
 

Online radiolistener

  • Super Contributor
  • ***
  • Posts: 3999
  • Country: ua
Re: From bit position to array index
« Reply #41 on: October 14, 2020, 05:25:06 pm »
What do you want to do if there are multiple runs of 1s in the input but they have different lengths? Average them?

I never worked with processing of raw TDC results, so I don't know what exactly I need to count, this is my first experience with TDC. At a glance there is need to calculate signal period or positive and negative pulse period. But I'm not sure, because I never worked with such problem before. So, it needs to perform experiments.

You can do this very simply with a state machine linked to three counters:

The problem here is that I need to get result after max 2-4 steps.

If you're on a 32 bit CPU you could encode all that into a single 32 bit value and update everything in parallel with branchless code at I think just under one instruction per input bit.

This is not CPU. This is logic hardware circuit. So, technically this is 1024 bit register. I cannot split this operation into many cycles, because there is limit for the frequency. I need to do it within 2-4 steps, but no more. On the other hand I can do many logic operations simultaneously. I can use many LUT tables to implement decoders, but there is limited memory (let's say 128 KB for all things).

Is it possible to count length of 1 series with logic and small LUT tables?

For 8 bit it can be easily done with 256 byte LUT. 
But I need to do it for 1024 bit, so it requires 2^1024= 1.79^308 memory cells  :-BROKE
« Last Edit: October 14, 2020, 05:48:42 pm by radiolistener »
 

Online Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6850
  • Country: fi
    • My home page and email address
Re: From bit position to array index
« Reply #42 on: October 14, 2020, 11:18:26 pm »
Edited:  The two previous versions were incorrect!

I would use a Game-Of-Life approach for this.

Essentially, you have a 1024×1024 cell/bit matrix, where the sample data is at the very top, and changes every sample cycle.  Each bit is handled by a cell that has three inputs, and two outputs:

    ┼─┼─┼─┼   A  0 0 0 0 1 1 1 1
    │A│B│C│   B  0 0 1 1 0 0 1 1
    ┼─┼─┼─┼   C  0 1 0 1 0 1 0 1
      │X│     X  0 0 0 0 0 0 1 1    X = B & A
      ┼─┼     Y  0 0 1 0 0 0 0 0    Y = B & !A & !C

The X output will be an input to three cells on the next level (down left, down, and down right), and the Y output will go to a summer on each row (producing the number of set Y's on that row).  The number of set Y's on row R means that the sample R cycles before had sum(Y) sequences of length R.

Essentially, when the data percolates down, each run of bits set shrinks from the left, and when they vanish, the Y output gets set on the row matching their length.
Therefore, on the top row of cells, the number of Y set is the number of 1-bit sequences in the data; on the second row of cells, the number of 2-bit sequences, and so on.

Note that on top row (of processing cells), at most one of each consecutive pair of Y's can be set at a time, so pairs can be OR'd together.  There are 512 of Y inputs to count on that row, and the count is between 0 and 256, inclusive.
On the second row, at most one of each set of three Y's can be set at a time, so triplets can be OR'd together.  The count is between 0 and 171, inclusive.
On the seventh row, each consecutive set of eight Y's are OR'd together.  The count is between 0 and 32, inclusive.

Also note that any contiguous sequence of zeros will grow by one when percolating down.  This means some sort of compression for long runs,  making the cell grid a triangle rather than a square, and reducing the total number of cells, is possible.

The total latency is 1024 cycles (or more), but each cycle you get new results.  There is no delay, because each stage works on data in parallel, just data from a different input sample set!

To count the Y signals on each row, a triangular cell map could be used; the top row would be the Y inputs on each sample cycle, and the actual count pops out at the end. 

Yes, it adds up very, very fast.  I'm not sure if this is feasible given current FPGA sizes – I'm just a hobbyist with zero experience with FPGAs, but lots of experience in programming (distributed simulations) –, but this is what I'd try.

Because the 1024×1024 bit map is simply a rectangular region of trivial logic gates, it might need an ASIC.  The Y signal summers are just counting the number of set inputs for each sample cycle, so they will need multiple stages and more complex circuitry, but each row works in parallel, and the more there is to do, the more stages you can use before the result is needed (assuming the counts of set bit sequence lengths are provided at the same time for each input, although a staggered mode is easy to un-stagger in software).  So, I'm absolutely certain this is doable, even if the circuitry only runs at twice or thrice the sample frequency.
« Last Edit: October 15, 2020, 05:48:40 am by Nominal Animal »
 
The following users thanked this post: radiolistener

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4468
  • Country: nz
Re: From bit position to array index
« Reply #43 on: October 14, 2020, 11:42:51 pm »
It's a bit difficult if you don't know what you are actually supposed to be calculating :-)

If you need it to be fast and you don't care how much hardware it takes then you basically just do a binary tree structured reduction of what I showed a linear reduction of.
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4468
  • Country: nz
Re: From bit position to array index
« Reply #44 on: October 14, 2020, 11:51:34 pm »
I would use an FPGA and a Game-Of-Life approach for this.

Essentially, you have a 1024×1024 bit matrix, where the sample data is at the top, and changes every sample cycle.  Each bit is handled by a cell that has three inputs, and two outputs:

    ┼─┼─┼─┼   A   0 0 0 0 1 1 1 1
    │A│B│C│   B   0 0 1 1 0 0 1 1
    ┼─┼─┼─┼   C   0 1 0 1 0 1 0 1
        │X│   X   0 0 0 1 0 0 0 1    X = C & B & (!A)
        ┼─┼   Y   0 1 0 0 0 1 0 0    Y = C & (!B)

The X output will be an input to three cells on the next level, and the Y output will go to a summer on each row (producing the number of set Y's on that row).

I agree with your general idea, but don't quite follow your detailed logic.

In particular, the last value in your X row doesn't agree with the formula -- when A, B, C are all 1 then C & B & !A is 0. Only 011 can produce a 1 for X.
 
The following users thanked this post: Nominal Animal

Online Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6850
  • Country: fi
    • My home page and email address
Re: From bit position to array index
« Reply #45 on: October 15, 2020, 08:52:11 am »
Okay, I had a bit of problems explaining my logic above, but it is now fixed.

I know, because I wrote a simulator in C, that takes a single input on the command line (only ones and zeros in arguments are considered, others are ignored), and simulates one time step of each (identical) row in the matrix.  It assumes there are implicit leading and trailing zeros in the input.
Code: [Select]
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>

/* Each input word contains 62 bits, aligned to the least significant bit.  Two highest bits are unused. */
typedef  uint64_t  u62;

static inline unsigned int  bits_set(const u62 u)
{
#if defined(__GNUC__)
    if (sizeof u >= sizeof (unsigned long long))
        return __builtin_popcountll(u);
    if (sizeof u >= sizeof (unsigned long))
        return __builtin_popcountl(u);
    if (sizeof u >= sizeof (unsigned int))
        return __builtin_popcount(u);
#endif
    uint64_t  m = u;

    m = (m & UINT64_C(0x5555555555555555)) + ((m >>  1) & UINT64_C(0x5555555555555555));
    m = (m & UINT64_C(0x3333333333333333)) + ((m >>  2) & UINT64_C(0x3333333333333333));
    m = (m & UINT64_C(0x0F0F0F0F0F0F0F0F)) + ((m >>  4) & UINT64_C(0x0F0F0F0F0F0F0F0F));
    m = (m & UINT64_C(0x00FF00FF00FF00FF)) + ((m >>  8) & UINT64_C(0x00FF00FF00FF00FF));  /* 0x001F001F001F001F */
    m = (m & UINT64_C(0x0000FFFF0000FFFF)) + ((m >> 16) & UINT64_C(0x0000FFFF0000FFFF));  /* 0x0000003F0000003F */
    m = (m & UINT64_C(0x00000000FFFFFFFF)) + ((m >> 32) & UINT64_C(0x00000000FFFFFFFF));  /* 0x000000000000007F */

    return m;
}

static unsigned int  process_row(u62 state[], const size_t words)
{
    size_t        i;
    unsigned int  n = 0;

    /* Clear the extra bits. */
    for (i = 0; i < words; ++i)
        state[i] &= UINT64_C(0x7FFFFFFFFFFFFFFE);

    /* Copy the LSB down, and MSB up. */
    for (i = 1; i < words; ++i) {
        state[i-1] |= (state[i] & 2) << 62;
        state[i] |= (state[i-1] >> 62) & 1;
    }

    /* Process each word. */
    for (i = 0; i < words; ++i) {
        const u62  s = state[i];

        /* Calculate the outputs, shifting the word back down.
           Ensure the extra bits will be zero. */
        state[i] = ((s >> 1) & UINT64_C(0x7FFFFFFFFFFFFFFF) & s);

        /* Calculate the number of dying sequences, and add to the sum. */
        n += bits_set(s & (~(s >> 1)) & (~(s << 1)) & UINT64_C(0x7FFFFFFFFFFFFFFE));
    }

    return n;
}

int main(int argc, char *argv[])
{
    u62    *word;
    size_t  w, words;
    size_t  bit, bits;
    int     arg;

    if (argc < 2 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help") || !strcmp(argv[1], "/?")) {
        const char *argv0 = (argc > 0 && argv[0] && argv[0][0]) ? argv[0] : "(this)";

        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help | /? ]\n", argv0);
        fprintf(stderr, "       %s BIT STRING ...\n", argv0);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program returns the counts of consecutive runs of ones, in order of length.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "All characters except 0 and 1 in input parameters are ignored.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    /* First, count the number of bits in input. */
    for (bits = 0, arg = 1; arg < argc; arg++) {
        const char *a = argv[arg];
        while (*a) {
            if (*a == '0' || *a == '1')
                bits++;
            a++;
        }
    }
    if (bits < 1) {
        fprintf(stderr, "Invalid input!\n");
        return EXIT_FAILURE;
    }

    /* Number of words needed. Round UP, and add an extra word for paranoia. */
    words = (bits + 61) / 62;
    word = calloc(words + 1, sizeof (u62));
    if (!word) {
        fprintf(stderr, "Not enough memory for %zu bits (%zu 62-bit words).\n", bits, words);
        return EXIT_FAILURE;
    }

    /* We store the bits in the middle bits of each word, 1..62, inclusive. */
    for (w = 0, bit = 0, arg = 1; arg < argc; arg++) {
        const char *a = argv[arg];
        while (*a) {
            if (*a == '0' || *a == '1') {
                /* Advance to next word? */
                if (bit >= 62) {
                    w++;
                    bit = 0;
                }

                /* Set bit. */
                word[w] |= ((uint64_t)(*a == '1')) << (++bit);
            }
            a++;
        }
    }

    /* Paranoid safety check. */
    if (bits != 62*w + bit || w >= words) {
        fprintf(stderr, "Internal bug parsing input.  Counted %zu bits, but parsed %zu.\n", bits, 62*w+bit);
        return EXIT_FAILURE;
    }

    /* We simulate each row exactly once.  Because of this, we only need one row buffer. */
    for (size_t row = 1; row <= bits; row++) {
        const unsigned int  n = process_row(word, words);
        if (n == 1)
            printf("%9u sequence  of length %zu\n", n, row);
        else
        if (n > 0)
            printf("%9u sequences of length %zu\n", n, row);
    }

    return EXIT_SUCCESS;
}

For example, if you compile and run the above using e.g. ./above 0101 11 111010111, the output is
        2 sequences of length 1
        1 sequence  of length 3
        1 sequence  of length 6

Because the simulator simulates each row for only one step, you get the counts in order of sequence length.

Since each row is dependent on the output of the row above, all rows can be implemented in parallel.  And, as the above code shows, they can be identical – the same process_row() function is called for each row; it modifies the state in-place.

Essentially, this implementation uses 64-bit units, for each input of 62 unique bits (the two extra bits are the extra A and C inputs from the preceding and following units), producing 62-bit result, and a single count (number of Y bits set on the line).

As you can see, the logic for output X with inputs B (above), A (above left), and C (above right) is
    X = B & A
and for output Y (sequence "died"),
    Y = B & !A & !C

This means there are a huge number of options on how to implement this.  Indeed, the 1024×1024 matrix itself is just 1,048,576 AND gates (running at the input sample rate, if the output of each gate is latched, storing one bit of the matrix).  (That does not include the logic needed to count the number of Y set on each row, though.)
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4468
  • Country: nz
Re: From bit position to array index
« Reply #46 on: October 15, 2020, 11:03:35 am »
I applaud the effort of reducing the concept to practice and proving it works. Well done that dude.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf