Author Topic: Bitfield compression using FPGA (Read 3365 times)

blueskull · « **on:** January 09, 2018, 11:58:14 am »

I'm working on a tricky hobby project that involves bitfield compression.
The input data is a 4-bit value that ranges from 0 to 14, 15 is not possible. Therefore, there are only 15 states rather than 16 states, which making the 4-bit integer carrying only 3.91-bit of entropy (97.67% encoding efficiency).

Now my question is, is there an algorithm that allows me to aggregate a bunch of 3.91-bit integers and pack them into fewer 4-bit code with higher coding efficiency?
Say, to encode 128 3.91-bit data into 126 4-bit data, and spare 2 4-bit slot for other use?

Any suggestions? I will be using Verilog to implement this.

fourtytwo42 · « **Reply #1 on:** January 09, 2018, 01:15:44 pm »

I am amazed you have any communication bandwidth or storage space issue that justifies the resources involved

xaxaxa · « **Reply #2 on:** January 09, 2018, 01:26:13 pm »

Quote from: blueskull on January 09, 2018, 12:57:39 pm

Solved. Did the stupid way -- multiply and add. Input: 16 samples of 3.91 bit, output: 1 sample of 63 bit.
Synplify says 52MHz at 194LUTs on iCE40LP, not bad considering I only need 10MHz throughput.

Not stupid at all, this is called arithmetic coding which is used in jpeg compression amongst other things:

https://en.wikipedia.org/wiki/Arithmetic_coding

SiliconWizard · « **Reply #3 on:** January 09, 2018, 03:02:58 pm »

Quote from: blueskull on January 09, 2018, 12:57:39 pm

Solved. Did the stupid way -- multiply and add. Input: 16 samples of 3.91 bit, output: 1 sample of 63 bit.
Synplify says 52MHz at 194LUTs on iCE40LP, not bad considering I only need 10MHz throughput.

This is not so stupid. It's actually a base conversion. You're converting from base-15 to base-16 (or base-2, it's equivalent here). It's a modulo-15 computation. The downside is, it costs a 4-bit x 63-bit multiplier, but it doesn't seem to matter in your case (not sure the iCE40LP has hardware multipliers but performance and area seem fine).

Just a couple remarks based on the code you posted (I'm not very proficient in Verilog, I'm more of a VHDL guy, so I may miss something, especially on the "reset" point):

I think you should multiply by 15 instead of 14? From what I gathered, 'data_in' is a modulo-15 value.
There are two instances of the multiply-and-add block in your code. Since it's the same expression, it's probably optimized away at the synthesize stage as just one multiply-and-add with a MUX, but you may want to factor this "costly" part just in case. It may be done in a separate process since it's computed at each clock's rising edge. The result would be stored in an additional register which in turn would be assigned to 'dbuf' or 'data_out', depending on 'cnt', in your main process.
There is no 'reset' handling. The initial state of your module seems unknown. It may work on most FPGAs due to them usually having internal global resets (and your expected initial values here being zero), but I recommend using proper reset signals anyway.
As I'm guessing, the 'clk_out' signals when the 'data_out' value can/should be read (on rising edge I suppose). It's fine except for the very first value (since it's raised at half count). You would read an incomplete one. But you may also add some means of handling the first value in another part of your code. Up to you.
As for performance and area, it seems you checked those at the synthesize stage but I would suggest (maybe you did so already) going all the way to place-and-route, the result may not be as favorable (but most likely still largely good enough for a 10 MHz goal).

NorthGuy · « **Reply #4 on:** January 09, 2018, 04:26:22 pm »

Perhaps, giving 15 some special meaning, such as "repeat the last code 2 times" or "repeat the last code 3 times" or "emit 2 zero codes" or even "treat the next code as RLE count" may lead to better compression and simpler calculations. It all depends on the nature of the data you're transmitting.

SiliconWizard · « **Reply #5 on:** January 10, 2018, 01:28:31 pm »

Quote from: blueskull on January 10, 2018, 04:13:19 am

I don't use process or tasks, I only use flat code inside a module. But yes, I should combine them.

As I said, I don't know Verilog that well. In VHDL, that would be a separate process, which would insure there is only one instance of the multiply-and-add (not relying on particular optimizations) and that it would get registered, probably making the design able to reach higher speeds. But then, since it would get registered, you'd get a one clock lag, which would have to be taken into account.

NorthGuy · « **Reply #6 on:** January 10, 2018, 03:25:06 pm »

Quote from: blueskull on January 10, 2018, 09:06:58 am

So, naturally I have the second question: how to decode the bit stream without expensive % and / operations?
Since 15 is very close to 16, I think there has to be a simple way to /15 and %15, just I don't know how to do it.
Any suggestions? Thanks in advance.

This process will surely stump on your guaranteed compression because it's really hard to make it fast enough to keep up with the transmission. Perhaps impossible, if the transmission speed is in Gb.

You can speed it up by doing binary division tree. If you have n 15-bit digits, divide by 15^(n/2), then deal with each half dividing by 15^(n/4) etc.

You can make it even faster if you do all the divisions in parallel - 15, 225, 3375 etc., but it'll take a lot of fabric.

Perhaps, overclocking the system by 3% instead of doing the compression is better?

ogden · « **Reply #7 on:** January 10, 2018, 04:05:47 pm »

Quote from: blueskull on January 10, 2018, 09:06:58 am

So, naturally I have the second question: how to decode the bit stream without expensive % and / operations?
Since 15 is very close to 16, I think there has to be a simple way to /15 and %15, just I don't know how to do it.
Any suggestions? Thanks in advance.

It does not make sense to care about "4 bits carrying only 3.91-bit of entropy" while using 8b10b in higher layer. Better implement 64b/66b or 128b/130b and drop compression idea.

Someone · « **Reply #8 on:** January 10, 2018, 08:44:14 pm »

Quote from: blueskull on January 10, 2018, 09:06:58 am

So, naturally I have the second question: how to decode the bit stream without expensive % and / operations?
Since 15 is very close to 16, I think there has to be a simple way to /15 and %15, just I don't know how to do it.
Any suggestions? Thanks in advance.

No advice, but the target to aim for is several dozen LUTs. Decoding maps to FPGA logic well but the code to produce it is either complicated (self generating) or long winded if written by hand.

hamster_nz · « **Reply #9 on:** January 10, 2018, 10:08:32 pm »

Quote from: blueskull on January 10, 2018, 09:08:34 pm

Quote from: ogden on January 10, 2018, 04:05:47 pm
It does not make sense to care about "4 bits carrying only 3.91-bit of entropy" while using 8b10b in higher layer. Better implement 64b/66b or 128b/130b and drop compression idea.

This is a HiFi audio DAC, the 4-bit mized signal engine for a sigma delta DAC implemented in FPGA and DSP.
The low jitter master clock is 122.88MHz, and I want to clock everything based on this.
In my particular case, I want to transfer 6.144Msps of 4-bit samples to the mixed signal engine, through a xfmr isolated channel. This calls for a DC-free encoding.
Since my master clock is 122.88MHz, and I want to oversample input bit stream in order to get rid of phase alignment dependency, I need at least 4x input bit stream clock to reliably sample input data stream.
Therefore, my maximum allowable encoded bitstream is 30.722Mbps, exactly 1.25x of 24.576Mbps data rate. This allows me to use 8b/10b.
64/66b is great, but it seems like it's fairly complicated, and I rather reuse my existing IPs.

Where will you find the bandwidth to include the comma symbols, so you can pick out the framing of the 8b/10b coding?

I guess you are going to map one of the data symbols onto a comma and just hope it shows up in the stream?

Or is this the original reason why you want to pack the samples, so you can glue eighty 0.1-of-a-bit together to get the space for a comma without changing the data rate....

Can't help but feel that time spend designing a 15-values to 5-bits coding scheme would make all your problems go away...

ogden · « **Reply #10 on:** January 10, 2018, 10:24:35 pm »

Quote from: blueskull on January 10, 2018, 09:08:34 pm

Therefore, my maximum allowable encoded bitstream is 30.722Mbps, exactly 1.25x of 24.576Mbps data rate. This allows me to use 8b/10b.
64/66b is great, but it seems like it's fairly complicated, and I rather reuse my existing IPs.

You just said that you have plenty of bandwidth. Then why bother to compress that 1 out of 16 states?

hamster_nz · « **Reply #11 on:** January 10, 2018, 11:23:49 pm »

Quote from: blueskull on January 10, 2018, 09:08:34 pm

Therefore, my maximum allowable encoded bitstream is 30.722Mbps, exactly 1.25x of 24.576Mbps data rate. This allows me to use 8b/10b.
64/66b is great, but it seems like it's fairly complicated, and I rather reuse my existing IPs.

How is this for a coding scheme? It isn't as nice as 8b/10b, but it seems to meet your needs (simple, 15 values in 5 bits, can be AC coupled, small to implement)

A 4b/5b coding
Start with a running disparity (DP) of -1.

Map your fifteen raw values onto fifteen 5-bit symbols, depending on the current value of the running disparity DP:

Code: [Select]

       DP is + / DP is -  Change in DP 
S0   => 00000   / 11111    -5 / +5
S1   => 00001   / 11110    -3 / +3
S3   => 00010   / 11101    -3 / +3
S3   => 00011   / 11100    -1 / +1
S4   => 00100   / 11011    -3 / +3
S5   => 00101   / 11010    -1 / +1  
S6   => 00110   / 11001    -1 / +1
S7   => 11000   / 00111    -1 / +1 
S8   => 01000   / 10111    -3 / +3
S9   => 01001   / 10110    -1 / +1
S10  => 10100   / 01011    -1 / +1
S11  => 01100   / 10011    -1 / +1
S12  => 10010   / 01101    -1 / +1
S13  => 10001   / 01110    -1 / +1
S14  => 10000   / 01111    -3 / +3

It helps to make sure that S0 is the most common symbol in the data stream.

Note: Just found this is flawed, so don't use it - framing can be ambiguous

You will note that one pair of 5-bit symbols is missing:

Code: [Select]

SERR => 01010   / 10101    -1/+1  INVALID

How to frame the data on receiving end
- When you see 0000011111 or 1111100000 (two S0s in a row) you have confirmed framing, this can only occur with two S0 symbols are in a row.
- When you see no SERR symbols, then it is pretty likely (but not certain) that you have the correct framing,
- When you see a SERR you should bitslip by one bit

Framing is lost
- When disparity falls out of out of bounds (it should stay between -4 and +4), but the RX should limit to +/-8 as it doesn't know when to start)
- That can be simplified to "when you see 00000-000xx" or "11111-111xx", to avoid keeping track of disparity
- When you see any SERR symbols

Benefits
- DC balanced
- Data stream can be inverted by TX and/or RX without any issues
- Run length limited (to eight bits at a glance), so clock recovery possible.
- Encoding is very simple, a four bit register for DP value, and a 4-to-5-bit LUT (for the symbol) and a 4-to-4 bit LUT for the change in disparity, and a four bit signed adder.
- Basic decoding is very simple, ideal for the small lookup tables (a 5-to-4 bit LUT for decode, a 5-to-1 bit LUT for the "valid symbol signal" and a few bits of state
- Only a small FSM and shift register is needed to control bit-slipping to get the framing sorted - Lock if you see 00000-11111 or 11111-00000, (two S0 symbols are seen in a row), and maybe if no SERR is seen for a few thousand symbols), unlock if SERR, 00000-000xx or 11111-111xx is seen, and bitslip if SERR is seen.
- No multiplications or divisions needed in encoding or decoding
- Any unexpected glitches in the encoder or decode will flush out very quickly, as there is minimal state being tracked

ogden · « **Reply #12 on:** January 11, 2018, 12:46:18 am »

Quote from: hamster_nz on January 10, 2018, 11:23:49 pm

How is this for a coding scheme? It isn't as nice as 8b/10b, but it seems to meet your needs (simple, 15 values in 5 bits, can be AC coupled, small to implement)

OP wanted to compress 4 bits into less than 4 (closer to 3.91 bits), but you are offering 5

Just kidding.

Do not see why 4b/5b is better than 8b/10b except perhaps smaller FSM. 4b/5b does not have control symbols which can be used for framing or signalling.

[edit] I suggest to use 8b/10b for coding and forget about compression.

hamster_nz · « **Reply #13 on:** January 11, 2018, 01:57:42 am »

Quote from: ogden on January 11, 2018, 12:46:18 am

Quote from: hamster_nz on January 10, 2018, 11:23:49 pm
How is this for a coding scheme? It isn't as nice as 8b/10b, but it seems to meet your needs (simple, 15 values in 5 bits, can be AC coupled, small to implement)

OP wanted to compress 4 bits into less than 4 (closer to 3.91 bits), but you are offering 5 Just kidding.

Do not see why 4b/5b is better than 8b/10b except perhaps smaller FSM. 4b/5b does not have control symbols which can be used for framing or signalling.

[edit] I suggest to use 8b/10b for coding and forget about compression.

My rough understanding was as follows:

Blueskull wants to transfer 6.144M samples, with a range of 0 to 14, using a 122.88MHz master clock. That gives 20 cycles per sample, and allowing for 4x oversampling on the RX it give a budget of 5 bits to encode each samplev(or 10 bits per 2 samples, if you want).

You can almost do that with 8b/10b by using each symbol to encode two four-bit samples, except you don't have the bandwidth to include the comma symbols that is needed every now and then to recover framing. You can't do that on a link that is always full of data symbols. That is what started Blueskull's quest for the compression - the need to squish together a lot of 0.093 bits to get the 8 bits of space required to insert the comma, which can be done once every 11 symbols or so.

That 4b/5b coding scheme might work fine, as long as the data going over it will contain every now and then contain two of the S0 symbols in a row - allowing framing to be reliably determined.

The other way would be some sort of feedback, so the TX could transfer 8b/10b commas until a "rx synced" signal is asserted, then it starts sending data frames. This could completely avoid the need for compression...

mariush · « **Reply #14 on:** January 11, 2018, 02:10:14 am »

Does it have to be 1-2 wire transfer, sending a bit at a time? Can't you just use 4 or 8 wires / pairs (send 4 bits at a time) ... i'm thinking for very short distances it may not be so difficult.

NorthGuy · « **Reply #15 on:** January 11, 2018, 06:15:17 pm »

I'm sure audio can be compressed quite a bit. The slew rate is probably limited, so the adjacent codes must be dependent of each other. Frequency is likely to be limited. Therefore the real information content is nowhere close to 100%. Thus, if you try to compress, the compression is guaranteed. Similar effect as with MP3 files being always smaller than WAV.

Another idea is to make your control bits opportunistic. You select the most common code. Say it is 14. When 14 comes along, you encode it either as 14 or 15. If the next control bit is 1, you send 15. If it is 0, you send 14. This way you can transmit control information without disturbing the flow. Although the timing of the control information is unpredictable, in a long run it'll work. In the worst case, you can switch the magic code dynamically. Say, you start from 14, but if 14 didn't occur in the next 1024 codes you switch it to the most frequent code of the streak or similar.

BrianHG · « **Reply #16 on:** January 13, 2018, 04:31:45 am »

Quote from: blueskull on January 11, 2018, 02:33:33 am

4. iCE40LP doesn't have IO digital delay line nor metastability detection, so I have to use oversampling to implement clock phase independent data lines. Therefore, Manchester code is out of scope. I want to keep the deign clock tree strictly synchronized, so if I don't absolutely have to, I won't use a PLL. Also, I want to try not to use negedge and posedge together, as a general design rule.

Just out of curiosity, based on something I've done in the past with 1 xformer, including data and clock.
I reserved positive edge running at 2x as my clock, and, used 1 delay line on the falling edge to separate my data bits from the data using a simple shift on the transmit. It was a clean trick allowing the RX side to be nothing more than (at the time) an Altera MAX7128 decoding a 2 bit high speed serial bus, dividing the pos-edge by 2 making a clean clock reconstruction. The delay line was nothing more than an inductor/resistor between 2 adjacent inputs. No PLLs. Clock reconstruction was what was most important at the time.

I take it you are trying to avoid data idolators for the super KV isolation of a split-bobin transformer and avoid the inherent jitter noise of the data isolators?

SiliconWizard · « **Reply #17 on:** January 13, 2018, 03:58:46 pm »

Quote from: blueskull on January 10, 2018, 09:06:58 am

So, naturally I have the second question: how to decode the bit stream without expensive % and / operations?
Since 15 is very close to 16, I think there has to be a simple way to /15 and %15, just I don't know how to do it.
Any suggestions? Thanks in advance.

There may be other ways of doing this, but I liked the challenge.

Well, I've thought of one way to compute %15 [or %(2^n-1)] in general].
It's based on residues (as in number theory). Adding the 4-bit segments of the number until it reduces to just one 4-bit segment, that will be your number modulo 15 (with just the particular case that if you get 15 it must be considered as 0).

In your case, you start with 16 4-bit segments that you add together. That will reduce to an 8-bit number (because worst case is 15x15+7 = 232). Then add the two 4-bit segments of this. This in turn will reduce to a 5-bit number (because it can't be greater than 0x1E = 0xF+0xF). A last addition of 1-bit and 4-bit will reduce to a 4-bit number (because worst cases are 1+0xE = 0xF or 0+0xF = 0xF). Dont forget that if you eventually get 0xF, it must be considered as 0.

Those additions can be all computed in just one clock cycle and since they are all on just a few bits, that shouldn't be very expensive in terms of area or delay (it's 3 levels though, but you could always pipeline them if it doesn't meet your timing requirements, which I doubt, and anyway, it's probably a lot lighter than a true integer division).

As for the divide by 15, it needs some more thinking. No idea for now.

As a side note, the multiplication by 15 for the compression may be done with a 4-bit left shift (x16) and subtracting the number (16-1 = 15), which would be much less expensive than a real multiplication. (Optimisation may already infer that for you, but I'm not sure, so that's worth a try.)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Bitfield compression using FPGA (Read 3365 times)

blueskull

Bitfield compression using FPGA

fourtytwo42

Re: Bitfield compression using FPGA

xaxaxa

Re: Bitfield compression using FPGA

SiliconWizard

Re: Bitfield compression using FPGA

NorthGuy

Re: Bitfield compression using FPGA

SiliconWizard

Re: Bitfield compression using FPGA

NorthGuy

Re: Bitfield compression using FPGA

ogden

Re: Bitfield compression using FPGA

Someone

Re: Bitfield compression using FPGA

hamster_nz

Re: Bitfield compression using FPGA

ogden

Re: Bitfield compression using FPGA

hamster_nz

Re: Bitfield compression using FPGA

ogden

Re: Bitfield compression using FPGA

hamster_nz

Re: Bitfield compression using FPGA

mariush

Re: Bitfield compression using FPGA

NorthGuy

Re: Bitfield compression using FPGA

BrianHG

Re: Bitfield compression using FPGA

SiliconWizard

Re: Bitfield compression using FPGA

Share me