Author Topic: Divide/Mod a 32-bit value by 1000. (Read 3375 times)

hamster_nz · « **on:** July 03, 2021, 04:58:03 am »

A Block RAM based LUT is a great way to convert the range of 0-999 to BCD and/or ASCII digits - but what's an efficient way to divide & mod a 32-bit value by 1000 in an FPGA?

I have found a nice solution, and want to see if it is well known...

peteru · « **Reply #1 on:** July 03, 2021, 06:01:09 am »

Use 1024 as the divisor and call it close enough.

hamster_nz · « **Reply #2 on:** July 03, 2021, 06:10:17 am »

Quote from: peteru on July 03, 2021, 06:01:09 am

Use 1024 as the divisor and call it close enough.

Very, very close...

Whales · « **Reply #3 on:** July 03, 2021, 06:19:28 am »

x/1000 = x/8/125

Implement 8 as bitshift then only have to divide the now smaller number (24 bits long) by 125?

EDIT: possibly then avoid subtract-compare-subtract-compare iterations by instead doing a x/8/5/5/5 where the divide-by-5 can be done using a LUT applied once per y bits? That could probably be pipelined too?

ataradov · « **Reply #4 on:** July 03, 2021, 06:48:54 am »

A very common way to divide is to multiply by a reciprocal value. There is no way to get a mod this way though. And it uses a hardware multiplier, of course.

hamster_nz · « **Reply #5 on:** July 03, 2021, 07:37:08 am »

Here's the algorithm, ready to implement in the HDL of your choice...

Code: [Select]

uint32_t thousands(uint32_t original) {
   uint32_t result = 0;
   uint32_t units = original;

   // Shifts & binary AND-by-constant is free in FPGA - only addition costs
   result += units>>10;
   units  = (units&0x3FF) + ((units>>10)<<4) + ((units>>10)<<3);

   result += units>>10;
   units  = (units&0x3FF) + ((units>>10)<<4) + ((units>>10)<<3);

   result += units>>10;
   units  = (units&0x3FF) + ((units>>10)<<4) + ((units>>10)<<3);

   result += units>>10;
   units  = (units&0x3FF) + ((units>>10)<<4) + ((units>>10)<<3);

   result += units>>10;
   units  = (units&0x3FF) + ((units>>10)<<4) + ((units>>10)<<3);

   if(units > 999) {
      result++;
      units -= 1000;
   }
   // result = original / 1000
   // units = original % 1000
   if(units > 999 || original != result*1000+units) {
     printf("Error:%u %u %u\n", original, result, units);
     return 0;
   }
   return 1;
}

T3sl4co1l · « **Reply #6 on:** July 03, 2021, 11:32:37 am »

Quote from: ataradov on July 03, 2021, 06:48:54 am

A very common way to divide is to multiply by a reciprocal value. There is no way to get a mod this way though. And it uses a hardware multiplier, of course.

Sure there is!

Here's an AVR assembly version I wrote a while ago:
https://www.seventransistorlabs.com/uDivTen.asm
The main downside is you need a couple more bits in the accumulator than the starting number, to prevent rounding.

Tim

StillTrying · « **Reply #7 on:** July 03, 2021, 11:52:30 am »

Quote from: hamster_nz on July 03, 2021, 06:10:17 am

Very, very close...

Multiplying by 1.024 first.

SiliconWizard · « **Reply #8 on:** July 03, 2021, 05:44:16 pm »

Quote from: peteru on July 03, 2021, 06:01:09 am

Use 1024 as the divisor and call it close enough.

Are you looking for a job at SpaceX?

Anyway, hamster_nz's algorithm is based on this idea: dividing by 1024. But he carefully adjusts the result so that it becomes correct.

hamster_nz · « **Reply #9 on:** July 03, 2021, 08:10:21 pm »

Is is not so careful...

For exevy 1024 you remove, call it 1000 and add 24 back to the remainder

Repeat until remainder is guarrenteed to be < 1024, then remove the last 1900, if present.

SiliconWizard · « **Reply #10 on:** July 03, 2021, 08:52:26 pm »

Quote from: StillTrying on July 03, 2021, 11:52:30 am

Quote from: hamster_nz on July 03, 2021, 06:10:17 am
Very, very close...

Multiplying by 1.024 first.

Well, that's related to what ataradov suggested, in the end.

If you have multipliers available, a common approach is to multiply by the reciprocal. You express the reciprocal in fixed point, multiply by the operand, and then right shift by the number of positions depending on the fixed point you used. The trick is to choose a large enough fixed-point representation of the reciprocal so that you don't lose accuracy. Depending on the initial divisor, that can lead to a pretty large reciprocal in fixed point, thus requiring a pretty large multiplier. But it's very simple otherwise. The problem with this approach is that it easily gives a off-by-one result however large the representation of your reciprocal may be, so you usually need an extra adjusting step.

Example: you choose to represent 1/1000 on 32 bits as:
2^32 / 1000 = 4294967 (integer part)

Now let's compute 10000/1000:
4294967 * 10000 / 2^32 = 9 (integer multiplication and 32-bit right shift)

of course you get something very close to 10 if you divide by 2^32 exactly, but if you use an integer division (done with a right shift here, but that's the same), you'll get 9. You might correct this with various tricks.

Now this approach can have much less latency in terms of clock cycles than hamster_nz's one (you'll probably implement this pipelining each step of his algorithm), but it will take a lot more resources. (Note that if you do this on FPGA and there are available multipliers on it, this might not "cost" that much either, but the cost will be hidden by the hardware multipliers. It's still there! Also, if you're using large multipliers like this, you might need to pipeline the multiplication as well to reach your target frequency, so all in all...)

cruff · « **Reply #11 on:** July 04, 2021, 01:53:27 am »

The book Hacker's Delight by Henry S Warren, Jr. has an entire chapter describing integer division by constants including getting correct results.

hamster_nz · « **Reply #12 on:** July 04, 2021, 05:02:54 am »

Quote from: SiliconWizard on July 03, 2021, 08:52:26 pm

Example: you choose to represent 1/1000 on 32 bits as:
2^32 / 1000 = 4294967 (integer part)

Now let's compute 10000/1000:
4294967 * 10000 / 2^32 = 9 (integer multiplication and 32-bit right shift)

I've was looking at 32-bit unsigned values, so to "divide by multiply by inverse" it will need a 32-bit multiplier - and doing that in a single cycle isn't cheap and fast on common architectures with smaller multipliers.

This shortened version is good to up to 43,000 (so ideal for 15-bit numbers, and the 10,000 used in the above example), and would be fine for implementing division in a single cycle at moderate clock rates:

Code: [Select]

   result = input_val>>10;
   remainder  = (input_val&0x3FF) + ((input_val>>10)<<4) + ((input_val>>10)<<3);

   if(remainder >= 1000) {
      result++;
      remainder -= 1000;
   }

There is nothing special about 1000 - it can be reformulated for other divisors, but the closer to the power of two it is the fewer repetitions of the first two lines are needed ensure an accurate result for a given input width.

hamster_nz · « **Reply #13 on:** July 04, 2021, 09:46:41 am »

Oh, did a bit more thinking, and now here is a more optimized version, which would use about 80% of the resources of the original, and one less clock cycle if pipelined:

Code: [Select]

uint32_t thousands(uint32_t original) {
   uint32_t result = 0, temp = original;

   result += ((temp>>17)<<7)    + ((temp>>17)<<1) + ((temp>>17)<<0);
   temp    = (temp&((1<<17)-1)) + ((temp>>17)<<6) + ((temp>>17)<<3);

   result += (temp>>10);
   temp    = (temp&((1<<10)-1)) + ((temp>>10)<<4)+((temp>>10)<<3);

   result += (temp>>10);
   temp    = (temp&((1<<10)-1)) + ((temp>>10)<<4)+((temp>>10)<<3);

   result += (temp>>10);
   temp    = (temp&((1<<10)-1)) + ((temp>>10)<<4)+((temp>>10)<<3);

   if(temp > 999) {
      result++;
      temp -= 1000;
   }
   if(temp < 1000 && original == result*1000+temp) return 1;
   printf("%u %u %u\n", original, result, temp);
   return 0;
}

(Once again, implement in whichever HDL you hate the least...)

radiolistener · « **Reply #14 on:** July 04, 2021, 11:29:11 am »

you're don't needs to divide. Just extend your BCD decoder from 3 to 6 digits and ignore low 3 digits.

I'm using such approach, it allows to swap between low 3 digits and hi 3 digits with a user button and see 6-digit value on 3 digit display.

Here is my BCD decoder with parametrized input/output width in verilog.
By default it use 30 bit binary input value and 10 bcd digit (4 bit x 10 = 40 bit) output value.

Code: [Select]

// binary to BCD decoder
// (c) radiolistener
module HEX2BCD(
    input [R-1:0] data,
    output reg [DIGITS-1:0][3:0] bcd 
);
    parameter R = 30;      // 2^31-1 = 2'147'483'647 / 10 digit
    parameter DIGITS = 10; // 10 x 4 = 40

    integer i,j;
    always @(data) begin
        for (j=DIGITS-1; j >= 0; j=j-1)
            bcd[j] = 4'd0;
        for (i=R-1; i >= 0; i=i-1) begin
            for (j=DIGITS-1; j >= 0; j=j-1) begin
                if (bcd[j] >= 4'd5)
                    bcd[j] = bcd[j] + 4'd3;
            end
            for (j=DIGITS-1; j >= 0; j=j-1) begin
                bcd[j] = bcd[j] << 1;
                bcd[j][0] = j > 0 ? bcd[j-1][3] : data[i];
            end
        end
    end
endmodule

SiliconWizard · « **Reply #15 on:** July 04, 2021, 05:19:30 pm »

I think hamster_nz 's approach is close in principle to the typical shift-and-add algorithm often used for BCD conversion.

bson · « **Reply #16 on:** July 04, 2021, 06:19:37 pm »

In case you're still interested, here's a small script to factor any divisor.

Code: [Select]

#! /usr/bin/env python3
import math as m
import sys

WANT=(int)(sys.argv[1])

first=round(m.log(WANT,2))
error=1./WANT - 1./(2**first)
terms=[first]

while m.fabs(error) > 1./(1<<63):
  sign = m.copysign(1, error)
  term = -round(m.log(m.fabs(error), 2.))
  terms.append(sign*term)
  error -= sign*(2**(-term))

print(terms)

And if you run it for 1000,

Code: [Select]

[10, 15.0, -17.0, 21.0, 24.0, 26.0, -29.0, -33.0, -34.0, 36.0, -38.0, -42.0, 45.0, -48.0, -50.0, -51.0, 53.0, -60.0]

This says

Code: [Select]

x/1000 ~= (x >> 10)
  + (x >> 15)
  - (x >> 17)
  + (x >> 21)
  + (x >> 24)
  + ...

Obviously, for a 16-bit integer, terms past 1>>15 don't matter. For 32-bit integers, the largest term is -(x >> 29).

A cursory check shows this yields a 1-bit error for values 1000-65535. Noticeable though is that 1000-1023 yield 0 for 16-bit integers.

(BTW, it's probably not accurate to the full 64 bits due to lack of precision in the 64-bit floating point error variable. I think it's something like 53 bits.)

Edit: here's a variant to output a C++ function for up to 32 bit integers.

Code: [Select]

#! /usr/bin/env python3
import math as m
import sys

WANT=(int)(sys.argv[1])

first=round(m.log(WANT,2))
error=1./WANT - 1./(2**first)
terms=[first]

print("template <typename T>")
print("static inline T div_%s(const T& val) {" % WANT)
print("  return (val >> %s)" % first, end='')

while m.fabs(error) > 1./(1<<32):
  sign = m.copysign(1, error)
  term = -round(m.log(m.fabs(error), 2.))
  print(" %s (val >> %s)" % ("-" if sign == -1.0 else "+", (int)(term)), end='')
  error -= sign*(2**(-term))

print(";\n}")

Which produces:

Code: [Select]

template <typename T>
static inline T div_1000(const T& val) {
  return (val >> 10) + (val >> 15) - (val >> 17) + (val >> 21) + (val >> 24) + (val >> 26) - (val >> 29);
}

A #pragma to suppress warnings about integer shifts always being zero might be useful for integers smaller than uint32_t.

radiolistener · « **Reply #17 on:** July 04, 2021, 07:03:33 pm »

Quote from: SiliconWizard on July 04, 2021, 05:19:30 pm

I think hamster_nz 's approach is close in principle to the typical shift-and-add algorithm often used for BCD conversion.

my hex2bcd also use shift and add operations for BCD conversion.

Well, the synthesis result of my hex2bcd module is really hard to analyze on a RTL viewer, but it works very well

And such approach don't require hardware multiplier or divide operation.

bson · « **Reply #18 on:** July 04, 2021, 09:58:51 pm »

I had to implement the shift divider in Verilog and synthesize it for a A7-100T just to see what it would produce. Works like a charm! All LUT, unclocked (I think).

Code: [Select]

module div_1000(
  input [31:0] word,
  output wire [31:0] result
    );
    
    assign result = 
        (word >> 10) 
        + (word >> 15)
        - (word >> 17) 
        + (word >> 21) 
        + (word >> 24) 
        + (word >> 26) 
        - (word >> 29);
    
endmodule

And to test it on a Digilent Nexys A7 board:

Code: [Select]

module div_test(
    input CLK,
    input [15:0] sw_i,
    output wire [15:0] led_o
    );

    wire [15:0] hi = 16'b0;
    wire [31:0] r;
    div_1000 (
        .word({hi, sw_i}),
        .result(r)
    );
    assign led_o[15:0] = r[15:0];
    
endmodule

I can toggle the 16 switches and watch to the divider output on the 16 LEDs.

5 LUTs used, including the test wrapper (though I don't think it needs anything to set the high half to 0, that goes straight in the LUTs).
63400 LUTs available.

SiliconWizard · « **Reply #19 on:** July 05, 2021, 01:30:13 am »

That's 6 cascaded adders. What's the Fmax? (Just curious!)

bson · « **Reply #20 on:** July 05, 2021, 05:05:50 pm »

Don't know, and since it's not clocked there's no timing info. In real usage of course timing depends on how it can be routed given global resource availability. For just this example I imagine each step could be latched, so it should be no worse than seven clocks?

That's of course assuming six cascaded adders. But in reality you only need ceil(log2(6)) = 3 stages, assuming 2-adders. But maybe there's a smarter way to synthesize a 6-adder than 3 deep tree.
Edit: and, on the likely possibility that the synthesizer actually does something as dumb as produce 6 cascaded adders, it's pretty easy to spell out. So in reality probably 3 clocks. Fmax would depend on WNS though, which depends on resource availability.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Divide/Mod a 32-bit value by 1000. (Read 3375 times)

Share me