### Author Topic: Saturating counters  (Read 9110 times)

0 Members and 2 Guests are viewing this topic.

#### SiliconWizard

• Super Contributor
• Posts: 15118
• Country:
##### Saturating counters
« on: March 26, 2020, 05:20:02 pm »
Small puzzle.

How would you implement a saturating counter (let's say here for an unsigned counter - so saturating downwards to 0, and upwards to max value it can hold, so all 1's) without adding a logic level (for testing for saturation)?

In other words, do you know of any means of doing this with a structure that has no added delay compared to a simple adder?

Any ideas welcome.

#### mrflibble

• Super Contributor
• Posts: 2051
• Country:
##### Re: Saturating counters
« Reply #1 on: March 27, 2020, 01:19:08 am »
Mmmh, interesting. As a first stab at it ... For example for an unsigned saturating 4-bit adder on xilinx, something like this:
1 - add two 4-bit numbers using a CARRY4
2 - the 4 result bits go straight to the relevant 4 FFs in same CLB
3 - take the MSB CARRY OUT signal on the CARRY4, and use it as a synchronous reset for the above 4 FFs.
4 - those 4 FFs should all reset to 1 (HIGH)

Note that this is an adder, so more general solution. For upcount, use a=+1 and b=prev_value., where prev_value is the previous value on the FFs.

« Last Edit: March 27, 2020, 01:24:57 am by mrflibble »

#### BrianHG

• Super Contributor
• Posts: 7999
• Country:
##### Re: Saturating counters
« Reply #2 on: March 27, 2020, 04:45:20 am »
Maybe something like (Verilog):

Code: [Select]
module limit_cnt ( input  wire       clk, input  wire       increment, input  wire       decrement, output reg  [3:0] counter  ) ;always @ (posedge clk) begin        if  ((counter != 4'hF ) && (increment )) counter <= counter +1;else if  ((counter != 4'h0 ) && (decrement )) counter <= counter -1;endendmodule
This should work way in the multi hundreds of megahertz to GHz, all the way up to 32 bit counters as the ' != ' is a simple xor equality compare function.  No addition math there.  And the +/-1 would be optimized by the compiler as a simple 1 bit add/subtract.  Adding a reset, or set/load inputs will also not decrease the performance.
« Last Edit: March 27, 2020, 02:59:00 pm by BrianHG »

#### OwO

• Super Contributor
• Posts: 1250
• Country:
• RF Engineer.
##### Re: Saturating counters
« Reply #3 on: March 27, 2020, 07:15:11 am »
The fastest way involves a dumb counter with enable (note that doIncrement should be driven by a FF):
Quote
cnt <= cnt+1 when doIncrement='1' and rising_edge(clk);

Then predicting ahead of time when the counter will reach saturation value:
Quote
cntWillSaturate <= '1' when cnt = unsigned(cnt'range=>'1') - 2 else '0';
cntWillSaturate1 <= cntWillSaturate when rising_edge(clk);
doIncrement <= not cntWillSaturate1 when rising_edge(clk);

You will need to make changes to support control inputs but the basic idea is the same. Since you can't predict control inputs ahead of time the next best you can do is speculate-and-fix: let the counter overflow for a few clock cycles until it notices it should saturate, but follow this up with pipelined logic to fix the output value.
Email: OwOwOwOwO123@outlook.com

#### OwO

• Super Contributor
• Posts: 1250
• Country:
• RF Engineer.
##### Re: Saturating counters
« Reply #4 on: March 27, 2020, 07:22:16 am »
But actually in an FPGA testing for saturation doesn't actually add a logic level if it ends up using the CE input of the flipflops - the MUX before a flipflop is always present in hardware, and if the CE path is no longer than the D path (e.g. the comparison is as fast as the addition) it doesn't affect timing. The synthesizer should be able to infer the use of CE as long as you didn't do something stupid (like a global clock enable) or hide the MUXing in long spaghetti behavioral code.

Now this code:
Quote
if  (counter != 8h'FF ) && (increment ) counter <= counter +1;
else if  (counter != 8h'00 ) && (decrement ) counter <= counter -1;
Is likely to not synthesize optimally because in my experience the tools won't see that it can be simplified to a adder/subtractor and a CE flipflop. It'll see a 3-input MUX and conclude it can't fit the CE flipflop template and end up using a LUT MUX.

I would separate the add/subtract part from the CE part:

Quote
counterNext <= counter+1 when addSubSel='1' else counter-1;
Quote
counterCE <= '1' when (increment='1' and counter != X"FF") or (decrement='1' and counter != X"00") else '0';
counter <= counterNext when counterCE='1' and rising_edge(clk);
« Last Edit: March 27, 2020, 07:29:05 am by OwO »
Email: OwOwOwOwO123@outlook.com

#### OwO

• Super Contributor
• Posts: 1250
• Country:
• RF Engineer.
##### Re: Saturating counters
« Reply #5 on: March 27, 2020, 01:56:23 pm »
I tried synthesizing the above behavioral code and compared it to my dataflow version.
The results are just as I guessed - the behavioral version synthesizes into crap logic that only runs at 228MHz (compared to 315MHz for the dataflow version).

(leftmost red column is WNS, followed by TNS. The clock was constrained to 2.4ns)

Looking at the failing timing path of the behavioral version:

There is a bunch of LUTs before the adder (an adder by itself should show only one LUT before the CARRY chain).

I looked closely at the logic to figure out why the behavioral version has such a complex data path, and it turns out the answer lies in the corner case, which is increment='1' and decrement='1' at the same time. For optimization purposes we don't care about what happens to the counter under this condition.

It turns out the behavioral code does something baroque - if increment and decrement are both high, it will prefer to increment, but if the counter is saturated high it will fall through to the else branch, which decrements. In other words, the type of arithmetic taken (add or subtract) depends on the value of the counter!

What does this mean? It means better tools and smarter synthesis will not solve the problem; the logic itself has to be changed.
This enforces my belief that behavioral code should be avoided at all costs; writing a dataflow description forces you to think what the datapath and clock enable path really looks like. The behavioral version hides that from you, and you end up with suboptimal logic in very non-obvious ways.

Behavioral code:
Code: [Select]
library IEEE;use IEEE.STD_LOGIC_1164.ALL;use IEEE.NUMERIC_STD.ALL;entity counterTestBehav is    port(clk: in std_logic;    increment, decrement: in std_logic;    dout: out unsigned(31 downto 0));end counterTestBehav;architecture a of counterTestBehav is signal count, count1: unsigned(31 downto 0);begin process(clk) is begin if rising_edge(clk) then if increment='1' and count /= X"FFFFFFFF" then count <= count+1; elsif decrement='1' and count /= X"00000000" then count <= count-1; end if; end if; end process; count1 <= count when rising_edge(clk); dout <= count1 when rising_edge(clk);end a;
Dataflow code:
Code: [Select]
library IEEE;use IEEE.STD_LOGIC_1164.ALL;use IEEE.NUMERIC_STD.ALL;entity counterTestDataflow is    port(clk: in std_logic;    increment, decrement: in std_logic;    dout: out unsigned(31 downto 0));end counterTestDataflow;architecture a of counterTestDataflow is signal countCE: std_logic; signal count, countNext, count1: unsigned(31 downto 0);begin countNext <= count+1 when increment='1' else count-1; countCE <= '1' when (increment='1' and count /= X"FFFFFFFF") or (decrement='1' and count /= X"00000000") else '0'; count <= countNext when countCE='1' and rising_edge(clk); count1 <= count when rising_edge(clk); dout <= count1 when rising_edge(clk);end a;
« Last Edit: March 27, 2020, 01:59:01 pm by OwO »
Email: OwOwOwOwO123@outlook.com

The following users thanked this post: Yansi, iMo, FenTiger

#### BrianHG

• Super Contributor
• Posts: 7999
• Country:
##### Re: Saturating counters
« Reply #6 on: March 27, 2020, 02:57:33 pm »
I tried synthesizing the above behavioral code and compared it to my dataflow version.
The results are just as I guessed - the behavioral version synthesizes into crap logic that only runs at 228MHz (compared to 315MHz for the dataflow version).

Design methodology aside, it is a sad state of affairs that a 4 bit limit counter of either design here con only reach 228Mhz.  Why does Quartus give a FMAX of 333MHz on their oldest obsolete cheapest FPGA with my behavioral version using only 4 registers, 9 combinational elements total?

« Last Edit: March 27, 2020, 03:04:21 pm by BrianHG »

#### SiliconWizard

• Super Contributor
• Posts: 15118
• Country:
##### Re: Saturating counters
« Reply #7 on: March 27, 2020, 03:01:44 pm »
Thanks OwO for the detailed analysis - that's interesting. My experience with just plain behavioral code for this was consistent with what you got, hence my question to find better ways of implementing this.

One thing to note (that I already pointed out a while ago in another thread) is how "dumb" synthesis tools can be sometimes when dealing with "if" constructs. This can be pretty surprising even to seasoned engineers, as those tools are otherwise pretty good at optimizing logic for a whole range of structures. A common pitfall with "if" is that it will often be translated as a priority encoder of some sort, even when it's completely obvious from behavior only that there is no priority needed.

@OwO: isn't your proposed counter a pipelined one? Not that it's necessarily an issue, it can be adequate for a number of applications, but in some cases the added latency may not be what we want.

#### OwO

• Super Contributor
• Posts: 1250
• Country:
• RF Engineer.
##### Re: Saturating counters
« Reply #8 on: March 27, 2020, 03:07:56 pm »
The two versions I compared above behave identically except in the case of increment='1' and decrement='1'. I added two pipeline registers at the end so that the tools are free to place logic whereever rather than close to the IOs; there is no pipeline delay otherwise.

But yes both of the above versions aren't pipelined and so aren't the fastest I can do; I'll probably design a pipelined and speculative version later and see if I can get above 400MHz (slowest Artix-100T).

The counters I compared are 32 bits; I'm sure you can do much faster with less bits.
« Last Edit: March 27, 2020, 03:09:40 pm by OwO »
Email: OwOwOwOwO123@outlook.com

#### SiliconWizard

• Super Contributor
• Posts: 15118
• Country:
##### Re: Saturating counters
« Reply #9 on: March 27, 2020, 03:11:19 pm »
The two versions I compared above behave identically except in the case of increment='1' and decrement='1'. I added two pipeline registers at the end so that the tools are free to place logic whereever rather than close to the IOs; there is no pipeline delay otherwise.

OK, I see the rationale.

But yes both of the above versions aren't pipelined and so aren't the fastest I can do; I'll probably design a pipelined and speculative version later and see if I can get above 400MHz (slowest Artix-100T).

Yes, I thought of a pipelined approach in which for instance, the first stage would just count (and let the counter roll over), and the second stage would select the right output depending on previous state, something like this.

But I was ultimately really curious about finding out if there was a modified counter structure that would saturate by itself while being structurally as simple as a basic counter that rolls over.

#### BrianHG

• Super Contributor
• Posts: 7999
• Country:
##### Re: Saturating counters
« Reply #10 on: March 27, 2020, 03:26:49 pm »
Yes there is, but unfortunately, it would run slower on FPGA.  In a full blown ASIC cell, it would not.

You may be thinking something like:

up_counter <= up_counter + ~(up_counter == 4'hF);

Gate wise, this literally does what you are asking for...  Make an adder whose addition input is 1 when the output bits are not equal to 1111.

This unfortunately makes adder whose 1 bit 0 or 1 input directly dependent on the state of the adder's outputs.
Advanced preparation of such a 'up_counter' total and a similar 'down_counter' total, then select which result you want depending on up/down inputs would be the pipeline method.
« Last Edit: March 27, 2020, 04:28:36 pm by BrianHG »

#### Bassman59

• Super Contributor
• Posts: 2501
• Country:
• Yes, I do this for a living
##### Re: Saturating counters
« Reply #11 on: March 27, 2020, 08:22:03 pm »
Small puzzle.

How would you implement a saturating counter (let's say here for an unsigned counter - so saturating downwards to 0, and upwards to max value it can hold, so all 1's) without adding a logic level (for testing for saturation)?

In other words, do you know of any means of doing this with a structure that has no added delay compared to a simple adder?

I would write it simply:

Code: [Select]
    signal reset : std_logic;    signal operation : std_logic_vector(1 downto 0);    alias inc : std_logic is operation(0);    alias dec : std_logic is operation(1);    signal counter : natural range 0 to CNTMAX; -- define CNTMAX as whateverSaturatingCounter : process (clk) isbegin    if rising_edge(clk) then        if reset = '1' then            counter <= 0;        else            OpDecode : case operation is                 when "01" => -- increment                     if counter < CNTMAX then                         counter <= counter + 1;                     end if;                 when "10" => -- decrement                     if count > 0 then                        counter <= counter - 1;                     end if;                 when others =>                      -- do nothing if both inc and dec are asserted, or if nothing.                     null;             end case OpDecode;        end if;    end if;end process SaturatingCounter;
If this meets my timing constraints, I'm done, as further optimization isn't necessary. If it doesn't, I look to see why.

#### aheid

• Regular Contributor
• Posts: 245
• Country:
##### Re: Saturating counters
« Reply #12 on: March 28, 2020, 03:56:33 pm »
Dumb question, why use separate inc/dec flags. Why not single? Inc=1, dec=0.

#### BrianHG

• Super Contributor
• Posts: 7999
• Country:
##### Re: Saturating counters
« Reply #13 on: March 28, 2020, 07:36:15 pm »
Dumb question, why use separate inc/dec flags. Why not single? Inc=1, dec=0.
You may want the counter to remain stationary.
Like a volume control know with a min and max, you may want it to turn up, or down, or stay still.

#### ralphrmartin

• Frequent Contributor
• Posts: 486
• Country:
##### Re: Saturating counters
« Reply #14 on: March 28, 2020, 07:58:57 pm »
Then the correct way to encode your inputs is, in my opinion,
one variable for which 0 means  Increment, 1 means Decrement
one variable for which 0 means do nothing, 1 means do something (increment or decrement as specified)

That way you avoid the issue of what to do if Increment = 1 and Decrement = 1 at the same time.

I think the problem here is not behavioural logic per se, but proper specification of what the behaviour should be.
« Last Edit: March 28, 2020, 08:00:29 pm by ralphrmartin »

#### aheid

• Regular Contributor
• Posts: 245
• Country:
##### Re: Saturating counters
« Reply #15 on: March 28, 2020, 09:42:18 pm »
Dumb question, why use separate inc/dec flags. Why not single? Inc=1, dec=0.
You may want the counter to remain stationary.
Like a volume control know with a min and max, you may want it to turn up, or down, or stay still.

Sure, but as a programmer my natural reaction would be to have a single inc/dec flag and then a counter/clock enable flag. So why two seemingly clearly ambiguous flags instead? Again, I'm a FPGA n00b so just trying to understand.

#### Someone

• Super Contributor
• Posts: 4845
• Country:
##### Re: Saturating counters
« Reply #16 on: March 28, 2020, 09:46:11 pm »
Then the correct way to encode your inputs is, in my opinion,
one variable for which 0 means  Increment, 1 means Decrement
one variable for which 0 means do nothing, 1 means do something (increment or decrement as specified)

That way you avoid the issue of what to do if Increment = 1 and Decrement = 1 at the same time.

I think the problem here is not behavioural logic per se, but proper specification of what the behaviour should be.
It is not always the "correct" encoding, a FIFO controller is a common example of a counter that has separate increment and decrement inputs. No change for inc == dec is entirely correct there.

#### BrianHG

• Super Contributor
• Posts: 7999
• Country:
##### Re: Saturating counters
« Reply #17 on: March 28, 2020, 09:49:26 pm »
Dumb question, why use separate inc/dec flags. Why not single? Inc=1, dec=0.
You may want the counter to remain stationary.
Like a volume control know with a min and max, you may want it to turn up, or down, or stay still.

Sure, but as a programmer my natural reaction would be to have a single inc/dec flag and then a counter/clock enable flag. So why two seemingly clearly ambiguous flags instead? Again, I'm a FPGA n00b so just trying to understand.
What if you want an up and down button connected to 2 inputs in the FPGA.  Press 1 to increment, or the other to decrement.

Arguing about the 2 controls for this counter is meaningless.  It can be anything your project requires, including 3 inputs or even more.  The whole issue here was about methods to implement the 'Saturation' itself.

#### aheid

• Regular Contributor
• Posts: 245
• Country:
##### Re: Saturating counters
« Reply #18 on: March 28, 2020, 10:24:33 pm »
Arguing about the 2 controls for this counter is meaningless.  It can be anything your project requires, including 3 inputs or even more.  The whole issue here was about methods to implement the 'Saturation' itself.

It just seemed the two separate flags was prone to poor synthesis, so I wanted to see if there was a reason to chose that way that I was missing.

#### hamster_nz

• Super Contributor
• Posts: 2812
• Country:
##### Re: Saturating counters
« Reply #19 on: March 29, 2020, 10:54:11 am »
For smaller, performance-critical saturation counters this pattern might be of use, as it doesn't use the carry chain and only has a single level of logic:

Code: [Select]
library IEEE;use IEEE.STD_LOGIC_1164.ALL;use IEEE.NUMERIC_STD.ALL;entity sat_counter is    Port ( clk : in STD_LOGIC;           inc : in STD_LOGIC;           dec : in STD_LOGIC;           count : out STD_LOGIC_VECTOR (3 downto 0));end sat_counter;architecture Behavioral of sat_counter is  type a_mem is array(0 to 63) of std_logic_vector(3 downto 0);  signal mem : a_mem := (x"0",x"1",x"2",x"3",x"4",x"5",x"6",x"7",x"8",x"9",x"A",x"B",x"C",x"D",x"E",x"F",  -- Nnee                         x"1",x"2",x"3",x"4",x"5",x"6",x"7",x"8",x"9",x"A",x"B",x"C",x"D",x"E",x"F",x"F",  -- Inc                         x"0",x"0",x"1",x"2",x"3",x"4",x"5",x"6",x"7",x"8",x"9",x"A",x"B",x"C",x"D",x"E",  -- Dec                         x"0",x"1",x"2",x"3",x"4",x"5",x"6",x"7",x"8",x"9",x"A",x"B",x"C",x"D",x"E",x"F"); -- inc & dec  signal index : std_logic_vector(5 downto 0);  signal state : std_logic_vector(3 downto 0) := (others => '0');begin   count <= std_logic_vector(state);   index <= dec & inc & state;process(clk)   begin      if rising_edge(clk) then         state <= mem(to_integer(unsigned(index)));      end if;   end process;end Behavioral;
For example, this four bit counter needs four 6-input LUTs and 4FFs.

BRAM blocks can also be used for counters of moderate widths (8 or so bits).

You can sometimes add add an "enable" and reset input at no extra cost (see second schematic):
Code: [Select]
library IEEE;use IEEE.STD_LOGIC_1164.ALL;use IEEE.NUMERIC_STD.ALL;entity sat_counter is    Port ( clk : in STD_LOGIC;           inc : in STD_LOGIC;           dec : in STD_LOGIC;           enable : in STD_LOGIC;           reset : in STD_LOGIC;           count : out STD_LOGIC_VECTOR (3 downto 0));end sat_counter;architecture Behavioral of sat_counter is  type a_mem is array(0 to 63) of std_logic_vector(3 downto 0);  signal mem : a_mem := (x"0",x"1",x"2",x"3",x"4",x"5",x"6",x"7",x"8",x"9",x"A",x"B",x"C",x"D",x"E",x"F",  -- Nnee                         x"1",x"2",x"3",x"4",x"5",x"6",x"7",x"8",x"9",x"A",x"B",x"C",x"D",x"E",x"F",x"F",  -- Inc                         x"0",x"0",x"1",x"2",x"3",x"4",x"5",x"6",x"7",x"8",x"9",x"A",x"B",x"C",x"D",x"E",  -- Dec                         x"0",x"1",x"2",x"3",x"4",x"5",x"6",x"7",x"8",x"9",x"A",x"B",x"C",x"D",x"E",x"F"); -- inc & dec  signal index : std_logic_vector(5 downto 0);  signal state : std_logic_vector(3 downto 0) := (others => '0');begin   count <= std_logic_vector(state);   index <= dec & inc & state;process(clk)   begin      if rising_edge(clk) then         if reset = '1' then            state <= (others => '0');         elsif enable = '1' then            state <= mem(to_integer(unsigned(index)));         end if;      end if;   end process;end Behavioral;
You can push the ROM+state register into a Block RAM, making longer fast counters possible, and with Dual-port RAMs you can implement two counters using the same memory block.
« Last Edit: March 29, 2020, 10:58:22 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.

#### hamster_nz

• Super Contributor
• Posts: 2812
• Country:
##### Re: Saturating counters
« Reply #20 on: March 29, 2020, 11:11:09 am »
Oh, and if you push the ROM into Block RAM with at least this on Xilinx parts:

Code: [Select]
  attribute rom_style : string;    attribute rom_style of mem : signal is "block";
here's what the implemented design looks like.

If you have an 'enable' and a 'reset' then a LUT2 is needed to control the BRAM's clock enable, possibly slowing timing. You can sometimes merge the reset into the block RAM.
« Last Edit: March 29, 2020, 11:13:24 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.

#### SiliconWizard

• Super Contributor
• Posts: 15118
• Country:
##### Re: Saturating counters
« Reply #21 on: March 29, 2020, 03:38:17 pm »
Using memory blocks to implement this is a creative approach. Of course, it's a waste of area, but on FPGAs, that doesn't really matter as long as the counter is small enough.

I'd be curious to see a comparison of Fmax between your proposed memory-based approach, and a purely behavioral/naive approach, for say a 4-bit counter.

#### NorthGuy

• Super Contributor
• Posts: 3237
• Country:
##### Re: Saturating counters
« Reply #22 on: March 29, 2020, 06:05:50 pm »
32-bit counter will use 8 CARRY4 elements on Xilinx, which is very slow already. I'm not sure it can run at 400 MHz.

Calculating the "will saturate up" and "will saturate down" conditions parallel to the carry chain is not difficult. Then these can be used by one extra logic level in front of the carry chain, which would get "will saturated up", "will saturate down", "increment" and "decrement" signals and figure out the inputs to the carry chain. This is probably better than adding muxes to the results. Such method will be a touch slower than the counter itself.

If you want faster, you need to pipeline the counter first. Since the free running counter is fully predictable, you can pipeline it as much as you can. This is not hard and maximum clock speed most likely can be achieved.

It's much harder with the saturating counter where the stream of "increment" or "decrement" pulses is unpredictable. But you still can maintain current values of "timer -2", "timer-1","timer+1","timer+2" and then get the result by muxing the correct values depending on the incoming "increment" and "decrement" values. But that's a lot of writing and a lot of logic. But anyway, you may be able to get to one logic level, so high speed certainly can be achieved. But there's no way the tools may do this for you automatically.

#### BrianHG

• Super Contributor
• Posts: 7999
• Country:
##### Re: Saturating counters
« Reply #23 on: March 29, 2020, 07:44:53 pm »
I'd be curious to see a comparison of Fmax between your proposed memory-based approach, and a purely behavioral/naive approach, for say a 4-bit counter.
We all provided you with source code examples.  All the top FPGA vendors offer free versions of their compilers which can compile all of our provided codes.  It is trivial to read the final FMAX product after a test compile.

In the case of these 4 bit counters, only 1 issue you might get is 2 different FMAX as in Quartus,  you will see 1 higher speed, the maximum error free operational clock frequency of the counter, and a 'restricted FMAX' which may be a slower limited frequency due to the counter being fed by or it feeding IO pins where this lower FMAX may be the speed limit of the IO pins with their set IO voltage standard, like LVTTL for example.  Only using the counter buried internally will allow the higher listed FMAX and there are no coding tricks which would allow you to go above that restricted FMAX if the counter being used is directly reliant on the IOs with their selected IO standard.  In other words, if you are trying to exceed this 'restricted' FMAX with simple coding tricks, I sorry to say that you are wasting time and you should upgrading your FPGA speed rating unless you don't care about the tight synchronous clocked inputs and allow for a sliping-async style inputs driving the direction of your counter.  Though, the clock still needs to be clean.
« Last Edit: March 29, 2020, 08:17:15 pm by BrianHG »

#### hamster_nz

• Super Contributor
• Posts: 2812
• Country:
##### Re: Saturating counters
« Reply #24 on: March 29, 2020, 09:34:55 pm »

4-bit saturating counter implemented using Block RAM = 4.001ns = ~250 MHz (constrained to 4ns clock, with -0.001 WNS)
4-bit saturating counter implemented using LUT6+FFs =  2.2ns = 454 MHz (constrained to 2.2ns with 0.687ns  slack).

When I constrained the later case to 2.0ns, it fails timing on "Total Pulse Width Negative Slack" .

I'm somewhat surprised by how slow the Block RAM version was. I was expecting it to go to the speed of the Block RAM - 450MHz. That external LUT2 (to combine the reset and enable signals) slows things down a bit
« Last Edit: March 29, 2020, 09:41:47 pm by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.

The following users thanked this post: SiliconWizard

#### NorthGuy

• Super Contributor
• Posts: 3237
• Country:
##### Re: Saturating counters
« Reply #25 on: March 29, 2020, 09:58:22 pm »
4-bit saturating counter implemented using LUT6+FFs =  2.2ns = 454 MHz (constrained to 2.2ns with 0.687ns  slack).

When I constrained the later case to 2.0ns, it fails timing on "Total Pulse Width Negative Slack" .

This is just 4 LUTs (one for each output line), so it should run at max clock speed - 464 MHz for grade -1. If you switch to grade 2, it should give you 628 MHz.

I'm somewhat surprised by how slow the Block RAM version was. I was expecting it to go to the speed of the Block RAM - 450MHz. That external LUT2 (to combine the reset and enable signals) slows things down a bit

It's 388 MHz for grade -1. You probably can get rid of the external LUT to get to the this speed.

BRAM is only slower than LUT RAM for small sizes. If you go to bigger counters, the LUT RAM will be slower and slower, but BRAM will keep the same speed.

#### Someone

• Super Contributor
• Posts: 4845
• Country:
##### Re: Saturating counters
« Reply #26 on: March 29, 2020, 10:24:35 pm »
I'm somewhat surprised by how slow the Block RAM version was. I was expecting it to go to the speed of the Block RAM - 450MHz. That external LUT2 (to combine the reset and enable signals) slows things down a bit
It's 388 MHz for grade -1. You probably can get rid of the external LUT to get to the this speed.

BRAM is only slower than LUT RAM for small sizes. If you go to bigger counters, the LUT RAM will be slower and slower, but BRAM will keep the same speed.
BRAM is pipelined and positioned for data flow designs, anything like this with single cycle feedback will be a long way off the Fmax. Once the size grows and you start having to tie multiple BRAMS together the fabric approaches will likely catch up/overtake in speed when you utilise the fast carry chain.

#### Bassman59

• Super Contributor
• Posts: 2501
• Country:
• Yes, I do this for a living
##### Re: Saturating counters
« Reply #27 on: March 30, 2020, 02:32:26 am »
Arguing about the 2 controls for this counter is meaningless.  It can be anything your project requires, including 3 inputs or even more.  The whole issue here was about methods to implement the 'Saturation' itself.

It just seemed the two separate flags was prone to poor synthesis, so I wanted to see if there was a reason to chose that way that I was missing.

Why would it be "prone to poor synthesis?"

I chose the two-flags approach arbitrarily. Consider a case where the logic that controls the increment is separate from the logic that controls the decrement. As has been noted, some FIFO implementations do that.

One could also use the enable and direction approach.  You might use this when counting ticks from a rotary encoder which provides those two signals.

I suppose it depends more on what is controlling the counter.

In any case, as I said, I'm rarely interested in the maximum speed for a given logic block. My designs are usually the "most obvious." I always have clock constraints which drive design requirements. If a particular piece of logic doesn't meet the timing constraints, then I look for why that's the case and what I need to do to fix the problem.

#### OwO

• Super Contributor
• Posts: 1250
• Country:
• RF Engineer.
##### Re: Saturating counters
« Reply #28 on: March 30, 2020, 03:52:28 am »
So to summarize, the two-flag control input only causes synthesis difficulties because there exists the corner case of increment = decrement = '1', and in behavioral code corner cases are not obvious and easy to be seen; it's very easy to create complex and unnecessary logic by the falling through of if-statements, which in this case resulted in additional logic in the datapath rather than just a simple adder/subtracter.

TL;DR: use dataflow syntax and explicitly spell out the structure of the hardware, the datapath, and the clock enable path. Understand that in an FPGA the registers have a built in MUX that allows clock enables without additional delays to the data path. In Altera FPGAs the registers have an additional "synchronous load" input (a load enable and a load data pin) which is another free MUX that can be used without increasing delays. No way you can make use of this efficiently writing behavioral code.
Email: OwOwOwOwO123@outlook.com

#### hamster_nz

• Super Contributor
• Posts: 2812
• Country:
##### Re: Saturating counters
« Reply #29 on: March 30, 2020, 04:00:08 am »
Dumb question, why use separate inc/dec flags. Why not single? Inc=1, dec=0.
You may want the counter to remain stationary.
Like a volume control know with a min and max, you may want it to turn up, or down, or stay still.

Sure, but as a programmer my natural reaction would be to have a single inc/dec flag and then a counter/clock enable flag. So why two seemingly clearly ambiguous flags instead? Again, I'm a FPGA n00b so just trying to understand.

If separate logic generates the increment and decrements signals then separate signals makes sense, and you can define the behavior when both is asserted at the same time.

If you want a minimal interface to a single piece of logic, then an "enable" and "direction" signal might be more appropriate.

You can always transform one interface to the other, but it might add levels of logic and might reduce timing margins if it becomes the critical timing path.

enable <= up XOR down;
direction <= up;

or

up  <= enable AND direction;
down <= enable AND NOT direction;

But the 'up' and 'down' interface is more generic (i.e. applicable to more situations). You can't assert 'up' and 'down' at the same time with an 'enable' and 'direction' interface
« Last Edit: March 30, 2020, 04:02:04 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.

#### hamster_nz

• Super Contributor
• Posts: 2812
• Country:
##### Re: Saturating counters
« Reply #30 on: March 30, 2020, 04:22:20 am »
So to summarize, the two-flag control input only causes synthesis difficulties because there exists the corner case of increment = decrement = '1', and in behavioral code corner cases are not obvious and easy to be seen; it's very easy to create complex and unnecessary logic by the falling through of if-statements, which in this case resulted in additional logic in the datapath rather than just a simple adder/subtracter.

TL;DR: use dataflow syntax and explicitly spell out the structure of the hardware, the datapath, and the clock enable path. Understand that in an FPGA the registers have a built in MUX that allows clock enables without additional delays to the data path. In Altera FPGAs the registers have an additional "synchronous load" input (a load enable and a load data pin) which is another free MUX that can be used without increasing delays. No way you can make use of this efficiently writing behavioral code.

If you do run into timing problems this is where the whole "small/cheap/fast" trade off can be seen.

- If you want to use minimal resources, then using the control signals to select what the operand is on the adder (-1, 0 or 1) will be slowest.

- If you need good performance then calculating both counter_plus_one and counter_sub_one then using the control signals to select between those values will add minimal delay.

- For best performance, where the counter arithmetic is the critical path then avoiding the addition or subtraction completely by converting it to a memory-driven state machine might be a workable solution.

You usually want to start with the first, simple solution, and only use the more advanced solutions if you know this will be in the critical path.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.

The following users thanked this post: Someone

#### Bassman59

• Super Contributor
• Posts: 2501
• Country:
• Yes, I do this for a living
##### Re: Saturating counters
« Reply #31 on: March 30, 2020, 04:45:36 am »
So to summarize, the two-flag control input only causes synthesis difficulties because there exists the corner case of increment = decrement = '1', and in behavioral code corner cases are not obvious and easy to be seen; it's very easy to create complex and unnecessary logic by the falling through of if-statements, which in this case resulted in additional logic in the datapath rather than just a simple adder/subtracter.

Note that I used a case statement instead of cascaded if statements. This takes care of priority. The case statement only checks for “01” and “10”, which increment or decrement. The case when both increment and decrement are asserted is treated as if neither were asserted — in the default, where nothing happens. In other words, that case is ignored. It’s not a corner case.

Quote
TL;DR: use dataflow syntax and explicitly spell out the structure of the hardware, the datapath, and the clock enable path. Understand that in an FPGA the registers have a built in MUX that allows clock enables without additional delays to the data path.

The synthesizer looks at the whole design. ISE and Vivado will create non-obvious logic. The CE input can be driven by logic more complex than you might expect if that is the most efficient implementation. The flip-flop in effect has two inputs.

Quote
In Altera FPGAs the registers have an additional "synchronous load" input (a load enable and a load data pin) which is another free MUX that can be used without increasing delays. No way you can make use of this efficiently writing behavioral code.

Again, I say: who cares about that efficiency, unless you’re running at clock frequencies near the fabric limit.

#### BrianHG

• Super Contributor
• Posts: 7999
• Country:
##### Re: Saturating counters
« Reply #32 on: March 30, 2020, 07:10:25 am »
In Altera FPGAs the registers have an additional "synchronous load" input (a load enable and a load data pin) which is another free MUX that can be used without increasing delays. No way you can make use of this efficiently writing behavioral code.
Quartus automatically uses them when writing behavioral code.  How do you think my 2 line Verilog (IF/ELSE IF) example compiled to produce a 333MHz design when using the slowest Cyclone II FPGA from 20 years ago.

Rewriting my code to this 1 line:
Code: [Select]
counter <= counter + ( increment && (counter != 4'hF) ) - ( decrement && (counter != 4'h0) ) ;Prevents the use of the Sync or Async load inputs and results in a FMAX of only 204MHz.

If Quartus didn't do so, mos of my designs would never reach their FMAX as I'm always under pressure to squeeze the most into the cheapest FPGA, running with good wide margins, and my designs would take almost twice as long to develop to work around the issue.

#### OwO

• Super Contributor
• Posts: 1250
• Country:
• RF Engineer.
##### Re: Saturating counters
« Reply #33 on: March 30, 2020, 07:46:04 am »
It does when you have very simple conditionals with 1 or 2 branches, but anything slightly more complex and you risk generating spaghetti logic. In this example (increment/decrement counters) there isn't even a need for the synchronous load (a clock enable is sufficient), so if that's what the synthesizer used then it's not optimizing as well as it could. I've shown above that the if/else example doesn't work well on Xilinx devices and ends up adding logic before the adder.

Code: [Select]
counter <= counter + ( increment && (counter != 4'hF) ) - ( decrement && (counter != 4'h0) ) ;I do NOT recommend writing code like this at all. VHDL forbids that kind of complex expression and for good reason. This is not software where you can take shortcuts by using the result of a comparison as an integer, so it is not helpful to think that way. It also does not save any resources to use bitwise operations instead of "branching", unlike in software. Again, forget any and all optimization techniques you learned in software - none are applicable to FPGA design.

Instead, think in terms of basic building blocks - adder/subtractors, comparators, MUXs, and flipflops. In the increment/decrement counter case you know instantly that you need an adder/subtractor, so infer that first:

Code: [Select]
countNext <= a + b when addSubtractSel='1' else a - b;
Then, compose the clock enable expression by looking at when the counter should "advance". It is a function of control inputs and current counter value.

Code: [Select]
ce <= '1' when (increment='1' and count /= X"ffffffff") or (decrement='1' and count /= 0) else '0';
Infer a flipflop and tie it all together:

Code: [Select]
count <= countNext when ce='1' and rising_edge(clk);addSubtractSel <= '1' when increment='1' else '0';a <= count;b <= X"00000001";
In the end you have logic that behaves identical to the behavioral version, but runs to >300MHz on Artix speed grade 1 (compared to 228MHz).
« Last Edit: March 30, 2020, 07:48:05 am by OwO »
Email: OwOwOwOwO123@outlook.com

The following users thanked this post: ralphrmartin, SiliconWizard

#### BrianHG

• Super Contributor
• Posts: 7999
• Country:
##### Re: Saturating counters
« Reply #34 on: March 30, 2020, 08:17:38 am »
It does when you have very simple conditionals with 1 or 2 branches, but anything slightly more complex and you risk generating spaghetti logic.
I know you can get away with at least 2 branches, with the addition of an automatically interpreted set of logic gates which also utilize a dedicated 0 penalty 'clock enable' for each logic cell.

Depending on the setup and logic functions, usually better interpreted by Quartus when using the 'CASE' statement, Quartus can squeeze between 3 to 8 select-able MUXs by adapting them part in the dedicated hardware & part in the regular gate fabric.  4 seems to be a tolerable sweet spot as designing a miniature CASE selectable 1 to 4 function logic based 8x8 bit ALUs usually always hits the top FMAX.  Just adding 1 'CASE' making a total of 5 can easily cripple the FMAX almost 2 fold.

#### NorthGuy

• Super Contributor
• Posts: 3237
• Country:
##### Re: Saturating counters
« Reply #35 on: March 30, 2020, 01:30:25 pm »
For a counter of any reasonable size (e.g. 8+ bits), the critical path is the carry chain. Aside of obvious mistake of making saturation logic and carry chain sequential (rather than running them in parallel), it doesn't matter much how the saturation logic is implemented. Yet, when trying to make things faster, the participants of this thread discuss the details of saturation logic, and nobody discusses the carry chain. The only way to make things faster is to work on the carry chain.

The following users thanked this post: Someone

#### SiliconWizard

• Super Contributor
• Posts: 15118
• Country:
##### Re: Saturating counters
« Reply #36 on: April 10, 2020, 01:34:34 pm »
As a wrap-up, a quick word.

Note that the main point in my query was not just to find ways of making a saturating counter of a given size more efficient speed-wise, although this alone is obviously of practical importance. As far as speed is concerned, we can either go for a pipelined approach, or do nothing special, as the overhead of the saturation logic may be negligible compared to the carry chain, indeed, beyond a certain counter width (but the break-even point may be more surprising that some may think.)

The main puzzle was to find out if we could devise any counter than would inherently saturate with ZERO overhead compared to a roll-over counter. There doesn't seem to really be a solution to that, although memory-based approaches are candidates (but would be costly past a certain width obviously...)

#### NorthGuy

• Super Contributor
• Posts: 3237
• Country:
##### Re: Saturating counters
« Reply #37 on: April 10, 2020, 02:42:24 pm »
The main puzzle was to find out if we could devise any counter than would inherently saturate with ZERO overhead compared to a roll-over counter.

The OwO's approach with saturation logic feeding the CE pins of the final flip-flops is zero overhead when applied to the carry chain solution. This is regardless of how long the carry chain is.

I would even say that nearly any sort of roll-over counter may be converted to a saturating one without adding any overhead. The exception would be hardware counters, such as built-in DSP counters, although DSPs may have built-in saturation logic as well.

<edit>To clarify, I'm referring to the speed. The saturating counter is likely to require extra logic.

<edit again>If the space needs to be saved, then you can use the CARRYOUT pin of the carry chain. If you just route inverted CARRYOUT to the CE pins of the flip-flops, then you'll get the saturating counter without any logic overhead. This doesn't require any extra logic (compared to non-saturating counter), but it'll be slower because of the extra time necessary to propagate CARRYOUT to flip-flops.
« Last Edit: April 10, 2020, 04:06:41 pm by NorthGuy »

#### SiliconWizard

• Super Contributor
• Posts: 15118
• Country:
##### Re: Saturating counters
« Reply #38 on: April 10, 2020, 04:18:29 pm »
The main puzzle was to find out if we could devise any counter than would inherently saturate with ZERO overhead compared to a roll-over counter.

The OwO's approach with saturation logic feeding the CE pins of the final flip-flops is zero overhead when applied to the carry chain solution. This is regardless of how long the carry chain is.

I would even say that nearly any sort of roll-over counter may be converted to a saturating one without adding any overhead. The exception would be hardware counters, such as built-in DSP counters, although DSPs may have built-in saturation logic as well.

<edit>To clarify, I'm referring to the speed. The saturating counter is likely to require extra logic.

Yep. So you get the puzzle. It was not just a speed matter as I said.

<edit again>If the space needs to be saved, then you can use the CARRYOUT pin of the carry chain. If you just route inverted CARRYOUT to the CE pins of the flip-flops, then you'll get the saturating counter without any logic overhead. This doesn't require any extra logic (compared to non-saturating counter), but it'll be slower because of the extra time necessary to propagate CARRYOUT to flip-flops.

Indeed. So as I said, there doesn't really seem to be any free lunch here. You'll either lose on speed or area, or both.
For the approach above (using the carry out of the carry chain), further pipelining it should solve the slowdown. But then you'll get the latency which is not necessarily desirable.

Edit: note that taking the carry out of the carry chain actually takes extra logic. For a roll-over counter, you can just ignore it, and there will be less logic involved then. Getting the carry out of the whole carry chain requires... propagating it. That's physically more logic. On some particular physical implementations, such as on some FPGAs, that may not make a difference, but I've started experimenting a bit now that I have a little time for this - right now with Lattice Diamond (for a MachXO2 target). It turns out the "carry out" version seems to take more area, but is significantly faster. It tried both with LSE and Synplify pro, with similar results. For a 16-bit counter, I get 202MHz max (and MachXO2's are not speed daemons!), 24 LUT4s, for the "naive" version ("if counter /= counter_max ..."), and 270MHz, 35 LUT4s for the carry out version. I'll do more experiments though.
« Last Edit: April 10, 2020, 04:53:45 pm by SiliconWizard »

#### NorthGuy

• Super Contributor
• Posts: 3237
• Country:
##### Re: Saturating counters
« Reply #39 on: April 10, 2020, 05:06:33 pm »
... note that taking the carry out of the carry chain actually takes extra logic. For a roll-over counter, you can just ignore it, and there will be less logic involved then. Getting the carry out of the whole carry chain requires... propagating it. That's physically more logic.

On Xilinx, it's just a matter of connecting CARRYOUT to CE (using built-in CE inverter). It only goes through switch boxes and doesn't take any extra logic.

On other FPGAs there may be  no built-in carry chains, or different kinds of carry chains. Xilinx's internals are very well thought out. It's amazing how the people who designed Xilinx's CLBs may coexist with people who designed Vivado in the same company.

#### SiliconWizard

• Super Contributor
• Posts: 15118
• Country:
##### Re: Saturating counters
« Reply #40 on: April 12, 2020, 03:56:17 pm »
Yeah of course it all depends on physical implementation.
On Lattice MachXO2, it takes *2* extra LUTs to get the carry out signal out of any adder. Don't ask me why!
And yes, Xilinx FPGAs in general are pretty well designed.

Of course, on a purely gate level, getting the carry out of the carry chain will always take at least an extra gate.

Smf