Author Topic: Saturating counters  (Read 9107 times)

0 Members and 1 Guest are viewing this topic.

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
Re: Saturating counters
« Reply #25 on: March 29, 2020, 09:58:22 pm »
4-bit saturating counter implemented using LUT6+FFs =  2.2ns = 454 MHz (constrained to 2.2ns with 0.687ns  slack).

When I constrained the later case to 2.0ns, it fails timing on "Total Pulse Width Negative Slack" .

This is just 4 LUTs (one for each output line), so it should run at max clock speed - 464 MHz for grade -1. If you switch to grade 2, it should give you 628 MHz.

I'm somewhat surprised by how slow the Block RAM version was. I was expecting it to go to the speed of the Block RAM - 450MHz. That external LUT2 (to combine the reset and enable signals) slows things down a bit

It's 388 MHz for grade -1. You probably can get rid of the external LUT to get to the this speed.

BRAM is only slower than LUT RAM for small sizes. If you go to bigger counters, the LUT RAM will be slower and slower, but BRAM will keep the same speed.
 

Online Someone

  • Super Contributor
  • ***
  • Posts: 4844
  • Country: au
    • send complaints here
Re: Saturating counters
« Reply #26 on: March 29, 2020, 10:24:35 pm »
I'm somewhat surprised by how slow the Block RAM version was. I was expecting it to go to the speed of the Block RAM - 450MHz. That external LUT2 (to combine the reset and enable signals) slows things down a bit
It's 388 MHz for grade -1. You probably can get rid of the external LUT to get to the this speed.

BRAM is only slower than LUT RAM for small sizes. If you go to bigger counters, the LUT RAM will be slower and slower, but BRAM will keep the same speed.
BRAM is pipelined and positioned for data flow designs, anything like this with single cycle feedback will be a long way off the Fmax. Once the size grows and you start having to tie multiple BRAMS together the fabric approaches will likely catch up/overtake in speed when you utilise the fast carry chain.
 

Offline Bassman59

  • Super Contributor
  • ***
  • Posts: 2501
  • Country: us
  • Yes, I do this for a living
Re: Saturating counters
« Reply #27 on: March 30, 2020, 02:32:26 am »
Arguing about the 2 controls for this counter is meaningless.  It can be anything your project requires, including 3 inputs or even more.  The whole issue here was about methods to implement the 'Saturation' itself.

It just seemed the two separate flags was prone to poor synthesis, so I wanted to see if there was a reason to chose that way that I was missing.

Why would it be "prone to poor synthesis?"

I chose the two-flags approach arbitrarily. Consider a case where the logic that controls the increment is separate from the logic that controls the decrement. As has been noted, some FIFO implementations do that.

One could also use the enable and direction approach.  You might use this when counting ticks from a rotary encoder which provides those two signals.

I suppose it depends more on what is controlling the counter.

In any case, as I said, I'm rarely interested in the maximum speed for a given logic block. My designs are usually the "most obvious." I always have clock constraints which drive design requirements. If a particular piece of logic doesn't meet the timing constraints, then I look for why that's the case and what I need to do to fix the problem.
 

Offline OwO

  • Super Contributor
  • ***
  • Posts: 1250
  • Country: cn
  • RF Engineer.
Re: Saturating counters
« Reply #28 on: March 30, 2020, 03:52:28 am »
So to summarize, the two-flag control input only causes synthesis difficulties because there exists the corner case of increment = decrement = '1', and in behavioral code corner cases are not obvious and easy to be seen; it's very easy to create complex and unnecessary logic by the falling through of if-statements, which in this case resulted in additional logic in the datapath rather than just a simple adder/subtracter.

TL;DR: use dataflow syntax and explicitly spell out the structure of the hardware, the datapath, and the clock enable path. Understand that in an FPGA the registers have a built in MUX that allows clock enables without additional delays to the data path. In Altera FPGAs the registers have an additional "synchronous load" input (a load enable and a load data pin) which is another free MUX that can be used without increasing delays. No way you can make use of this efficiently writing behavioral code.
Email: OwOwOwOwO123@outlook.com
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Saturating counters
« Reply #29 on: March 30, 2020, 04:00:08 am »
Dumb question, why use separate inc/dec flags. Why not single? Inc=1, dec=0.
You may want the counter to remain stationary.
Like a volume control know with a min and max, you may want it to turn up, or down, or stay still.

Sure, but as a programmer my natural reaction would be to have a single inc/dec flag and then a counter/clock enable flag. So why two seemingly clearly ambiguous flags instead? Again, I'm a FPGA n00b so just trying to understand.

If separate logic generates the increment and decrements signals then separate signals makes sense, and you can define the behavior when both is asserted at the same time.

If you want a minimal interface to a single piece of logic, then an "enable" and "direction" signal might be more appropriate.

You can always transform one interface to the other, but it might add levels of logic and might reduce timing margins if it becomes the critical timing path.

   enable <= up XOR down;
   direction <= up;

or

   up  <= enable AND direction;
   down <= enable AND NOT direction;

But the 'up' and 'down' interface is more generic (i.e. applicable to more situations). You can't assert 'up' and 'down' at the same time with an 'enable' and 'direction' interface
« Last Edit: March 30, 2020, 04:02:04 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Saturating counters
« Reply #30 on: March 30, 2020, 04:22:20 am »
So to summarize, the two-flag control input only causes synthesis difficulties because there exists the corner case of increment = decrement = '1', and in behavioral code corner cases are not obvious and easy to be seen; it's very easy to create complex and unnecessary logic by the falling through of if-statements, which in this case resulted in additional logic in the datapath rather than just a simple adder/subtracter.

TL;DR: use dataflow syntax and explicitly spell out the structure of the hardware, the datapath, and the clock enable path. Understand that in an FPGA the registers have a built in MUX that allows clock enables without additional delays to the data path. In Altera FPGAs the registers have an additional "synchronous load" input (a load enable and a load data pin) which is another free MUX that can be used without increasing delays. No way you can make use of this efficiently writing behavioral code.

If you do run into timing problems this is where the whole "small/cheap/fast" trade off can be seen.

- If you want to use minimal resources, then using the control signals to select what the operand is on the adder (-1, 0 or 1) will be slowest.

- If you need good performance then calculating both counter_plus_one and counter_sub_one then using the control signals to select between those values will add minimal delay.

- For best performance, where the counter arithmetic is the critical path then avoiding the addition or subtraction completely by converting it to a memory-driven state machine might be a workable solution.

You usually want to start with the first, simple solution, and only use the more advanced solutions if you know this will be in the critical path.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 
The following users thanked this post: Someone

Offline Bassman59

  • Super Contributor
  • ***
  • Posts: 2501
  • Country: us
  • Yes, I do this for a living
Re: Saturating counters
« Reply #31 on: March 30, 2020, 04:45:36 am »
So to summarize, the two-flag control input only causes synthesis difficulties because there exists the corner case of increment = decrement = '1', and in behavioral code corner cases are not obvious and easy to be seen; it's very easy to create complex and unnecessary logic by the falling through of if-statements, which in this case resulted in additional logic in the datapath rather than just a simple adder/subtracter.

Note that I used a case statement instead of cascaded if statements. This takes care of priority. The case statement only checks for “01” and “10”, which increment or decrement. The case when both increment and decrement are asserted is treated as if neither were asserted — in the default, where nothing happens. In other words, that case is ignored. It’s not a corner case.

Quote
TL;DR: use dataflow syntax and explicitly spell out the structure of the hardware, the datapath, and the clock enable path. Understand that in an FPGA the registers have a built in MUX that allows clock enables without additional delays to the data path.

The synthesizer looks at the whole design. ISE and Vivado will create non-obvious logic. The CE input can be driven by logic more complex than you might expect if that is the most efficient implementation. The flip-flop in effect has two inputs.

Quote
In Altera FPGAs the registers have an additional "synchronous load" input (a load enable and a load data pin) which is another free MUX that can be used without increasing delays. No way you can make use of this efficiently writing behavioral code.

Again, I say: who cares about that efficiency, unless you’re running at clock frequencies near the fabric limit.
 

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 7999
  • Country: ca
Re: Saturating counters
« Reply #32 on: March 30, 2020, 07:10:25 am »
In Altera FPGAs the registers have an additional "synchronous load" input (a load enable and a load data pin) which is another free MUX that can be used without increasing delays. No way you can make use of this efficiently writing behavioral code.
Quartus automatically uses them when writing behavioral code.  How do you think my 2 line Verilog (IF/ELSE IF) example compiled to produce a 333MHz design when using the slowest Cyclone II FPGA from 20 years ago.

Rewriting my code to this 1 line:
Code: [Select]
counter <= counter + ( increment && (counter != 4'hF) ) - ( decrement && (counter != 4'h0) ) ;
Prevents the use of the Sync or Async load inputs and results in a FMAX of only 204MHz.

If Quartus didn't do so, mos of my designs would never reach their FMAX as I'm always under pressure to squeeze the most into the cheapest FPGA, running with good wide margins, and my designs would take almost twice as long to develop to work around the issue.
 

Offline OwO

  • Super Contributor
  • ***
  • Posts: 1250
  • Country: cn
  • RF Engineer.
Re: Saturating counters
« Reply #33 on: March 30, 2020, 07:46:04 am »
It does when you have very simple conditionals with 1 or 2 branches, but anything slightly more complex and you risk generating spaghetti logic. In this example (increment/decrement counters) there isn't even a need for the synchronous load (a clock enable is sufficient), so if that's what the synthesizer used then it's not optimizing as well as it could. I've shown above that the if/else example doesn't work well on Xilinx devices and ends up adding logic before the adder.

Code: [Select]
counter <= counter + ( increment && (counter != 4'hF) ) - ( decrement && (counter != 4'h0) ) ;
I do NOT recommend writing code like this at all. VHDL forbids that kind of complex expression and for good reason. This is not software where you can take shortcuts by using the result of a comparison as an integer, so it is not helpful to think that way. It also does not save any resources to use bitwise operations instead of "branching", unlike in software. Again, forget any and all optimization techniques you learned in software - none are applicable to FPGA design.

Instead, think in terms of basic building blocks - adder/subtractors, comparators, MUXs, and flipflops. In the increment/decrement counter case you know instantly that you need an adder/subtractor, so infer that first:

Code: [Select]
countNext <= a + b when addSubtractSel='1' else a - b;
Then, compose the clock enable expression by looking at when the counter should "advance". It is a function of control inputs and current counter value.

Code: [Select]
ce <= '1' when (increment='1' and count /= X"ffffffff") or (decrement='1' and count /= 0) else '0';
Infer a flipflop and tie it all together:

Code: [Select]
count <= countNext when ce='1' and rising_edge(clk);
addSubtractSel <= '1' when increment='1' else '0';
a <= count;
b <= X"00000001";

In the end you have logic that behaves identical to the behavioral version, but runs to >300MHz on Artix speed grade 1 (compared to 228MHz).
« Last Edit: March 30, 2020, 07:48:05 am by OwO »
Email: OwOwOwOwO123@outlook.com
 
The following users thanked this post: ralphrmartin, SiliconWizard

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 7999
  • Country: ca
Re: Saturating counters
« Reply #34 on: March 30, 2020, 08:17:38 am »
It does when you have very simple conditionals with 1 or 2 branches, but anything slightly more complex and you risk generating spaghetti logic.
I know you can get away with at least 2 branches, with the addition of an automatically interpreted set of logic gates which also utilize a dedicated 0 penalty 'clock enable' for each logic cell.

Depending on the setup and logic functions, usually better interpreted by Quartus when using the 'CASE' statement, Quartus can squeeze between 3 to 8 select-able MUXs by adapting them part in the dedicated hardware & part in the regular gate fabric.  4 seems to be a tolerable sweet spot as designing a miniature CASE selectable 1 to 4 function logic based 8x8 bit ALUs usually always hits the top FMAX.  Just adding 1 'CASE' making a total of 5 can easily cripple the FMAX almost 2 fold.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
Re: Saturating counters
« Reply #35 on: March 30, 2020, 01:30:25 pm »
For a counter of any reasonable size (e.g. 8+ bits), the critical path is the carry chain. Aside of obvious mistake of making saturation logic and carry chain sequential (rather than running them in parallel), it doesn't matter much how the saturation logic is implemented. Yet, when trying to make things faster, the participants of this thread discuss the details of saturation logic, and nobody discusses the carry chain. The only way to make things faster is to work on the carry chain.
 
The following users thanked this post: Someone

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15118
  • Country: fr
Re: Saturating counters
« Reply #36 on: April 10, 2020, 01:34:34 pm »
As a wrap-up, a quick word.

Note that the main point in my query was not just to find ways of making a saturating counter of a given size more efficient speed-wise, although this alone is obviously of practical importance. As far as speed is concerned, we can either go for a pipelined approach, or do nothing special, as the overhead of the saturation logic may be negligible compared to the carry chain, indeed, beyond a certain counter width (but the break-even point may be more surprising that some may think.)

The main puzzle was to find out if we could devise any counter than would inherently saturate with ZERO overhead compared to a roll-over counter. There doesn't seem to really be a solution to that, although memory-based approaches are candidates (but would be costly past a certain width obviously...)

 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
Re: Saturating counters
« Reply #37 on: April 10, 2020, 02:42:24 pm »
The main puzzle was to find out if we could devise any counter than would inherently saturate with ZERO overhead compared to a roll-over counter.

The OwO's approach with saturation logic feeding the CE pins of the final flip-flops is zero overhead when applied to the carry chain solution. This is regardless of how long the carry chain is.

I would even say that nearly any sort of roll-over counter may be converted to a saturating one without adding any overhead. The exception would be hardware counters, such as built-in DSP counters, although DSPs may have built-in saturation logic as well.

<edit>To clarify, I'm referring to the speed. The saturating counter is likely to require extra logic.

<edit again>If the space needs to be saved, then you can use the CARRYOUT pin of the carry chain. If you just route inverted CARRYOUT to the CE pins of the flip-flops, then you'll get the saturating counter without any logic overhead. This doesn't require any extra logic (compared to non-saturating counter), but it'll be slower because of the extra time necessary to propagate CARRYOUT to flip-flops.
« Last Edit: April 10, 2020, 04:06:41 pm by NorthGuy »
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15118
  • Country: fr
Re: Saturating counters
« Reply #38 on: April 10, 2020, 04:18:29 pm »
The main puzzle was to find out if we could devise any counter than would inherently saturate with ZERO overhead compared to a roll-over counter.

The OwO's approach with saturation logic feeding the CE pins of the final flip-flops is zero overhead when applied to the carry chain solution. This is regardless of how long the carry chain is.

I would even say that nearly any sort of roll-over counter may be converted to a saturating one without adding any overhead. The exception would be hardware counters, such as built-in DSP counters, although DSPs may have built-in saturation logic as well.

<edit>To clarify, I'm referring to the speed. The saturating counter is likely to require extra logic.

Yep. So you get the puzzle. It was not just a speed matter as I said.

<edit again>If the space needs to be saved, then you can use the CARRYOUT pin of the carry chain. If you just route inverted CARRYOUT to the CE pins of the flip-flops, then you'll get the saturating counter without any logic overhead. This doesn't require any extra logic (compared to non-saturating counter), but it'll be slower because of the extra time necessary to propagate CARRYOUT to flip-flops.

Indeed. So as I said, there doesn't really seem to be any free lunch here. You'll either lose on speed or area, or both.
For the approach above (using the carry out of the carry chain), further pipelining it should solve the slowdown. But then you'll get the latency which is not necessarily desirable.

Edit: note that taking the carry out of the carry chain actually takes extra logic. For a roll-over counter, you can just ignore it, and there will be less logic involved then. Getting the carry out of the whole carry chain requires... propagating it. That's physically more logic. On some particular physical implementations, such as on some FPGAs, that may not make a difference, but I've started experimenting a bit now that I have a little time for this - right now with Lattice Diamond (for a MachXO2 target). It turns out the "carry out" version seems to take more area, but is significantly faster. It tried both with LSE and Synplify pro, with similar results. For a 16-bit counter, I get 202MHz max (and MachXO2's are not speed daemons!), 24 LUT4s, for the "naive" version ("if counter /= counter_max ..."), and 270MHz, 35 LUT4s for the carry out version. I'll do more experiments though.
« Last Edit: April 10, 2020, 04:53:45 pm by SiliconWizard »
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3237
  • Country: ca
Re: Saturating counters
« Reply #39 on: April 10, 2020, 05:06:33 pm »
... note that taking the carry out of the carry chain actually takes extra logic. For a roll-over counter, you can just ignore it, and there will be less logic involved then. Getting the carry out of the whole carry chain requires... propagating it. That's physically more logic.

On Xilinx, it's just a matter of connecting CARRYOUT to CE (using built-in CE inverter). It only goes through switch boxes and doesn't take any extra logic.

On other FPGAs there may be  no built-in carry chains, or different kinds of carry chains. Xilinx's internals are very well thought out. It's amazing how the people who designed Xilinx's CLBs may coexist with people who designed Vivado in the same company.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15118
  • Country: fr
Re: Saturating counters
« Reply #40 on: April 12, 2020, 03:56:17 pm »
Yeah of course it all depends on physical implementation.
On Lattice MachXO2, it takes *2* extra LUTs to get the carry out signal out of any adder. Don't ask me why!
And yes, Xilinx FPGAs in general are pretty well designed.

Of course, on a purely gate level, getting the carry out of the carry chain will always take at least an extra gate.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf