Author Topic: FPGA: More elegant (and less timing violating) way of doing simple register map?  (Read 1054 times)

0 Members and 1 Guest are viewing this topic.

Offline daqq

  • Super Contributor
  • ***
  • Posts: 1185
  • Country: sk
    • My site
Hi guys,

I'm doing a block in an FPGA project that is pretty much a configuration register map - if NReset is set to zero, all of the registers gain their default values (a fixed parameter, but different for different registers), otherwise assuming that DataValid is one, the appropriate register (32 bit), defined by Address (7 bit) will be set to DataIn. These are then connected outside of the module to control whatever blocks are needed.

This approach synthesized well enough, but after adding more registers, it generates timing violations...

First of all I would like to ask whether there is any more elegant way to do something like this? Since verilog can't use an array of registers as a module port I can't just use something like
Code: [Select]
output reg [31:0] Registers [31:0];. As such I have to write out a LOT of lines.

Second: Is there any more elegant way to do this, that does not create an unholy synthesized mass of logic? The system runs @ 250MHz internally, so I can understand why it creates problems, since the combination logic has a very high fanout (at least to my understanding, since I'm very new at this). It does not have to work in a single clock cycle - the fastest the control SPI can feed this is in several thousand clock cycles.

Code: [Select]
always @(posedge Clock) begin
if(NReset == 0) begin
Reg00 <= Reg00_DefaultValue;
Reg01 <= Reg01_DefaultValue;
Reg02 <= Reg02_DefaultValue;
Reg03 <= Reg03_DefaultValue;
Reg04 <= Reg04_DefaultValue;
Reg05 <= Reg05_DefaultValue;
Reg06 <= Reg06_DefaultValue;
Reg07 <= Reg07_DefaultValue;
Reg08 <= Reg08_DefaultValue;
Reg09 <= Reg09_DefaultValue;
Reg0A <= Reg0A_DefaultValue;
Reg0B <= Reg0B_DefaultValue;
Reg0C <= Reg0C_DefaultValue;
Reg0D <= Reg0D_DefaultValue;
Reg0E <= Reg0E_DefaultValue;
Reg0F <= Reg0F_DefaultValue;

Reg10 <= Reg10_DefaultValue;
Reg11 <= Reg11_DefaultValue;
Reg12 <= Reg12_DefaultValue;
Reg13 <= Reg13_DefaultValue;
Reg14 <= Reg14_DefaultValue;
Reg15 <= Reg15_DefaultValue;
Reg16 <= Reg16_DefaultValue;
Reg17 <= Reg17_DefaultValue;
Reg18 <= Reg18_DefaultValue;
Reg19 <= Reg19_DefaultValue;
Reg1A <= Reg1A_DefaultValue;
Reg1B <= Reg1B_DefaultValue;
Reg1C <= Reg1C_DefaultValue;
Reg1D <= Reg1D_DefaultValue;
Reg1E <= Reg1E_DefaultValue;
Reg0F <= Reg1F_DefaultValue;
end else if(DataValid == 1'b1) begin
case (Address)
7'h00: Reg00 <= DataIn;
7'h01: Reg01 <= DataIn;
7'h02: Reg02 <= DataIn;
7'h03: Reg03 <= DataIn;
7'h04: Reg04 <= DataIn;
7'h05: Reg05 <= DataIn;
7'h06: Reg06 <= DataIn;
7'h07: Reg07 <= DataIn;
7'h08: Reg08 <= DataIn;
7'h09: Reg09 <= DataIn;
7'h0A: Reg0A <= DataIn;
7'h0B: Reg0B <= DataIn;
7'h0C: Reg0C <= DataIn;
7'h0D: Reg0D <= DataIn;
7'h0E: Reg0E <= DataIn;
7'h0F: Reg0F <= DataIn;

7'h10: Reg10 <= DataIn;
7'h11: Reg11 <= DataIn;
7'h12: Reg12 <= DataIn;
7'h13: Reg13 <= DataIn;
7'h14: Reg14 <= DataIn;
7'h15: Reg15 <= DataIn;
7'h16: Reg16 <= DataIn;
7'h17: Reg17 <= DataIn;
7'h18: Reg18 <= DataIn;
7'h19: Reg19 <= DataIn;
7'h1A: Reg1A <= DataIn;
7'h1B: Reg1B <= DataIn;
7'h1C: Reg1C <= DataIn;
7'h1D: Reg1D <= DataIn;
7'h1E: Reg1E <= DataIn;
7'h1F: Reg1F <= DataIn;
endcase;
end;
end

Best regards,

David
Believe it or not, pointy haired people do exist!
+++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 3693
  • Country: us
  • I informed you thusly
    • Personal site
The only reason this may not meet timings is the case statement doing something funky. Otherwise, this is a pretty straightforward thing where write enable is generated by a combinatorial logic based on the Address. There is no reason why this would get significantly slower with number of registers.

Try something like this:
Code: [Select]
reg [31:0] Reg [31:0];

always @(posedge Clock) begin
 if(NReset == 0) begin
   Reg[0] <= Reg00_DefaultValue;
   ........
   Reg[31] <= Reg1F_DefaultValue;
 end else if(DataValid == 1'b1) begin
   Reg[Address] <= DataIn;
 end;
end

assign Reg00_o = Reg[0];
........
assign Reg1F_o = Reg[31];

Also, SystemVerilog supports arrays as ports.
Alex
 
The following users thanked this post: daqq, newbrain

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 1264
  • Country: nz
It all depends on you needs. For example, imagine a 256x8 write only register file, 8 bits address, 8 bits data, 1 bit write enable, and 256 8-bit outputs.

If all of the outputs are going to be used directly, and you want them to update instantly you need 2048 flip-flops, lots of decoding logic, and very high fan outs on the data signal.  Basically the mess you already have.

The only option is to increase the latency to allow you to manage thimgs better.

Let's take this idea to the extreme. If you only need one update every 20 cycles you can do the following:

When the data come in, write a 1, the address and the data to a 17 bit shift register.

Shift those bits out, starting with the 1. Into a single wire.

Where you want your register, receive those bits into an 8 bit shift register, and add am FSM triggered by the leading one. This FSM inspects the address, and if it matches a constant address it then captues the next 8 bits into registers, the assigns them to the registers outputs.

You can then chain these stages together so your write-pnly register file can be implemented in very small parts all over the FPGA die, with very low fan-outs, so running at very high speed. It can also use pipelining as required to meet timing

Basically it becomes a simple network on a chip, the cost being maybe 2x as many registers, and latency between the write and the register changing.

Taking this to the extreme, routing this network can be implemented as an overlay, where you chain these blocks togeather after the main place and route, using whatever leftover resources you have to hand.

Oh, the other option for things like a 3x3 matrix used in colour conversion is to rather than having 9 registers, have an 9-deep shift register and just have the CPU stream the values in. This gets rid of a lot of the decode logic.
 
The following users thanked this post: daqq

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 3693
  • Country: us
  • I informed you thusly
    • Personal site
and very high fan outs on the data signal.  Basically the mess you already have.
Yeah, I totally missed data line fan-out.
Alex
 

Offline AndyC_772

  • Super Contributor
  • ***
  • Posts: 3012
  • Country: gb
  • Will design for cookies
What exactly are (say) the top three timing violations?

A very common cause of timing errors in a register map is having registers which are updated in one clock domain but read back in another. If the two clocks are not synchronised, a situation can exist where a register is read at a time when it's not yet stabilised from the previous write. The existence of the problem doesn't depend on the frequency of the two clocks.

Possible fixes include:

- treat SCK as an asynchronous signal, and sample it in the same (fast) clock domain as the one in which the registers are read.

- use a dual clock FIFO. SPI writes go into the FIFO under control of the 'write' clock, and are read out from the other side (into your register file) under control of a 'read' clock
 
The following users thanked this post: daqq

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 1264
  • Country: nz
What exactly are (say) the top three timing violations?

A very common cause of timing errors in a register map is having registers which are updated in one clock domain but read back in another. If the two clocks are not synchronised, a situation can exist where a register is read at a time when it's not yet stabilised from the previous write. The existence of the problem doesn't depend on the frequency of the two clocks.

Possible fixes include:

- treat SCK as an asynchronous signal, and sample it in the same (fast) clock domain as the one in which the registers are read.

- use a dual clock FIFO. SPI writes go into the FIFO under control of the 'write' clock, and are read out from the other side (into your register file) under control of a 'read' clock

Even if it is all in one clock domain (which would be a good thing, IMO), the high fanouts will be killing this - esp at 200+MHz.  Some pipelining might help a little (as it allows for high-fan out registers to get duplicated, but it will end up being a pretty large mesh of wires. The placer will most likely try and place this all in one spot (as it is so interconnected with high fanouts) so running the register outputs out across the die to the functional blocks of the design will be problematic. Upshot is that the "command and control" aspect will be very constraining on design (which at 250MHz is shooting pretty high anyway).

Another option is to run the register file at an integer fraction of the main design's clock (e.g. 50.0MHz or 25MHz). That way you have plenty of slack for place and route, and don't get any timing violations because the clock domains are synchronous. Actually that sounds like quite a nice idea. I should charge for this quality of advice!  :D
« Last Edit: December 05, 2017, 08:00:34 PM by hamster_nz »
 
The following users thanked this post: daqq

Offline daqq

  • Super Contributor
  • ***
  • Posts: 1185
  • Country: sk
    • My site
First of all: Thanks guys for the advice! I have already eliminated the problem - it was in one of the controlled "peripherals" that connected to one of the control registers and compared it - basically did a horrible amount of combinatorial logic (comparison, then mux, then comparison) before the next register stage. I've added intermediate registers between the two areas and it works OK. So it's all a problem of me not identifying the offending source properly  :palm:

Quote
The only reason this may not meet timings is the case statement doing something funky. Otherwise, this is a pretty straightforward thing where write enable is generated by a combinatorial logic based on the Address. There is no reason why this would get significantly slower with number of registers.
I've tried synthesizing both (before finding out the bug), both gave roughly the same result and there is no noteworthy difference in how the actual logic was created. Attached are both outputs of both cases, they seem pretty similar, even when looked at close.

Quote
Try something like this:
Thanks! Looks great!

hamster_nz: Those are some very interesting ideas. For now I'll stay clear of them, but thanks! The asynchronous bit might work, but at the moment the SPI is sampled and processed - the receiving system does not run on the SPI SCK clock, but rather on the internal 250MHz clock and the signals are sampled. Converting the SPI into a slower system could lessen the problems in the future, I'll keep that door open.

AndyC_772: The SPI is treated asynchronously and is sampled - the SCK does not provide any real clocking, just moving forward in the receiving/transmitting FSM.

So, problem solved... for now...
Believe it or not, pointy haired people do exist!
+++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
 

Offline Someone

  • Super Contributor
  • ***
  • Posts: 1768
  • Country: au
Even if it is all in one clock domain (which would be a good thing, IMO), the high fanouts will be killing this - esp at 200+MHz... (which at 250MHz is shooting pretty high anyway).
Speed is relative to the part being used, this would breeze it in on a modern node but be fiendishly difficult on a 20 year old part. The tools should be properly handling fan out in every stage with appropriate techniques for the part, unless there is already very high utilisation then replication of the nets is routine in even basic tools.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 1264
  • Country: nz
Even if it is all in one clock domain (which would be a good thing, IMO), the high fanouts will be killing this - esp at 200+MHz... (which at 250MHz is shooting pretty high anyway).
Speed is relative to the part being used, this would breeze it in on a modern node but be fiendishly difficult on a 20 year old part. The tools should be properly handling fan out in every stage with appropriate techniques for the part, unless there is already very high utilisation then replication of the nets is routine in even basic tools.

When doing commercial video work - (3D LUTs, focus aids, peaking, zooming and so on), we found that 148.5MHz was quite achievable without too many issues in lower-end Zynqs and Cyclone V SoCs, but going to 4k at 295MHz was almost asking too much for the parts, unless we paid very, very careful attention to every detail (and then there was the power issues with lots of logic running at that speed...).

A former pet project mine, of a Mandelbrot fractal generator on a Kintex, started getting tricky around 230MHz when implemented complex multiplication in a mix of DSP48s and LUT logic.
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 12938
  • Country: nl
    • NCT Developments
Using a 250MHz clock inside an FPGA for generic logic is very a bad idea and this is the root cause of the problem. It is better to have 2 or 3 clock domains with related clocks (like 250MHz, 125MHz and 31.75MHz) using the FPGA's internal clock generator(s). The lowest clock frequency can be used for house keeping stuff like configuration registers. The advantage is that only the parts which need to be fast cause routing problems where the other (slower) stuff can be placed anywhere and it still meets timing.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 
The following users thanked this post: daqq

Offline daqq

  • Super Contributor
  • ***
  • Posts: 1185
  • Country: sk
    • My site
Quote
The lowest clock frequency can be used for house keeping stuff like configuration registers.
Wouldn't this cause a lot of extra registers for syncing between the two, or issues when reading out the counter part of the configuration registers - the status registers? The counter part for the provided code is a status register muxing system.
Believe it or not, pointy haired people do exist!
+++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 12938
  • Country: nl
    • NCT Developments
Quote
The lowest clock frequency can be used for house keeping stuff like configuration registers.
Wouldn't this cause a lot of extra registers for syncing between the two, or issues when reading out the counter part of the configuration registers - the status registers? The counter part for the provided code is a status register muxing system.
No because when the clocks are related their edges are still aligned which in turn makes it very easy to go from a domain with a lower frequency to a domain with a higher frequency. Vice versa is a bit more tricky because you need to keep the signal stable for the clock cycle duration of the lower frequency but still you can use the fact the clock edges are aligned.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline daqq

  • Super Contributor
  • ***
  • Posts: 1185
  • Country: sk
    • My site
Hmmm... I'll think about it - it certainly would solve some problems, since all of the configuration/status reading is slow compared to the main logic.

The configuration registers (writing them) would be fairly trivial (slow clock domain feeding a fast clock domain), but the status registers readout would cause a lot of objections from the compiler unless handled with a lot of nastyness...
Believe it or not, pointy haired people do exist!
+++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 12938
  • Country: nl
    • NCT Developments
Hmmm... I'll think about it - it certainly would solve some problems, since all of the configuration/status reading is slow compared to the main logic.

The configuration registers (writing them) would be fairly trivial (slow clock domain feeding a fast clock domain), but the status registers readout would cause a lot of objections from the compiler unless handled with a lot of nastyness...
Not if you clock the readout signals into flipflops with a lower frequency clock OR you specifically tell the compiler the signals have more relaxed timing constraints.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline NorthGuy

  • Frequent Contributor
  • **
  • Posts: 510
  • Country: ca
The counter part for the provided code is a status register muxing system.

If you're muxing your registers afterwards (that is do not need all the outputs at the same time), then the whole thing looks very similar to a RAM block. Your FPGA may have built-in RAM blocks which might be able to replace both of your systems, but it all depends on the details.
 

Offline aandrew

  • Regular Contributor
  • *
  • Posts: 140
  • Country: ca
Using a 250MHz clock inside an FPGA for generic logic is very a bad idea and this is the root cause of the problem. It is better to have 2 or 3 clock domains with related clocks (like 250MHz, 125MHz and 31.75MHz) using the FPGA's internal clock generator(s).

I disagree; run all/most of the core logic at your main rate (250MHz) and use the built in clock enables to create slower/lower speed logic. There's no need to use DCMs/PLLs to generate integer fractions of your main rate, so long as it doesn't have to be 50% duty cycle.

And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 12938
  • Country: nl
    • NCT Developments
Using a 250MHz clock inside an FPGA for generic logic is very a bad idea and this is the root cause of the problem. It is better to have 2 or 3 clock domains with related clocks (like 250MHz, 125MHz and 31.75MHz) using the FPGA's internal clock generator(s).
I disagree; run all/most of the core logic at your main rate (250MHz) and use the built in clock enables to create slower/lower speed logic. There's no need to use DCMs/PLLs to generate integer fractions of your main rate, so long as it doesn't have to be 50% duty cycle.
If the synthesizer is able to deal with that and detects the clock rate reduction properly then it can be a solution as well. Otherwise you'd need a mess of timing constraints and really think about what you are doing. Having multiple clock domains with clocks as slow as possible has been serving me well to get to short place & route times and excellent use of FPGA resources. Ofcourse you'd need enough clock distribution nets and PLLs available which is why wrote to have 2 or 3 different clocks.
Quote
And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.
I've seen people do that as well and it is awfull indeed.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 1264
  • Country: nz
And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.

Amen!
 

Offline daqq

  • Super Contributor
  • ***
  • Posts: 1185
  • Country: sk
    • My site
Quote
I disagree; run all/most of the core logic at your main rate (250MHz) and use the built in clock enables to create slower/lower speed logic. There's no need to use DCMs/PLLs to generate integer fractions of your main rate, so long as it doesn't have to be 50% duty cycle.

And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.
If not flip flops or DCMs/PLLs then how? If the clocks must be related to one another (be direct multiples of one another) then how do I generate such a clock besides those two options?

Could you give m some keywords that I can feed into google for this kind of thin? Let's say that I want to go with the dual/multiple clock domains, one for the high speed nasty (250MHz) and one, say, 50MHz for housekeeping (SPI interface with config/status registers, misc.). Is there any example? In particular the clock crossing and what constraints should I look for?
Believe it or not, pointy haired people do exist!
+++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 1264
  • Country: nz
Quote
I disagree; run all/most of the core logic at your main rate (250MHz) and use the built in clock enables to create slower/lower speed logic. There's no need to use DCMs/PLLs to generate integer fractions of your main rate, so long as it doesn't have to be 50% duty cycle.

And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.
If not flip flops or DCMs/PLLs then how? If the clocks must be related to one another (be direct multiples of one another) then how do I generate such a clock besides those two options?

Could you give m some keywords that I can feed into google for this kind of thin? Let's say that I want to go with the dual/multiple clock domains, one for the high speed nasty (250MHz) and one, say, 50MHz for housekeeping (SPI interface with config/status registers, misc.). Is there any example? In particular the clock crossing and what constraints should I look for?

I think the suggested design is to use the same DCM or PLL to generate both the fast and slow clocks.

At least for Xilinx, there is no need to synchronise going from slow to fast just have a value registered in the slow domain and consume it in the fast domain - the derived clock constraints cover it and make it happen magically. Just be aware that any control signals (e.g. a write enable) will have stretched pulses when used in the fast domain).

Going the other way can be trickier, best bet is to use a FIFO any data streams. Or only act on control signals sourced from the slow domain only once every 'n' cycles.
 

Offline NorthGuy

  • Frequent Contributor
  • **
  • Posts: 510
  • Country: ca
Going the other way can be trickier

But if you derive slow clock domain using clock enable (e.g. on BUFG) then your fast clock domain will always have a counter which tells you the phase of your slow clock. This makes any form of synchronization very easy.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 1264
  • Country: nz
Going the other way can be trickier

But if you derive slow clock domain using clock enable (e.g. on BUFG) then your fast clock domain will always have a counter which tells you the phase of your slow clock. This makes any form of synchronization very easy.

When needed, I tended to sample a flipflop toggling in the slow domain to allow the relative phase to be deduced locally... but using a BUFGCE is a nice idea too.
 

Offline Bassman59

  • Frequent Contributor
  • **
  • Posts: 482
  • Country: us
  • Yes, I do this for a living
And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.

Except — if youre using an FPGA which does not have a PLL, or for Real Good Reasons you can’t use one that it might have, then you have little choice if you need to divide the clock using flip-flops. This is the boat I’m in now.

In this case, you have to ensure that the divided clock ends up on a global net, and realize that the divided clock is asynchronous to its source clock. The divided clock will always switch after its source, so you have to treat registers and signals generated in the source domain carefully. That means double- or triple-flip-flop synchronizers, asserting strobes for a “long” time, and all of the metastability hardening you don’t need with Xilinx parts.

 

Offline daqq

  • Super Contributor
  • ***
  • Posts: 1185
  • Country: sk
    • My site
Thanks guys for the tips. I've done the switchover to a slow (50MHz) housekeeping domain and a fast (250MHz) number crunchy domain. It seems to be working so far.

There seems to be no problem so far, even without extra syncing registers between the two domains in both ways.
Believe it or not, pointy haired people do exist!
+++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
 

Offline Dubbie

  • Supporter
  • ****
  • Posts: 493
And not a comment directly for you, but for God's sake, don't use FFs to manually divide a clock down and feed that into other logic! I see this so often and it's such a bad practise.
I've seen people do that as well and it is awfull indeed.

Have you guys been reading my Verilog files?

Ah well, you learn something new every day!
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf