Author Topic: FPGA text overlay on HDMI - can't meet timing  (Read 1848 times)

0 Members and 1 Guest are viewing this topic.

Offline AlessandroAU

  • Regular Contributor
  • *
  • Posts: 168
  • Country: au
FPGA text overlay on HDMI - can't meet timing
« on: January 21, 2019, 04:11:00 am »
Hi all,

I am a beginner to FPGA, I am trying to do a simple project on an Artix-7 FPGA. I want to display a HDMI signal at 720p with some text overlay.
I have the HDMI generator and output all working fine.

Next I tried to implement a simple 'text engine' that just takes the xpos/ypos of the current pixel and grabs the font data from a block ROM to determine if it should be modified to draw the text.

Problem is looking up the data from the block ROM takes wayyyy too much time and I can't meet timing, any ideas? What is the usual way this is solved?

Cheers,
Alessandro
« Last Edit: January 21, 2019, 04:13:42 am by AlessandroAU »
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 1900
  • Country: ca
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #1 on: January 21, 2019, 04:43:47 am »
Pipeline. You do everything in one clock. Do it by little pieces, a little bit on each clock, so it all get calculated exactly by the time you need to output the pixel.
 

Offline apblog

  • Regular Contributor
  • *
  • Posts: 95
  • Country: us
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #2 on: January 21, 2019, 05:05:07 am »
Yes, pipeline.  To expand on that a bit, you are trying to do a lot of math in one clock cycle.

Remember that variables are essentially just like C macros, in the way that you are using them here.
 

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 3535
  • Country: ca
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #3 on: January 21, 2019, 06:46:08 am »
Though I could give you my verilog from a decade ago, which has has self contained display memory, font memory, palette memory with alpha 4 colors and translucency, it's been so long that I could not give you the best possible support and you wouldn't learn much about thinking out a design and stage pipelining as I've already done all the work for you.

Note, my design is an Altera Quartus II design, for a Cyclone device.  All my memory instances, rom and ram would need to be re-written for Xilinx.

Basically, I have a X&Y counter stage with programmable horizontal and vertical set and clear points.  This makes a programmable horizontal and vertical enable window box so I may dynamically positioning/scroll the OSD window anywhere on the display as well as enlarge or shrink it.  (Basically a X-on and X-off / Y-on and Y-off register I can set via my internal controller MCU)

Now, those 2 generated enables (X-ena and Y-ena) go next to a reset of a synchronous X&Y counter for the OSD text memory contents reader, with a programmable increment speed to scale the on screen font X&Y size.  The upper bits of those counters go into the read port of the character display memory which has latched address in and latched data out setup for maximum speed.  The least significant bits of those 2 X&Y counters go into 2 sets of D-Flipflop registers so that the contents are in parallel with the character display memory output.  Now, the character display memory output and the LSB X&Y counter D-Flipflop delayed registered data are re-wired/re-bundled to go into the address of the font memory (2 bit in my case since I had 4 colors wired to a palette memory) which has it's address clocked in and the data clocked out.  That output went into a palette memory ram, which also contained the superimpose 4 bit blend setting 12 bit 4096 colors (16 bit data total), who's output went into the final MUX gate to select how much of the OSD palette memory's 12 bit RGB color to show VS how much of the background video.  That final MUX selection gate had only a final 1 clock delay, meaning I would also pass the main video HS,VS,and video ena through a single D-Flipflop latch to keep the output picture centered.

All taken into account, since the beginning of this whole thing, the horizontal enable position needs to begin 9 pixels to the left of where you really want you text output.  By doing a clocked layout like this, the Artix 7 should compile and it's FMAX clock will be at Xilinx's maximum possible speed because of these 9 sequential steps used in creating an OSD text display, though the background picture had only 1 pixel clock delay in the final output MUX mixing selector.

(This means when setting the  X-on and X-off OSD window registers, subtract 8 and add the number of pixels between the HS and active picture region if you want the OSD to start right at the left most of the picture...)

I know this is a big step with all my stages, but if you want up to 500MHz (Yikes, that's 6.6x the speed of 720p) video capability, all these isolated steps are needed.  Removing the latched addresses and data outs on the memory or compacting the X&Y counters directly in the memory's address inputs cuts down on the 9 setup clocks for the OSD generator, but it slows down you FMAX and to achieve your FMAX, you might need additional spare room as you fill your FPGA so Xilinx's compiler/fitter can best optimize layout to achieve your desired FMAX.
« Last Edit: January 21, 2019, 07:18:49 am by BrianHG »
__________
BrianHG.
 

Online scatha

  • Regular Contributor
  • *
  • Posts: 61
  • Country: au
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #4 on: January 21, 2019, 10:20:35 am »
Pipelining the maths will help, making sure that the output from the block RAM is registered (assuming that is where the ROM ends up) is mandatory for any high-speed stuff, but ultimately this is all speculation until you check out the failing path(s) of the post-routed design in Vivado. What is the contribution of routing and logic to the delay of the worst-case negative slack path, for instance?


 

Offline Rasz

  • Super Contributor
  • ***
  • Posts: 2314
  • Country: 00
    • My random blog.
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #5 on: January 21, 2019, 04:44:56 pm »
Next I tried to implement a simple 'text engine' that just takes the xpos/ypos of the current pixel and grabs the font data from a block ROM to determine if it should be modified to draw the text.

Its not like you need low latency random access.

Problem is looking up the data from the block ROM takes wayyyy too much time and I can't meet timing, any ideas? What is the usual way this is solved?

Like BrianHG already said start computing way ahead. Your character generator will never be asked for random parts of fonts, you will pretty much always do whole lines, and you will always know in advance what to generate, no need to wait to the last moment/pixel.
Who logs in to gdm? Not I, said the duck.
My fireplace is on fire, but in all the wrong places.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 4318
  • Country: fr
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #6 on: January 21, 2019, 05:36:35 pm »
As said above. Pipeline. I'm not really convinced the issue is that reading from "block ROM" takes too long. It's usually pretty fast in FPGAs. The problem lies in that you're doing too much at each clock cycle, with too many dependencies, which will make the logic path too long. Inspecting the timing analyzer will help spot the issue.
 

Online scatha

  • Regular Contributor
  • *
  • Posts: 61
  • Country: au
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #7 on: January 21, 2019, 10:01:23 pm »
As said above. Pipeline. I'm not really convinced the issue is that reading from "block ROM" takes too long. It's usually pretty fast in FPGAs.

The read itself is pretty fast, the problem is that in Xilinx FPGAs the block ram resources are physically so far from the slice logic you incur a large routing delay when you try to do anything with the block ram output. Registering at least reduces the combined logic+routing delay - that's why 'always register your block ram outputs' is recommended in the Xilinx high-speed design guides.

Ultimately all this is spit-balling until the OP fires up the Vivado timing analysis and floor planning tools to work out what *actually* is wrong.
 

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 3535
  • Country: ca
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #8 on: January 21, 2019, 11:13:48 pm »
As said above. Pipeline. I'm not really convinced the issue is that reading from "block ROM" takes too long. It's usually pretty fast in FPGAs.

The read itself is pretty fast, the problem is that in Xilinx FPGAs the block ram resources are physically so far from the slice logic you incur a large routing delay when you try to do anything with the block ram output. Registering at least reduces the combined logic+routing delay - that's why 'always register your block ram outputs' is recommended in the Xilinx high-speed design guides.

Ultimately all this is spit-balling until the OP fires up the Vivado timing analysis and floor planning tools to work out what *actually* is wrong.
Hence in my instruction: (QUOTE) "character display memory which has latched address in and latched data out"
And I said this for every ram instance in my OSD routine algorithm description, which gives you that 2 clock delay from address to data out at the display ram, font ram, and color palette ram.

The issue is identical for Altera's dual port RAM and ROM instances with one caveat, on some of the larger Altera FPGAs, you can place an address latch with an increment/decrement counter within the address input of that RAM/ROM instance's address input logic cells reducing 1 clock cycle for things like very fast FIFO with an async address load feature.  I'm sure Xilinx may have the same capability, but like Altera, to gain this feature, you might need to initiate a custom module function supplied by the vendor's development suite.
__________
BrianHG.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 1900
  • Country: ca
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #9 on: January 21, 2019, 11:41:45 pm »
And I said this for every ram instance in my OSD routine algorithm description, which gives you that 2 clock delay from address to data out at the display ram, font ram, and color palette ram.

On Xilinx it all depends on the clock speed. Up to about 250 MHz-ish you can get away with unregistered BRAM (data on the next clock after address). Up to about 450 MHz you can use BRAM with internal registers (data after 2 clocks). If you want faster, you need to use tricks, such as interleaved reading at the slower clock.

You can also use LUT RAM which is faster than BRAM and doesn't even require a clock to read, but its performance deteriorates as you increase address width.
 

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 3535
  • Country: ca
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #10 on: January 22, 2019, 12:06:25 am »
Up to about 250 MHz-ish you can get away with unregistered BRAM (data on the next clock after address).
Yes, if the fitter has an easy time where you have the right space free on the FPGA.  We cannot know the OPs complete use of the FPGA, or how much more items he will want to place on the FPGA.  I say eat the extra 24-32 data latches and give the fitter the easiest possible time trying to route a design which only requires 75MHz to operate so that 6 months down the line when you fill the FPGA to 95%, you will not be going back to your OSD section to mess with it all over, of going from the cheapest -1 to a -2 or -3 or larger FPGA, or manually aiding the fitter to meet that clock rate.

Do it once, do it right, and when you next need 1080p support, or higher scan rates, you have already maxed out what the FPGA can deliver if need be.
__________
BrianHG.
 

Offline AlessandroAU

  • Regular Contributor
  • *
  • Posts: 168
  • Country: au
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #11 on: January 22, 2019, 04:49:59 am »
Hi everyone,

Thanks for your help!! particularly BrianHG, it will take me awhile to digest your post. I actually managed to get this working with some 'simple' pipe-lining?. I split the process into 2 halves and and read an 'A' set of variables while processing the 'B' set and vice versa. Is this actually how to pipeline correctly? It feels extremely clumsy.


 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 1900
  • Country: ca
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #12 on: January 22, 2019, 05:28:20 am »
Is this actually how to pipeline correctly?

Sort of, except you don't need two sets of variables. Pipelining simply means inserting flip-flops into your logic chain. For example, this does everything in one clock:

Code: [Select]
if rising_edge(clk) then
  R <= A + B + C + D;
end if;

In contrast, this pipelenes it:

Code: [Select]
if rising_edge(clk) then
  AB <= A + B;
  CD <= C + D;
  R <= AB + CD;
end if;

While R is being calculated for time (t), AB and CD are calculated for time (t+1). At the next clock, you take AB and CD you have just calculated and calculate R for time (t+1).

Now it takes two clocks to complete, but the amount of work per clock cycle is less, so you can run your clock faster.

 
The following users thanked this post: BrianHG

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 3535
  • Country: ca
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #13 on: January 22, 2019, 06:42:29 am »
That is not the way I would do it.  Since I do not see the rest of your code, I cannot help you.  It looks as if you are feeding a character at a time, but where is this coming from.
Anyways, here is my example elaborated out :
-------------------------------------------------------
Step1:
I have an OSD_raster_X&Y pixel position counter which runs only where I want my OSD text.
My text display memory in my old designs were dual port RAMs in the FPGA.
Port A was an address, WE and data in where I could write text to my display memory.
Port B of that ram was read only where it's address was wired to the OSD_raster_X position shifted down by 4 bits and the OSD_raster_Y position shifted down by 4 bits. (My font was 16x16 pixels).  The data output was called 'Display_Character' and it came on the next pixel clock since them ram module was registered.
--------------------------------------------------------------
Step2:
I make a copy of the LSB 4 bits on the OSD_raster_X&Y pixel counters, hence now delayed by 1 pixel clock, now called DLY_OSD_raster_X&Y.
My Font memory is also a dual port ram (Just so I have a port to edit the font in software).
Port A (Address, WE, & data inputs to allow software modification of the font)
Port B read address was the DLY_OSD_raster_X,Y, which came from the OSD position counters, plus, the 'Display_charcter' from the display ram's Port B data out.  My font's Port B output was 2 bits which then went to the palette memory, then display.
---------------------------------------------------------------
Now, in this example, it is not that my code runs only Step1, then only runs Step2.  Both always run simultaneously and always.  What is going on is that every pixel clock, the screen position counters only add, or reset at HS and VS.  These counters realtime feed the display memory's PortB read address.  At the same time, my Step2, even though the first pixel clock shot around had bad data, still clocks in the delayed lemon contents DLY_OSD_raster_X&Y coordinates and the lemon output of the display memory 'Display_Character'.

It is at the next pixel clock, the display ram's 'Display_Character' and it's DLY_OSD_raster_X&Y coordinates have the right data from the previous pixel clock's position which now feeds the font memory's address input.

At the next clock, the display ram's address input is now on the third pixel position.  The font's address input is on the second pixel position and the font output finally has the first pixel position valid pixel.  This pipe stream goes on and on.  This is a valid pipeline for maximum speed. 

What's important here is that no math is being done at all, all you need to fee into this is a reset for the OSD X&Y counters and always increment the X when not in reset and increment the Y once on each HS unless it's in reset.  Everything else is just shifting blindly through registers or ram blocks.

Now, I left out the part of when to reset the OSD's internal X&Y counter which is done with 2 flags, or registers, which is calculate done in advance of Step1.  The FPGA can increment or reset these 2 X&Y position counters without doing any add or subtract from you master reference raster generator's X&Y counters to position the OSD on the display using something simple like:

---------------------------------------------------------------------------------------------------------------
if (raster_position_X == X_OSD_POSITION_LEFT_PARAMETER) {
              x_osd_reset <= 0;
} else if (raster_position_X == X_OSD_POSITION_RIGHT_PARAMETER) {
              x_osd_reset <= 1;
}
if (raster_position_Y == Y_OSD_POSITION_TOP_PARAMETER) {
              y_osd_reset <= 0;
} else if (raster_position_X == Y_OSD_POSITION_BOTTOM_PARAMETER) {
              y_osd_reset <= 1;
}
-----------------------------------------------------------------------------------------------------------------
(Your parameter may also be a register if you want a software programmable OSD position window and programmable X&Y size.)
Notice here, there are only 4 equality compares, which generate the x&y_osd_reset registers.  These registers would control my 'OSD raster X&Y' counters, and also be passed through (register delayed) to the display output as an inverted enable OSD for the MUX.

Basically:
---------------------------------------------------------------------------------
if (x_osd_reset) begin {
                                     OSD_raster_x <= 0;
                         } else  OSD_raster_x <= OSD_raster_x + 1;
if (y_osd_reset) begin {
                                     OSD_raster_y <= 0;
                         } else if (HS) OSD_raster_y <= OSD_raster_y + 1;  // (This assumes HS is 1 pixel wide, otherwise you may need to adjust this code.  I can give you a simple foolproof trick, but, you'll need to ask)
----------------------------------------------------------------------------------

Think this through as you may adapt some of your code without going way off the mark all the way to my design.

And don't forget I have a:
 -----------------------------------------------------------------------
OSD_output_enable_early2 <= ~x_osd_reset && ~y_osd_reset;
OSD_output_enable_early1 <= OSD_output_enable_early2;
OSD_output_enable_early0 <= OSD_output_enable_early1;
OSD_output_enable             <= OSD_output_enable_early0;
------------------------------------------------------------------------
This generates the OSD_output_Enable in sync with the 3 registered delayed memory clock cycles.
« Last Edit: January 22, 2019, 07:06:09 am by BrianHG »
__________
BrianHG.
 

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 3535
  • Country: ca
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #14 on: January 22, 2019, 07:33:39 am »
Nowhere in my code am I doing something like:
------------------------------------------------------------------------------------------
 if  hc_a >= posx and hc_a < posx + (font_width * displaytext^length) then ...
------------------------------------------------------------------------------------------

remember, every single clock, the entire argument above needs to be solved whether each variable/register is a fixed number of a dynamic number while you are doing a full up and down magnitude comparison of said mathematical sums.

Do you know how much faster an FPGA could run your code like this:
-------------------------------------------------------------------------------------------------
temp_brian <= posx + (font_width * displayText^length) +1 ;
 if  hc_a >= posx and hc_a < temp_brian  then ...
------------------------------------------------------------------------------------------------
(Remember, temp_brian is a latched register here!, not an integer.)
I removed the combination of the Less Than magnitude compare with all that math bundled up together which requires so many gates that it slows down everything a ton.  Though, I was guessing, that posx increments every clock and since 'temp_brian' is delayed a clock, I added 1 when computing that register.   Preparing an enable/disable register instead using an == test, one test for enable, one for disable would be even faster for the FPGA.

EG:
----------------------------------------------------------------------------------------
if (hc_a == posx) {
                             inXrange <= 1;
                } else if (hc_a == temp_brian) inXrange <=0;
----------------------------------------------------------------------------------------
This only require 1 xor gate per bit X 2 and 2 big AND gates to trigger the inXrange on and off.
In your code, using a >= and < requires a magnitude number of gates to evaluate the magnitude of 2 dynamic quantities while compute an additional sum and potential multiply on one side, and even worse, both compares are grouped together and the logic gates must evaluate everything all together before deciding what to do with inXrange.

These are only tricks if you are trying to go so far as to get your pixel clock to run up to 600Mhz with your Artex, or, run your project at 1080p 148.5Mhz in a 5$ slow and small FPGA.

I'm not saying that your coding is bad, it's just with FPGA compilers, there are some limits to their speed and squishing all that math and compare together may work fast with a hardwired ASIC, but we want to shrink and simplify the amount of math going into each register or compare to keep an FPGA going quick.
« Last Edit: January 22, 2019, 07:57:23 am by BrianHG »
__________
BrianHG.
 
The following users thanked this post: agehall

Offline vaualbus

  • Frequent Contributor
  • **
  • Posts: 302
  • Country: it
Re: FPGA text overlay on HDMI - can't meet timing
« Reply #15 on: January 23, 2019, 02:41:17 pm »
Also I don't know if any one point it out but you are using VHDl as a programming lenguage and it is not it! Is a language to describe hardware so you first should have done the timing than develop the datapath and than program in VHDl the ASm that control it.
Best regards, Alberto
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf