FPGA VGA Controller for 8-bit computer

#2250 Reply
Posted by nockieboy on 17 Dec, 2020 22:33
Quote from: BrianHG on 17 Dec, 2020 20:56
Did you not take a look at my above attached code (2 posts up) with the 2 versions and 3 errors and work from there???

No I missed that post somehow - I just did what you asked (below). Will have a look at those tomorrow.

Quote from: BrianHG on 17 Dec, 2020 09:44
Send me a working Freebasic version for testing...
(Region 1&2 are your old 'Inv' function - green / red lines of the complex geoarc function...)

#2251 Reply
Posted by BrianHG on 18 Dec, 2020 00:34
I guess you could rename my previous post's geoarc.bas to geoarc_boxstyle.bas & copy & rename the geoarc_complex.bas to geoarc.bas & add in this new code only drawing that 1 quadrant like the other 2.

#2252 Reply
Posted by nockieboy on 18 Dec, 2020 15:26
I think this is what you want?

Looks like another issue with the 2nd region - the red line jumps off on a tangent when the ellipse becomes very wide.

Geoarc.bas

#2253 Reply
Posted by BrianHG on 18 Dec, 2020 22:28
Ok, much better. The code generating the vertical region 1 is perfect like you said. Region 2 is missing the final few pixels when the ellipse is almost flat and it's missing all the pixels when it is flat. (You can say that this is region 1 not finishing the last flat part of the line. This isn't a problem as we add code to finish the flat portion of the ellipse like what we have in the _complex version.)

Let me play with this code tonight. I think getting rid of region 2 and draw region 1 twice with flipped X&Y & A&B like we did in the _complex version would make that identical 'mirror' version of the vertical arc.

The good news is that this region 1 of this generator looks almost perfect and if that 45degree mirror & flip trick works, we should have the Verilog in a day.

Also, look at what happens to the region 2 with really large ellipses, the math breaks and the line goes flat. We are exceeding 32 bits. Let's see what happens when doubling region 1.

#2254 Reply
Posted by BrianHG on 19 Dec, 2020 06:10
Ok, please scrutinize the crap out of this freebasic version. If it is good, I will make code it to verilog for you.

When I say scrutinize, I mean that the graphics is rendered perfectly.

Also, make sure the code looks good.

geoarc_simplest.zip

#2255 Reply
Posted by nockieboy on 19 Dec, 2020 10:32
Quote from: BrianHG on 19 Dec, 2020 06:10
Ok, please scrutinize the crap out of this freebasic version. If it is good, I will make code it to verilog for you.

When I say scrutinize, I mean that the graphics is rendered perfectly.

Also, make sure the code looks good.

The code is looking sweet - MUCH simpler than the complex iteration we had before. I've just formatted the code a little and added-in the quadrant drawing, so the function will now draw all four quadrants as specified by the 'quadrant' parameter. Probably completely unnecessary, but I feel like I've done something!

As far as my testing can go, it would appear the arcs are spot on.

Geoarc.bas

#2256 Reply
Posted by nockieboy on 19 Dec, 2020 14:20
Whilst looking at the DVI specification and spending half an hour on the fpga4fun HDMI page before I realised the example is for a Xilinx FPGA (), I came across this repo on github which purports to implement an HDMI interface with embedded audio... https://github.com/hdl-util/hdmi.git.

I've only spent five minutes with it - have cloned the repo and attempted to build it in Quartus II, but it's throwing errors in the audio_clock_regeneration_packet.sv file (it doesn't seem to like this next line):
Code: [Select]
line 41: localparam int CYCLE_TIME_STAMP_COUNTER_WIDTH = $clog2(20'(int'(real'(CYCLE_TIME_STAMP_COUNTER_IDEAL) * 1.1))); // Account for 10% deviation in audio clock

... but then I never expect to clone a github repo and it just 'work' straight away. They normally take more time to get working than it would take to build the thing from scratch, but I was hoping that I might be able to save some time and also use a more powerful interface (the inclusion of audio in the output is very desirable).

Anyway, I've gone back to the fpga4fun example and am trying to convert it for my test-case. It seems nice and simple, but it looks like the OBUFDS and DCM_SP and BUFG components are created by the Xilinx IDE, like Altera's megafunctions. OBUFDS looks simple enough - it just drives the two differential-pair output pins with the data and an inverted copy of the data, from what I can gather from the equivalent file in the github repo above (although it does some funny stuff with IO pins which complicates module somewhat).

But can I replace DCM_SP and BUFG with a PLL clock multiplier to get the required 250MHz clock?

Here's my DVI_test code below - I've commented it for the most part, but just past half-way down the code you'll see the instantiation of OBUFDS, DCM_SP and BUFG components. These are what I'm having issues with understanding, currently.

Code: [Select]
module video_source( // inputs input clk, // 25MHz pixel clock // outputs output [2:0] TMDSp, // TMDS data out output [2:0] TMDSn, output TMDS_CLKp, // pixel clock out output TMDS_CLKn ); reg [9:0] x, y ; // horizontal & vertical pixel counters reg hSync ; reg vSync ; reg D_EN ; // Display ENable /* Create timing signals for a valid 640x480 display. * * This requires D_EN to be enabled when the raster is * within the visible display area and for the raster * counters (X and Y) to be updated and reset according * to their position on the screen. * * hSync and vSync should also go high according to the * specifications outlined for the 640x480 video mode. */ always @(posedge clk) begin D_EN <= ( x < 640 ) && ( y < 480 ) ; // enable display if pixel counters are in visible display area x <= ( x == 799 ) ? 0 : x + 1 ; // increment horizontal pixel counter, or reset if at end of line if( x == 799 ) begin // horizontal pixel counter has reached end of row if ( y == 524 ) begin y <= 0 ; // reset vertical pixel counter end else begin y <= y + 1 ; // increment vertical pixel counter end end hSync <= ( x >= 656 ) && ( x < 752 ) ; // hSync goes HIGH when horizontal pixel counter is between 655 and 752 vSync <= ( y >= 490 ) && ( y < 492 ) ; // vSync goes HIGH when vertical pixel counter is between 489 and 492 end wire [7:0] W = {8{x[7:0]==y[7:0]}} ; wire [7:0] A = {8{x[7:5]==3'h2 && y[7:5]==3'h2}} ; reg [7:0] red ; reg [7:0] green ; reg [7:0] blue ; // Create a display pattern always @(posedge clk) begin red <= ( { x[5:0] & { 6 { y[4:3] == ~x[4:3] } }, 2'b00 } | W ) & ~A ; green <= ( x[7:0] & { 8{ y[6] } } | W ) & ~A ; blue <= y[7:0] | W | A ; end // // Create three TMDS_encoder instances to handle the Red, Green, Blue and Control signals // wire [9:0] TMDS_red ; wire [9:0] TMDS_green ; wire [9:0] TMDS_blue ; TMDS_encoder encode_R( .clk ( clk ), .VD ( red ), .CD ( 2'b00 ), .VDE ( D_EN ), .TMDS( TMDS_red ) ); TMDS_encoder encode_G( .clk ( clk ), .VD ( green ), .CD ( 2'b00 ), .VDE ( D_EN ), .TMDS( TMDS_green ) ); TMDS_encoder encode_B( .clk ( clk ), .VD ( blue ), .CD ( { vSync, hSync } ), .VDE ( D_EN ), .TMDS( TMDS_blue ) ); // // Multiply 25MHz clock by 10 to generate a 250MHz clock wire clk_TMDS ; wire DCM_TMDS_CLKFX ; // 25MHz x 10 = 250MHz DCM_SP #(.CLKFX_MULTIPLY(10)) DCM_TMDS_inst(.CLKIN(clk), .CLKFX(DCM_TMDS_CLKFX), .RST(1'b0) ) ; BUFG BUFG_TMDSp(.I(DCM_TMDS_CLKFX), .O(clk_TMDS)) ; // // Create three 10-bit shift registers running at 250MHz reg [3:0] TMDS_mod10 = 0 ; // modulus 10 counter reg [9:0] TMDS_shift_red = 0 ; reg [9:0] TMDS_shift_green = 0 ; reg [9:0] TMDS_shift_blue = 0 ; reg TMDS_shift_load = 0 ; always @(posedge clk_TMDS) begin TMDS_shift_load <= ( TMDS_mod10 == 4'd9 ) ; TMDS_shift_red <= TMDS_shift_load ? TMDS_red : TMDS_shift_red [9:1] ; TMDS_shift_green <= TMDS_shift_load ? TMDS_green : TMDS_shift_green[9:1] ; TMDS_shift_blue <= TMDS_shift_load ? TMDS_blue : TMDS_shift_blue [9:1] ; TMDS_mod10 <= ( TMDS_mod10 == 4'd9 ) ? 4'd0 : TMDS_mod10 + 4'd1 ; end OBUFDS OBUFDS_red ( .I( TMDS_shift_red [0] ), .O( TMDSp[2] ), .OB( TMDSn[2] ) ) ; OBUFDS OBUFDS_green( .I( TMDS_shift_green[0] ), .O( TMDSp[1] ), .OB( TMDSn[1] ) ) ; OBUFDS OBUFDS_blue ( .I( TMDS_shift_blue [0] ), .O( TMDSp[0] ), .OB( TMDSn[0] ) ) ; OBUFDS OBUFDS_clock( .I( clk ), .O( TMDS_CLKp ), .OB( TMDS_CLKn ) ) ; endmodule //********************************************************************************************************* // // TMDS Encoder Module // //********************************************************************************************************* module TMDS_encoder( input clk, input [7:0] VD, // video data (red, green or blue) input [1:0] CD, // control data input VDE, // video data enable, to choose between CD (when VDE=0) and VD (when VDE=1) output reg [9:0] TMDS = 0 ); wire [3:0] Nb1s = VD[0] + VD[1] + VD[2] + VD[3] + VD[4] + VD[5] + VD[6] + VD[7] ; wire XNOR = ( Nb1s > 4'd4 ) || ( Nb1s == 4'd4 && VD[0] == 1'b0 ) ; wire [8:0] q_m = { ~XNOR, q_m[6:0] ^ VD[7:1] ^ { 7{ XNOR } }, VD[0] } ; reg [3:0] balance_acc = 0 ; wire [3:0] balance = q_m[0] + q_m[1] + q_m[2] + q_m[3] + q_m[4] + q_m[5] + q_m[6] + q_m[7] - 4'd4 ; wire balance_sign_eq = ( balance[3] == balance_acc[3] ) ; wire invert_q_m = ( balance == 0 || balance_acc == 0 ) ? ~q_m[8] : balance_sign_eq ; wire [3:0] balance_acc_inc = balance - ( { q_m[8] ^ ~balance_sign_eq } & ~( balance == 0 || balance_acc == 0 ) ) ; wire [3:0] balance_acc_new = invert_q_m ? balance_acc-balance_acc_inc : balance_acc + balance_acc_inc ; wire [9:0] TMDS_data = { invert_q_m, q_m[8], q_m[7:0] ^ { 8{ invert_q_m } } } ; wire [9:0] TMDS_code = CD[1] ? (CD[0] ? 10'b1010101011 : 10'b0101010100) : ( CD[0] ? 10'b0010101011 : 10'b1101010100 ) ; always @(posedge clk) begin TMDS <= VDE ? TMDS_data : TMDS_code ; balance_acc <= VDE ? balance_acc_new : 4'h0 ; end endmodule

#2257 Reply
Posted by asmi on 19 Dec, 2020 15:07
The reason there are so many code examples for Xilinx FPGA is because it's very trivial to implement HDMI on that platform.

OBUFDS - Output BUFfer with Differential Signalling
DCM - Digital Clock Manager (think of it as advanced PLL with additional functionality)
BUFG - global clock buffer (entry into low-skew lines designed to distribute clock signals across the die)

The code you've posted does not use SERDES, which is fine for lower frequencies, but won't work for things like 720p, which run at 742.5 MHz.

#2258 Reply
Posted by BrianHG on 19 Dec, 2020 21:42
Quote from: nockieboy on 19 Dec, 2020 14:20
... but then I never expect to clone a github repo and it just 'work' straight away. They normally take more time to get working than it would take to build the thing from scratch, but I was hoping that I might be able to save some time and also use a more powerful interface (the inclusion of audio in the output is very desirable).

#2259 Reply
Posted by BrianHG on 19 Dec, 2020 21:49
Quote from: nockieboy on 18 Dec, 2020 15:26
I think this is what you want?

Looks like another issue with the 2nd region - the red line jumps off on a tangent when the ellipse becomes very wide.
There was no need to add the 4 quadrants. They are implies, but, thanks.
You only did a half job, you forgot to add them during the 'finish-flat ellipse' portion.

Anyways, I'll code it tonight & I decided to do it with on single 24bit or 32bit A*B=Y multiplier using the existing setup.
At least if you read the code, you will see how I went about doing that. 24 bit should have no problems passing the 125MHz, maybe even the 150 point, but 32bit X 32 bit with 32 bit out, I never tried. However, the design will allow for a 2 clock piped multiplier witch will negate the problem. The new code should render 125mpps no problem as there are no such heavy duty multiple multiplies during the iteration loop.

#2260 Reply
Posted by SiliconWizard on 19 Dec, 2020 22:13
Quote from: asmi on 19 Dec, 2020 15:07
The reason there are so many code examples for Xilinx FPGA is because it's very trivial to implement HDMI on that platform.

OBUFDS - Output BUFfer with Differential Signalling
DCM - Digital Clock Manager (think of it as advanced PLL with additional functionality)
BUFG - global clock buffer (entry into low-skew lines designed to distribute clock signals across the die)

The code you've posted does not use SERDES, which is fine for lower frequencies, but won't work for things like 720p, which run at 742.5 MHz.

Yep. Out of curiosity, do you think there is any possibility that SERDES could be inferred from pure HDL (maybe with a particular coding style), or if it's never going to be inferred and you need to explicitely instantiate that?

#2261 Reply
Posted by BrianHG on 19 Dec, 2020 22:46
Quote from: SiliconWizard on 19 Dec, 2020 22:13
Quote from: asmi on 19 Dec, 2020 15:07
The reason there are so many code examples for Xilinx FPGA is because it's very trivial to implement HDMI on that platform.

OBUFDS - Output BUFfer with Differential Signalling
DCM - Digital Clock Manager (think of it as advanced PLL with additional functionality)
BUFG - global clock buffer (entry into low-skew lines designed to distribute clock signals across the die)

The code you've posted does not use SERDES, which is fine for lower frequencies, but won't work for things like 720p, which run at 742.5 MHz.

Yep. Out of curiosity, do you think there is any possibility that SERDES could be inferred from pure HDL (maybe with a particular coding style), or if it's never going to be inferred and you need to explicitely instantiate that?
Well, initiating 2 different SERDES, like Intel/Xilinx with a compiler directive to select between the 2 shouldn't be a problem as both SERDES would eventually have the same configuration selected unless you are targeting a vendor specific feature. Software SERDES also shouldn't be a problem though you wont hit those super high speed specific HW. Selecting the IO buffer type and configuring it's drive and terminations characteristics/features does become much more vendor specific and even with a compiler directive to select between the two, their configuration would probably be totally different.

For example, in Quartus, you can define the outputs coming from your SERDES logic in the pin-mapping floor-plan separate of any HDL code. For example, if you assign you output 'DVI_OUT[3..0]' as LVDS or Differential STTL in the compiler's assignment editor, it will automatically force those onto 4 output pairs, 8 output pins and you will need to specify the +output pin and the compiler will force it's -output on the corresponding differential IO pair.
You can also obviously initiate in your HDL the LVDS/Differential IOBUF with whatever configuration you like bypassing the need to configure the IO pin specifics in the pin-mapping floor-plan editor.

#2262 Reply
Posted by asmi on 20 Dec, 2020 00:42
Quote from: SiliconWizard on 19 Dec, 2020 22:13
Yep. Out of curiosity, do you think there is any possibility that SERDES could be inferred from pure HDL (maybe with a particular coding style), or if it's never going to be inferred and you need to explicitely instantiate that?
No chance. If you remember, you can't actually achieve 10:1 DDR serialization ratio with a single SERDES, you will need to use a cascaded pair. Which is of course how it was designed to be - since you have SERDES per pin, differential pair gives you two SERDES'es. And they will have to run at DDR, so bit clock will be 5x from parallel one. That means, if you were to write this code manually, there is got to be a CDC circuit for parallel to bit clock, which in reality is implemented in SERDES in silicon. Obviously all of that is highly device-specific, nice thing about 7 series is that entire series have identical HW blocks, so migrating from Spartan-7 to Artix-7 to Kintex-7 to Virtex-7 is absolutely painless, but if you were to go to Ultrascale for example, you will have to make changes in a code.
The same deal is with MGTs. These are device family specific such that even GTPs in Artix-7 are quite a bit different from GTXes in Kintex-7.

#2263 Reply
Posted by BrianHG on 20 Dec, 2020 11:44
Here it is, the new ellipse generator.
Currently, the FMAX is 129MHz.
Only, it only uses 4 9bit multiplier elements instead of 24.
However, the logic element count has more than doubled.

The only problem is to get 129MHz, I had to limit the core to 24 bits. At the preferred 32 bits, we only get 116MHz. This is due to having a 2 alternate 4 way 32bit signed parallel adds for every iteration of the arc. I will think of something as we need 32bit for ellipses with a radius of 2047x2047. However, this will make a circle 4095x4095, outside our signed 12 bit coordinate system. I will try shaving a few bits and removing the sign where it is not needed as in the timing report, it is only the last 2-3 MSB bits of the 'p' accumulator which cannot meet the 125MHz timing with that 4 way parallel add.

For now, the core is set to 24bits. Perform your tests. It should match the Freebasic output other than the 'quadrant' is setup in a different order. Let me know how it works. Test all you like.

Get attached simulation setup. If it works, next I try to enlarge the 24 bit core without sacrificing FMAX and then we integrate into the GPU.

Geo_Writer_V8_129mhz.zip

#2264 Reply
Posted by nockieboy on 20 Dec, 2020 16:13
I've done some simulations using various dimensions for the ellipse - have attached some typical variations, and have tested permutations of X and Y from 0-40 pixels (and other sizes without checking ALL the pixels) - all are pixel-perfect in the simulation output compared with the expected output provided by the FreeBasic code.

I can't get it to break.

#2265 Reply
Posted by BrianHG on 20 Dec, 2020 17:53
Quote from: nockieboy on 20 Dec, 2020 16:13
I've done some simulations using various dimensions for the ellipse - have attached some typical variations, and have tested permutations of X and Y from 0-40 pixels (and other sizes without checking ALL the pixels) - all are pixel-perfect in the simulation output compared with the expected output provided by the FreeBasic code.

I can't get it to break.
That final 1x20 simulation results are wrong. Are you sure you didn't change the inputs and forgot to run/update the results?
Try at least 1 shape other than perfect circles or near vertical lines...

#2266 Reply
Posted by nockieboy on 20 Dec, 2020 18:02
Uh, looks like I messed up the images or something - the output is fine, it matches geo.bas with everything I've thrown at it so far.

#2267 Reply
Posted by BrianHG on 21 Dec, 2020 04:13
Tutorial for Nockieboy in improving FMAX. Part 1.

Ok, for the new 'ellipse_generator.sv', we have a problem passing our required 125MHz when running the core at 32bits. Right now, our limit is 117 MHz. Looking at the timing report, we see that a lot of signals (From Node) p[ # ],px[ # ],ry2[ # ],px[ # ] do not make it to the register p[ # ] (To Node) in time. The worst signal arrives 0.54ns late (Slack in -xxx ns). See:

(Unfortunately, in QuartusPrime when compiling for CV, I think they only provide 1 single worst case timing signal. I'm sure there is a ways to increase the size of the timing report so you get a better overview.)

Ok, so we need to look at the code to see what P is equal to and why these signals feeding it come in too late and how we may be able to improve the situation.
Here is how I begin to approach the problem and the technique I used here is middle of the road and there are other ways, but this was my first approach trying to maintain the current structure. First look to see everywhere I make register 'p' = to something. Here:

Code: [Select]
When sub_function == 3 p <= (alu_mult_y + 2) >> 2 ; When sub_function == 6 p <= p + ry2 - alu_mult_y ; When sub_function == 7 && (px <= py) && (p <= 0) p <= p + ry2 + (px + (ry2<<1)) ; When sub_function == 7 && (px <= py) && !(p <= 0) p <= p + ry2 + (px + (ry2<<1)) - (py - (rx2<<1)) ;
Below, I'm showing you how the compiler constructs the logic for calculating 'p' (approximately). Remember, the FPGA is not a CPU passing memory variable to and from a single ALU, all the above instructions need to be combined into a single set of gates to make the 32 bit register 'p' equal the following function at the core clock of 125MHz.
(Yes, I tried to get this right, so analyze it...)

Code: [Select]
p <= ( p * (sub_function !=3)) + (( (((alu_mult_y + 2) >> 2) * sub_function == 3) )) - (( ((alu_mult_y) * sub_function == 6) )) + (( ry2 * ( sub_function == 6 || (sub_function == 7 && (px <= py)) ) )) + (( (px + (ry2<<1)) * ((sub_function == 7) && (px <= py)) )) - (( (py - (rx2<<1)) * ((sub_function == 7) && (px <= py) && !(p <= 0)) )) ;
YES, all that shit... Though, the compiler will simplify the algebra as much as possible, this is the mess that 32 bit register 'p' must equal with all those other variables being 32 bits which feed a mass of gates to compute for the D-flipflop 32 bit data input. Apparently, the necessary entire mass of gates will fail to guarantee the correct solution when register 'p' is clocked (with everything else of course) above 117MHz.

'p' is dependent on the sub_function[3:0] number, (px <= py), !(p <= 0), plus the 32 bit registers 'p' itself since it is being added to itself, then alu_mult_y both added by 2 and shifted and again natively, rx2, ry2, px and py.

Code: [Select]
sub_function = 4 bits (px<=px) = 32+32 bits (p<=0) = 32 bits 'p' = 32 bits alu_mult_y = 32bits *2=64 (shifted and non shifted) rx2,ry2,px2,py3 = 32*4bits =128 TOTAL: 324 bits / 324 wires/signals to generate the result 'p'.
Here is a test I performed. What I did was make p<=0 at sub_function==3 and getting rid of sub_function 6 which changes the equation to:

Code: [Select]
p <= ( p * (sub_function !=3)) + (( ry2 + (px + (ry2<<1))) * ((sub_function == 7) && (px <= py)) ) - (( (py - (rx2<<1)) * ((sub_function == 7) && (px <= py) && !(p <= 0)) )) ;
We got rid of the 2 * alu_mult_y cutting 64 wires from the 324 needed from the first equation.
Now the compiler give us an FMAX of 132MHz and looking at the worst case timing paths, 'p' (To Node) is actually third down on the list which means there is something else above which is limiting the system to 132MHz.

So, we have a goal, how to incorporate these 2 setup actions:
Code: [Select]
p <= (alu_mult_y + 2) >> 2 ; p <= p + ry2 - alu_mult_y ;And not add any complexity/dependancies to the above test 132MHz FMAX equation.

The trick I decided to use was to temporarily store '(alu_mult_y + 2) >> 2' in ry2 and then just use the beginning of what already exists in the equation:
(remove the red part and just keep the beginning...)
p <= p + ry2 + (px + (ry2<<1)) - (py - (rx2<<1)) ;

Ok, trick 1, with the repeat of reusing ry2.
In the above sub_function==3, I made 'ry2 <= (alu_mult_y + 2) >> 2'. Since ry2 only has 1 = alu_mult_y, adding this here doesn't really slow down that register.
Now during sub_function==4, I made 'p <= ry2'. Since 'ry2' is added to 'p' everywhere else, this doesn't add additional signal dependencies to the master 'p <= 'blahhh blahhh blahhh' ' equation.

Sub_function==5, since ry2 will now have been updated to the next value, I just added:
p <= p + ry2;
Again, no new dependencies to calculate the master equation 'P'.

Sub_function==6, ok, there was no choice, I had to make it:
p <= p - alu_mult_y;
This added a 32 new bits of dependence. Let's try a compile and see the results.

As you can see, the new FMAX is 121MHz and there are only 4 signals too slow to make the cut.
Now, we know getting rid of that '- alu_mult_y' will allow us to clear the hurdle with spades, but without doing back-flips let's try 1 thing first.

The rx2 & ry2 are the square of the Xr & Yr 12 bit numbers which I have forced to 32 bits.
Since they are only positive integers from 0 to 2047, the result will always be an unsigned 22 bit number. Let's see if I force these to 'UNSIGED 22 bits' since rx2 & ry2 are used so often everywhere in that gigantic ' p <= blahhh blahhh blahhh '.

Ok, talk about just clearing the hurdle, 126MHz...
We also went from 926 Logic elements to 888.

Question, can we do better.
Another solution may be making a temporary register hold:
p <= (alu_mult_y + 2) >> 2 ;
p <= p + ry2 - alu_mult_y ;
Then at the last step. make p<= that 1 register.
Optimization attempt #2 as well as testing only making the Rx2 & Ry2 = 22bits with the original code for tomorrow.

Test V9 attached spaghetti code.
Snapshots not necessary unless there are errors...
(I also found out I'm doing 2 sub_functions uselessly identically twice, the correction will be done next.)

Geo_Writer_V9_optimize_v1_126mhz.zip

#2268 Reply
Posted by nockieboy on 21 Dec, 2020 11:58
Quote from: BrianHG on 21 Dec, 2020 04:13
Tutorial for Nockieboy in improving FMAX. Part 1.

...

Below, I'm showing you how the compiler constructs the logic for calculating 'p' (approximately). Remember, the FPGA is not a CPU passing memory variable to and from a single ALU, all the above instructions need to be combined into a single set of gates to make the 32 bit register 'p' equal the following function at the core clock of 125MHz.
(Yes, I tried to get this right, so analyze it...)

Code: [Select]
p <= ( p * (sub_function !=3)) + (( (((alu_mult_y + 2) >> 2) * sub_function == 3) )) - (( ((alu_mult_y) * sub_function == 6) )) + (( ry2 * ( sub_function == 6 || (sub_function == 7 && (px <= py)) ) )) + (( (px + (ry2<<1)) * ((sub_function == 7) && (px <= py)) )) - (( (py - (rx2<<1)) * ((sub_function == 7) && (px <= py) && !(p <= 0)) )) ;
YES, all that shit... Though, the compiler will simplify the algebra as much as possible, this is the mess that 32 bit register 'p' must equal with all those other variables being 32 bits which feed a mass of gates to compute for the D-flipflop 32 bit data input. Apparently, the necessary entire mass of gates will fail to guarantee the correct solution when register 'p' is clocked (with everything else of course) above 117MHz.

Whoah. Well, firstly thank you for that post - it was immensely useful in understanding what's going on 'under the hood' of the compiler and in identifying Fmax bottlenecks in the HDL. I mean... damn!! And I think I understood most of it, too!

Okay, here's what I've got from the latest code for the p register since you've amended the code - pulling out any lines where the p-register is having a value assigned to it:

Code: [Select]
if (sub_function == 4) p <= p + ry2 ; if (sub_function == 5) p <= p + ry2 ; if (sub_function == 6) p <= p - alu_mult_y ; if (sub_function == 7) && (px <= py) && (p <= 0) p <= p + ry2 + (px + (ry2<<1)) ; if (sub_function == 7) && (px <= py) && !(p <= 0) p <= p + ry2 + (px + (ry2<<1)) - (py - (rx2<<1)) ;
And from that, my attempt at what the compiler is doing to get the HDL-equivalent of those rules:

Code: [Select]
p <= p + ( ( ( sub_function > 3 ) && ( sub_function != 6 ) ) * ry2 ) - ( ( sub_function == 6 ) * alu_mult_y ) + ( ( ( sub_function == 7 ) && ( px <= py ) ) * ( px + ( ry2<<1) ) ) - ( ( ( sub_function == 7 ) && ( px <= py ) && ( p > 0) ) * ( py - ( rx2<<1) ) )
If I've understood and worked it out correctly, that looks a LOT simpler than the mess that was there before you'd worked on the Fmax performance last night.

Quote from: BrianHG on 21 Dec, 2020 04:13
'p' is dependent on the sub_function[3:0] number, (px <= py), !(p <= 0), plus the 32 bit registers 'p' itself since it is being added to itself, then alu_mult_y both added by 2 and shifted and again natively, rx2, ry2, px and py.

Code: [Select]
sub_function = 4 bits (px<=px) = 32+32 bits (p<=0) = 32 bits 'p' = 32 bits alu_mult_y = 32bits *2=64 (shifted and non shifted) rx2,ry2,px2,py3 = 32*4bits =128 TOTAL: 324 bits / 324 wires/signals to generate the result 'p'.

I know it's nothing compared to far more complex projects that people do on FPGAs, but it blows my mind to think that the fitter is trying to route 324 lanes of data around the die just for this one single function.

Quote from: BrianHG on 21 Dec, 2020 04:13
Test V9 attached spaghetti code.
Snapshots not necessary unless there are errors...
(I also found out I'm doing 2 sub_functions uselessly identically twice, the correction will be done next.)

Yes, sub_function 4 and 5 can be merged into one with the line below, right?

Code: [Select]
p <= p + ( ry2 << 1 ) ;
OR, would it be better to do this instead and remove step 6 entirely by merging it into step 5? EDIT: Have just realised - the alu takes 2 clocks for its result to be valid, so this may not be a valid solution. $:-\$

Code: [Select]
if (sub_function == 4) p <= p + ry2 ; if (sub_function == 5) p <= p + ry2 - alu_mult_y ;
That would remove one step from sub_function but also simplify the logic for the entire system, as this:

p <= p + ( ( ( sub_function > 3 ) && ( sub_function != 6 ) ) * ry2 )
- ( ( sub_function == 6 ) * alu_mult_y )
     + ( ( ( sub_function == 7 ) && ( px <= py ) ) * ( px + ( ry2<<1) ) )
     - ( ( ( sub_function == 7 ) && ( px <= py ) && ( p > 0) ) * ( py - ( rx2<<1) ) )

...could become this:

p <= p + ( ( sub_function > 3 ) * ry2 ) <<-- simplifies this line by one dependency
- ( ( sub_function == 5 ) * alu_mult_y )
     + ( ( ( sub_function == 6 ) && ( px <= py ) ) * ( px + ( ry2<<1) ) )
     - ( ( ( sub_function == 6 ) && ( px <= py ) && ( p > 0) ) * ( py - ( rx2<<1) ) )

#2269 Reply
Posted by nockieboy on 21 Dec, 2020 15:48
Just as an aside, I've got the attached project building now, but it appears to be using some sort of megafunction to set up the TMDS outputs and I cannot assign pins in the Pin Planner to get the project to work on my EasyFPGA board with the DVI Tester. The pins are already assigned and read only when I go to the Pin Planner.

I'm thinking I should just remove the OBUFDS elements entirely from hdmi.sv (lines 319-334) and just connect tmds_current[] to the output pins whilst inverting the _n pin signal, allowing me to assign pins in Pin Manager as usual. Hopefully. Seems OBUFDS uses the altera_gpio_lite megafunction, but I can't find it in the IP Catalog so I can't seem to find where it's assigning the IO pins to the TMDS signals.

I'm probably going to get a lot of flak for not just doing it from scratch, but I like the promise of being able to output an HDMI signal and include audio in the bit stream - that's a major benefit over straight DVI and is why I'm plugging away at this example project.

HDMI_test.zip

#2270 Reply
Posted by BrianHG on 22 Dec, 2020 10:28
Quote from: nockieboy on 21 Dec, 2020 11:58
OR, would it be better to do this instead and remove step 6 entirely by merging it into step 5? EDIT: Have just realised - the alu takes 2 clocks for its result to be valid, so this may not be a valid solution. $:-\$

Code: [Select]
if (sub_function == 4) p <= p + ry2 ; if (sub_function == 5) p <= p + ry2 - alu_mult_y ;
That would remove one step from sub_function but also simplify the logic for the entire system, as this:

p <= p + ( ( ( sub_function > 3 ) && ( sub_function != 6 ) ) * ry2 )
- ( ( sub_function == 6 ) * alu_mult_y )
     + ( ( ( sub_function == 7 ) && ( px <= py ) ) * ( px + ( ry2<<1) ) )
     - ( ( ( sub_function == 7 ) && ( px <= py ) && ( p > 0) ) * ( py - ( rx2<<1) ) )

...could become this:

p <= p + ( ( sub_function > 3 ) * ry2 ) <<-- simplifies this line by one dependency
- ( ( sub_function == 5 ) * alu_mult_y )
     + ( ( ( sub_function == 6 ) && ( px <= py ) ) * ( px + ( ry2<<1) ) )
     - ( ( ( sub_function == 6 ) && ( px <= py ) && ( p > 0) ) * ( py - ( rx2<<1) ) )

This wouldn't do any optimization since '( sub_function > 3 ) * ry2' cannot be allowed when sub_function is >6. Also, what happens when the sub_function is 6 and ( px <= py ) is not valid, the ry2 should not be added in that case. Also, remember that the compiler simplifies and the boolean gate level, IE and, nand, on, nor, xor as it constructs addition and subtraction functions using the digital gates.

I've attached a new version where I added 1 clock to the setup. I also discovered if you want a radius of 2047, because of this required setup function:

p = ( (Ry2 - (Rx2 * Ry) + (0.25 * Rx2)) ) + 0.5

if Ry = 2047 & Rx=2047, and Ry2=Ry^2, same for Rx2, then

p= (( 2047 * 2047 - (2047*2047*2047) + (0.25*2047*2047)) ) +0.5

The result is ' -8572120061 '. 8572120061 is a 33bit number, plus, we want negative, so this means 'P' needs to be 34 bits to support a radius of 2047. (Note I do realize that this is a circle which is 4095 pixels wide, but, you might be doing 1080p @30Hz when moving to the new CV board with DDR3 ram and display screens can be much larger than the display resolution where you may be scrolling oversized backgrounds.

Anyways, test the new attached code and only report errors. I will document the enhancements and reasoning tomorrow, but it now calls Altera's LPM_MULT megafunction instead of using the simple Verilog ' Y <= A * B '. The other enhancement better ensures reaching the FMAX at 34 bit even though the old one just managed to do it at 125MHz on the dot.

Geo_Writer_V9_34bit_Opt_V2_127Mhz.zip

#2271 Reply
Posted by nockieboy on 22 Dec, 2020 15:21
All looks good - haven't found any errors.

#2272 Reply
Posted by BrianHG on 22 Dec, 2020 17:14
Arrrrgggg, found a bug...

Looking at the ellipse generator code, in the setup, we have the following:

py = rx2 * ry * 2.

That's 2047*2047*2047*2 = 17154715646, a 34bit number. But, py is also a singed integer. This means we need 35 bits for the core (3 registers p,py,px), not 34.

Ok, here is the final code. I really trimmed the fat and implicitly set the depths of all the registers and multiplier to exactly whats needed. I used register px to build the timing problematic register p since px only needs to begin with 0 when begging to draw the ellipse. Since during the arc iterations px is added to itself and ry2*2, I made ry2 compute the rounded rx2/4 and added it to px. The again with the authentic ry2, then subtracted that nasty huge (rx2*ry) from the multiplier's Y output, then made p<=px finishing the setup of p. Then if you look at the possible formula for p during the arc iterations, you will see we cut out 64 signals from the original 1st formula. This is why even with an increase of another bit, we still can achieve such a good FMAX.

Our embeded 9 bit element multiplier count went from 6 down to 4 units and our FMAX is now 133MHz. Next step, integrate into the full geo_writer, then update your GPU for testing. I'll do that tonight. This means if we are lucky, I can take a look at your HDMI core in another day.

The comments in the source code have been corrected and now the setup for generating the arc has dropped back down to 7 steps. Please test.

Geo_Writer_V9_Ellipse_35bit_final_V3_133MHz.zip