Author Topic: FPGA VGA Controller for 8-bit computer (Read 425438 times)

nockieboy · « **Reply #3650 on:** December 22, 2022, 11:36:01 am »

I've started another thread to discuss the design of a PCB for a Xilinx FPGA for this project here.

I'm starting to reach the limits of the 10M50 with LEs and want to move up to a larger FPGA. With issues finding a reasonable Intel upgrade path from the 10M50 on Mouser, I figure it's a good time to switch to Xilinx perhaps - we can test out the GPU HDL on a different platform as well and perhaps fine-tune the DDR3 controller there as well. Plus I have the itch to design a new PCB!

nockieboy · « **Reply #3651 on:** December 22, 2022, 11:28:37 pm »

Thought I'd reply to this post in this thread as it's more relevant to the GPU discussion.

Quote from: BrianHG on December 22, 2022, 11:16:36 pm

Note that the 48$-39$ 85KLE fpgas from Lattice are finally available once again.
Though, you need to convert my DDR3 controller as Lattice's DDR3 costs money.

Note that in your current design, Parallel layers in my video controller eats a few K-LE per layer while the Sequential layers are near free in LE usage. Obvious using the 2 layer types together is what gives you the 16-64 layers you may be using.

The way your current pixel writer is designed is slow and excessive in gate count because it was engineered to run on the old FPGA dual-port ram and support backward compatibility. This is how things end up when you begin with engineering with a design for a 6KLE FPGA, then just recklessly add layers ontop.

As you know, starting with a 6,000 LE FPGA was a conscious decision as I'd never used FPGAs before and wanted to start small whilst I was learning about them and developing my very basic soldering skills and confidence. The pixel writer was designed to access screen memory in the FPGA's dual-port RAM because that was all we had available to us. Since we've had DECA boards to play with, and I will be including DDR3 RAM on the next custom board I build, the screen RAM has logically moved to DDR3.

So the next question is, what would be involved in fine-tuning the pixel writer to work with DDR3 via your controller more efficiently? Knowing what is required of the memory interface could be useful in guiding my DDR3 component selection and design for the next board, although I guess just 'faster is better' generally?

BrianHG · « **Reply #3652 on:** December 23, 2022, 01:24:43 am »

Ohhh boy, Ohhh boy...

Ok, currently, you pixel writer and geometry unit runs at 100MHz. Let's call this a limit of the lower end FPGAs and they coding style architecture for now. Right now, because we support less the 8bpp, every time we want to write a pixel, we first must read a byte (or in our case we read 32bits), edit the bits we want to draw on, then write the newly edited byte.

Cheap optimization option 1, pipeline the read pixel through one of my DDR3 read ports, stuffing the write pixel command inside my 'read vector' feature, taking that vector and data output generate a new write data in a new DDR3 write channel. This piping of the write-pixel command means our writ pixel module isn't waiting for a read to come back to edit, then write data out once ready. This would probably increase pixel write speed by 2-3x and still maintain 100% backwards compatibility and still support the existing write pixel collision counter.

Cheap optimization option #2, get rid of 1bpp, 2bpp, 4bpp support all together for writing pixels. This means we loose the ability to paint pictures on anything less than 8bpp, 16bpp, or 32bpp screens. This should increase pixel write speed by 3-5x, but, we will hit a hard limit of 100 million pixels a second but generally achieve only around 75-50 million pixels a second. Since we no longer pre-read the memory address where we are painting pixels, we no longer have a write pixel collision counter.

Proper optimization method #1. Our DDR3 right now runs at full speed only at 128bits wide in the 100MHz clock domain. In each 128bits, we have 16 8ppp pixels, 8 16bpp pixels, or 4 32bpp pixels. This means to get the highest pixel writing speed in 8bpp mode, we need to fill 16 sequential pixels each 100MHz clock. This means our geometry unit needs to change how it works. The best way to describe this is when we request like a triangle, we first need to generate a rectangle box address area which the triangle fits inside containing the width rows padded to 128 bits and the columns padded to 128 bits. Next, for every 100MHz clock, we go through each 128bit chunk, and with 16 parallel running pixel shaders, we decide which pixels get filled or not. Compared to our old drawing routine, we get the ~4x of optimization technique #2 multiplied by 16 pixels, meaning we get a 64x speed increase over your current pixel writer. We are beginning to enter the realm of a simple 3D accelerator and with an added texture reader prior to this writer, with proper design, and maybe a second DDR3 chip for a 256bit wide bus & 128x speed, we will pass the first Sony Playstation in rendering capability.

Now, as you can guess, though we might use the same geometry coordinate to initiate the drawings for this optimization, the geometry unit core guts will look completely different for this proper optimization method #1 & #2. It is more akin to us having a bounding box to draw inside and at each 8 or 16 pixel chunk, our 8 or 16 pixel shaders all in parallel will be answering the same question: At this point on the screen, does my pixel fit on or inside the line's / triangle's / rectangle's coordinates? (Yes/No) This will produce 8-16 parallel sequential pixels in one shot, every 100MHz clock. (Note that if we are paint 1 pixel thin vertical line, yes 7 of the 8 pixel writers will say no, but the DDR3 cannot draw vertical line any faster anyways. And if we want antialiasing, then the question is how much so and if it isn't 0% or 100%, then we need to do a pixel read and do a pixel blend.

There are others ways to handle the vertical 1 pixel wide line issue by rendering multiple objects by cached screen square blocks, but, a lot of what is needed to process that is still technically needs the engineering step of designing multiple parallel pixel shaders.

BrianHG · « **Reply #3653 on:** December 23, 2022, 02:42:00 am »

Quote from: nockieboy on December 22, 2022, 11:36:01 am

I'm starting to reach the limits of the 10M50 with LEs and want to move up to a larger FPGA.

What kind of limits?
Can you show us the utilization by entity in your compiler report...
Which main entities are utilizing most of the LEs?
How much more do you need and for what?

nockieboy · « **Reply #3654 on:** December 23, 2022, 03:43:31 pm »

Quote from: BrianHG on December 23, 2022, 01:24:43 am

Ohhh boy, Ohhh boy...

Don't get me wrong, I didn't think it'd be a walk in the park.

Quote from: BrianHG on December 23, 2022, 01:24:43 am

Cheap optimization option 1, pipeline the read pixel through one of my DDR3 read ports, stuffing the write pixel command inside my 'read vector' feature, taking that vector and data output generate a new write data in a new DDR3 write channel. This piping of the write-pixel command means our writ pixel module isn't waiting for a read to come back to edit, then write data out once ready. This would probably increase pixel write speed by 2-3x and still maintain 100% backwards compatibility and still support the existing write pixel collision counter.

I'm going to need to go away and read that a few times over to understand exactly what you mean.

Quote from: BrianHG on December 23, 2022, 01:24:43 am

Cheap optimization option #2, get rid of 1bpp, 2bpp, 4bpp support all together for writing pixels. This means we loose the ability to paint pictures on anything less than 8bpp, 16bpp, or 32bpp screens. This should increase pixel write speed by 3-5x, but, we will hit a hard limit of 100 million pixels a second but generally achieve only around 75-50 million pixels a second. Since we no longer pre-read the memory address where we are painting pixels, we no longer have a write pixel collision counter.

At the moment at least, I can't see the loss of <8bpp support being an issue. The loss of the pixel counter may be an issue later down the line, but so far I haven't needed it. I know originally I needed support for <8bpp because of the limited block RAM in the base Cyclone IV models. Since the move to DDR3 though, I haven't used <8bpp at all and can't see any reason why I'd want to other than to reduce workload on the host, but the host doesn't have to do much anyway with all the work you've done on the GPU.

Quote from: BrianHG on December 23, 2022, 01:24:43 am

Proper optimization method #1. Our DDR3 right now runs at full speed only at 128bits wide in the 100MHz clock domain. In each 128bits, we have 16 8ppp pixels, 8 16bpp pixels, or 4 32bpp pixels. This means to get the highest pixel writing speed in 8bpp mode, we need to fill 16 sequential pixels each 100MHz clock. This means our geometry unit needs to change how it works. The best way to describe this is when we request like a triangle, we first need to generate a rectangle box address area which the triangle fits inside containing the width rows padded to 128 bits and the columns padded to 128 bits. Next, for every 100MHz clock, we go through each 128bit chunk, and with 16 parallel running pixel shaders, we decide which pixels get filled or not. Compared to our old drawing routine, we get the ~4x of optimization technique #2 multiplied by 16 pixels, meaning we get a 64x speed increase over your current pixel writer.

Right, so we're 'quantising' the geometry unit's access to the DDR3 RAM into 128-bit chunks, and thus making it operate all of its drawing functions within boxes to make the memory access as efficient as possible? I wouldn't have a clue where to begin making changes to GEOFF for that. Each rendering function, be it line, triangle, quad, ellipse, will need to be aware of the bounding box it's working within I guess and when it hits a boundary, either drops to the next line down within the box and carries on (remembering to continue the shape in the next bounding box along later) or waits for the DDR3 to update before moving into the next bounding box?

Quote from: BrianHG on December 23, 2022, 01:24:43 am

We are beginning to enter the realm of a simple 3D accelerator and with an added texture reader prior to this writer, with proper design, and maybe a second DDR3 chip for a 256bit wide bus & 128x speed, we will pass the first Sony Playstation in rendering capability.

Okay, that's what I'm talking about when I mentioned guiding my component selection for the external memory.

Two DDR3 chips, check.

Quote from: BrianHG on December 23, 2022, 01:24:43 am

Now, as you can guess, though we might use the same geometry coordinate to initiate the drawings for this optimization, the geometry unit core guts will look completely different for this proper optimization method #1 & #2. It is more akin to us having a bounding box to draw inside and at each 8 or 16 pixel chunk, our 8 or 16 pixel shaders all in parallel will be answering the same question: At this point on the screen, does my pixel fit on or inside the line's / triangle's / rectangle's coordinates? (Yes/No) This will produce 8-16 parallel sequential pixels in one shot, every 100MHz clock. (Note that if we are paint 1 pixel thin vertical line, yes 7 of the 8 pixel writers will say no, but the DDR3 cannot draw vertical line any faster anyways. And if we want antialiasing, then the question is how much so and if it isn't 0% or 100%, then we need to do a pixel read and do a pixel blend.

There are others ways to handle the vertical 1 pixel wide line issue by rendering multiple objects by cached screen square blocks, but, a lot of what is needed to process that is still technically needs the engineering step of designing multiple parallel pixel shaders.

This is all very exciting after the recent lull while I've been muddling through the SD card interface and FPU. The question is, do you have the time/desire to help out any further?

nockieboy · « **Reply #3655 on:** December 23, 2022, 03:52:23 pm »

Quote from: BrianHG on December 23, 2022, 02:42:00 am

Quote from: nockieboy on December 22, 2022, 11:36:01 am
I'm starting to reach the limits of the 10M50 with LEs and want to move up to a larger FPGA.
What kind of limits?
Can you show us the utilization by entity in your compiler report...
Which main entities are utilizing most of the LEs?
How much more do you need and for what?

As you can see in the image above, this is a build of the latest HDL with 4 PDI and 4 SDI layers, signal tap instance for HDL debugging, 2 PSGs, FPU and SD Interface. I'm looking at the total utilisation of logic elements - sitting at 96% currently. Obviously the signal tap instance some resources, but it's a very basic instance with only 16 or so waveforms being monitored. I can obviously reduce the layers from 16 down to 4 with no impact on any software projects I'm working on currently, but what I'm thinking about is the future. What if I want to add-in a softcore processor, or we decide to ramp-up the GPU's capabilities?

I'm also starting to think about producing a custom board again for this GPU and wouldn't want to hamstring future development by limiting myself to a 50k LE FPGA which is already at 96% usage. I know you'd be the first to tell me I should have thought bigger at the design stage.

BrianHG · « **Reply #3656 on:** December 24, 2022, 12:01:59 am »

Quote from: nockieboy on December 23, 2022, 03:52:23 pm

Quote from: BrianHG on December 23, 2022, 02:42:00 am
Quote from: nockieboy on December 22, 2022, 11:36:01 am
I'm starting to reach the limits of the 10M50 with LEs and want to move up to a larger FPGA.
What kind of limits?
Can you show us the utilization by entity in your compiler report...
Which main entities are utilizing most of the LEs?
How much more do you need and for what?

As you can see in the image above, this is a build of the latest HDL with 4 PDI and 4 SDI layers, signal tap instance for HDL debugging, 2 PSGs, FPU and SD Interface. I'm looking at the total utilisation of logic elements - sitting at 96% currently. Obviously the signal tap instance some resources, but it's a very basic instance with only 16 or so waveforms being monitored. I can obviously reduce the layers from 16 down to 4 with no impact on any software projects I'm working on currently, but what I'm thinking about is the future. What if I want to add-in a softcore processor, or we decide to ramp-up the GPU's capabilities?

I'm also starting to think about producing a custom board again for this GPU and wouldn't want to hamstring future development by limiting myself to a 50k LE FPGA which is already at 96% usage. I know you'd be the first to tell me I should have thought bigger at the design stage.

Well, as you can see, the largest chunk used is the VGA system at 18.3kle. Either go to a larger FPGA, or, go down to 1 PDI layer and cut that by ~66%, then use those extra gates for the 64x speed geo unit and run software sprite layers. Removing tile text support and going down to 1 or 2 SDI layers only you will save over 15kle. However, there is quite a bit to discuss about what to do and still, a larger FPGA can still offer both systems simultaneously.

The multiport DDR3 (12.5kle) is large because of all those simultaneous RW ports, but everything else is chump change. RS232 debug / signaltap / bridget = ~0.7kle each, stereo sound = 2kle, ALU = 4.6kle (though we can do a lot more with this with regards to 3D, the FP divide is the largest single consumer here), GEO = 5.1kle.

As for the SD card, well I didn't program it but 2kle seems a lot. Maybe a bunch of regs are stored as regs instead of being stored in a memory block.

nockieboy · « **Reply #3657 on:** December 24, 2022, 12:24:31 pm »

Quote from: BrianHG on December 24, 2022, 12:01:59 am

Well, as you can see, the largest chunk used is the VGA system at 18.3kle. Either go to a larger FPGA, or, go down to 1 PDI layer and cut that by ~66%, then use those extra gates for the 64x speed geo unit and run software sprite layers. Removing tile text support and going down to 1 or 2 SDI layers only you will save over 15kle. However, there is quite a bit to discuss about what to do and still, a larger FPGA can still offer both systems simultaneously.

As you know, I'm going to be making a development board to replace the DECA with a Xilinx XC7A100T, so once that's done we're going to be swimming in spare LEs anyway.

Quote from: BrianHG on December 24, 2022, 12:01:59 am

As for the SD card, well I didn't program it but 2kle seems a lot. Maybe a bunch of regs are stored as regs instead of being stored in a memory block.

Yes, it does seem a lot. I'll add a to-do list entry for a closer examination of that module to see what's going on with the regs. I'm pretty sure you're right though, from what I remember it was using registers for the regs instead of memory.

BrianHG · « **Reply #3658 on:** January 29, 2023, 03:03:32 pm »

Hi Nockieboy,

Please test my new modified code for my VGA processor. If it works, check with asmi to see if all the way at the bottom of the code, my `ifdef / `else - home made dual port dual clock ram will be accepted by Xilinx. Specifically line 1359 where I have a byte enable for the write data. Quartus gets stuck in a loop here when compiling and generates an over 1 million logic gate design, yet the original 'altsyncram' function I have further above tied to the ' `ifdef ALTERA_RESERVED_QIS ' should compile fine.

tchiwam · « **Reply #3659 on:** February 03, 2023, 12:05:42 am »

I was thinking of a 6845 6847 drop in replacement that fits in a 40pin dip600 with a HDMI/DP/Dlinkk output

nockieboy · « **Reply #3660 on:** February 26, 2023, 02:44:34 pm »

Quote from: BrianHG on January 29, 2023, 03:03:32 pm

Hi Nockieboy,

Please test my new modified code for my VGA processor. If it works, check with asmi to see if all the way at the bottom of the code, my `ifdef / `else - home made dual port dual clock ram will be accepted by Xilinx. Specifically line 1359 where I have a byte enable for the write data. Quartus gets stuck in a loop here when compiling and generates an over 1 million logic gate design, yet the original 'altsyncram' function I have further above tied to the ' `ifdef ALTERA_RESERVED_QIS ' should compile fine.

Hi BrianHG,

Sorry for the delay replying, it's been hectic. I've compiled the project with your new VGA processor HDL and it worked fine - no issues.

I'll check with asmi regarding your conditional code at the bottom of the file and get back to you.

asmi · « **Reply #3661 on:** February 26, 2023, 09:02:20 pm »

Quote from: BrianHG on January 29, 2023, 03:03:32 pm

Hi Nockieboy,

Please test my new modified code for my VGA processor. If it works, check with asmi to see if all the way at the bottom of the code, my `ifdef / `else - home made dual port dual clock ram will be accepted by Xilinx. Specifically line 1359 where I have a byte enable for the write data. Quartus gets stuck in a loop here when compiling and generates an over 1 million logic gate design, yet the original 'altsyncram' function I have further above tied to the ' `ifdef ALTERA_RESERVED_QIS ' should compile fine.

Xilinx has it's own primitives for true dual port RAM, you can either generate a primitive via IP wizard, you can use one of Xilinx-provided HDL primitives: XPM_MEMORY_TDPRAM or BRAM_TDP_MACRO First one is more generic, and can use various underlined structures to implement it, the second one explicitly instantiates a BRAM.

BrianHG · « **Reply #3662 on:** February 26, 2023, 09:24:32 pm »

Quote from: asmi on February 26, 2023, 09:02:20 pm

Quote from: BrianHG on January 29, 2023, 03:03:32 pm
Hi Nockieboy,

Please test my new modified code for my VGA processor. If it works, check with asmi to see if all the way at the bottom of the code, my `ifdef / `else - home made dual port dual clock ram will be accepted by Xilinx. Specifically line 1359 where I have a byte enable for the write data. Quartus gets stuck in a loop here when compiling and generates an over 1 million logic gate design, yet the original 'altsyncram' function I have further above tied to the ' `ifdef ALTERA_RESERVED_QIS ' should compile fine.
Xilinx has it's own primitives for true dual port RAM, you can either generate a primitive via IP wizard, you can use one of Xilinx-provided HDL primitives: XPM_MEMORY_TDPRAM or BRAM_TDP_MACRO First one is more generic, and can use various underlined structures to implement it, the second one explicitly instantiates a BRAM.

The question was is it even necessary. Take a look at my code. I created a generic verilog array with byte write enable in place of 'not defined ALTERA_RESERVED_QIS '. The question is will Xilinx eat the code as is?

asmi · « **Reply #3663 on:** February 26, 2023, 10:15:45 pm »

Quote from: BrianHG on February 26, 2023, 09:24:32 pm

The question was is it even necessary. Take a look at my code. I created a generic verilog array with byte write enable in place of 'not defined ALTERA_RESERVED_QIS '. The question is will Xilinx eat the code as is?

Can you post a code snippet here? I was under impression that it's not possible to model a true dual port RAM in pure Verilog because it's not possible to write to the same registers from multiple processes.

BrianHG · « **Reply #3664 on:** February 26, 2023, 10:35:02 pm »

Quote from: asmi on February 26, 2023, 10:15:45 pm

Quote from: BrianHG on February 26, 2023, 09:24:32 pm
The question was is it even necessary. Take a look at my code. I created a generic verilog array with byte write enable in place of 'not defined ALTERA_RESERVED_QIS '. The question is will Xilinx eat the code as is?
Can you post a code snippet here? I was under impression that it's not possible to model a true dual port RAM in pure Verilog because it's not possible to write to the same registers from multiple processes.

Already done...
https://www.eevblog.com/forum/fpga/fpga-vga-controller-for-8-bit-computer/msg4668115/#msg4668115
It's the last module 'BHG_VIDEO_DPDC_BRAM', starting at line 1319, right at the bottom.

asmi · « **Reply #3665 on:** February 26, 2023, 11:01:13 pm »

Quote from: BrianHG on February 26, 2023, 10:35:02 pm

Already done...
https://www.eevblog.com/forum/fpga/fpga-vga-controller-for-8-bit-computer/msg4668115/#msg4668115
It's the last module 'BHG_VIDEO_DPDC_BRAM', starting at line 1319, right at the bottom.

That's not a true dual port RAM, as it only writes into one port and reads off another one. Xilinx calls that Simple Dual Port RAM, and there is a module for that too: XPM_MEMORY_SDPRAM, but your code might just work as-is. Leave it like that for now, once nockieboy completes the port, he can verify if it works or not.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: FPGA VGA Controller for 8-bit computer (Read 425438 times)

nockieboy

Re: FPGA VGA Controller for 8-bit computer

nockieboy

Re: FPGA VGA Controller for 8-bit computer

BrianHG

Re: FPGA VGA Controller for 8-bit computer

BrianHG

Re: FPGA VGA Controller for 8-bit computer

nockieboy

Re: FPGA VGA Controller for 8-bit computer

nockieboy

Re: FPGA VGA Controller for 8-bit computer

BrianHG

Re: FPGA VGA Controller for 8-bit computer

nockieboy

Re: FPGA VGA Controller for 8-bit computer

BrianHG

Re: FPGA VGA Controller for 8-bit computer

tchiwam

Re: FPGA VGA Controller for 8-bit computer

nockieboy

Re: FPGA VGA Controller for 8-bit computer

asmi

Re: FPGA VGA Controller for 8-bit computer

BrianHG

Re: FPGA VGA Controller for 8-bit computer

asmi

Re: FPGA VGA Controller for 8-bit computer

BrianHG

Re: FPGA VGA Controller for 8-bit computer

asmi

Re: FPGA VGA Controller for 8-bit computer

Share me