Author Topic: Planning/design/review for a 6-layer Xilinx Artix-7 board for DIY computer. (Read 36817 times)

asmi · « **Reply #150 on:** January 20, 2023, 03:12:40 pm »

Quote from: nockieboy on January 20, 2023, 01:49:28 pm

Hmm - I can't. The highest value allowed is 3,300 ps. If I switch to DDR2 SDRAM in the memory selection, it will allow a value of 5,000 ps, but I thought we were looking at DDR3 SODIMMs?

You mixed up two screens - the one which sets the memory frequency and selects a memory device (title is "Options for Controller 0 - DDR3 SDRAM"), with the one which sets the input clock frequency (title is "Memory Options C0 - DDR3 SDRAM"). On the former you set 2500 ps (400 MHz), on the latter - 5000 ps (200 MHz). See attached screenshots.

Also, I just realized that there are Micron's MT16KTF1G64HZ-1G6E1 8GB SODIMMs on Amazon for like $20. Tempted to order one. Or two

nockieboy · « **Reply #151 on:** January 20, 2023, 11:04:25 pm »

That seems to have worked. Vivado completed its synthesis run on the SODIMM module, with these settings:

Code: [Select]

Vivado Project Options:
   Target Device                   : xc7a100t-fgg484
   Speed Grade                     : -2
   HDL                             : verilog
   Synthesis Tool                  : VIVADO

If any of the above options are incorrect,   please click on "Cancel", change the CORE Generator Project Options, and restart MIG.

MIG Output Options:
   Module Name                     : mig_7series_SODIMM
   No of Controllers               : 1
   Selected Compatible Device(s)   : xc7a35t-fgg484, xc7a50t-fgg484, xc7a75t-fgg484, xc7a15t-fgg484

FPGA Options:
   System Clock Type               : Differential
   Reference Clock Type            : Use System Clock
   Debug Port                      : OFF
   Internal Vref                   : disabled
   IO Power Reduction              : ON
   XADC instantiation in MIG       : Enabled

Extended FPGA Options:
   DCI for DQ,DQS/DQS#,DM          : enabled
   Internal Termination (HR Banks) : 50 Ohms

/*******************************************************/
/*                  Controller 0                       */
/*******************************************************/

Controller Options :
   Memory                        : DDR3_SDRAM
   Interface                     : NATIVE
   Design Clock Frequency        : 2500 ps (400.00 MHz)
   Phy to Controller Clock Ratio : 4:1
   Input Clock Period            : 4999 ps
   CLKFBOUT_MULT (PLL)           : 4
   DIVCLK_DIVIDE (PLL)           : 1
   VCC_AUX IO                    : 1.8V
   Memory Type                   : SODIMMs
   Memory Part                   : MT16KTF1G64HZ-1G6
   Equivalent Part(s)            : --
   Data Width                    : 64
   ECC                           : Disabled
   Data Mask                     : enabled
   ORDERING                      : Strict

AXI Parameters :
   Data Width                    : 512
   Arbitration Scheme            : RD_PRI_REG
   Narrow Burst Support          : 0
   ID Width                      : 4

Memory Options:
   Burst Length (MR0[1:0])          : 8 - Fixed
   Read Burst Type (MR0[3])         : Sequential
   CAS Latency (MR0[6:4])           : 6
   Output Drive Strength (MR1[5,1]) : RZQ/7
   Rtt_NOM - ODT (MR1[9,6,2])       : RZQ/4
   Rtt_WR - Dynamic ODT (MR2[10:9]) : Dynamic ODT off
   Memory Address Mapping           : BANK_ROW_COLUMN

Bank Selections:
	Bank: 14
		Byte Group T1:	DQ[40-47]
		Byte Group T2:	DQ[48-55]
		Byte Group T3:	DQ[56-63]
	Bank: 15
		Byte Group T0:	Address/Ctrl-0
		Byte Group T1:	Address/Ctrl-1
		Byte Group T2:	Address/Ctrl-2
		Byte Group T3:	DQ[32-39]
	Bank: 16
		Byte Group T0:	DQ[0-7]
		Byte Group T1:	DQ[8-15]
		Byte Group T2:	DQ[16-23]
		Byte Group T3:	DQ[24-31]

System_Clock: 
	SignalName: sys_clk_p/n
		PadLocation: K18/K19(CC_P/N)  Bank: 15

System_Control: 
	SignalName: sys_rst
		PadLocation: No connect  Bank: Select Bank
	SignalName: init_calib_complete
		PadLocation: No connect  Bank: Select Bank
	SignalName: tg_compare_error
		PadLocation: No connect  Bank: Select Bank

Hopefully that's all good.

Time to start working on learning more Vivado and also how to start porting BrianHG's multi-port adaptor across - or is there anything else I should be thinking about before that?

asmi · « **Reply #152 on:** January 21, 2023, 07:15:40 am »

Quote from: nockieboy on January 20, 2023, 01:49:28 pm

Indeed. Well, it's all this talk from BrianHG about what we could do with a 16-bit (or better) CPU at the helm. I've had a very enjoyable stroll down memory lane designing, building and even teaching myself assembly for my Z80 DIY computer, but it seems this GPU project has taken on a life of its own, with an awful lot of potential. My biggest concern at the moment is that its capabilities are already outstripping my programming skills - and certainly my free time - to do it justice. So switching to a more generic 'development board' PCB and eliminating the need for specialist hardware (my uCOM Z80 host) means anyone could create a board, download the project and have a working games machine of some description.

Well with 8GB SODIMM you will need to design a 64bit CPU so that you can address that much RAM and still have some address space for memory mapped I/O - unless you want to go the way Intel went with PAE and similar hacks to get around 4GB address space limit on a 32 bit system.
Another advantage of using some sort of softcore is that if you pick some well-known architecture (like RISC-V), you can take advantage of existing gcc toolchain and write code in C as opposed to assembly, which will speed up development tremendously. Also this will allow others to join in because they can actually replicate a hardware setup on their own - not everyone wants to mess with all that ancient stuff like Z80. I for one prefer more-or-less modern tech (maybe because I was too young when the likes of Z80 reigned supreme), and since I happen to be a professional software developer (with hardware being a side hustle of mine which began as a hobby and passion from my university days), using C/C++ will make me that much more likely to be willing to invest what little spare time I have into contributing to the project.

Quote from: nockieboy on January 20, 2023, 11:04:25 pm

Hopefully that's all good.

Looks good to me.

Quote from: nockieboy on January 20, 2023, 11:04:25 pm

Time to start working on learning more Vivado and also how to start porting BrianHG's multi-port adaptor across - or is there anything else I should be thinking about before that?

I would create a simple testbench just to practice talking to controller first. I can throw together a quick one for you to help you get started in a day or two if you don't mind waiting a bit. Once you feel confident enough with the interface, I would implement a simple simulation-only BFM (Bus Functional Model) of controller to speed up simulation. That's how I typically do sims - replace modules not relevant to component I'm focusing on with their simplified models, and only do a full-up high fidelity simulation as the final step before moving onto the real hardware.

nockieboy · « **Reply #153 on:** January 21, 2023, 09:09:20 am »

Quote from: asmi on January 21, 2023, 07:15:40 am

Well with 8GB SODIMM you will need to design a 64bit CPU so that you can address that much RAM and still have some address space for memory mapped I/O - unless you want to go the way Intel went with PAE and similar hacks to get around 4GB address space limit on a 32 bit system.

For an 8-bit processor like the Z80, it's phenomenal cosmic power...

... and itty-bitty 64KB memory space is easily expanded to 4MB in my uCOM with a very simple MMU.

I presume you mean Physical Address Extension, and not Prostate Artery Embolisation which was my first hit on Google I when went to find out what it was?

That sounds like an MMU to me.

Quote from: asmi on January 21, 2023, 07:15:40 am

Another advantage of using some sort of softcore is that if you pick some well-known architecture (like RISC-V), you can take advantage of existing gcc toolchain and write code in C as opposed to assembly, which will speed up development tremendously. Also this will allow others to join in because they can actually replicate a hardware setup on their own - not everyone wants to mess with all that ancient stuff like Z80.

Exactly, and that's another very strong reason for me to consider moving on. Plus, I've realised I just don't have the time to design a 16-bit computer and write the bootstrap (or should I say Kickstart?

) ROM software for it.

And yes, 'ancient stuff like Z80', well that was for me and me alone. When I started the project, I wanted to go back and really understand something that, as a kid, was a bit of a black-box for me and I literally just wanted to see if I could make a working Z80 computer on breadboard. Now I'm feeling the real learning (and potential) is in the FPGA itself. Designing the board for this device is going to take me waaaaay further than anything I'd have made for a 16-bit system.

And opening the project up so that any other interested parties can get involved with a minimum of hurdles to jump is a bonus. It's also been on my mind for the last few days regarding the design. What am I actually looking to make, here?

Originally, I wanted a core board that I can plug into a carrier that will act as an interface to a host system (i.e. my uCOM, or anything else) and be sufficiently generic to allow any variety of carrier boards to be used with it. I'd design a carrier board to fit my uCOM, with HDMI and USB ports, port the GPU HDL and I'd be more than happy. However, now there's a major conceptual change brewing in the design. I have, after all, met all my stated objectives with the uCOM by plugging a DECA card into the stack and using the GPU HDL as it stands. There's nowhere really to go with that project, other than designing a dedicated card for it (which is what the core/carrier combo of this new project would be). But it's not for general release and so what's the point?

However, if I'm going to dispense with the hardware host and move that into the FPGA itself, that releases a LOT of constraints. For one, it frees up around 60 I/O on the FPGA, as I won't need full Address, Data and Control bus access via the carrier card. The majority will be internal to the core card.

It also raises the question of the core/carrier combination entirely. Is it needed? The flexibility of having a carrier was so that I could have a very niche, one-off card for my uCOM and everyone else could do what they wanted. If I don't need that niche one-off carrier anymore, why not just design a single board large enough to accommodate the SODIMM and optional FPGA-cooling comfortably, as the remainder of the peripherals for my implementation of an 8-, 16- or more bit computer will be sufficiently generic that they'll be useful to anyone else anyway (i.e. HDMI port, USB port, SD card).

Quote from: asmi on January 21, 2023, 07:15:40 am

I for one prefer more-or-less modern tech (maybe because I was too young when the likes of Z80 reigned supreme), and since I happen to be a professional software developer (with hardware being a side hustle of mine which began as a hobby and passion from my university days), using C/C++ will make me that much more likely to be willing to invest what little spare time I have into contributing to the project.

I'm certainly no professional programmer and what I do know is all self-taught. I'm a hobbyist from start to finish, hence all the silly questions and silly mistakes.

Quote from: asmi on January 21, 2023, 07:15:40 am

I would create a simple testbench just to practice talking to controller first. I can throw together a quick one for you to help you get started in a day or two if you don't mind waiting a bit. Once you feel confident enough with the interface, I would implement a simple simulation-only BFM (Bus Functional Model) of controller to speed up simulation. That's how I typically do sims - replace modules not relevant to component I'm focusing on with their simplified models, and only do a full-up high fidelity simulation as the final step before moving onto the real hardware.

Okay, that sounds great, thank you.

I'm going to think about this design in more detail. SODIMM does seem to offer more benefits than downsides, plus some huge memory capacities at cheap prices. With a SODIMM and an XC7A100T, you could just add a 32-bit soft core processor and probably install Linux on the thing...

BrianHG · « **Reply #154 on:** January 21, 2023, 09:53:23 am »

@nockieboy, it is OK to wire for 8gb as it will just add 1-2 more IOs. However, some of the 8gb sticks are dual rank (16 ram chips on a module), IE same addresses as 4gb, but using a second CKE pin so you may interleave commands to both banks to hide the RAS-CAS row setup in the first 8 ram chips while the alternate 8 ram chips are in the middle of a transfer. (Not for you to know, this is the job of the ram controller to handle behind your back, streamlines ram access if your data processing pipe is aware of the extended pipeline delay and is designed to make use of it.) Single rank 8gb sticks with 8x ddr3 ram chips also exist. Be careful with your choice of ram. Modules with 4x ddr3 chips will have the least electrical load on the command and clock lines.

nockieboy · « **Reply #155 on:** January 21, 2023, 10:14:46 am »

Quote from: BrianHG on January 21, 2023, 09:53:23 am

@nockieboy, it is OK to wire for 8gb as it will just add 1-2 more IOs. However, some of the 8gb sticks are dual rank (16 ram chips on a module), IE same addresses as 4gb, but using a second CKE pin so you may interleave commands to both banks to hide the RAS-CAS row setup in the first 8 ram chips while the alternate 8 ram chips are in the middle of a transfer. (Not for you to know, this is the job of the ram controller to handle behind your back, streamlines ram access if your data processing pipe is aware of the extended pipeline delay and is designed to make use of it.) Single rank 8gb sticks with 8x ddr3 ram chips also exist. Be careful with your choice of ram. Modules with 4x ddr3 chips will have the least electrical load on the command and clock lines.

That makes sense. So more care must be taken to select suitable sticks if I'm looking for an 8GB one.

Maybe asmi is better able to answer this, but would it be possible to read the SODIMM and adjust the MIG's memory controller to the specific SODIMM's memory chip timings? I don't know how complex or flexible the HDL generated by the MIG is (being a relative HDL simpleton in the Vivado/Xilinx environment, you understand), but if it's trivial to add that flexibility and allow someone to use a wider variety of SODIMMs before they have to edit the HDL and compile a new bitstream, that's got to be a good thing?

Even if it can't adjust for single/dual-rank sticks, but can adjust between certain families of chips on those sticks, any added flexibility like that would be a bonus.

asmi · « **Reply #156 on:** January 21, 2023, 02:57:07 pm »

Quote from: BrianHG on January 21, 2023, 09:53:23 am

@nockieboy, it is OK to wire for 8gb as it will just add 1-2 more IOs. However, some of the 8gb sticks are dual rank (16 ram chips on a module), IE same addresses as 4gb, but using a second CKE pin so you may interleave commands to both banks to hide the RAS-CAS row setup in the first 8 ram chips while the alternate 8 ram chips are in the middle of a transfer. (Not for you to know, this is the job of the ram controller to handle behind your back, streamlines ram access if your data processing pipe is aware of the extended pipeline delay and is designed to make use of it.) Single rank 8gb sticks with 8x ddr3 ram chips also exist. Be careful with your choice of ram. Modules with 4x ddr3 chips will have the least electrical load on the command and clock lines.

If you use any of modules supported by MIG, you won't have to worry about any of this. To you it's all going to look like a single contigious address space, with rank being the MSB of address, then either bank-row-column or row-bank-column (depending on the option you choose while generating a core). You also get to select a number of bank machines in a controller, with 4 being a default, but you can change it to any value between 2 and 8. This number basically means a number of rows in different banks which can be open at the same time, so depending on you access pattern, having more or less bank machines allows you to optimize controller. Same goes for transaction ordering - it's set to "Strict" by default, but you can set it to "Normal" to allow for transaction reordering for higher efficiency.

Quote from: nockieboy on January 21, 2023, 10:14:46 am

That makes sense. So more care must be taken to select suitable sticks if I'm looking for an 8GB one.

The best bet would be to pick a module from the list in the MIG. And since MT16KTF1G64HZ is only about $20, I don't see any reason to even bother with smaller modules.

Quote from: nockieboy on January 21, 2023, 10:14:46 am

Maybe asmi is better able to answer this, but would it be possible to read the SODIMM and adjust the MIG's memory controller to the specific SODIMM's memory chip timings? I don't know how complex or flexible the HDL generated by the MIG is (being a relative HDL simpleton in the Vivado/Xilinx environment, you understand), but if it's trivial to add that flexibility and allow someone to use a wider variety of SODIMMs before they have to edit the HDL and compile a new bitstream, that's got to be a good thing?

Even if it can't adjust for single/dual-rank sticks, but can adjust between certain families of chips on those sticks, any added flexibility like that would be a bonus.

No, it's absolutely NOT flexible by design, but since adapting it to a different module only involves making few changes in a GUI (as opposed to messing with HDL) and perhaps dealing with resulting trivial HDL changes (like UI address width becoming smaller if you use a module of smaller capacity), I don't really see it as a big deal. We might want to leave a possibility of reading SPD so that the software can adapt itself to different amount of RAM, that's going to involve another level shifter if we want to use few remaining pins from DDR3 banks for that.

nockieboy · « **Reply #157 on:** January 21, 2023, 10:25:34 pm »

Quote from: asmi on January 21, 2023, 02:57:07 pm

Quote from: BrianHG on January 21, 2023, 09:53:23 am
@nockieboy, it is OK to wire for 8gb as it will just add 1-2 more IOs. However, some of the 8gb sticks are dual rank (16 ram chips on a module), IE same addresses as 4gb, but using a second CKE pin so you may interleave commands to both banks to hide the RAS-CAS row setup in the first 8 ram chips while the alternate 8 ram chips are in the middle of a transfer. (Not for you to know, this is the job of the ram controller to handle behind your back, streamlines ram access if your data processing pipe is aware of the extended pipeline delay and is designed to make use of it.) Single rank 8gb sticks with 8x ddr3 ram chips also exist. Be careful with your choice of ram. Modules with 4x ddr3 chips will have the least electrical load on the command and clock lines.
If you use any of modules supported by MIG, you won't have to worry about any of this. To you it's all going to look like a single contigious address space, with rank being the MSB of address, then either bank-row-column or row-bank-column (depending on the option you choose while generating a core). You also get to select a number of bank machines in a controller, with 4 being a default, but you can change it to any value between 2 and 8. This number basically means a number of rows in different banks which can be open at the same time, so depending on you access pattern, having more or less bank machines allows you to optimize controller. Same goes for transaction ordering - it's set to "Strict" by default, but you can set it to "Normal" to allow for transaction reordering for higher efficiency.

@BrianHG - do you have thoughts/preferences on these options at all for working best with your multi-port module? I've defaulted to bank-row-column, from what asmi is saying it'll be rank-bank-row-column (don't try saying that when you're drunk!), although the rank part is dealt with by the MIG interface.

What about bank machines? More the merrier, or is 4 the sweet spot? I guess you'd want as many bank machines as you have ports (up to max 8 ), as each port could be using a different row? Depends on resource usage I guess?

What about transaction ordering? Any preference there for maximum compatibility/performance?

Quote from: asmi on January 21, 2023, 02:57:07 pm

The best bet would be to pick a module from the list in the MIG. And since MT16KTF1G64HZ is only about $20, I don't see any reason to even bother with smaller modules.

Righto.

Quote from: asmi on January 21, 2023, 02:57:07 pm

No, it's absolutely NOT flexible by design, but since adapting it to a different module only involves making few changes in a GUI (as opposed to messing with HDL) and perhaps dealing with resulting trivial HDL changes (like UI address width becoming smaller if you use a module of smaller capacity), I don't really see it as a big deal. We might want to leave a possibility of reading SPD so that the software can adapt itself to different amount of RAM, that's going to involve another level shifter if we want to use few remaining pins from DDR3 banks for that.

Yes, true, I just thought it was worth the ask in case we could make it even slightly tolerant of SODIMM variation without having to generate a new bitstream.

nockieboy · « **Reply #158 on:** January 22, 2023, 07:14:15 pm »

Whilst I'm thinking about it, what should I be looking for when I search for a commonly-used fan to cool an FPGA like the XC7A FGG484? I'm not having much luck finding FPGA coolers, or just PCB-mounting fans, in EasyEDA or Mouser, for that matter. My Google-fu is failing me on this...

I just want a PCB footprint really, so I have an idea where to include mounting holes if a fan/heatsink combo is required.

asmi · « **Reply #159 on:** January 22, 2023, 10:31:20 pm »

Quote from: nockieboy on January 22, 2023, 07:14:15 pm

Whilst I'm thinking about it, what should I be looking for when I search for a commonly-used fan to cool an FPGA like the XC7A FGG484? I'm not having much luck finding FPGA coolers, or just PCB-mounting fans, in EasyEDA or Mouser, for that matter. My Google-fu is failing me on this...

I just want a PCB footprint really, so I have an idea where to include mounting holes if a fan/heatsink combo is required.

I'm think of something like this one: https://www.mouser.ca/ProductDetail/Wakefield-Vette/960-23-33-D-AB-0?qs=PqoDHHvF64%2FNvAsuJB%2Fzyw%3D%3D
Or we can use a one size up one: https://www.mouser.ca/ProductDetail/Wakefield-Vette/960-27-33-D-AB-0?qs=PqoDHHvF64%2FM6UHqLS5KaQ%3D%3D

As for fans, since we only going to have a 5V rail, we'll need to use 5 V fans, and there isn't that many of them - which is why I'm thinking if it would make sense to use a larger heatsink (like 27 mm I listed above), as it would allow using 25 mm fans, of which there is a larger selection.

----
I've created a quick testbench for MIG, you the code is here: https://github.com/asmi84/a100-484-sodimm It just waits until controller completes initialization (signal "init_calib_complete" goes high), executes a bunch of writes to addresses 0, 0x8, 0x10 (to see back-to-back bursts), as well as 0x20000000 to test access to the second rank, and then performs reads from those addresses (in the same order as writes). Just load the project, click "Run Simulation" on the left panel, and wait - it takes a while, on my PC it takes about 10-11 minutes to complete. That's why you will need to come up with some kind of BFM as waiting that long is going to make development too slow and too annoying. 32bit controller took about half as long.

BrianHG · « **Reply #160 on:** January 23, 2023, 09:44:03 am »

Quote from: nockieboy on January 21, 2023, 10:25:34 pm

@BrianHG - do you have thoughts/preferences on these options at all for working best with your multi-port module? I've defaulted to bank-row-column, from what asmi is saying it'll be rank-bank-row-column (don't try saying that when you're drunk!), although the rank part is dealt with by the MIG interface.

I'm thinking that 'rank-bank-row-column' gives you the opportunity to create a 2 rank controller for when a 8gig stick is installed and it may still function with a 4gig stick except that everything above 4 gig will read and write as blank or error. It depends on Vivado's ram controller's flexibility when in operation.

This is what is meant by the 'order'

Say we have a 33bit address. The rank-bank-row-column means that the address being wired to the ram modules will be:

Desired_33bit_address [32:0] = address_assigned_in_ram_chip { rank, bank[2:0] , row[14:0], column[9:0], null[3:0] } ;
(For Vivado's ram controller, just ignore the bottom 4 null bits)

The 'rank' controls the 2 separate CS# (S1#/S0# chip select on module) pins for the first 8 and second 8 ram chips wired in parallel except for the CS and CKE pins. Basically it is wired as an upper address line in the 'rank-bank-row-column' order. The bank[2:0] selects 1 of 8 banks in each ram chip. The row selects the row address. The column[9:0] selects which column to read. I placed the null[3:0] there since even though the module is a 64bit DDR/ 128bit, I still point the address down to they theoretical byte even though those address lines in the command are ignored. (Assuming Vivado's ram controller places these dummy address lines in their command like mine. Also, my controller also ignores the column[2:0] and forces 0s there as well as it's read write minimum burst size 8 (Called a BL8) always aligns the read and write in a forward order.)

Now, why would we do this. Say we went with 'rank-row-bank-column', ie:
Desired_33bit_address [32:0] = address_assigned_in_ram_chip { rank, row[14:0], bank[2:0], column[9:0], null[3:0] } ;

Now, every sequential 16384 bytes of ram, we will switch to a new bank.
bytes 0-16383 is in bank 0.
bytes 16384-32767 is in bank 1.
ect... until bank 7, 131071 bytes. After bank 7, we go to row 1, bank 0,... ect...

With my selected preference rank-bank-row-column, ie:
Desired_33bit_address [32:0] = address_assigned_in_ram_chip { rank, bank[2:0] , row[14:0], column[9:0], null[3:0] } ;

Now the first sequential 536870912 bytes of ram are all in bank 0, though every 16384 bytes will switch to a new row for that bank 0.
Then we will go to bank 1, row 0 and up.

If all your code and loops usually fits into a block or two of 131071 bytes, then it should be advantageous to use 'rank-row-bank-column' or even 'row-rank-bank-column'. However, with out large display buffer and future display textures, sound and network buffers, operating in 'rank-bank-row-column' can offer other future memory access optimization if you properly design to do so. (IE: place display buffers 1,2,3,4 in banks 6 and 7, place textures in banks 5,4,3,2, line interleaved multiplexed, (IE you can parallel read alternate lines of a texture when filling so you can do bi-linear filtering when zooming to a texture without all the additional row precharge/activate/read/ /precharge/activate/read everytime every time you read a new Y coordinate on a texture to acquire the pixel shading blend.) code and audio in banks 0,1. In dual rank mode, make each bank size 1gb and potentially keep track of 16 banks)

Quote

What about bank machines? More the merrier, or is 4 the sweet spot? I guess you'd want as many bank machines as you have ports (up to max 8 ), as each port could be using a different row? Depends on resource usage I guess?

I do not know what Vivado's ram controller's bank machines are all about. Ask asmi. In my controller, since the DDR3 has 8 banks, my controller keeps track an keeps open all 8 banks. It will only close them as needed to optimize access. My guess, and it is only a guess, is that your should set Vivado's ram controller's bank machines to 8 to keep all 8 individual banks open memory access makes use of it.

Quote

What about transaction ordering? Any preference there for maximum compatibility/performance?

I do not know what this setting does.

For my multiport controller, I have:

Code: [Select]

// ************************************************************
// *** Controls are received from the BrianHG_DDR3_PHY_SEQ. ***
// ************************************************************
input                                SEQ_CAL_PASS        ,    // Goes low after a reset, goes high if the read calibration passes.
input                                DDR3_READY          ,    // Goes low after a reset, goes high when the DDR3 is ready to go.

input                                SEQ_BUSY_t          ,    // (*** WARNING: THIS IS A TOGGLE INPUT when parameter 'USE_TOGGLE_OUTPUTS' is 1 ***) Commands will only be accepted when this output is equal to the SEQ_CMD_ENA_t toggle input.
input                                SEQ_RDATA_RDY_t     ,    // (*** WARNING: THIS IS A TOGGLE INPUT when parameter 'USE_TOGGLE_OUTPUTS' is 1 ***) This output will toggle from low to high or high to low once new read data is valid.
input        [PORT_CACHE_BITS-1:0]   SEQ_RDATA           ,    // 256 bit date read from ram, valid when SEQ_RDATA_RDY_t goes high.
input        [DDR3_VECTOR_SIZE-1:0]  SEQ_RDVEC_FROM_DDR3 ,    // A copy of the 'SEQ_RDVEC_FROM_DDR3' input during the read request.  Valid when SEQ_RDATA_RDY_t goes high.

// ******************************************************
// *** Controls are sent to the BrianHG_DDR3_PHY_SEQ. ***
// ******************************************************
output logic                         SEQ_CMD_ENA_t       ,  // (*** WARNING: THIS IS A TOGGLE CONTROL! when parameter 'USE_TOGGLE_OUTPUTS' is 1 *** ) Begin a read or write once this input toggles state from high to low, or low to high.
output logic                         SEQ_WRITE_ENA       ,  // When high, a 256 bit write will be done, when low, a 256 bit read will be done.
output logic [PORT_ADDR_SIZE-1:0]    SEQ_ADDR            ,  // Address of read and write.  Note that ADDR[4:0] are supposed to be hard wired to 0 or low, otherwise the bytes in the 256 bit word will be sorted incorrectly.
output logic [PORT_CACHE_BITS-1:0]   SEQ_WDATA           ,  // write data.
output logic [PORT_CACHE_BITS/8-1:0] SEQ_WMASK           ,  // write data mask.
output logic [DDR3_VECTOR_SIZE-1:0]  SEQ_RDVEC_TO_DDR3   ,  // Read destination vector input.
output logic                         SEQ_refresh_hold       // Prevent refresh.  Warning, if held too long, the SEQ_refresh_queue will max out.
);

(I have a parameter to change the '_toggles' controls to positive logic. IE: normal High = on, low = off)
After the DDR3 is ready...

While you keep my 'SEQ_BUSY_t' input low, I will set 'SEQ_CMD_ENA_t' high when I am sending out a command. Otherwise it is low.

My SEQ_ADDR output will be 33bit for an 8gb module. (For Vivado's ram controller, just ignore the bottom 4 bits)
The bottom 7bits will always be 0s since I am expecting 512bit data.

My 'SEQ_WMASK' output will be 64bits (512/8) and for every bit which is high, the 8 associated data bits are expected to be written.
(Warning, Vivado's ram controller may have this inverted as this is how it is on the DDR3 ram chips)

When I send a read command, my 'SEQ_RDVEC_TO_DDR3' output will have a 4 bit ID number.

My multiport will accept read data word every clock while the (SEQ_RDATA_RDY_t) input is high. While it is high, it is expecting a 4 bit ID input 'SEQ_RDVEC_FROM_DDR3' from the ram controller with the read data on input 'SEQ_RDATA'.

If Vivado's ram controller doesn't support such a feature, you will need to create your own. So long as the reads come back in the same order as the read requests, it is nothing more than a FIFO tied to my read request, and Vivado's ram controller read ready. This FIFO just needs to be long enough to support the maximum Queued read commands Vivado's ram controller allows before an actual read is returned. It is preferred that Vivado's ram controller supports some sort of read target address/pointer function as this removes any possible synchronization error or bug.

As for the dual rank, settings in my multiport, we will just change my parameter 'DDR3_WIDTH_BANK' from 3 to 4, effectively treating the rank as another 8 banks. Basically we will set the ram chips to 8x 4gig, 8bit, but add an extra bit on the bank as the 'rank-bank-row-column' will just operate as a 16 bank memory even though it just comes across 2 groups of ram chips tied in parallel.

BrianHG · « **Reply #161 on:** January 23, 2023, 10:56:28 am »

Quote from: asmi on January 22, 2023, 10:31:20 pm

Quote from: nockieboy on January 22, 2023, 07:14:15 pm
Whilst I'm thinking about it, what should I be looking for when I search for a commonly-used fan to cool an FPGA like the XC7A FGG484? I'm not having much luck finding FPGA coolers, or just PCB-mounting fans, in EasyEDA or Mouser, for that matter. My Google-fu is failing me on this...

I just want a PCB footprint really, so I have an idea where to include mounting holes if a fan/heatsink combo is required.
I'm think of something like this one: https://www.mouser.ca/ProductDetail/Wakefield-Vette/960-23-33-D-AB-0?qs=PqoDHHvF64%2FNvAsuJB%2Fzyw%3D%3D

Or we can use a one size up one: https://www.mouser.ca/ProductDetail/Wakefield-Vette/960-27-33-D-AB-0?qs=PqoDHHvF64%2FM6UHqLS5KaQ%3D%3D

As for fans, since we only going to have a 5V rail, we'll need to use 5 V fans, and there isn't that many of them - which is why I'm thinking if it would make sense to use a larger heatsink (like 27 mm I listed above), as it would allow using 25 mm fans, of which there is a larger selection.

For a chip called Artix, you would think it runs cool, or at least a heat sink should be enough.
If you want a fan, you can also try looking for a faned heatsink. The max10 50kle barely gets warm with no heatsink on it. Though I am expecting that Nockieboy can move his core logic from 100Mhz (damn ellipse and line generator was holding us back) to 200MHz on the Artix, and with double logic size, the heat generated could be up to quadruple.

25mm with heatsink, 2 for 5$

30mm with heatsink, 1 for 4$

asmi · « **Reply #162 on:** January 23, 2023, 03:12:50 pm »

Quote from: BrianHG on January 23, 2023, 10:56:28 am

For a chip called Artix, you would think it runs cool, or at least a heat sink should be enough.

You can think that, or you can ask anyone who actually used these chips how cool they are. I needed a heatsink even for A35T device, though it depends on many factors, including board-level stuff like presense of and proximity to other heat-generating components, size of a board, layer composition, etc.

Quote from: BrianHG on January 23, 2023, 10:56:28 am

If you want a fan, you can also try looking for a faned heatsink. The max10 50kle barely gets warm with no heatsink on it. Though I am expecting that Nockieboy can move his core logic from 100Mhz (damn ellipse and line generator was holding us back) to 200MHz on the Artix, and with double logic size, the heat generated could be up to quadruple.

Fan might not be required, but it's good to add a provision for it just in case.

Quote from: BrianHG on January 23, 2023, 10:56:28 am

25mm with heatsink, 2 for 5$

30mm with heatsink, 1 for 4$

If you sanity is worth anything to you, you would never buy a fan on Aliexpress.

nockieboy · « **Reply #163 on:** January 23, 2023, 06:16:43 pm »

Quote from: asmi on January 22, 2023, 10:31:20 pm

Quote from: nockieboy on January 22, 2023, 07:14:15 pm
Whilst I'm thinking about it, what should I be looking for when I search for a commonly-used fan to cool an FPGA like the XC7A FGG484? I'm not having much luck finding FPGA coolers, or just PCB-mounting fans, in EasyEDA or Mouser, for that matter. My Google-fu is failing me on this...

I just want a PCB footprint really, so I have an idea where to include mounting holes if a fan/heatsink combo is required.
I'm think of something like this one: https://www.mouser.ca/ProductDetail/Wakefield-Vette/960-23-33-D-AB-0?qs=PqoDHHvF64%2FNvAsuJB%2Fzyw%3D%3D
Or we can use a one size up one: https://www.mouser.ca/ProductDetail/Wakefield-Vette/960-27-33-D-AB-0?qs=PqoDHHvF64%2FM6UHqLS5KaQ%3D%3D

As for fans, since we only going to have a 5V rail, we'll need to use 5 V fans, and there isn't that many of them - which is why I'm thinking if it would make sense to use a larger heatsink (like 27 mm I listed above), as it would allow using 25 mm fans, of which there is a larger selection.

Marvellous - thank you. Funny how I couldn't find anything, but now I've got the name/category I keep falling over them.

Quote from: asmi on January 22, 2023, 10:31:20 pm

I've created a quick testbench for MIG, you the code is here: https://github.com/asmi84/a100-484-sodimm It just waits until controller completes initialization (signal "init_calib_complete" goes high), executes a bunch of writes to addresses 0, 0x8, 0x10 (to see back-to-back bursts), as well as 0x20000000 to test access to the second rank, and then performs reads from those addresses (in the same order as writes). Just load the project, click "Run Simulation" on the left panel, and wait - it takes a while, on my PC it takes about 10-11 minutes to complete. That's why you will need to come up with some kind of BFM as waiting that long is going to make development too slow and too annoying. 32bit controller took about half as long.

Brilliant, thanks for this. I've cloned the repo and will give it a spin as soon as I have enough time to sit and start making sense of it all.

Quote from: BrianHG on January 23, 2023, 09:44:03 am

I'm thinking that 'rank-bank-row-column' gives you the opportunity to create a 2 rank controller for when a 8gig stick is installed and it may still function with a 4gig stick except that everything above 4 gig will read and write as blank or error. It depends on Vivado's ram controller's flexibility when in operation.

Thanks for the explanation BrianHG, it's making more sense to me now. It really feels like I've a hell of a lot to learn about memory and controllers before I can truly understand everything you've written about the controller settings etc.

Quote from: BrianHG on January 23, 2023, 09:44:03 am

Quote
What about bank machines? More the merrier, or is 4 the sweet spot? I guess you'd want as many bank machines as you have ports (up to max 8 ), as each port could be using a different row? Depends on resource usage I guess?
I do not know what Vivado's ram controller's bank machines are all about. Ask asmi. In my controller, since the DDR3 has 8 banks, my controller keeps track an keeps open all 8 banks. It will only close them as needed to optimize access. My guess, and it is only a guess, is that your should set Vivado's ram controller's bank machines to 8 to keep all 8 individual banks open memory access makes use of it.

@asmi - is BrianHG on the money here? Makes sense to me to have 8 bank machines, but you're the Xilinx expert here.

Quote from: asmi on January 23, 2023, 03:12:50 pm

Fan might not be required, but it's good to add a provision for it just in case.

I agree. Whilst there are some designs that absolutely won't require one, that's not to say that the GPU project or some other project someone uses won't need one. If I can include provision for one, that's a Good Thing^TM.

Quote from: asmi on January 23, 2023, 03:12:50 pm

If you sanity is worth anything to you, you would never buy a fan on Aliexpress.

I have to agree with asmi here. I don't know why, but something just seems to not sit right with a hydraulic-bearing 25x25mm cooling fan, plus heatsink (albeit not a very tall one), for £2.94 (~$4)... for two. It's going to cost more than that to post them from China to my house...

Anyhow, a cooler on its own from Mouser is over double that, but it's all about options I guess.

EDIT: Oh, nearly forgot. I'm looking more and more seriously at scrubbing the core/carrier combination and just doing one PCB with everything on it - it's looking like a better idea all the time, not least from a cost and complexity perspective, with the preferred SODIMM memory. This is even more important now I'm thinking of moving to soft-core CPUs within the FPGA and not bothering supporting a discrete host system. I need to give some thought to required peripherals on the board:

USB OTG or USB HOST is a must
Ethernet
Audio codec
HDMI output
Sufficient IO for peripherals - PMOD, Beaglebone or generic pin-strip connectors

...plus (and I know we're coming back to the stub issue), a neat way to program/update the FPGA via USB, and USB serial comms.

BrianHG · « **Reply #164 on:** January 23, 2023, 07:50:46 pm »

Quote from: nockieboy on January 23, 2023, 06:16:43 pm

Quote from: BrianHG on January 23, 2023, 09:44:03 am
I'm thinking that 'rank-bank-row-column' gives you the opportunity to create a 2 rank controller for when a 8gig stick is installed and it may still function with a 4gig stick except that everything above 4 gig will read and write as blank or error. It depends on Vivado's ram controller's flexibility when in operation.

Thanks for the explanation BrianHG, it's making more sense to me now. It really feels like I've a hell of a lot to learn about memory and controllers before I can truly understand everything you've written about the controller settings etc.

It is less about understanding Vivado's or my DDR3 controller. It is about understanding how a DDR3 ram chip works.

Download this data sheet: https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr3/4gb_ddr3l.pdf?rev=c2e67409c8e145f7906967608a95069f

Take a look at page 16, figure 4. This is one of the ram chips on your SODIMM module.
Your module has 2 groups of 8 of them in parallel giving you a 64 bit data bus.
The 2 different groups are the 2 different 'rank'.

Looking at the block diagram, you will see banks 0 through 7, 8 of them.
The chip appears as 8 smaller DDR3 ram chips inside a single chip.

To perform a read, you need to look at a few pages:

Step 1) First, if the current/previously activated row in the bank you want to access is the wrong one, you need to 'precharge' that bank to release the selected row. There is an associated minimum amount of time you must wait before executing the 'precharge'. See page 167, figure 75 for an example final 'read' command to the earliest 'precharge' command. As you can see you have to wait all the way until clock T13 before you can activate a new row.

Step 2) Second, if the data you want to read is not in a row which hasn't already been activated, you need to activate that row, page 161, figure 7. They show you an activate on clock T3 and you can only read from 'that bank's row' by clock T11. Another great delay.

Step 3) Once your bank and row has been setup, see page 165 figure 70, you see there is a delay until T5 before the first read command comes out. You will also notice that even though the first data did not come out a second read was performed at T4 and the read stream continues unbroken seen in the DQ output. This is permitted so long as you realize the data you read/or/write is in the same row. What is not illustrated is that if any of your other 8 banks also already have the correct row activated where you wish to read/or/write data, you can skip steps 1 and 2, and just read continuously as in just doing step 3 without any break in the high speed data stream.

Remember I said place your video buffer and textures/painting bitmaps in different banks. If you are reading a straight line of graphic data coming from bank 1 and writing that straight line of data to a bank 2, once the first access of each read and write addresses have 'precharged' then 'activated' their memory location rows in their associated banks, the read and write commands can run looping in step #3 without having to continuously perform the slow steps #1 and #2. If you texture and display data were both in bank 1, but in different rows, you would need to do steps 1, 2, 3 to read a 512bit block of texture data, then do steeps 1, 2 to change the row in the same bank, then an equivalent step 3 to write the 512bits. Then for the next read, back to step 1 and 2 to change the row again, and back to step 3. The other choice to radically decrease steps 1 & 2 is to have a large read and write cache, read the whole texture in 1 straight line into a FPGA BRAM buffer, then write out in a straight line to display ram.

I still say read the above .pdf data sheet so you have an idea of what the bottlenecks in using DDR3 ram is and what the controllers are doing to read and write to them. This can give you an idea in the future when you are trying to optimize the performance of your design or code / asset placement in ram to help speed things along.

Quote

@asmi - is BrianHG on the money here? Makes sense to me to have 8 bank machines, but you're the Xilinx expert here.

Do not quote me on this. I do not know if this is what is meant by Vivado's 'bank machine' settings. I only told you that my ram controller keeps track of all 8 banks and allows for all 8 to remain open. If Vivado's 'bank machine' setting has to do with the number of banks activated in advance of a command stream, a setting of 4 may be optimal as Vivado's controller may still keep all 8 banks open until it must close one due to a new row address in that bank, or a mandatory refresh. I do not foresee this being any noticeable performance help either way until much later in the game when you really done a 3D pixel shader with some really advanced stuff.

asmi · « **Reply #165 on:** January 23, 2023, 07:57:11 pm »

Quote from: nockieboy on January 23, 2023, 06:16:43 pm

Brilliant, thanks for this. I've cloned the repo and will give it a spin as soon as I have enough time to sit and start making sense of it all.

I tried to setup the project such that you would only need to open the project and press the single button. It will simulate 150 us, of those about 120 us is taken by the initialization, memory training, etc, and then there are my commands.

Quote from: nockieboy on January 23, 2023, 06:16:43 pm

Quote from: BrianHG on January 23, 2023, 09:44:03 am
I'm thinking that 'rank-bank-row-column' gives you the opportunity to create a 2 rank controller for when a 8gig stick is installed and it may still function with a 4gig stick except that everything above 4 gig will read and write as blank or error. It depends on Vivado's ram controller's flexibility when in operation.

Thanks for the explanation BrianHG, it's making more sense to me now. It really feels like I've a hell of a lot to learn about memory and controllers before I can truly understand everything you've written about the controller settings etc.

That's actually not true for Micron modules - both MT16KTF51264HZ (4 GB) and MT16KTF1G64HZ (8 GB) are dual rank modules with 16 x8 memory devices, they use different chips - 4 GB version uses 2 Gb devices, while 8 GB version uses 4 Gb ones. I actually happen to have that exact 4 GB module (bought it for a different project which didn't come to life for various reasons). So it's not that simple, and you have to look at specs for a specific module to see if it's a single rank or dual rank. There are some DDR3 RDIMMs which are quad rank as well.

Quote from: nockieboy on January 23, 2023, 06:16:43 pm

@asmi - is BrianHG on the money here? Makes sense to me to have 8 bank machines, but you're the Xilinx expert here.

I never actually needed more than 4 open rows at the same time because I always try to optimize my access patterns to minimize open rows, so I don't really know. One thing I do know is that bank machines cost logic resources which are finite, so add more only if you need them. Thankfully, all it takes to change that number is to run the GUI again and regenerate sources, so I'd say leave it as it is for now, and you can always change it later.

Quote from: nockieboy on January 23, 2023, 06:16:43 pm

I agree. Whilst there are some designs that absolutely won't require one, that's not to say that the GPU project or some other project someone uses won't need one. If I can include provision for one, that's a Good Thing^TM.

Artix devices actually have a thermal diode on die bonded out to pins DXP and DXN, so technically it's possible to implement some kind of external fan controller (you will need to add a temp sense IC which would perform the measurement and give you a temperature reading) and do a PWM with a FET to reduce the noise when you don't need a full blast, but that will require some kind of microcontroller, and I don't think it's worth it to add an MCU just for that purpose, but I just wanted to let you know that this feature is available.
Also these FPGA have dual multichannel 12bit 1 Msps ADCs, which can also measure things like die temperature and power supply voltage, so you can implement a PWM fan controller inside FPGA too. Infact MIG uses that very ADC to measure die temperature and perform thermal compensation of internal delays to make sure DDR3 controller work in the entire range of temperatures (which can be as wide as -40 - +125 °C depending on a device grade).

Quote from: nockieboy on January 23, 2023, 06:16:43 pm

I have to agree with asmi here. I don't know why, but something just seems to not sit right with a hydraulic-bearing 25x25mm cooling fan, plus heatsink (albeit not a very tall one), for £2.94 (~$4)... for two. It's going to cost more than that to post them from China to my house...

Anyhow, a cooler on its own from Mouser is over double that, but it's all about options I guess.

There are two things which I would never buy on Aliexpress (or ebay for that matter) - these are fans and computer storage (SSD, SD Cards, HDD, etc). The former because of noise (cheap stuff is cheap for a reason!), the latter because the data I tend to put on those cost much more than device itself, so I prefer to have something reliable.

Quote from: nockieboy on January 23, 2023, 06:16:43 pm

EDIT: Oh, nearly forgot. I'm looking more and more seriously at scrubbing the core/carrier combination and just doing one PCB with everything on it - it's looking like a better idea all the time, not least from a cost and complexity perspective, with the preferred SODIMM memory. This is even more important now I'm thinking of moving to soft-core CPUs within the FPGA and not bothering supporting a discrete host system. I need to give some thought to required peripherals on the board:
USB OTG or USB HOST is a must
Ethernet
Audio codec
HDMI output
Sufficient IO for peripherals - PMOD, Beaglebone or generic pin-strip connectors
...plus (and I know we're coming back to the stub issue), a neat way to program/update the FPGA via USB, and USB serial comms.

If it's going to be a single board, then we'll need to place everything one would need for a typical SBC (single board computer). So I would add a 4 port USB 2.0 hub (implemented using Microchip's USB2514 or USB2504A device) connected to a USB UPLI device so that you can connect multiple devices at the same time (think keyboard and mouse + perhaps some thumb drive). This also has some advantages as far as implementation does because you won't have implement USB 1.1 alongside of USB 2.0, but instead use special features of USB 2.0 to talk to slower devices, but those are small details not relevant here. Also maybe implement a PCIE, which can be switched with DisplayPort - either as an M.2 connector (for NVMe SSD) or as a full-up x4 PCIE port? There are some affordable switches which are fast enough to switch PCIE 2 traffic. I'm just thinking aloud here of what can you possibly want in a SBC.

asmi · « **Reply #166 on:** January 23, 2023, 08:44:19 pm »

In addition to what @BrianHG said, it will do you good to study the waveforms of activity on a DDR3 bus, attached is a good example of complex commanding for a dual-rank module.

Notes for the image:
1: "Precharge" command to rank 1.
2: "Write" command to a column 0 of rank 0. Line A12 is high for a full 8 beats burst (not a 4 beats chop).
3. "Write" command to a column 0x8 of rank 0. Line A12 is high for a full 8 beats burst (not a 4 beats chop). Data burst 0 for command (2) is about to begin.
4. "Refresh" command to rank 1. Data burst 0 is in progress.
5. "Write" command to a column 0x10 of rank 0. Line A12 is high for a full 8 beats burst (not a 4 beats chop). Data burst 0 for command (2) is about to complete followed by data burst 1 for command (3).

As you can see, write commands are scheduled in order to fully utilize a data bus for uninterrupted stream of data into the memory - three 8 beat bursts with no breaks between them (there is a similar stream for reading later in the timeline), while at the same time using other command slots to give commands to the second rank. You can also see that this controller has 1T commanding, meaning it can issue commands to memory on every memory clock cycle, unlike some other controllers I've seen which have 2T commanding and can only issue a command on every other clock cycle and so can't pack commands as efficiently as 1T one can.

nockieboy · « **Reply #167 on:** January 24, 2023, 06:57:04 pm »

Quote from: asmi on January 23, 2023, 07:57:11 pm

That's actually not true for Micron modules - both MT16KTF51264HZ (4 GB) and MT16KTF1G64HZ (8 GB) are dual rank modules with 16 x8 memory devices, they use different chips - 4 GB version uses 2 Gb devices, while 8 GB version uses 4 Gb ones. I actually happen to have that exact 4 GB module (bought it for a different project which didn't come to life for various reasons). So it's not that simple, and you have to look at specs for a specific module to see if it's a single rank or dual rank. There are some DDR3 RDIMMs which are quad rank as well.

I got an MT16KTF1G64HZ-1G6E1 in the post thanks to an impulse-buy on eBay for £9.

Quote from: asmi on January 23, 2023, 07:57:11 pm

Quote from: nockieboy on January 23, 2023, 06:16:43 pm
@asmi - is BrianHG on the money here? Makes sense to me to have 8 bank machines, but you're the Xilinx expert here.
I never actually needed more than 4 open rows at the same time because I always try to optimize my access patterns to minimize open rows, so I don't really know. One thing I do know is that bank machines cost logic resources which are finite, so add more only if you need them. Thankfully, all it takes to change that number is to run the GUI again and regenerate sources, so I'd say leave it as it is for now, and you can always change it later.

Okay, still with default 4 for the moment.

Quote from: asmi on January 23, 2023, 07:57:11 pm

Artix devices actually have a thermal diode on die bonded out to pins DXP and DXN, so technically it's possible to implement some kind of external fan controller (you will need to add a temp sense IC which would perform the measurement and give you a temperature reading) and do a PWM with a FET to reduce the noise when you don't need a full blast, but that will require some kind of microcontroller, and I don't think it's worth it to add an MCU just for that purpose, but I just wanted to let you know that this feature is available.
Also these FPGA have dual multichannel 12bit 1 Msps ADCs, which can also measure things like die temperature and power supply voltage, so you can implement a PWM fan controller inside FPGA too. Infact MIG uses that very ADC to measure die temperature and perform thermal compensation of internal delays to make sure DDR3 controller work in the entire range of temperatures (which can be as wide as -40 - +125 °C depending on a device grade).

Interesting. Would make sense to use the internal ADC and implement an HDL PWM with pin out to a FET to control the fan, no? Only one external part needed that way (and maybe a couple of resistors), and you get thermally-controlled fan regulation for little effort.

Quote from: asmi on January 23, 2023, 07:57:11 pm

If it's going to be a single board, then we'll need to place everything one would need for a typical SBC (single board computer). So I would add a 4 port USB 2.0 hub (implemented using Microchip's USB2514 or USB2504A device) connected to a USB UPLI device so that you can connect multiple devices at the same time (think keyboard and mouse + perhaps some thumb drive). This also has some advantages as far as implementation does because you won't have implement USB 1.1 alongside of USB 2.0, but instead use special features of USB 2.0 to talk to slower devices, but those are small details not relevant here. Also maybe implement a PCIE, which can be switched with DisplayPort - either as an M.2 connector (for NVMe SSD) or as a full-up x4 PCIE port? There are some affordable switches which are fast enough to switch PCIE 2 traffic. I'm just thinking aloud here of what can you possibly want in a SBC.

4-port USB 2.0 - yeah, I can see that being useful.
x4 PCIE? Interesting. I can see the immediate potential of an M.2 connector, but what about PCIE? Forgive my lack of imagination or knowledge, but what sort of peripherals would that open the board up to?
What about DisplayPort? Is that big thing/must have?

Is there any way the 4-port USB hub could be made use of by a soft-core CPU running something that doesn't have a USB stack like Linux? I'd still like to be able to plug a USB keyboard and mouse in with, for example, a 16-bit Motorola soft-core CPU - or even the dreaded Z80 or other 8-bit processors. This is a pretty key requirement for the board - it's up there with the SODIMM, really. The only solution I could come up with myself was to use a dedicated chip like the CH559 to handle the USB peripheral and send codes to the host via serial. I don't really want to include something as specific and niche as that on a more generic board that will likely be used for more powerful applications that can handle the USB ports natively? Or will be a case of running a MicroBlaze CPU to handle the USB in additional to the actual emulated CPU, whatever that may be, and passing coordinates/keypresses etc. to it internally?

nockieboy · « **Reply #168 on:** January 24, 2023, 07:23:25 pm »

Quote from: asmi on January 23, 2023, 08:44:19 pm

As you can see, write commands are scheduled in order to fully utilize a data bus for uninterrupted stream of data into the memory - three 8 beat bursts with no breaks between them (there is a similar stream for reading later in the timeline), while at the same time using other command slots to give commands to the second rank. You can also see that this controller has 1T commanding, meaning it can issue commands to memory on every memory clock cycle, unlike some other controllers I've seen which have 2T commanding and can only issue a command on every other clock cycle and so can't pack commands as efficiently as 1T one can.

I got the simulation running in Vivado. Like you said, a single click and off it went. Have exactly the same output as your picture too, so it's looking good. I just need to get a grasp of how it all fits together now and get comfortable using it like I am with Quartus. Then I'll be more use in trying to apply BrianHG's multi-port adapter to the simulation and get the BFM set up?

asmi · « **Reply #169 on:** January 24, 2023, 09:08:28 pm »

Quote from: nockieboy on January 24, 2023, 06:57:04 pm

I got an MT16KTF1G64HZ-1G6E1 in the post thanks to an impulse-buy on eBay for £9.

I ordered one for myself as well. I've also learned that apparently there is an even larger capacity version - MT16KTF2G64HZ - 16 GBytes! But it's quite expensive even on the likes of ebay, and in general is not easy to find in stock. I'm also thinking about picking up some cheap "computer store"-sourced modules just to see if they would work - but that is for later when we'll actually have a board in our hands.

Quote from: nockieboy on January 24, 2023, 06:57:04 pm

Interesting. Would make sense to use the internal ADC and implement an HDL PWM with pin out to a FET to control the fan, no? Only one external part needed that way (and maybe a couple of resistors), and you get thermally-controlled fan regulation for little effort.

We've got to make sure it goes on full blast by default as it seems to be the safest option, also there are some complications with using XADC when you have DDR3 controller as well, because it uses it internally for the same reason. It's nothing complex really - you just need to drive current temp measurement to controller, but it can be done.

Quote from: nockieboy on January 24, 2023, 06:57:04 pm

x4 PCIE? Interesting. I can see the immediate potential of an M.2 connector, but what about PCIE? Forgive my lack of imagination or knowledge, but what sort of peripherals would that open the board up to?

Since you can get Linux running on a Microblaze, you can get a lot of commercial PCIE devices to run there using Linux PCIE device drivers - from PCIE-to-M.2 storage cards to multi-gig (2.5G, 5G) network adapters. And perhaps at some point you (or someone else) will want to design an actual video card - with dedicated FPGA and dedicated memory, connecting it to a host via PCIE makes a lot of sense. Or if you run out of FPGA resources in the main FPGA and would want to move some functionality to external FPGA - again, PCIE 2x4 will provide for 20 Gbps of bandwidth in each direction, or perhaps even more if some sort of custom protocol (like Aurora) is used instead of PCIE while still running over the same lanes, but at higher bitrate (GTP transceivers can run up to 6.25 Gbps per lane instead of 5 Gbps of PCIE 2, giving a total of 25 Gbps of bandwidth in each direction). I can see myself using that port with another FPGA board just to play around with different protocols over multi-gigabit serial lines - at the end of the day, PCIE connector is just a connector, nothing in it says that you can't use other protocols over it.

Quote from: nockieboy on January 24, 2023, 06:57:04 pm

What about DisplayPort? Is that big thing/must have?

To be honest I'm not really convinced of it's utility given that we have HDMI 1080p@60. I frankly don't think A100T will have enough resources for CPU, all peripherals AND a GPU powerful enough to generate an image of sufficient complexity with higher than FullHD resolution. The only reason I offered it for the carrier was that it was a relatively painless way to expose those transceiver lanes externally. But now I think PCIE connector would be much better because it can also provide power to a connected board, as well as having both transmitter lanes and received lanes going through the same connector.

Quote from: nockieboy on January 24, 2023, 06:57:04 pm

Is there any way the 4-port USB hub could be made use of by a soft-core CPU running something that doesn't have a USB stack like Linux? I'd still like to be able to plug a USB keyboard and mouse in with, for example, a 16-bit Motorola soft-core CPU - or even the dreaded Z80 or other 8-bit processors. This is a pretty key requirement for the board - it's up there with the SODIMM, really. The only solution I could come up with myself was to use a dedicated chip like the CH559 to handle the USB peripheral and send codes to the host via serial. I don't really want to include something as specific and niche as that on a more generic board that will likely be used for more powerful applications that can handle the USB ports natively? Or will be a case of running a MicroBlaze CPU to handle the USB in additional to the actual emulated CPU, whatever that may be, and passing coordinates/keypresses etc. to it internally?

I think using Microblaze with Linux just to handle keyboard is nothing but a massive waste of resources. But since we are going to have some sort of GPIO header anyway, you can always design a small plug-in PCB with CH559 and have your keyboard issue taken care of this way. Or to have a few PMOD-like connectors and design a PMOD-compliant CH559 module. The point is since that connection isn't going to be particularily high-speed one, we can get away with using cheap PMOD-like headers. And as a bonus, you can connect some of a myriad of PMOD modules which are already on a market if you happen to have some (or buy some from say Digilent).

Quote from: nockieboy on January 24, 2023, 07:23:25 pm

I got the simulation running in Vivado. Like you said, a single click and off it went. Have exactly the same output as your picture too, so it's looking good. I just need to get a grasp of how it all fits together now and get comfortable using it like I am with Quartus. Then I'll be more use in trying to apply BrianHG's multi-port adapter to the simulation and get the BFM set up?

Cool. I think at first you need to get comfortable using MIG UI to issue commands to it and write/read data. So, once you've read the relevant section of a UG586 which explains how to use it, try playing with it, adding some more writes and reads, and see that it behaves like you think it should. It's important that you understand very well how to interact with it and how it does things, because BFM is meant to replicate the outside behavior of it (meaning the "UI" would need to behave exactly like a real UI would in any circumstances). Once you have that familiarity, we can proceed with making a model for it.

To be honest, that is not how I use MIG, because I always use it through AXI bus, and for that there is a free AXI Verification IP which can emulate memory (among other things), which is what I use for anything memory-related, and you can also pre-fill certain addresses of that "memory" with specific data to help with your testing - like for example if component you are testing expects specific data at specific memory location(s), you can set it up as part of a testbench. And since all but the simplest of my designs use AXI, it's a natural for me to prefer using that. Incidentally if you will decide to use Microblaze at some point, you will have to use AXI because that's the bus this CPU uses to talk to any external devices - be it memory, or IO devices.

Now, as far as Vivado goes, here is how I do things. Let me preface it with saying that it's not neccessarily the best way, or the only way, but merely that's what works for me.
1. I don't use Vivado's built in editor, because frankly it sucks. Instead I use VS Code with a plugin for SystemVerilog. I open "<project_name>.srcs" folder in VS Code, which makes all sources I'd be interested in available for me to edit.
2. Depending on the project, there will be one, two, or three subfolders - the one named "sources_1" is where your synthesizable HDL goes (if you click "Add Sources" option on the left panel and select "Add or create design sources" - that's where all these files will go), another "sim_1" is where your simulation-only HDL goes (option "Add or create simulation sources" in the same dialog will land files in that folder), and yet another one "constrs_1" (option "Add or create constraints"). There might be more folders created later if you have multiple simulation sets, but let's leave it aside for now.
3. Once you've made any changes in HDL and want to re-run simulation, there is a button for that in the top toolbar, see the frist screenshot in attachment. By default, simulation will run for a length of time configured in settings dialog (see second screenshot). Other controls next to "Restart simulation" button allows advancing simulation for some more time, or restarting simulation from the beginning without recompiling sources.
4. Simulation waveform interface is broadly similar in form and function to what I remember of Modelsim (or any other HDL sim that I've ever seen for that matter). Some useful controls:

Left-right cursor keys move to the next transition of currently-selected trace, same keys while holding "Shift" allow measuring time intervals between current cursor position and the next edge, up-down keys select trace above-below current one
Mouse scroll wheel scrolles the list of traces, the wheel while holding "Control" changes the timescale, the wheel while holding "Shift" scrolls the waveforms hotizontally (along the time axis)
You can assign different colors to traces, change the radix for multibit traces and other things via context menu on a trace. You can add a separator, create a new group or a virtual bus there as well. To rearrange traces, just drag them around.
If you want to add more traces (for example for internal signals which are not displayed by default), use the "Scope" panel to the left to browse through the modules hierarchy, pick an instance you want traces to added from, and drag traces from "Objects" panel into the waveform viewer

That should get you started, but like always feel free to experiment, if you somehow screw something up, you can always go back to initial working state by simple getting rid of your changes using git. If you have any questions or problems, feel free to post them here and I will try to help.

nockieboy · « **Reply #170 on:** January 25, 2023, 01:37:26 pm »

Quote from: asmi on January 24, 2023, 09:08:28 pm

Quote from: nockieboy on January 24, 2023, 06:57:04 pm
Interesting. Would make sense to use the internal ADC and implement an HDL PWM with pin out to a FET to control the fan, no? Only one external part needed that way (and maybe a couple of resistors), and you get thermally-controlled fan regulation for little effort.
We've got to make sure it goes on full blast by default as it seems to be the safest option, also there are some complications with using XADC when you have DDR3 controller as well, because it uses it internally for the same reason. It's nothing complex really - you just need to drive current temp measurement to controller, but it can be done.

A pull-up on the gate will do that. If the FPGA isn't pulling the gate low, the fan will default to full-on.

Quote from: asmi on January 24, 2023, 09:08:28 pm

Since you can get Linux running on a Microblaze, you can get a lot of commercial PCIE devices to run there using Linux PCIE device drivers - from PCIE-to-M.2 storage cards to multi-gig (2.5G, 5G) network adapters. And perhaps at some point you (or someone else) will want to design an actual video card - with dedicated FPGA and dedicated memory, connecting it to a host via PCIE makes a lot of sense.

It'll be an SBC, like a Raspberry Pi, but with an FPGA instead of a dedicated processor and the ability to attach bona-fide PC expansion cards...

Quote from: asmi on January 24, 2023, 09:08:28 pm

Or if you run out of FPGA resources in the main FPGA and would want to move some functionality to external FPGA - again, PCIE 2x4 will provide for 20 Gbps of bandwidth in each direction, or perhaps even more if some sort of custom protocol (like Aurora) is used instead of PCIE while still running over the same lanes, but at higher bitrate (GTP transceivers can run up to 6.25 Gbps per lane instead of 5 Gbps of PCIE 2, giving a total of 25 Gbps of bandwidth in each direction). I can see myself using that port with another FPGA board just to play around with different protocols over multi-gigabit serial lines - at the end of the day, PCIE connector is just a connector, nothing in it says that you can't use other protocols over it.

Just so I'm absolutely 100% certain we're thinking about the exact same thing, what sort of PCIE connector are you thinking of? Up to this point, I have been thinking of a PCIE socket on the PCB, but would it be worth adding a PCIE edge connector to the PCB to allow the board to be a PCIE peripheral itself, like those expensive dev boards I see here and there? Would that even be possible with a socket as well? With the socket, am I going to need to provide a 12V source for it as well to be fully compliant and compatible with all peripherals?

Quote from: asmi on January 24, 2023, 09:08:28 pm

I think using Microblaze with Linux just to handle keyboard is nothing but a massive waste of resources.

I agree.

Quote from: asmi on January 24, 2023, 09:08:28 pm

But since we are going to have some sort of GPIO header anyway, you can always design a small plug-in PCB with CH559 and have your keyboard issue taken care of this way. Or to have a few PMOD-like connectors and design a PMOD-compliant CH559 module. The point is since that connection isn't going to be particularily high-speed one, we can get away with using cheap PMOD-like headers. And as a bonus, you can connect some of a myriad of PMOD modules which are already on a market if you happen to have some (or buy some from say Digilent).

Well I mentioned PMOD connectors in a previous post to expose some of the free IO, as well as maybe a Raspberry Pi-compatible header etc., depending on how many free IO there will be. I don't want to waste any. But using a CH559 - or whatever other solution I want to use - is actually a really good idea and I wish I'd had it!

Quote from: asmi on January 24, 2023, 09:08:28 pm

To be honest, that is not how I use MIG, because I always use it through AXI bus, and for that there is a free AXI Verification IP which can emulate memory (among other things), which is what I use for anything memory-related, and you can also pre-fill certain addresses of that "memory" with specific data to help with your testing - like for example if component you are testing expects specific data at specific memory location(s), you can set it up as part of a testbench. And since all but the simplest of my designs use AXI, it's a natural for me to prefer using that. Incidentally if you will decide to use Microblaze at some point, you will have to use AXI because that's the bus this CPU uses to talk to any external devices - be it memory, or IO devices.

I'm not going to be using AXI for this memory controller, am I? This is how I think it all fits together (in my mind, at least):

BrianHG · « **Reply #171 on:** January 25, 2023, 03:37:31 pm »

Ok, here you go. I've attached your MIG setup which will be backwards compatible with your existing GPU project.

There are 2 files:
BrianHG_CONTROLLER_v16_Xilinx_MIG_DDR3_top.sv :
Replaces my BrianHG_DDR3_CONTROLLER_v16_top.sv. Read inside carefully, the:

Code: [Select]

//xxxxxxxxx
//
//xxxxxxxxx

Denotes something you should pay attention to and may need changing.

You will see it now only requires 2 other files from my original DDR3 project:

Code: [Select]

//   - BrianHG_DDR3_COMMANDER_v16.sv                     -> v1.6 High FMAX speed multi-port read and write requests and cache, commands the BrianHG_DDR3_PHY_SEQ.sv sequencer.
//   - BrianHG_DDR3_FIFOs.sv                             -> Serial shifting logic FIFOs.

There is a second new file, BrianHG_Xilinx_MIG_DDR3.sv.
This file replaces my BrianHG_DDR3_PHY_SEQ_v16.sv for the new controller as seen in this diagram:

Again, in the code, read the:

Code: [Select]

//xxxxxxxxx
//
//xxxxxxxxx

Which denotes something you should pay attention to and may need changing.

You need to place the Xilinx Vivado MIG at the bottom of this source file and wire it to all of my 'SEQ_xxx' command control ports with any additional logic, wire, clocks, signals which may be required for the system to function.

I've commented out all the unused parameters. The ones remaining are either used by existing modules in your GPU system, or may just be there as useful notes or functions which may be added in the future.

The only other change is that for the wide 128bit bus setting used for my VGA generator's display read channel, you will need to switch that read port from 128bits to 512bits, or whatever the MIG bus width becomes. This will ensure unbroken full speed read bursting when the VGA generator reads data from DDR3 ram to fill the video output line buffer when displaying each line of video. (IE: with 512bit, fill an entire line of video before the Z80 executes 2-3 instructions, then wait for the next H-Sync to make the next line.)

As for the VGA generator, I'm now patching the source file 'BrianHG_GFX_Video_Line_Buffer.sv' to remove it's dependence on Quartus' altdpram megafunction. I have already made the new working code (works in modelsim), but a Quartus related bug cant infer a wide block-ram which contains 'byte-enable' without it taking hours to compile (yes, its a compiler bug). I'll send you the patched code next. Most likely Vivado doesn't have the same bug, so it should work. But if it doesn't, I left space at the bottom of the code to allow you to manually insert Vivado's BRAM function.

(Also, remember there is a parameter for the ellipse generator for using the Quartus multiply megafunction in place of a standard verilog ' C <= A * B ', you also need to turn that off.)

(opps, in BrianHG_CONTROLLER_v16_Xilinx_MIG_DDR3_top.sv, I forgot to erase lines 497-499. Those ports no longer exist.)

(I will also edit my testbench 'BrianHG_DDR3_CONTROLLER_v16_top_tb.sv' into 'BrianHG_CONTROLLER_v16_Xilinx_MIG_DDR3_top_tb.sv' so you may test that my multiport successfully generates unbroken sequential bursts with Xilinx's MIG controller. It will basically be the same as the original, but, it will just call the new 'BrianHG_CONTROLLER_v16_Xilinx_MIG_DDR3_top.sv' demonstrating backwards compatibility.)

asmi · « **Reply #172 on:** January 26, 2023, 02:10:20 am »

Quote from: nockieboy on January 25, 2023, 01:37:26 pm

Just so I'm absolutely 100% certain we're thinking about the exact same thing, what sort of PCIE connector are you thinking of? Up to this point, I have been thinking of a PCIE socket on the PCB, but would it be worth adding a PCIE edge connector to the PCB to allow the board to be a PCIE peripheral itself, like those expensive dev boards I see here and there? Would that even be possible with a socket as well? With the socket, am I going to need to provide a 12V source for it as well to be fully compliant and compatible with all peripherals?

Yes, I'm talking about a PCIE x4 connector, kinda like what you have on PC motherboards. As for 12 V - well we can power the system with that voltage (this is actually a good idea because it's going to reduce the current in the connecting cable), and have a DC-DC converter to convert 12 V to 5 V. This will also allow us to use a 12 V fan, which are much more plentiful on a market.
As for connecting two boards - there are cables like this one: https://www.samtec.com/products/pciec-064-0050-ec-ec-cp which allows connecting two PCIE connectors without an edge connector, you can also find similar male-to-male PCIE cables on the likes of Aliexpress for less money - they are not as abundant as male-to-female extension cables, but they do exist and can be found. Or you can design another FPGA board with an edge connector so that you can plug it into this board.

Quote from: nockieboy on January 25, 2023, 01:37:26 pm

I'm not going to be using AXI for this memory controller, am I? This is how I think it all fits together (in my mind, at least):

I don't know what are you going to use - it's for you and BrianHG to decide. I use AXI because everything in Xilinx world speaks AXI, so it was natural for me to also adopt it. Just to give you idea, here is a system I designed for a board from my signature:

As you can see, there are two big crossbar interconnects:
1. First (axi_smc) connects a MIG DDR2 controller and a execute-in-place (XIP) QSPI flash on the slave side with a CPU's instruction (M_AXI_IC) and data (M_AXI_DC) ports so that CPU can fetch commands and read/write data from any address in DDR2 and XIP (of course XIP is read-only, so you can't write anything there), also there are two more master ports for DMA-capable IPs - one of which (v_frmbuf_wr_0) writes video frames into memory (in this case those frames are generated by v_tpg_0 - video test pattern generator), and another one (v_frmbuf_rd_0) reads frames from the memory and eventually they are output via HDMI video output.
2. Second (microblaze_0_axi_periph) connects "peripheral" port of a CPU with all other modules via AXI4-lite bus - things like debug module (mdm_1), interrupt controller (microblaze_0_axi_intc), I2C master module (axi_iic_0), GPIO module (axi_gpio_0), and configuration ports of all other blocks.

Those crossbar interconnects handle address mapping and translation as per address map:

Here each master has it's own address space, and you can setup address windows for slave interfaces however you want or need. For example, you can see that Microblaze's data port can access everything, while instruction port can only access DDR2 memory, XIP flash and a local memory (not shown in a diagram because it uses dedicated port of Microblaze, and not routed through main interconnect), and video reader and writer can only access DDR2 memory (XIP is not mapped to their address spaces).

Oh, and BTW - the only part of this diagram that I designed myself (and wrote HDL for) is HDMI Video Out, all other components are provided by Xilinx for free!

Now you will hopefully understand why I use AXI pretty much everywhere

BrianHG · « **Reply #173 on:** January 26, 2023, 03:15:31 am »

It's up to Nockieboy what he wants to do. All I offered was a way of backwards compatibility to his existing GPU code as all it took were a few cut and paste on my side. I no longer have time to help his to adapt to an entirely new layout and architecture. That will now become your job asmi.

nockieboy · « **Reply #174 on:** January 26, 2023, 09:03:15 am »

Quote from: BrianHG on January 26, 2023, 03:15:31 am

It's up to Nockieboy what he wants to do. All I offered was a way of backwards compatibility to his existing GPU code as all it took were a few cut and paste on my side. I no longer have time to help his to adapt to an entirely new layout and architecture. That will now become your job asmi.

There I was thinking switching FPGA wouldn't be that much of an upheaval to the project... $:-\$

I'm going to have to go the path of least resistance, which for me at the moment is to retain the existing architecture as much as possible (as I understand it) and keep the interconnects within the GPU project close to the metal, rather than using something like AXI to make it all plug 'n' play. I guess this means I'll have to write an AXI interface to use a MicroBlaze with the GPU, but I did that with Wishbone so perhaps it's possible for me to do it with AXI too. It still means any other soft-core CPU (I'm looking at you, 68000 etc.) will be straightforward to interface to the GPU via a bridge like the Z80_Bridge module.

In terms of 'what's next', aside from having a play with Vivado and the simulation software to get familiar with it and send some commands to the DDR3, I need to start looking at the MIG's HDL itself and working out what signals, ports, buses are exposed by it to look into wiring it to BrianHG_CONTROLLER_v16_Xilinx_MIG_DDR3_top.sv.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Planning/design/review for a 6-layer Xilinx Artix-7 board for DIY computer. (Read 36817 times)

Share me