Electronics > FPGA

SOLVED: Multi-tier LUT based round-robin arbiter, Modelsim=Ok, Quartus=1h compi

(1/1)

BrianHG:
I made this LUT selection generator which works great and instantly on Modelsim, but in Quartus, it just takes forever to compile instead of the usual ~30 seconds.

Note that there are up to 16 read and 16 write channels with a priority strength parameter assigned to each between 0 and 7.

In Quartus, this code took ~hour during the analysis and synthesis stage:


--- Code: ---// Selection table size = req(16),last_accessed channel(4) = 20 bits...
localparam  LUT_BASE_last = 0          ; // Defines the LSB position for the 4 bit 'last_accessed'.
localparam  LUT_BASE_req  = 4          ; // Defines the LSB position for the 16 bit 'req'.

localparam  LUT_SIZE   = 1<<(20-1)     ; // Set the size of the lookup table.
logic [3:0] LUT_RC_SEL [0:LUT_SIZE-1]  ; // A lookup table which returns which read channel should be requested next.
logic [2:0] LUT_RC_PRI [0:LUT_SIZE-1]  ; // A lookup table which the priority of the requested read channel.
logic [3:0] LUT_WC_SEL [0:LUT_SIZE-1]  ; // A lookup table which returns which write channel should be requested next.
logic [2:0] LUT_WC_PRI [0:LUT_SIZE-1]  ; // A lookup table which the priority of the requested write channel.

// **************************************************************************************************************************
// Initialize a set of lookup tables which derives which read/write port is next granted access to the DDR3_PHY_SEQ.
// **************************************************************************************************************************
initial begin
// Clear tables.
for (int z=0 ; z<LUT_SIZE ; z++) begin
                            LUT_RC_SEL[z]  = 0 ;
                            LUT_RC_PRI[z]  = 0 ;
                            LUT_WC_SEL[z]  = 0 ;
                            LUT_WC_PRI[z]  = 0 ;
                            end // for z

// **************************************************
// Construct port selection priority decoding table.
// **************************************************
for (int z=0 ; z<LUT_SIZE ; z++) begin

// *******************************
// Construct read selection lookup table.
// *******************************
    loop_break     = 0;

    for (int p=7 ; p>=0 ; p-- ) begin                          // 'p' scans in order of highest priority to lowest.
        for (int i=0 ; i<16 ; i++ ) begin                      // 'i' scans channels 0-15.
   
        // round robin arbiter priority selection, ensure the previous accessed R/W channel of equal priority is considered last before
        // access is granted once again.
        RC_cs  = 4'(4'(i) + z[LUT_BASE_last+3:LUT_BASE_last] + 1) ;          // RC_cs begins the scan at 'i' + the last read channel + 1

            if ( RC_cs<PORT_R_TOTAL && !loop_break ) begin  // Only scan from the available read ports.

                if ( z[LUT_BASE_req+RC_cs] && ( PORT_R_PRIORITY[RC_cs] == p[2:0] ) ) begin
                                                                            LUT_RC_SEL[z]  = RC_cs;  // If there is a DDR3 read_req hit with port priority 'p', set that read channel
                                                                            LUT_RC_PRI[z]  = PORT_R_PRIORITY[RC_cs];  // If there is a DDR3 read_req hit with port priority 'p', set that read channel
                                                                            loop_break     = 1 ;     // and ignore the rest of the scan loop.
                                                                            end
            end

        end // for i
    end // for p


// *******************************
// Construct write selection lookup table.
// *******************************
    loop_break     = 0;

    for (int p=7 ; p>=0 ; p-- ) begin                          // 'p' scans in order of highest priority to lowest.
        for (int i=0 ; i<16 ; i++ ) begin                      // 'i' scans channels 0-15.

        // round robin arbiter priority selection, ensure the previous accessed R/W channel of equal priority is considered last before
        // access is granted once again.
        WC_cs  = 4'(4'(i) + z[LUT_BASE_last+3:LUT_BASE_last] + 1) ;          // WC_cs begins the scan at 'i' + the last write channel + 1

            if ( WC_cs<PORT_W_TOTAL && !loop_break ) begin  // Only scan from the available write ports.

                if ( z[LUT_BASE_req+WC_cs] && ( PORT_W_PRIORITY[WC_cs] == p[2:0] ) ) begin
                                                                            LUT_WC_SEL[z]  = WC_cs;  // If there is a DDR3 write_req hit with port priority 'p', set that write channel
                                                                            LUT_WC_PRI[z]  = PORT_W_PRIORITY[WC_cs];  // If there is a DDR3 read_req hit with port priority 'p', set that read channel
                                                                            loop_break     = 1 ;     // and ignore the rest of the scan loop.
                                                                            end
            end

        end // for i
    end // for p

   
  end // for z
end // Initial


// Lateron...

read_req_sel   = LUT_RC_SEL[{RC_ddr3_read_req,  last_read_req_chan }];
write_req_sel  = LUT_WC_SEL[{WC_ddr3_write_req, last_write_req_chan}];


--- End code ---


Is there anything I can do?
My guess is that Quartus is attempting to simplify down the table to the least number of logic gates.

My attempt to drive the 'read_req_sel' and 'write_req_sel' directly within an 'always_comb', feeding 'R/WC_ddr3_read/write_req' and 'last_read/write_req_chan' directly in place of the 'z' address appears to occasionally skip the right bus selection, but, works in Quartus.

Having ports with assigned priority parameters is too powerful a feature to give up for a dumb round robin arbiter which just cycles through all the requested ports.  Coding such a scheme becomes simple-stupid, but I want the ability for some ports to have an optional overriding priority group to prevent the lesser request ports from being granted.

BrianHG:
Ok, I broke up the code back into the original all combinational logic, no LUT.  I just had to disassemble the loop-within a loop into 2 sequential smaller loops and the original bug has disappeared.

Compiles fine in Quartus now.

SiliconWizard:

--- Quote from: BrianHG on June 14, 2021, 03:20:22 am ---Ok, I broke up the code back into the original all combinational logic, no LUT.  I just had to disassemble the loop-within a loop into 2 sequential smaller loops and the original bug has disappeared.

--- End quote ---

I'm curious what it looks like now?
Not sure about this particular case, but nested if's inside loops tend to yield particularly inefficient structures. Compilers tend to be pretty dumb with if's (VHDL or Verilog alike...)

BrianHG:

--- Quote from: SiliconWizard on June 14, 2021, 05:12:19 pm ---
--- Quote from: BrianHG on June 14, 2021, 03:20:22 am ---Ok, I broke up the code back into the original all combinational logic, no LUT.  I just had to disassemble the loop-within a loop into 2 sequential smaller loops and the original bug has disappeared.

--- End quote ---

I'm curious what it looks like now?
Not sure about this particular case, but nested if's inside loops tend to yield particularly inefficient structures. Compilers tend to be pretty dumb with if's (VHDL or Verilog alike...)

--- End quote ---

No problem originally, this was then pnultimate which simulated with a glitch, and Quartus also hated (Old version):


--- Code: ---any_read_req = (RC_ddr3_read_req != 0) ; // Go high if there are any DDR3 reads required.

read_req_sel = 4'd0 ;
RC_break     = 0;

// Scan for the top priority DDR3 read req, scanning the previous read req channel last unless
// that channel is in a sequential burst within the same row or has a priority boost.
for (int p=31 ; p>=0 ; p-- ) begin                                                               // 'p' scans in order of highest priority to lowest.
    for (int i=(last_read_req_chan+1) ; i<(17+last_read_req_chan) ; i++ ) begin                  // 'i' scans starting at the next read channel port and counts up and around.
        if ( 4'(i)<PORT_R_TOTAL && !RC_break ) begin                                             // Only scan from the available read ports.

            // Add 1/2 to the read port's priority weight if the page_hit is true.
            // Add 8 to the priority if the priority boost is set.
            if ( RC_ddr3_read_req[4'(i)] && ( (CMD_R_priority_boost[4'(i)]<<4) || (PORT_R_PRIORITY[4'(i)]<<1) || RC_page_hit[4'(i)] ) == 5'(p) ) begin
                                                                          read_req_sel = 4'(i);  // If there is a DDR3 read_req hit with port priority 'p', set that read channel
                                                                          RC_break     = 1 ;     // and ignore the rest of the scan loop.
                                                                          end
        end
    end // for i
end // for p

--- End code ---

This new version shrinks everything down, fixes the range of 'i' in it's for loop, and pre-set's up the the 16 channel 'CMD_R_priority_boost[ x ]' into a fixed array prior to the loop:


--- Code: ---RC_boost_hit   = (RC_boost & RC_ddr3_read_req);
// Generate a flag which will show if any boosted request are available.
WC_boost_hit   = (WC_boost & WC_ddr3_write_req);

any_boost      = (WC_boost_hit !=0) || (RC_boost_hit !=0) ;             // Make a flag which indicates that a boosted request exists.
read_preq      = (any_boost ? RC_boost_hit:RC_ddr3_read_req)  ;         // Filter select between boosted ports and no boosted ports.
write_preq     = (any_boost ? WC_boost_hit:WC_ddr3_write_req) ;         // Filter select between boosted ports and no boosted ports.

any_read_req   = (read_preq  != 0) ;  // Go high if there are any DDR3 reads required.
any_write_req  = (write_preq != 0) ;  // Go high if there are any DDR3 writes required.

// Scan inputs:   

    read_req_sel = 0 ;
    RC_break     = 0 ;
    for (int p=15 ; p>=0 ; p-- ) begin                         // 'p' scans in order of highest priority to lowest.
        for (int i=1 ; i<17 ; i++ ) begin                      // 'i' scans channels 0-15.
   
        // round robin arbiter priority selection, ensure the previous accessed R/W channel of equal priority is considered last before
        // access is granted once again.
        RC_cs  = 4'(4'(i) + last_read_req_chan ) ;             // RC_cs begins the scan at 'i' + the last read channel + 1

            if ( RC_cs<PORT_R_TOTAL && !RC_break ) begin       // Only scan from the available read ports.

                // Add 1/2 to the read port's priority weight if the page_hit is true.
                if ( read_preq[RC_cs] && ( {PORT_R_PRIORITY[RC_cs],RC_page_hit[RC_cs]} == p[3:0] ) ) begin
                                                                            read_req_sel   = RC_cs;  // If there is a DDR3 read_req hit with port priority 'p', set that read channel
                                                                            RC_break       = 1 ;     // and ignore the rest of the scan loop.
                                                                            end
            end

        end // for i
    end // for p

--- End code ---


As you can see, the loop structure is half the size, 'read_preq[RC_cs]' solves for the priority boost inputs prior to entering the loop.  (Write channel not shown)  And after the loop, I now have this segment to coalesce reads and writes into chunks:


--- Code: ---SEQ_READY     = ( SEQ_CAL_PASS && DDR3_READY && !SEQ_BUSY ) ; // High when a command is allowed to be sent to the DDR3_PHY_SEQ.

    if (SEQ_READY)                                      begin  // Make sure the DDR3 sequencer is ready to accept a command.

                     if (any_read_req && any_write_req && (PORT_W_PRIORITY[write_req_sel] > PORT_R_PRIORITY[read_req_sel]) ) begin
                     // There is a read and write, but, the write port's priority is higher than the read's.
                                                        DDR3_read_req  = 0;
                                                        DDR3_write_req = 1;

            end else if (any_read_req && any_write_req && (PORT_W_PRIORITY[write_req_sel] < PORT_R_PRIORITY[read_req_sel]) ) begin
                     // There is a read and write, but, the read port's priority is higher than the write's.
                                                        DDR3_read_req  = 1;
                                                        DDR3_write_req = 0;

            end else if (any_read_req  && !(last_req_write && any_write_req) )  begin
            // If there is a read req while there currently isn't a write req && the last command was a write, and the port priorities are equal.
            // This coalesces a number of consecutive reads before allowing a write, or, coalesce a number of consecutive writes before allowing a read.
            // This minimizes the read-write (RL + tCCD + 2tCK - WL) delay or write-read (tWTR) delays.
                                                        DDR3_read_req  = 1;
                                                        DDR3_write_req = 0;

            end else if (any_write_req )                                    begin  // If there is no valid read condition and there is a write req.

                                                        DDR3_read_req  = 0;
                                                        DDR3_write_req = 1;

            end else                                                        begin  // No DDR3 access required.
       
                                                        DDR3_read_req  = 0;
                                                        DDR3_write_req = 0;
            end
       
    end else begin // DDR3_PHY_SEQ isn't ready for a command.

                                                        DDR3_read_req  = 0;
                                                        DDR3_write_req = 0;
    end

--- End code ---

I just go a picture out of the DECA board and I should have a demo and fully documented source code coming in a few days.  The source code was designed to be ported between Xilinx and Lattice.

Navigation

[0] Message Index

There was an error while thanking
Thanking...
Go to full version