Author Topic: DDR3 initialization sequence issue (Read 65587 times)

promach · « **Reply #225 on:** July 04, 2021, 02:49:52 pm »

As for DQS centering, is the IODELAY2 primitive delay-shifting the DQS strobe OR the parallel DQ bits ?

Note: It seems that the similar clock restriction might apply here ?

NorthGuy · « **Reply #226 on:** July 04, 2021, 04:03:34 pm »

Quote from: promach on July 04, 2021, 02:49:52 pm

As for DQS centering, is the IODELAY2 primitive delay-shifting the DQS strobe OR the parallel DQ bits ?

Note: It seems that the similar clock restriction might apply here ?

DQS and DQ are different pins.

When you write, you drive both DQS and DQ. You either use different clocks (shifted by 90 degrees) to drive them. Or you can use the same clock, but shift the outputs to produce 90 degree difference.

Whatever clock you used to drive DQS during writes, you must use to read DQS (at least your table says so). Therefore, you must use IDELAY to shift the input of DQSs to align with the clock. You can use DQS for verifying "10101010" pattern as BrianHG did. Or you can shift it by extra 90 degress and use for calibrating the delays, and then use the calibration values to adjust the delays in DQ in sync with DQS.

Similarly, when you read DQ, you must use the same clock as you have used to drive DQ during writes. Of course, the data won't be aligned to the clock, so you need to configure IDELAY to create correct alignment.

promach · « **Reply #227 on:** July 04, 2021, 05:10:21 pm »

1. For IODELAY , how exactly is CAL different from CE ?

2. On page 131, if I use "WRAPAROUND" for the attribute COUNTER_WRAPAROUND , it looks like it would be similar to bitslip operation ?

3. Besides on page 132, IDELAY_VALUE ranges from 0 to 255. How shall I use these value during calibration ? Should I increment or decrement by just '1' during initial MPR calibration ?

NorthGuy · « **Reply #228 on:** July 04, 2021, 06:04:29 pm »

Quote from: promach on July 04, 2021, 05:10:21 pm

1. For IODELAY , how exactly is CAL different from CE ?

CAL performs calibration. CE increments/decrements the number of steps.

Quote from: promach on July 04, 2021, 05:10:21 pm

2. On page 131, if I use "WRAPAROUND" for the attribute COUNTER_WRAPAROUND , it looks like it would be similar to bitslip operation ?

Sort of. When the counter wraps, you can do bitslip in opposite direction to compensate.

Quote from: promach on July 04, 2021, 05:10:21 pm

3. Besides on page 132, IDELAY_VALUE ranges from 0 to 255. How shall I use these value during calibration ? Should I increment or decrement by just '1' during initial MPR calibration ?

If you use dynamic calibration (such as if you try to align input DQS with the CK) then you increment/decrement by 1. When DQS and CK edges are the same, sampling DQS with CK produces roughly the same amount of '0' and '1' when they're aligned. So you either increment by one or decrement by one until you get to this situation. At this point the calibration is complete. But if DQS walks away from CK (say because of temperature change), you need to re-calibrate, which means you need to read periodically.

If you do static calibration, you just try all the values and select the best one for future use. Then you assume that changes in DQS timing are not big enough to derail your calibration.

promach · « **Reply #229 on:** July 05, 2021, 01:45:33 am »

Quote

CAL performs calibration. CE increments/decrements the number of steps.

@NorthGuy CE signal does not increment/decrement, that is the job of INC signal

So, what is the difference between CAL and CE ?

NorthGuy · « **Reply #230 on:** July 05, 2021, 02:51:25 am »

Quote from: promach on July 05, 2021, 01:45:33 am

Quote
CAL performs calibration. CE increments/decrements the number of steps.

@NorthGuy CE signal does not increment/decrement, that is the job of INC signal

INC selects whether the count increments or decrements.
CE selects whether the count changes (increments or decrements depending on INC) or doesn't change.

promach · « **Reply #231 on:** July 05, 2021, 02:59:42 am »

Quote

CAL Input : Initiate calibration input.
CE Input : Enable increment/decrement.

I am bit confused with the actual purpose of CAL and CE inputs.

what is the difference between CAL and CE ?
How are they used together ? Or should I just ignore CAL and use only CE and INC ?

NorthGuy · « **Reply #232 on:** July 05, 2021, 03:23:46 am »

Quote from: promach on July 05, 2021, 02:59:42 am

Or should I just ignore CAL and use only CE and INC ?

CAL is for calibration. It calculates number of taps in the clock period - MAX. I think you need to do it before you start using IODELAY. Read "I/O Delay Calibration and Reset" in ug381.

promach · « **Reply #233 on:** July 05, 2021, 07:45:07 am »

Quote

In this example, the delay taps have an average value of 40 ps under current operating conditions. An I/O clock of 250 MHz (4,000 ps period) is applied to the IODELAY2 via CLK0 for SDR mode. When the calibrate (CAL) command is issued, a value of 4,000/40 = 100 is returned internally. If the input delay is programmed to be VARIABLE_FROM_HALF_MAX, then, following a reset (RST) command, the input delay value is set to 50 taps, equivalent to approximately ½ the input clock period. As operating conditions change, the average value of the delay taps will also change, as will the result obtained from a CAL command.

For CAL input port of IODELAY2 primitive , is this CAL calibration not related to DQS centering ? It seems to me that this CAL input is only for IODELAY2 internal calibration mechanism ?

But the explanation using the VARIABLE_FROM_HALF_MAX example does not seem to imply so.

Besides, the description for CAL input port : Invokes the IODELAY2 calibration sequence. The calibration sequence lasts between eight and 16 GCLK cycles. Drives BUSY Low when complete. implies that CAL input signal has to be asserted together with CE and INC input signals during actual calibration for read DQS strobe centering ?

NorthGuy · « **Reply #234 on:** July 05, 2021, 12:00:43 pm »

Quote from: promach on July 05, 2021, 07:45:07 am

For CAL input port of IODELAY2 primitive , is this CAL calibration not related to DQS centering ? It seems to me that this CAL input is only for IODELAY2 internal calibration mechanism ?

Yes, the CAL calibration is internal to IODELAY. It prepares IODELAY for use. This process is necessary because the number of taps which fit into a clock cycle varies between different FPGAs and also depends on temperature and voltage. This calibration process is different from calibrating delays to align signals in your design, which is done with CE and INC.

In 7-series CAL is not needed - the delays auto-calibrate - you only need to supply a reference clock.

promach · « **Reply #235 on:** July 05, 2021, 12:20:17 pm »

Quote

Yes, the CAL calibration is internal to IODELAY. It prepares IODELAY for use.

Thanks, I am now checking how the IODELAY2 primitive's CAL input port is being driven by this Xilinx demo example.

It seems to me that Xilinx engineer issues internal CAL command TWICE, note the cal_data_sint signal inside the FSM for both state 4'h1 as well as state 4'h6

Besides, the busy_data_d logic does not seem to obey IODELAY2 primitive requirement : The calibration sequence lasts between eight and 16 GCLK cycles.

This is a bit confusing. Any idea ?

promach · « **Reply #236 on:** July 05, 2021, 01:17:37 pm »

@NorthGuy It seems that IODELAY2 primitive also needs some initial hardware warmup time ?

There is some explanation about Phase Detector Calibration Mechanisms , but I do not understand why SLAVE delay is always the MASTER delay minus half MAX ?

NorthGuy · « **Reply #237 on:** July 05, 2021, 01:21:35 pm »

Quote from: promach on July 05, 2021, 12:20:17 pm

Quote
Yes, the CAL calibration is internal to IODELAY. It prepares IODELAY for use.

Thanks, I am now checking how the IODELAY2 primitive's CAL input port is being driven by this Xilinx demo example.

It seems to me that Xilinx engineer issues internal CAL command TWICE, note the cal_data_sint signal inside the FSM for both state 4'h1 as well as state 4'h6

Besides, the busy_data_d logic does not seem to obey IODELAY2 primitive requirement : The calibration sequence lasts between eight and 16 GCLK cycles.

This is a bit confusing. Any idea ?

I don't see any discrepancy. They assert CAL for one clock then they wait until the calibration is done by monitoring BUSY.

promach · « **Reply #238 on:** July 05, 2021, 03:11:24 pm »

@NorthGuy What is the purpose of the use of mux signal in the phase detector module ?

NorthGuy · « **Reply #239 on:** July 05, 2021, 03:30:59 pm »

Quote from: promach on July 05, 2021, 03:11:24 pm

@NorthGuy What is the purpose of the use of mux signal in the phase detector module ?

I don't know. I haven't analyzed their code.

promach · « **Reply #240 on:** July 05, 2021, 03:34:30 pm »

By the way, I suppose the following early/late data sampling check mechanism could not be used during the initial MPR_Read_function calibration since my code only have a single IDELAY primitive to do DQS centering work ?

NorthGuy · « **Reply #241 on:** July 05, 2021, 07:02:47 pm »

Quote from: promach on July 05, 2021, 03:34:30 pm

By the way, I suppose the following early/late data sampling check mechanism could not be used during the initial MPR_Read_function calibration since my code only have a single IDELAY primitive to do DQS centering work ?

You have two IODELAY blocks because DQS is differential. I don't know if the rounting would permit using the built-in mechanism in your case. If not, you can do it in fabric on your own.

Your DQS IO logic is clocked by a clock. You need to align DQS to this clock. If you sample DQS with the rising edge of the clock, you can get different responses:

1. If you get always '0' which means that the clock rising edge already happened, but DQS risding edge didn't. DQS needs to be moved earlier by decreasing DQS delay.

2. If you get always '1' which means that the clock rising edge happens after DQS edge. Therefore, DQS's delay must be increased.

3. If you're somewhere in the middle (in the jitter zone) then DQS and the clock are aligned.

Of course, you don't need DQS data, you only need DQ data. Therefore you adjust DQ delays the same as DQS - every time you increase DQS delay, you also increase DQ delay as well. Every time you decrease DQS delay you decrease DQ delay. This way, if DQS shifts, you shift the DQ sampling point to follow DQS.

Regardless of this, DQ delays themselves must be adjusted so that you sample in the middle of the window.

promach · « **Reply #242 on:** July 06, 2021, 01:41:14 am »

Quote

Your DQS IO logic is clocked by a clock. You need to align DQS to this clock. If you sample DQS with the rising edge of the clock, you can get different responses:

1. If you get always '0' which means that the clock rising edge already happened, but DQS risding edge didn't. DQS needs to be moved earlier by decreasing DQS delay.

2. If you get always '1' which means that the clock rising edge happens after DQS edge. Therefore, DQS's delay must be increased.

3. If you're somewhere in the middle (in the jitter zone) then DQS and the clock are aligned.

@NorthGuy

In the following waveform, we have incoming read DQS strobe as well as parallel DQ bits, and also 90-degree phase shifted DQ bits with respect to DQS strobe.

I do not quite understand your quoted points #1 and #2.

No matter how the DQS strobe is delayed within a single bit period, DQS strobe will always sample the values of DQ bits correctly, although it would not be in the center of DQ bits for all such phase shift delay choices.

Therefore, I am confused as in how your quoted points #1 and #2 actually makes sure that DQS strobe is delay-shifted to the CENTER of DQ bits ?

NorthGuy · « **Reply #243 on:** July 06, 2021, 02:59:41 am »

Quote from: promach on July 06, 2021, 01:41:14 am

No matter how the DQS strobe is delayed within a single bit period, DQS strobe will always sample the values of DQ bits correctly, although it would not be in the center of DQ bits for all such phase shift delay choices.

Except you cannot use DQS strobe to sample DQ because you must use the same clock for both input and output SERDES (At least that's what the table you have posted said). Thus you must sample DQ with the output clock. Therefore you need to align DQ so that the receiving clock edges are centered in DQ bits.

Even if you could sample DQ with DQS strobes (that is if you could route DQS to clock ILOGIC flip-flops), you then must transfer the results to a regular clock domain. To achieve this, DQS would have to be roughly aligned with the clock. So, the DQ/DQS groups would have to be delayed.

promach · « **Reply #244 on:** July 06, 2021, 03:10:43 am »

Quote

Except you cannot use DQS strobe to sample DQ because you must use the same clock for both input and output SERDES (At least that's what the table you have posted said). Thus you must sample DQ with the output clock. Therefore you need to align DQ so that the receiving clock edges are centered in DQ bits.

@NorthGuy

But how exactly are points #1 and #2 being applied to achieve such purpose ?

promach · « **Reply #245 on:** July 06, 2021, 03:44:34 am »

It seems that points #1 and #2 assume that incoming parallel DQ bits are length-matched with read DQS strobe ?

And I think you were implying to use XOR operation between FPGA PLL-ed's clock and the read DQS strobe ?

Please correct me if wrong.

NorthGuy · « **Reply #246 on:** July 06, 2021, 03:46:15 am »

Quote from: promach on July 06, 2021, 03:10:43 am

But how exactly are points #1 and #2 being applied to achieve such purpose ?

While you read the pattern from DDR3 chip, you shift DQS to be in phase with the clock and maintain it that way. You know how big is the shift, so you know how much you need to shift DQ to move it to the point where the sampling clock will be centred in the DQ bit.

You don't have to do this. You can determine (test or guess) the necessary delays and hard-code them into your design.

NorthGuy · « **Reply #247 on:** July 06, 2021, 03:50:00 am »

Quote from: promach on July 06, 2021, 03:44:34 am

It seems that points #1 and #2 assume that incoming parallel DQ bits are length-matched with read DQS strobe ?

They must be length-matched. However, if there's a mismatch, you can add/remove taps to IODELAYs to compensate.

promach · « **Reply #248 on:** July 06, 2021, 04:38:41 am »

Quote

They must be length-matched. However, if there's a mismatch, you can add/remove taps to IODELAYs to compensate.

@NorthGuy

What if the following skew situation also happens between those parallel incoming READ DQ bits ?

And in such high working frequency of DDR3 RAM, it might be impossible to calibrate skew phase between incoming DQ bits. In other words, COMBINATIONAL logic comparison between those DQ bits would not be feasible in this case.

Note: pairwise comparison between DQS and a particular single DQ bit would not work in all corner test cases, therefore ^DQ xor operation is needed. However, my DDR3 RAM is of x16 configuration, which means DQ_BITWIDTH=16. So, this really exacerbates the setup timing issue even further.

Not to mention that there would also be placement and routing issue for fixed amount of IODELAY2 primitive in a given hardware block inside xilinx spartan-6 chip.

What do you think ?

BrianHG · « **Reply #249 on:** July 06, 2021, 07:37:25 am »

Ok, you need to read the data sheet's worst case drift on the IO performance and do a little math.

In my example case, I have 16 tuning steps until my clock rotates 360 degrees.

MAX10-6 FPGA...

At 250Mhz, 7 of 16 steps give me true error free data. (ram underclocked)
At 300Mhz, 7 of 16 steps give me true error free data.
At 350Mhz, 7 of 16 steps give me true error free data. (fpga CK/DQS/DQ IOs buffers overclocked.)
At 400Mhz, 6 of 16 steps give me true error free data.
At 450Mhz, 5 of 16 steps give me true error free data. (fpag DDR3 core and write data DQ serdes overclocked.)
At 500Mhz, 5 of 16 steps give me true error free data. (fpag read data DQ serdes over clocked at this point)

With this, at 500Mhz / 1gtps, it is possible to calculate the # of picosecond play of valid data I get when tuned in the middle and how much each tuning step gives me.

Also, there is additional error 1 tuning point at one end where around half of the 16 bits are correct, the rest are jiggling.

It is this '1' tuning transition point which should give you the idea as the timing errors between the 16 bits.

Sorry I cannot test above 500Mhz, the Max10 completely fails to do anything. The DECA board I used has a single 800MHz/1600mtps 16 bit DDR3 ram chip.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: DDR3 initialization sequence issue (Read 65587 times)

Share me