Depending on the DDR3 chip speed bin, DQS can be up to 400 ps before CK to up to 400 ps after, for most commonly used DDR3L-1600 chips it's ±225ps. DQ can appear anytime from tLZDQ(min) (450ps for DDR3L-1600) before DQS to tDQSQ(max) (225ps) after it, and each transition can happen anywhere within this interval, so you can not calibrate this out. The hold time for DQ is defined as 0.38*tCK minimum from the DQS edge (tQH), so your data valid window is tQH - tDQSQ(max), so for DDR3L-1600 running at 400 MHz it is 0.38*2000ps - 225ps = 760 ps. For the same device running at 300 MHz it will be 1042 ps. For comparison's sake, that device running at it's max frequency of 800 MHz would leave a window of only 250 ps. Now, all of that is assuming you sample using actual DQS signal as a clock (the way it's designed and intended to work), if you use the main clock, you will have to reduce your window by the tDQSCK time (225 ps), and compensate for possible DQS duty cycle variation (it can be as short as 0.4 tCK), so the worst case for 400 MHz can be 760 ps - 225 ps - 200 ps (0.1 tCK for duty cycle shortfall) - 128 ps (cumulative error derating) = 207 ps, and the window will be fully closed for 800 MHz. I don't know if DQS to CK offset is stable for specific chip (so it can be calibrated out), if so, the window will be wider by tDQSCK, but this will only work for a single x8 chip and a perfect routing (which is way beyond DDR3's routing guidelines). For anything other that x8 CK-to-DQS time will almost always be different for different DQ groups, unless - again - you go that extra mile while routing to ensure perfect matching. For off-the-shelf boards it will likely to NOT be the case, as DDR3 routing guidelines only call for matching within DQ groups, and there is no requirement of any of these groups to be matched to ADDR/CTRL/CK group.