If ADC conversions are better without calibration than with calibration, then you either did something incorrectly during calibration, or, more probably, do something incorrectly during the conversions.
VREF+ instability is one of the candidates, as said previously; too short sampling times are one of the other possibilities.
JW