Did Bob give you any indication as to the spread of drifts of the LTZ (or 399) after burn-in? Or the percentage that are rejected even before they are sold to someone like HP?
The more interesting thing would be how the criteria for selections are.
For a instrument with 35ppm/year the drift of the reference should be below 18ppm (the other ppm for resistor + tempco).
To guarantee (3 sigma) that the measured drift of the reference alone should be even below 6 ppm/year or 2ppm/kHr (assuming sqrt (9khrs) = a factor of 3 difference between 1kHr and 1year).
From my measurements about 50% of LM399 are below 1-2ppm/kHr after 2000 hours ageing under power (15 hrs on/7 hrs off per day) without previous burn in.
Now after 4000 hrs further devices seem to stabilize below this limit.
But probably you will have further tests (0.1Hz - 10Hz) noise or even lower frequency noise to filter out the bad references.
I do not know if it is a really good idea to do a burn in at 125 degrees C. Since the heater is off above 90 degrees the aging mechanism will probably change. And changes to the heater will create relative large changes to the output voltage. E.g. changing the heater voltage supply will create several ppm/V output change especially around low (10V) heater voltage.
Although the temperature and so the heater power are regulated on chip.
Of cause it could be that all aging is mainly not related to the chip itself but related to the die attach (epoxy) between chip and package. In this case a burn in at higher temperatures would make sense.
Good aged LM399 based instruments have about 1-2 ppm drift per year.
with best regards
Andreas