Maybe I'm smoking something but its seems to me that my simulation should, in the long run, pretty accurately tell me, for example, how many bits I can usefully get if the averaged references are used as the reference in a 24-bit A/D. I know its less than 24 bits (these are not LTZ1000s) but how many? If the temperature range is restricted, what then?
Perhaps the questions I'm trying to answer are just obvious to you folks? Don't know.
Yes-- it is. There are theoretical bases for all of this, which shortcut a lot of prodding and guessing

The central part is the Fourier Transform: take a signal as a function of time, and convert it to a function of frequency. Now it's upside-down: the average-for-all-time component is at frequency 0 (DC), while the variations over short periods of time lie at high frequencies.
The most useful part of the FT is, per-element multiplication in the frequency domain is equivalent to convolution in the time domain.
You should be familiar with convolution, because it's responsible for many useful things in software: numerical multiplication (if you've ever written out a multiply algorithm), DSP (FIR filters), lots of simulated dynamics (game engine physics?), etc. In any case, it's taking two arrays, and combining them into a new 2*N sized array, where one array is read in reverse order and slid past the other, taking the sum of products of the two arrays. The "slide" position is the output index, and arrays of length N have 2*N overlapping positions, so you get that many output points (though in practice, it's usually arranged so that half of those are zeroes, so you can truncate it).
For numerical multiplication, this gives maximum 2*N digits in the output, from N length input numbers, so there you go.
But in the Fourier domain, the sliding has already been done, in a sense: one definition of the FT is, convolving the input with all possible sine and cosine waves. This selects the components of the input, which are most similar to those sine waves. In other words, the frequency components, the spectrum. If you have the FT of a frequency response (a filter), you simply multiply that (per-element multiplication, no sliding or whatever) by your transformed input, and you've filtered your signal. Of course, to get a time-domain signal back, you need to inverse transform it. (A nice property of the Fourier transform is, it is self-inverse, so you don't need to duplicate code, just take FT(filter*FT(signal)).
So having introduced that: what's the FT of an averaging function? It's a lowpass filter!
There are several kinds of simple average: arithmetic average, weighted average (given some arbitrary array of weights), geometric average, sliding average, etc.
Note that a sliding average is a special case of a weighted average, where the weights are all 1 in a certain span, and zero elsewhere (i.e., we ignore any other samples). Likewise, the simple arithmetic average is equally weighted for all samples, so we don't need to make that special.
We do discard the geometric average, however: it is nonlinear. FT only works when the functions obey linearity: FT(a*X+Y) = a*FT(X) + FT(Y) (for vectors X and Y and scalar a), and associativity and all that. The geometric average takes the Nth root of the result, so it cannot, in general, share this property. (It will be very close when X and Y are nearly constant; but then, why not save the computation and use the arithmetic average?).
So what's the FT of a general weighted average? Well, the weights are just a vector, and we can FT that. Indeed, if we do FT(FT(weights) * FT(signal)), we get the filtered signal just as if we did convolution(signal, weights) -- which remember, is the definition of the weighted average (minus the sliding part).
A weighted average is just an FIR filter (finite impulse response: when the input signal is a single point surrounded by zeroes -- an impulse, the output is simply the series of weights, and when that vector is finite, as it is in real DSP systems, the response is also finite).
There's also the IIR filter, which works more like a capacitor charging and discharging. Add up successive input terms in an accumulator, "leak" some away by removing a percentage (10% lost each step == multiplying by 0.9, is roughly equivalent to taking the average over 1/(0.1) = 10 samples), then adjust the output gain by that effective number (divide by 10 --> output). It's not a strict sliding average, because the weights decrease exponentially over time, never quite reaching zero.
But what can signals teach us? Well, EEs have been doing
much, much better filters for over a century. If we take the transform of one of those filters, we will find we can get much sharper attenuation of high frequency components.
And with FT in hand, we can synthesize them from scratch, too!
(IIR filters require more math -- you're solving for polynomial roots, just as network synthesis methods need to. Downside: they can be unstable, so you need more bits of accuracy in the computation.)
Back to the matter of references: suppose your error goes as sqrt(t) -- a diffusion mechanism. As t --> infty, error --> infty, so you can't hope for anything guaranteed. That's simply flat out impossible, given that statistic. What about probability? What's the error bound that you need? What if there's a 0.1% chance it's outside your min/max bounds after, say, a decade? A spec like that gives us a finite time, and therefore a finite frequency response, we can use to reason about the system.
Suppose your noise density goes as 1/f. The integral of 1/f, over all frequencies (we must integrate, to find the total noise -- a density is per-frequency only), is log(f). And if we take log(0Hz), i.e., DC, we find it diverges -- so again one cannot say anything about the long term stability of the system.
How many bits to expect, and how "good" are they? Of course if you keep adding up samples, as in the IIR or total-average case, your accumulator grows as Lg(N) for N samples. To double the ENOB, you need 2^ENOB samples.
Can you do better? Is this a lower or upper bound, and do you need more to be sure?
You can do better, but not in the general case (uncorrelated noise). If the noise comes from a known source (usually an interference source with known frequency range -- like SMPS noise), it can be low-pass filtered, or notched out. If it comes from a known process, like ADC quantization noise, you can take some steps to average it out (dithering) or subtract it (subtractive dithering).
This is an upper bound, because systematic errors tend to dominate. That is, how do you know your ADC is perfectly linear at this level? If it has a small offset or gain error, that varies as a function of input value, you'd have to measure that transfer function and calibrate it out. Same goes for any other circuitry in the way.
All the rest: signal noise, ADC noise, reference noise, we can treat as normal uncorrelated Gaussian noise, which adds vectorially (i.e., RMS), and averages out the same.
At the very least, we don't have any motivation to suspect the noise sources are anything other than Gaussian, and if they do obey different statistics, they'll probably be close enough to be fine (i.e., the central limit theorem applies), but in any case if we want to obtain any better result, we need to really dig into and discover the statistics of those sources, which may not be a useful gain in the end (say the modified statistical approach only yields 1-2dB of SNR; who cares?).
Most likely, your filtering (and therefore ENOB gain) will depend on how long a user is willing to wait. A DMM must read in a useful time frame (under a second, say), and that limits how far down the spectrum the filter can operate. (A gate period of 1Hz can't filter anything below 1Hz, and so on.)
Sure, you could set the gate duration obscenely high (months?), but what good is it? If you're just wanking to the reference, why not just wire the display to a constant value?

Tim