I've got the core functionality of the design up and running, and everything seems reasonable at this point. The current draw at 16V is 140 mA, so battery life with 21700 cells should be 24h with margin. I am waiting on a trim resistor to get the +10V reference voltage to 10.48576 V to give a 10 uV LSB for the 10V range. The low voltage part is very quiet and has a range of +/- 100 mV (or less, depending on the reference SW position), so I should be able to also get INL tests for the low range of the 3458 and my NVM in the future.
I've included the results from the only real test I have run so far, including the raw data if anyone wants to play around with it. The DUT is my 3458A. I used 50 DAC codes evenly spaced in a random order. For each code, I measured voltage at each of the eight possible switch settings (DAC,GND; CT,GND; DAC,CT; CT,CT + the reverse). For each setting, I took 20 samples at 100 NPLC, kept the last 16, and recorded the average and standard deviation of those 16. I ran ACAL right before the run, but it's been a while since I did a short calibration, so null is a bit off. The total run time was about 7 hours.
For future tests, I will see if it is actually important to randomize the order of the codes. I suspect it is not because gain drift between codes would be second order effects that would get lost in the noise with any reasonably accurate DUT. I am also going to try random SW position order but maintaining the CT,CT settings as bookends. I do think this will be more consequential.
Standard deviations were reasonably good, averaging about 40 ppb of full scale at the ends and 11 ppb of full scale in the middle. With 16 samples, the total width of the 95% confidence interval is equal to the standard deviation, so getting sufficiently low scatter for moderate order polynomial fits at sub-ppm level should be feasible in 24 hours or less. There reversal errors at the center tap voltages that are in excess of those measured between DAC and GND by several-fold (average is 3-4 uV). I don't know why these should exist other than that they may come from parasitic thermocouples between the leads of the switch. I will try compensating those for future runs by switching the DAC to bipolar references, running codes on either side of zero sequentially, and averaging the error terms for each. I think this will help to avoid confounding non-idealities in the source with DUT nonlinearity.
Anyways, with the sum errors (corrected for offset at CT,CT), there are a couple points that I would probably discard or rerun for a fit, but there is a clear shape to the curve. I subtracted out reversal errors for each voltage in the sum to get "corrected" sum errors. This makes the plot symmetric about the origin, but it's not far off from that uncorrected (less the offset, of course). This correction should cancel out the even-order errors in both source and DUT, which is fine because we can determine the even order terms in the transfer function from the reversal error fit. Just by looking at the corrected sum error plot, it seems the dominant term is fifth order with nonlinearity errors topping out around 50 ppb. It will be interesting to see if this shape remains with the bipolar compensation scheme I mentioned. The magnitude of the reversal errors is smaller, and without doing the analysis, my gut feeling is that the experimental power is probably not sufficient to put any even order error terms in an INL model. Overall, I am optimistic that this source could be used to characterize INL down to at least the 0.1 ppm level if the transfer function is well-described by a polynomial fit without too many terms.
I also modeled the DAC linearity error, but this isn't really a good way of measuring it because of the amount of time spent measuring other things and the greater opportunity for drift. Here I saw maximum deviations from linearity of about 22 uV, so 2.2 ppm. I did a quicker run before this, and it was within the DAC11001B's +/- 1 ppm spec.