The certificate or data sheet was an excerpt, I'm assuming there is more but they didn't show it. If they did, I'm pretty sure that would be a topic for discussion as well, although as you say, a sufficiently accurate reference DMM can substitute for a lot of that equipment. The tolerances being different I didn't catch, but as you figured out, the real issue is that they used an ad-hoc set of test parameters that don't match what the OEM specifies. The most egregious omission is the lack of negative voltage/current testing. The 5100B uses a separate negative reference circuit and so on--getting the positive side correct guarantees you nothing on the negative side! The calibrator could be completely broken and still pass this calibration test.
I can tell you from direct experience that meeting the OEM-specified test points is a lot harder than the ones they have listed. As you surmise, the reason is that the OEM test points catch the unit at its weakest points, testing how well the various ranges match up where they 'shift gears' and so on. And a lot of the test measurements are interdependent--there's a lot of iterative adjustments that force you to compromise between two measurements to minimize the error in both. Even the OEM-specced points don't fully address the linearity of the DAC, which is another issue altogether. Fluke has published a paper, which is really an ad for their cal services so I won't link it, but in it they point out that there is no explicit law or rule that says that a cal lab must use the OEM procedures. As long as they document what they did, it's kosher. So whenever someone offers to sell you something with 'fresh calibration certificate' the correct response is not "Ooh, great! Calibrated!" but rather "OK, let me see that". See this thread:
https://www.eevblog.com/forum/metrology/can-we-believe-a-calibration-certificate/msg3141840/#msg3141840Now for the 8846A, its specs and its mess of a current measurement implementation. The answer can be found in the specs, but I didn't see it until I had this issue and looked. First, the meter is generally more accurate than the 1-year spec. If you look at the 24-hour, 90-day and 1-year specs, you'll notice that the much tighter 24-hour specs also specify +/- 1C, while the others are +/- 5C. When these are calibrated, they should be calibrated in laboratory conditions after an extended soak at 23C +/- 1C. A calibration certificate that uses the 1-year specs to determine the pass/fail limits is....um...not so good? So a good portion of the additional error in the 90-day and 1-year specs can be attributed to tempco. If you take the listed temp coefficient listed for temps outside the 18-28C range and apply it to the 24-hour specs, you'll see that you get pretty close to the 90-day spec. These meters don't drift much, especially if they are powered down a lot. So, unofficially at least, if I stay near 23C, I can count on this meter to be near its 24 hour specifications.
However, if you look at the 24-hour specs, you will see there are two components as usual (usually listed as % of reading + counts, or % of reading + % of range) and the second one, the offset or residual component, is larger than the percent-of-reading error on some, but not all of the ranges. Typically lower ranges have larger offset or residual counts, and often the best accuracy is specified using the REL or ZERO function to zero them out. However, when attempting to read a current of 1.9mA (as specified for the 5100B test point), I found it impossible to get stable readings and the ZERO function only worked for a second or two. At 1.0mA, the meter was golden--nice, steady and presumably accurate 6.5 digits. So I went back to the specs and calculated the total specified uncertainty for a reading of 1.9mA and found that it was .00295mA or 0.16%--over 3X worse than what it would be at 1mA. And my readings, though noisy and unstable, did fall just within that range, assuming my other equipment was functioning properly. That's terrible! Why? Because the 8846A has three current shunts when it really ought to have six or seven and they prioritized low burden voltage (and perhaps shunt heating) over uniform precision in current measurement.
If you look at the current ranges and the shunt values, you'll see that they only match up with the meters lowest voltage range of 100mV for three ranges--1mA for 100R, 100mA for 1R and 10A for the 10mR shunt. For the oddball 400mA and 3A ranges, which were presumably put there in response to some real or perceived customer demand, the meter actually displays one less significant digit, which reflects the realities of matching up the shunts with a voltage range without an intermediate gain amplifier. But for the 100uA, 10mA and 1A ranges--the ones with the large residual or offset component in the uncertainty, they've done something different. Just truncating a digit would make it obvious that they weren't real ranges but rather just the bottom tenth of the next range up. Adding a gain amplifier, like the 8808A has, would be an additional expense, and looking inside the meter I didn't see one. So instead, as far as I can determine, they just exposed an additional digit from the ADC, a sort of digital gain. The 8846A is nominally a 1.2 million count meter, but on certain statistical displays it shows 10 million counts, and using the PC software I believe you can get 8 digits without any missing codes. Like many meters, the internal resolution is much higher than what they display, and for good reason--those hidden digits are generally just noise. So, they just shift all the digits one to the left, hide the leading zero (since the range is always less than 0.1 of the next range) and expose an additional digit....of noise. If they did this on voltage by creating a new 10mV range, people would quickly figure out "hey--this is just noise". But precision current measurement on the 10mA range will lurk in the background for much longer until it finally snags someone like me.
I know, RTFM. But more than that, know your test instruments! They all have characteristics, some of which can be called flaws, others limitations. They all interact with your circuit to some extent. You can't ignore that with any sort of precision measurement. The notion that "hey, it's calibrated and that's all I need to know" only works when the instrument is much more accurate than you need and the interaction with your circuit is much below the precision you need.