"3) You cannot verify with any certainty, 20mV which meter is off using a reference from unknown sources without proof of calibration."
That's demonstrably untrue. Modern silicon comes factory-trimmed to a greater precision. You could pick up a $5 voltage reference (half the cost of the AliExpress one!) and with little care (a couple of capacitors) end up with a result accurate to within single-digit millivolts.
So yes, it is certainly possible to identify which of two multimeters (or both, or any number of them) are faulty, if their discrepancy is 20 mV.
If, and only if, the reference has been measured
with a trusted voltmeter!
Otherwise, you have three devices, DMM 1, DMM 2, and voltage reference, showing the same or different readings, and you have no way of telling which reading is true, because none of the devices have been verified against a known-good reference.
That's exactly the problem with the Chinese references. They aren't that bad in terms of design and parts. But how do we know whether their output voltages match what's written on the sticker, and if it does, then with what exactly degree of uncertainty?
Oh and regarding the modern silicon... What did you have in mind? If we take, for example, LM399, then its initial tolerance is specified at 2%. That's 0.139V at its nominal 6.95V output. Definitely not good for checking DMMs out of the box without measuring and adjusting. What's the highest precision single-chip reference? Is there anything that has at least a 0.05% tolerance?