The list owner has reported as much as 20% discrepancies between instruments that were sent out for a cal before use. He has raised the topic of a group effort to design a reliably accurate reference. Bill Schmidt has PhDs in EE and chemical engineering and owns an electronics company.
He insists that the only accurate measurements of RF power he has made were done with a load, styrofoam cup of water and platinum RTD in grad school.
I have only ever seen errors like this with a broken power sensor that has been overloaded (assuming we are still talking about CW, modulated/pulsed power measurement can be more tricky). I think you'd have to completely mess up the calibration otherwise, like an error in the algorithm, a wrong connection.
That being said, calorimeters are still the gold standard and the only proper way to get a traceable calibration from DC voltage. There are good reasons that people generally don't use them for anything else. First, they are not easy to get right, in particular you need a feedline that has both low thermal conductivity and good RF properties. But most importantly, they are very slow. If it takes 15 minutes to reach equilibrium that isn't just cumbersome, it also adds uncertainty through drift elsewhere in the system.
Of course, I definitely don't want to put you off trying to build one. I'm sure it would be a very interesting project. If you want a good starting point, you can try to find a copy of "Radio frequency & microwave power measurement" by Alan Fantom. It isn't exactly up-to-date (late 80s?) but the basics are all still valid and it has a lot of references to classic papers which may be more useful for DIY than the latest advances in mm-wave metrology.
I want to calibrate a Weinschel 0.1 dB step attenuator so I need *at least* 0.01 dB accuracy. And 0.001 dB to cal my HP 438A, 8481D & 8482A power meter and sensors.
You may want to do some preliminary uncertainty budgets before you jump in.
The attenuator I guess could be doable (or close enough, maybe drop the "at least") since it is a relative measurement. I'd probably start by looking at (temperature) stability and repeatability first to make sure it is even worth the effort. Anyway, it doesn't require accurate power measurement at all, just good linearity. Thermistor sensors with dc substitution (like old 432A + 478A + 6.5 digit multimeter as suggested above) are also very linear and you don't even need the correction factors in this case, so no need for tracable calibration or a calorimeter. It still needs a lot of care. If you don't have a very stable source, you probably need a second reference power meter to track the incident power. At <0.1dB accuracy, mismatch uncertainty is likely a major limiting factor as well. You might have to characterise the S-parameters of your DUT and all components of your measurement system with a VNA first.
The second one... I am pretty sure it is way beyond the capabilities of even major metrology institutes. 0.001 dB works out to about 0.02%. The 438A manual calls for "only" 0.5%. I think NIST or PTB will do thermistor calibration against their microcalorimeters to 0.3% (maybe 0.2%).