Isn' the (statisitically) achievable level more or less just a matter of the number of samples being averaged (i.e. a matter of the granted maximum duration of the measurement interval)?
In theory, yes. In practice, a lot of different factors, both systematic and random, conspire to make it difficult to achieve precise phase measurements at low frequencies.
In this particular problem, AM-PM conversion will be one of the big stumbling blocks. If you try to measure the phase shift of an RF attenuator at the picosecond level, for example, you'll find that no two instruments in the house give you quite the same numbers as the attenuator is switched on and off. A high-performance counter with averaging will show phase differences that don't agree perfectly with a different brand of counter, which won't agree with the phase measurement button on your DSO, which won't agree with a direct-digital stability analyzer, and so on. These discrepancies are already in the single-digit picosecond range when measuring at 1 MHz, and they will grow as the carrier frequency falls.
One way to think of it is to remember that 'phase' is ultimately defined by zero-crossing times, even if you're not directly measuring the zero crossings themseves. To characterize the timing of the zero crossings, you have to pin down their position. That, in turn, is affected by the harmonic content of the waveform in conjunction with its slew rate. Harmonic distortion and IMD will consist of both tonal and noise-like components that arise in every stage of the signal chain, from the crystal or atomic resonators where the signals originate to the measuring instrument itself. To one extent or another, every stage the signals pass through will turn amplitude variations into phase distortion.
Anyway, the TL, DR is that LF phase measurement is harder than it appears at first. The OP isn't trying to measure an attenuator, presumably, but the example is illustrative because similar factors will confound both measurements.
What does it report then? The average phase difference in the measurement interval?
Phase measurements between two signals at different frequencies are effectively scaled by that frequency difference. So if you know the nominal or expected frequencies of the two signals (and you generally do), you can scale the readings to make it appear that the signals were measured at the same frequency.
In a measurement where one signal is designated as a the "reference" and the other as the "DUT," the reference phase is typically scaled to conform to the nominal DUT frequency. This makes the phase-difference measurement independent of the frequency of the reference in use. For the OP's problem, the two signals are presumed to be at the same frequency, so this extra complication doesn't apply.