You're right. I suppose with any large data it's best to work out how you will use it first and structure it accordingly.
Yes, zoomed out over months/years it will be difficult to see individual transitions, just sort of "dithering". But... there are aggregate calculation done on the timeseries such as "total transitions" and "total on time". Those are dynamic based on time period.
However. Storing the "percentage of on time" and "total transitions", that data can be reconstructed later.
The gas bill is fine, but it's out of phase and messed up by estimate readings, price changes and all manor of things. It would be nice to see total runtime per period, average transitions per day over period... so I can deduce trends, rather than absolute values. "Did the insulation I installed make this winter better?", "Did the sympathetic heating responses lower the amount of cycling."
It's looking like % + count is the way to go. I can still have the absolute state transition history for 1 week to diagnose and monitor day to day behaviour.
As an aside, the separate mind bender I'm working with is, current target temperatures. I have many things - an indeterminate number of things - that can set a "target" to aim for. The trouble is, it's perfectly valid for multiple targets to exist for each zone. The source of such targets "schedules" are not aware of each other. So they publish competing data for the same value. I would like to record/display the boiled down reality, ie. the highest active target for each zone. Easy. Except it's a distributed, concurrent, async, event driving, architecture. My solution is to store the timestamp of every target temperature for a zone against it's value, overriding any on update. When I want to publish the target temp for a given period, I just find the highest key in the data which isn't expired/stale. Not finished it, but it looks good so far.