Author Topic: Hi, I'm Paul and I have a data habit. (Read 1280 times)

paulca · « **on:** February 19, 2024, 11:45:00 am »

I started with a single temperature sensor on a Raspberry PI.

Now my data logger is processing around 2400 metrics per minute. Somewhere along the way I had to introduce retention policies because the database began to consume most of the server.

The most recent additions can been seen (in draft form stuffed on to the "at a glance realtime dash") in the attached image.

[ Attachment Invalid Or Does Not Exist ]

Most of the top section is the new weather station, and the slightly less new background ionisation detector.

Below those the original "at a glance" dashboard, showing the indoor temps, heating control states/demands/targets and the automated light states.

A quick glance of the "Office off grid" battery. Yes, it's partly running on battery this morning making this post.

Quick glances on power gen/consumption 24h and 1h.

Down the right size is the "at a glance" power zone break down, the heating pipe monitors, Lux sensors and the batteries in the sensors in one place.

There are 3 more entire dashboards, for "Solar power system details", a longer dash just of straight time-aligned graphs for everything and a dashboard specific to monitoring the servers/pcs.

I'll see if the uploader will let me add an image of the others below.

The architecture is rather simple. The majority of devices already support MQTT messaging, so devices all publish those onto an "External" message bus. A set of "proxies" subscribe to this bus and republish the data metrics with a slight enrichment and "normalization" to all "services" to make assumptions and self configure / self onboard devices based on the message contents, prefixes, suffixes and even just "Units". Importantly this layer timestamps everything... you would be surprised at home many sensors send data without a timestamp and you would not be surprised how much a headache that can cause with repeated stale data.

Those devices which don't do MQTT, will either use a hardware bridge (ESP32 BLE->MQTT) or HTTP... but they will end up in the same place on the internal MQTT bus.

The main internal, normalized, bus is then lifted and published to the InfluxDB persistence. Grafana graphs make the UI.

Some of the heating control states and lighting control states touch on the fact that there is a service layer behind this as well. Services which watch data, combine different elements of data and make determinations, publishing other messages to control things.

The lights for example, have their choice of data when i comes to determining if lights are required or not. Presently it uses the main MPPT power output, it was found that the solar panel voltage was triggering way to early in the morning and way to late in the evening. When the panel basically stops producing power works far better and will even bring the lights on correctly mid afternoon on dark winter days.

The next feature is to fix the motion controlled night light in the kitchen to not come on if the overhead is already on... by looking at the lux of that motion sensor to check. I mention this, as a king pin feature of this setup is that the data is all normalized into one big fat data bus, so all services have access to everything regardless of vendor or protocol.

paulca · « **Reply #1 on:** February 19, 2024, 11:45:35 am »

Image...

paulca · « **Reply #2 on:** February 19, 2024, 11:47:07 am »

Solar power details.

tggzzz · « **Reply #3 on:** February 19, 2024, 12:44:11 pm »

Quote from: paulca on February 19, 2024, 11:45:00 am

Now my data logger is processing around 2400 metrics per minute. Somewhere along the way I had to introduce retention policies because the database began to consume most of the server.

Better a data habit than a hand-waving habit

Why don't you only record sufficient samples to meet the Nyquist limit of whatever it is you are measuring? Room temperature doesn't change very fast!

Consult any DSP textbook for how to "coalesce" many samples taken at an unnecessarily high frequency into a single sample. Keywords include "decimation" and "subsampling" and "downsampling".

paulca · « **Reply #4 on:** February 19, 2024, 02:53:14 pm »

Quote from: tggzzz on February 19, 2024, 12:44:11 pm

Quote from: paulca on February 19, 2024, 11:45:00 am
Now my data logger is processing around 2400 metrics per minute. Somewhere along the way I had to introduce retention policies because the database began to consume most of the server.

Better a data habit than a hand-waving habit

Why don't you only record sufficient samples to meet the Nyquist limit of whatever it is you are measuring? Room temperature doesn't change very fast!

Consult any DSP textbook for how to "coalesce" many samples taken at an unnecessarily high frequency into a single sample. Keywords include "decimation" and "subsampling" and "downsampling".

The dashboard is "real-time". Rather it has a 5 second refresh, while the data has nanosecond precision, but the LCD display downstairs is literally real-time, the service layer is real time. When the temperature changes I might want the heating to respond to that change "now" not in a 1 waveforms time.

Anyway, straw man asides, yes, I could "hold and modify" the data on ingestion into longer intervals, or I could force an immediate lower resolution on append, but.. I choose to just run queries like the following.

Code: [Select]

day2month_temps           CREATE CONTINUOUS QUERY day2month_temps ON home_auto BEGIN SELECT mean(value) AS value, max(value) AS maxVal, min(value) AS minVal INTO home_auto.one_month.temperature FROM home_auto.one_day.temperature GROUP BY time(1m), * END

Which the DB server runs every 30 seconds. The retention policy can then be selected in queries. In practice the 1 month -> 1 year -> 10 years policies aren't really working out either, so it will change at some point. My main issues with this approach (reduction queries) are:

* Management. They need to be set up per time-series.
* Variation of data. They are not all the same query/value/cadence.
* State transition time-series are a handful to process and reduce, not directly supported easily in the DB Server. (first/last are supported, total transition count isn't, nor would be total runtime per interval). If I reduce the heating boolean states to aggregate data it is no longer a discrete state timeseries, so needs different graphs, showing a visualisation of what happened within the interval.

So what I am likely to do, is nominate data, like state transitions and give them a fixed 1 year retention and no down sampling. I then need to automate the addition of the other retention policies to aid "onboarding" of devices. I note there are several series missing from the monthly policies which is annoying and the result of "copy pasta" management.

On how often the sensors report, at the moment is more critical. My "manually timered" sensors are, 5 seconds, 10 seconds, 1 minute or in a lot of cases "on change". I have experienced "event starvation" issues with temp sensors defaulting back to a max reporting period of 1 hour. My logic always favours "OFF" if nothing speaks up to say it (heating/lights) should still be on, it will be turned off. So the heating oscillated all night while the temp fell really, really slowly. This is probably a burn from the decision to use "cadence" for "hysteresis". The practical solution is an artifical event on a timer to force re-evaluation of the 'last' reported data.... that in itself has issues when senors stop reporting entirely! Generally, stale temperature (or out of spec) values are dropped and no action is taken.

paulca · « **Reply #5 on:** February 19, 2024, 03:07:02 pm »

The other "data anomaly" which irks me sometimes is the double reduction of "temporal units" like energy. This does raise it's head in reduction queries as well.

A given device might be reporting on the total number of kWh 'today'. It's reporting interval for this data is "variable".... basically when it suits it. Often referred to as "counters".

Using the non-negative difference this is easy enough to turn into a time series graph. As that value increases, we monitor the delta interval to interval.

If however, you attempt to reduce that data below it's reporting resolution, you end up with very skewed and inaccurate "buckets" showing energy consumed in one interval in the next and vice versa.

I believe the answer is in the "non-negative derivative" function instead. Which if I am not mistaken should "calculus" out the interval to where the interval is approaching zero. Assuming I can set it up right. When I try and use it, it doesn't seem to make much change over using the "difference".

Displaying it and the units is trouble too. Displaying things like kWh/day or because of dynamic zoom levels and intervals you get units like kWh/15 minutes.

tggzzz · « **Reply #6 on:** February 19, 2024, 03:20:28 pm »

In general, it is an error to believe you know anything about a signal between successive samples. Naive people try to linearly interpolate between successive, but that isn't mathematically valid.

Such interpolation might have some value iff you have extra information about a system's behaviour. Typically that means you know the sampling frequency is already above the Nyquist frequency!

paulca · « **Reply #7 on:** February 19, 2024, 03:58:46 pm »

Yes. However scatter plots (ie. a graph showing only discrete sample points) are exceedingly horrible to look at.

That said, for some reason when I display the background radiation as a "point plot" it seems to make more sense. Only because the sample is noise. When you plot them as big marker points they form a very obvious band which gives a clear indication of the noise over the signal, if that makes sense.

I was looking for other visualisations for it, but none of the default visualisations seemed to fit.

Basically an error bar chart showing the main mean/trend and the noise (min/max/percentile) only shown with "ticks" on the lines. I can't remember what it's called.. the scientific equivalent of the candle stick chart.

paulca · « **Reply #8 on:** February 19, 2024, 04:27:21 pm »

Quote from: tggzzz on February 19, 2024, 03:20:28 pm

In general, it is an error to believe you know anything about a signal between successive samples. Naive people try to linearly interpolate between successive, but that isn't mathematically valid.

This. This is why I have an infinite available sample rate on the first / default retention policy. Until it becomes too expensive to store it/process it at that resolution anyway.

However, I still think your strict adherence to Nyquist frequency and sample theory is missing something.

To throw up a straw man....

If you have a years worth of temperature data and you use a spectrum analyser to give you the natural frequency.... it's 1 day. So your nyquist is 1/2 a day. 1/42300Hz So you only need to sample every 12 hours.

More fairly, considering we are discussion a heated home, the frequency in question comes from (what I call) the ramp rate. How fast can the heating heat it up and how fast will it cool down. Those give a *C/h slope in either direction. Taking that slope and extrapolating it out to a sinusodial and thus getting the maximum unfiltered harmonic necessary in the data... sounds like mathematics to me. I have never got along with the language of Maths to even start with getting the terminology right.

That *C/h slope is actually very important and something that has been on my list to start developing around for a while. It relates to many "want to have features" like ... prediction. Not just saying "The heating is needed now", but correctly predicting when the heating will be required to meet a future target. My first implementation just used linear deltas in measurements and compared it with a straight line prediction from a hard coded value.

Being able to analyse a block of say, "2 hours" historic data and use it to predict and then bring the heating on in advance, but not unnecessarily in advance. The other things it can then be used to "detect" and "alert" for are "out of parameters" temperature gradients. In "Smart" home products they talk about automatic open window detection. So the system can detect a lack of warming from heating and "give up". Simimlarly a zone dropping faster than normal could give information that a door is open or ... as happened to me this year a window seal has gone.

That needs factored back into the nyquist calculation, as we also need to detect anomalous rates in excess of the nominal/normal.

My policy has been... store it as high resolution as you can get it for as long as you can afford to and you can always deal with it later.

tggzzz · « **Reply #9 on:** February 19, 2024, 04:35:28 pm »

Quote from: paulca on February 19, 2024, 04:27:21 pm

Quote from: tggzzz on February 19, 2024, 03:20:28 pm
In general, it is an error to believe you know anything about a signal between successive samples. Naive people try to linearly interpolate between successive, but that isn't mathematically valid.

This. This is why I have an infinite available sample rate on the first / default retention policy. Until it becomes too expensive to store it/process it at that resolution anyway.

However, I still think your strict adherence to Nyquist frequency and sample theory is missing something.

To throw up a straw man....

If you have a years worth of temperature data and you use a spectrum analyser to give you the natural frequency.... it's 1 day. So your nyquist is 1/2 a day. 1/42300Hz So you only need to sample every 12 hours.

More fairly, considering we are discussion a heated home, the frequency in question comes from (what I call) the ramp rate. How fast can the heating heat it up and how fast will it cool down. Those give a *C/h slope in either direction. Taking that slope and extrapolating it out to a sinusodial and thus getting the maximum unfiltered harmonic necessary in the data... sounds like mathematics to me. I have never got along with the language of Maths to even start with getting the terminology right.

That *C/h slope is actually very important and something that has been on my list to start developing around for a while. It relates to many "want to have features" like ... prediction. Not just saying "The heating is needed now", but correctly predicting when the heating will be required to meet a future target. My first implementation just used linear deltas in measurements and compared it with a straight line prediction from a hard coded value.

Being able to analyse a block of say, "2 hours" historic data and use it to predict and then bring the heating on in advance, but not unnecessarily in advance. The other things it can then be used to "detect" and "alert" for are "out of parameters" temperature gradients. In "Smart" home products they talk about automatic open window detection. So the system can detect a lack of warming from heating and "give up". Simimlarly a zone dropping faster than normal could give information that a door is open or ... as happened to me this year a window seal has gone.

That needs factored back into the nyquist calculation, as we also need to detect anomalous rates in excess of the nominal/normal.

My policy has been... store it as high resolution as you can get it for as long as you can afford to and you can always deal with it later.

You don't understand the relationship between the time domain and the frequency domain.

The period (a.k.a. fundamental frequency) of a signal is completely and utterly irrelevant. The only parameter that matters is how fast the signal changes.

For a little theory and a practical demonstration, see https://entertaininghacks.wordpress.com/2018/05/08/digital-signal-integrity-and-bandwidth-signals-risetime-is-important-period-is-irrelevant/ Different starting point, but directly relevant.

paulca · « **Reply #10 on:** April 10, 2024, 09:46:30 am »

So I have been pondering your points and wondering why I'm struggling to apply any of it.

So here goes.

The data is not recorded in the frequency domain. It is recorded in the time domain. Probably incorrectly using the terms so let me explain.

Data is NOT periodic. It has no natural frequency, it can record at any frequency or at multiple frequencies concurrently.

All data points are store as:

Nano-second Timestamp -> Data point

This is the nature of "time series databases".

There is no "sample frequency" in this data. The source devices emitting data decide when is the most appropriate time to send a sample. These vary from fixed intervals of 5 seconds to 5 minutes. A lot of them incorporate "emit on change" mechanics. So if the value hasn't changed they only emit a repeat value every 5 minutes. However if the value is changing rapidly it might output a data point every few seconds.

The data server is designed to be able to handle thousands of currently streams of data and tens of thousands of messages a second. So, depending on buffering available, you could probably point a ADC to it and use it as a scope! It would not be a very good one though, given all the "variable delays" caused by a distributed set of "generic concurrent" OS's and async application layers. EDIT: If the ADC can send a nano-second timestamp for each sample, then this problem goes away. You can pass the source device's timestamp into the data server if you wish. You just gotta remember if your clocks are not in sync, your data is not in sync.

So the questions I am asking are surrounding how to "quantise", select, aggregate, group, box, window the data to give the best visualisation and produce the most "useful" graph.

In the case of the radiation data. It is by its very nature random. The device provides it's 1 minute rolling average "counts per minute" approximately every minute.

How quickly can a 1 minute rolling average of counts per minute change? That's open to debate, but as it's sampled once a second, then that is the maximum frequency available. Thus it "could" only tell me information about every 2 minutes and below?

Anyway. While the data is random, it lies within a "nominal" range. Peaks and troughs are random, but also important. This is why my first attempt was to show the highest sample within any "interval" (peek). Note that interval is dynamic based on your "zoom" level, much like a digital scope screen does. It will aggregate multiple samples into one pixel depending on your selector setting min/max/mean etc. Then I added a forced longer interval of 1 hour and took the mean... and further 1 day interval and took the mean.

What I feel would be useful next is a plotted "standard distribution" over the longer periods of 1 day, 1 week, 1 month and maybe a graph showing the 5% outliers.

Maybe I am still wrong, but I can't think in terms of frequency when I have non-periodic samples. Or at least the data storage does not require or retain any periodicity, only the individual discrete sample time stamps.

tggzzz · « **Reply #11 on:** April 10, 2024, 10:16:09 am »

Quote from: paulca on April 10, 2024, 09:46:30 am

So I have been pondering your points and wondering why I'm struggling to apply any of it.
...
There is no "sample frequency" in this data. The source devices emitting data decide when is the most appropriate time to send a sample. These vary from fixed intervals of 5 seconds to 5 minutes. A lot of them incorporate "emit on change" mechanics. So if the value hasn't changed they only emit a repeat value every 5 minutes. However if the value is changing rapidly it might output a data point every few seconds.

If it only outputs data when the data changes, then you can make the obvious inference of the values between successive outputs. Simply fill those values into your time series.

If it only outputs data when the mood takes it, then that is equivalent to "missing or sparse samples". I believe there are standard DSP techniques to deal with that, but I am not familiar with them.

Quote

So the questions I am asking are surrounding how to "quantise", select, aggregate, group, box, window the data to give the best visualisation and produce the most "useful" graph.

That entirely depends on the nature of the data and what you will use to which the derived information for.

Quote

How quickly can a 1 minute rolling average of counts per minute change? That's open to debate, but as it's sampled once a second, then that is the maximum frequency available. Thus it "could" only tell me information about every 2 minutes and below?

A moving average is a form of low-pass filter.

Consult statistics textbooks and signal processing textbooks

Quote

Maybe I am still wrong, but I can't think in terms of frequency when I have non-periodic samples. Or at least the data storage does not require or retain any periodicity, only the individual discrete sample time stamps.

Non-periodic data still has frequency domain representations. Consider "white" noise, "pink" noise, Poisson processes, (time-domain) impulses.

jonpaul · « **Reply #12 on:** April 10, 2024, 10:59:11 am »

OCD, get therapy (:-:)

paulca · « **Reply #13 on:** April 10, 2024, 11:26:30 am »

Quote from: tggzzz on April 10, 2024, 10:16:09 am

If it only outputs data when the data changes, then you can make the obvious inference of the values between successive outputs. Simply fill those values into your time series.

Expanding data within the record of truth by interpolation would definitely fall to the category of "NEVER EVER" in data science.

This will however be done in the presentation layer. In much the same way it will in a digital scope when interpolation is enabled.

Grafana supports many different ways to interpolate from none through linear to various standard curves.

The data selection and aggregation is very expansive, much more so than the Siglent scope I have. That's without touching the exotic plots either. Things like grouping by non-negative difference/derivative/integral. Some of this I hope to use for classifying electrical consumption into "buckets" by "size of delta" and aiming towards classifying individual devices.

Quote

If it only outputs data when the mood takes it, then that is equivalent to "missing or sparse samples". I believe there are standard DSP techniques to deal with that, but I am not familiar with them.

Consider a magnetic door sensor. It takes about 3 milliseconds for it to emit the event that the contact position has changed. So, if you want an accurate representation in your DSP model, you would need a sample period of 1.5ms to accurate record the transitions? Over a 24 hour period that is a lot of data. Over a 10 year period you are taking about £1000s monthly storage costs.

If do use sparse time-series recording of events then, when the door opens and closes only 3 times a day, we only store 6 records. Without losing ANY precision or accuracy. In fact the sample period is 1x10^-9 seconds. If someone manages to open and close it 10 times in one minute, it will be recorded.

Quote

Quote
So the questions I am asking are surrounding how to "quantise", select, aggregate, group, box, window the data to give the best visualisation and produce the most "useful" graph.

That entirely depends on the nature of the data and what you will use to which the derived information for.

Yes. Now you are getting it. Take the next step and you will be there. This system can record any data, from anything, at any frequency, periodicity in true and accurate temporal form, without having to store (or even process) a contiguous stream of data points. It doesn't even require any configuration. You just throw data at it whenever you want.

Quote

Quote
How quickly can a 1 minute rolling average of counts per minute change? That's open to debate, but as it's sampled once a second, then that is the maximum frequency available. Thus it "could" only tell me information about every 2 minutes and below?

A moving average is a form of low-pass filter.

Consult statistics textbooks and signal processing textbooks

What is a "count per minute" then averaged with a low pass filter? It's already a time compound metric which is then averaged again over time.

The ideal, although heavy on resources, would be to process each individual "count" event. That would be the highest accuracy of recording. As an ionisation event is one of the most random things possible (in context) it's time domain resolution is infinite to the planck constant. However, nano-second would do just fine.

This would work fine for background of ~20CPM, but if I put it onto an Uranium glazed plate it will destroy my database server and my network with 10s of thousands of events per second.

I could write better firmware for the device to emit "counts" bucketed to the second for higher resolution, but I have what I have for now. I am stuck with the rolling average and the 1 minute "bucket" size and reporting interval.

What would I be missing? At a long reach, if my counts where constantly coming in small groups like C.C.C.C................CC.C.C.C..... the CPM would not tell me.

If you were to use DSP techniques on this data, as I think you are suggesting, you would have to pick a rather large bucket size, aka period. Then count and group the ionisation events into a single integer each period. In an impure, interrupted and mutated form. Given you want to store 10 years of data. That bucket size will need to HUGE is you are recording non-sparse data series. Good luck processing or storing it.

In my setup I could store every single ionisation event accurate to the nano second and still use orders of magnitude less storage and less processing. I can then do any, all and more mathmatical and statistical wizardry on that data to produce the same as the DSP would... and much more.

Anyway. It's likely we are basically saying the same thing, it's done in different ways, in different places. If you store the raw data as and when it occured, you can always run your bucketing and averaging onto a contiguous stream at a given frequency later. If you do it while you are recording the data, you are stuck with your low resolution approximation.

Quote

Quote
Maybe I am still wrong, but I can't think in terms of frequency when I have non-periodic samples. Or at least the data storage does not require or retain any periodicity, only the individual discrete sample time stamps.

Non-periodic data still has frequency domain representations. Consider "white" noise, "pink" noise, Poisson processes, (time-domain) impulses.

Consider the frequency domain of when I drive my car into the driveway.

You might ask, why would I care about capturing those events with such an accurate timestamp? Well, I don't record these events for pure amusement. There are automations which respond to them. So when one event happens a cascade of other events may, or may not happen. Having highly accurate timestamps on discrete events allows for their execution order to be determined. Cause and effect to be known.

paulca · « **Reply #14 on:** April 10, 2024, 11:28:57 am »

Quote from: jonpaul on April 10, 2024, 10:59:11 am

OCD, get therapy (:-:)

Already in therapy, but not for OCD.

Obsessive? Maybe a little, but what healthy hobby isn't. Compulsive? While I do feel the want to record data, I don't feel compelled in anyway to do so. I don't struggle if I miss some data or even if I lose a whole hard-disk of it. My hobby development works at a glacial pace compared to the day job.

paulca · « **Reply #15 on:** April 10, 2024, 11:43:07 am »

The latest edition. WIP as the particulate mater sensor doesn't work yet.

Now. Can you tell when

* I am in the room?
* I have smoked a cigarette?
* I went to bed?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Hi, I'm Paul and I have a data habit. (Read 1280 times)

paulca

Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

tggzzz

Re: Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

tggzzz

Re: Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

tggzzz

Re: Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

tggzzz

Re: Hi, I'm Paul and I have a data habit.

jonpaul

Re: Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

paulca

Re: Hi, I'm Paul and I have a data habit.

Share me