General > General Technical Chat
Hi, I'm Paul and I have a data habit.
paulca:
So I have been pondering your points and wondering why I'm struggling to apply any of it.
So here goes.
The data is not recorded in the frequency domain. It is recorded in the time domain. Probably incorrectly using the terms so let me explain.
Data is NOT periodic. It has no natural frequency, it can record at any frequency or at multiple frequencies concurrently.
All data points are store as:
Nano-second Timestamp -> Data point
This is the nature of "time series databases".
There is no "sample frequency" in this data. The source devices emitting data decide when is the most appropriate time to send a sample. These vary from fixed intervals of 5 seconds to 5 minutes. A lot of them incorporate "emit on change" mechanics. So if the value hasn't changed they only emit a repeat value every 5 minutes. However if the value is changing rapidly it might output a data point every few seconds.
The data server is designed to be able to handle thousands of currently streams of data and tens of thousands of messages a second. So, depending on buffering available, you could probably point a ADC to it and use it as a scope! It would not be a very good one though, given all the "variable delays" caused by a distributed set of "generic concurrent" OS's and async application layers. EDIT: If the ADC can send a nano-second timestamp for each sample, then this problem goes away. You can pass the source device's timestamp into the data server if you wish. You just gotta remember if your clocks are not in sync, your data is not in sync.
So the questions I am asking are surrounding how to "quantise", select, aggregate, group, box, window the data to give the best visualisation and produce the most "useful" graph.
In the case of the radiation data. It is by its very nature random. The device provides it's 1 minute rolling average "counts per minute" approximately every minute.
How quickly can a 1 minute rolling average of counts per minute change? That's open to debate, but as it's sampled once a second, then that is the maximum frequency available. Thus it "could" only tell me information about every 2 minutes and below?
Anyway. While the data is random, it lies within a "nominal" range. Peaks and troughs are random, but also important. This is why my first attempt was to show the highest sample within any "interval" (peek). Note that interval is dynamic based on your "zoom" level, much like a digital scope screen does. It will aggregate multiple samples into one pixel depending on your selector setting min/max/mean etc. Then I added a forced longer interval of 1 hour and took the mean... and further 1 day interval and took the mean.
What I feel would be useful next is a plotted "standard distribution" over the longer periods of 1 day, 1 week, 1 month and maybe a graph showing the 5% outliers.
Maybe I am still wrong, but I can't think in terms of frequency when I have non-periodic samples. Or at least the data storage does not require or retain any periodicity, only the individual discrete sample time stamps.
tggzzz:
--- Quote from: paulca on April 10, 2024, 09:46:30 am ---So I have been pondering your points and wondering why I'm struggling to apply any of it.
...
There is no "sample frequency" in this data. The source devices emitting data decide when is the most appropriate time to send a sample. These vary from fixed intervals of 5 seconds to 5 minutes. A lot of them incorporate "emit on change" mechanics. So if the value hasn't changed they only emit a repeat value every 5 minutes. However if the value is changing rapidly it might output a data point every few seconds.
--- End quote ---
If it only outputs data when the data changes, then you can make the obvious inference of the values between successive outputs. Simply fill those values into your time series.
If it only outputs data when the mood takes it, then that is equivalent to "missing or sparse samples". I believe there are standard DSP techniques to deal with that, but I am not familiar with them.
--- Quote ---So the questions I am asking are surrounding how to "quantise", select, aggregate, group, box, window the data to give the best visualisation and produce the most "useful" graph.
--- End quote ---
That entirely depends on the nature of the data and what you will use to which the derived information for.
--- Quote ---How quickly can a 1 minute rolling average of counts per minute change? That's open to debate, but as it's sampled once a second, then that is the maximum frequency available. Thus it "could" only tell me information about every 2 minutes and below?
--- End quote ---
A moving average is a form of low-pass filter.
Consult statistics textbooks and signal processing textbooks :)
--- Quote ---Maybe I am still wrong, but I can't think in terms of frequency when I have non-periodic samples. Or at least the data storage does not require or retain any periodicity, only the individual discrete sample time stamps.
--- End quote ---
Non-periodic data still has frequency domain representations. Consider "white" noise, "pink" noise, Poisson processes, (time-domain) impulses.
jonpaul:
OCD, get therapy (:-:)
paulca:
--- Quote from: tggzzz on April 10, 2024, 10:16:09 am ---If it only outputs data when the data changes, then you can make the obvious inference of the values between successive outputs. Simply fill those values into your time series.
--- End quote ---
Expanding data within the record of truth by interpolation would definitely fall to the category of "NEVER EVER" in data science.
This will however be done in the presentation layer. In much the same way it will in a digital scope when interpolation is enabled.
Grafana supports many different ways to interpolate from none through linear to various standard curves.
The data selection and aggregation is very expansive, much more so than the Siglent scope I have. That's without touching the exotic plots either. Things like grouping by non-negative difference/derivative/integral. Some of this I hope to use for classifying electrical consumption into "buckets" by "size of delta" and aiming towards classifying individual devices.
--- Quote ---If it only outputs data when the mood takes it, then that is equivalent to "missing or sparse samples". I believe there are standard DSP techniques to deal with that, but I am not familiar with them.
--- End quote ---
Consider a magnetic door sensor. It takes about 3 milliseconds for it to emit the event that the contact position has changed. So, if you want an accurate representation in your DSP model, you would need a sample period of 1.5ms to accurate record the transitions? Over a 24 hour period that is a lot of data. Over a 10 year period you are taking about £1000s monthly storage costs.
If do use sparse time-series recording of events then, when the door opens and closes only 3 times a day, we only store 6 records. Without losing ANY precision or accuracy. In fact the sample period is 1x10^-9 seconds. If someone manages to open and close it 10 times in one minute, it will be recorded.
--- Quote ---
--- Quote ---So the questions I am asking are surrounding how to "quantise", select, aggregate, group, box, window the data to give the best visualisation and produce the most "useful" graph.
--- End quote ---
That entirely depends on the nature of the data and what you will use to which the derived information for.
--- End quote ---
Yes. Now you are getting it. Take the next step and you will be there. This system can record any data, from anything, at any frequency, periodicity in true and accurate temporal form, without having to store (or even process) a contiguous stream of data points. It doesn't even require any configuration. You just throw data at it whenever you want.
--- Quote ---
--- Quote ---How quickly can a 1 minute rolling average of counts per minute change? That's open to debate, but as it's sampled once a second, then that is the maximum frequency available. Thus it "could" only tell me information about every 2 minutes and below?
--- End quote ---
A moving average is a form of low-pass filter.
Consult statistics textbooks and signal processing textbooks :)
--- End quote ---
What is a "count per minute" then averaged with a low pass filter? It's already a time compound metric which is then averaged again over time.
The ideal, although heavy on resources, would be to process each individual "count" event. That would be the highest accuracy of recording. As an ionisation event is one of the most random things possible (in context) it's time domain resolution is infinite to the planck constant. However, nano-second would do just fine.
This would work fine for background of ~20CPM, but if I put it onto an Uranium glazed plate it will destroy my database server and my network with 10s of thousands of events per second.
I could write better firmware for the device to emit "counts" bucketed to the second for higher resolution, but I have what I have for now. I am stuck with the rolling average and the 1 minute "bucket" size and reporting interval.
What would I be missing? At a long reach, if my counts where constantly coming in small groups like C.C.C.C................CC.C.C.C..... the CPM would not tell me.
If you were to use DSP techniques on this data, as I think you are suggesting, you would have to pick a rather large bucket size, aka period. Then count and group the ionisation events into a single integer each period. In an impure, interrupted and mutated form. Given you want to store 10 years of data. That bucket size will need to HUGE is you are recording non-sparse data series. Good luck processing or storing it.
In my setup I could store every single ionisation event accurate to the nano second and still use orders of magnitude less storage and less processing. I can then do any, all and more mathmatical and statistical wizardry on that data to produce the same as the DSP would... and much more.
Anyway. It's likely we are basically saying the same thing, it's done in different ways, in different places. If you store the raw data as and when it occured, you can always run your bucketing and averaging onto a contiguous stream at a given frequency later. If you do it while you are recording the data, you are stuck with your low resolution approximation.
--- Quote ---
--- Quote ---Maybe I am still wrong, but I can't think in terms of frequency when I have non-periodic samples. Or at least the data storage does not require or retain any periodicity, only the individual discrete sample time stamps.
--- End quote ---
Non-periodic data still has frequency domain representations. Consider "white" noise, "pink" noise, Poisson processes, (time-domain) impulses.
--- End quote ---
Consider the frequency domain of when I drive my car into the driveway.
You might ask, why would I care about capturing those events with such an accurate timestamp? Well, I don't record these events for pure amusement. There are automations which respond to them. So when one event happens a cascade of other events may, or may not happen. Having highly accurate timestamps on discrete events allows for their execution order to be determined. Cause and effect to be known.
paulca:
--- Quote from: jonpaul on April 10, 2024, 10:59:11 am ---OCD, get therapy (:-:)
--- End quote ---
Already in therapy, but not for OCD.
Obsessive? Maybe a little, but what healthy hobby isn't. Compulsive? While I do feel the want to record data, I don't feel compelled in anyway to do so. I don't struggle if I miss some data or even if I lose a whole hard-disk of it. My hobby development works at a glacial pace compared to the day job.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version