Author Topic: Software to store, aggregate, analyse and display ~1G points of streamed data  (Read 2240 times)

0 Members and 1 Guest are viewing this topic.

Offline rs20Topic starter

  • Super Contributor
  • ***
  • Posts: 2322
  • Country: au
I have a sensor that produces 12-bit samples at 1 kHz, which will be left running for (let's say) a couple of weeks. This means that over the 2 weeks, the sensor will produce about a billion samples. I'm looking for a software solution that can store this data as it's streamed in, and perform analysis and visualization at any time during or after the 2 weeks.

One example of the sorts of computation I might want to do is take the deltas between successive samples ("differentiate"), clamp those deltas within a given range, and then do a cumulative sum ("integrate"). I'd then want a graph of those results (obviously decimated, I don't want to be drawing 1G points of data).

It might be tempting to suggesting some sort of SQL database or similar because it can store stuff and accept queries in the interim, but SQL database are much more geared toward creating indexes and joining across tables; I suspect it's very far from optimal for simple sequential data / convolution-style operations. I'd shudder to think how long it would take to perform the example calculation mentioned above on a billion rows in a SQL database, even aside from the inelegance of expressing the above in SQL.

In particular, the ability to perform streaming computation, and store those results seems like an essential feature of any reasonable solution here (so that the results of the computations can be simply retrieved rather than computed on the fly).

Bonus points if the solution is free (w.r.t. money), and slight bonus points if the visualization is accessible on the internet (either tool has an HTTP interface, or the whole thing is in the cloud).

I could spend weeks crafting a custom bit of software for this, but I'd really prefer to use something a bit more generic in the hope that I'll get some features for free (w.r.t. effort).
 

Offline djacobow

  • Super Contributor
  • ***
  • Posts: 1170
  • Country: us
  • takin' it apart since the 70's
Take a look at InfluxDB and TimescaleDB. Those are the two solid time series databases. Amazon also has a new offering; I forget its name.

For visualization, the typical companion is Grafana. People also use Kibana and Chronograf.

I like TimescaleDB because it is actually just a plug-in for postgres. Basically, you can still do SQL-y stuff, but you can mark certain tables as time series.

Performance may be tricky, though. You might try writing in chunks of several seconds at once.
 
The following users thanked this post: rs20

Offline OwO

  • Super Contributor
  • ***
  • Posts: 1250
  • Country: cn
  • RF Engineer.
I would probably just store it as raw data in some files, I don't think any database is designed for this. Maybe keep track of the files or metadata in a database, but the actual data points should be stored as 16 bit integers in a file (not string).
Email: OwOwOwOwO123@outlook.com
 

Offline OwO

  • Super Contributor
  • ***
  • Posts: 1250
  • Country: cn
  • RF Engineer.
So am I understanding this right that you want something that can store and display the data as a graph on a web UI with zooming and all that? If that's the case there is no existing solution that can handle 1M points. I looked into exactly this problem a while ago (in my case it was about displaying waveforms from a SDR receiver over a web UI). The conclusion was I have to write everything including the graph UI if performance is any bit important. The server side has to precompute min/max downsampled "mipmaps" of the data and serve bits of it on request, while the client side has to transparently switch between mipmap levels (or portions of the raw data when fully zoomed in). There is currently no graphing library that will do this.
« Last Edit: June 20, 2019, 06:07:02 am by OwO »
Email: OwOwOwOwO123@outlook.com
 

Offline voltsandjolts

  • Supporter
  • ****
  • Posts: 2549
  • Country: gb
Of course you can't actually plot more points than your display has pixels, so really its an aggregation problem.
I wouldn't bother with a SQL DB for this simple situation, just write a binary file, start a new one for every hour.
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7453
  • Country: pl
One example of the sorts of computation I might want to do is take the deltas between successive samples ("differentiate"), clamp those deltas within a given range, and then do a cumulative sum ("integrate").
That's a kind of thing a simple awk script could do on a text file with one number per line and you may want to familiarize yourself with that tool anyway for other jobs, but I'm not sure if performance would be satisfactory on gigabytes of data. I think you are in the realm of dedicated time series databases as djacobow said, but I've never used that stuff so no recommendations here.
 

Offline Karel

  • Super Contributor
  • ***
  • Posts: 2275
  • Country: 00
Have a look at EDFbrowser: https://www.teuniz.net/edfbrowser/

It's the fastest waveform viewer available and opensource. In order to use it, you need to convert your data to EDF format.
EDFbrowser has an ASCII (CSV) converter tool that can convert your sampledata from CSV to EDF but with 1 GSmpls this can take quiet some time.
Once your data is converted to EDF, you can als open it in Scilab, Octave, R and Python, much faster than reading from CSV files.
 
The following users thanked this post: 3roomlab, RoGeorge

Offline borjam

  • Supporter
  • ****
  • Posts: 908
  • Country: es
  • EA2EKH
Most of the tools mentioned will work. Time series databases (or something similar like Elasticsearch) together with a graphical front end (Grafana, Kibana) can tackled that.

Right now I am working with a database of 1.5 giga records (yes, 1,500,000,000 ) on Elasticsearch.
 

Offline snarkysparky

  • Frequent Contributor
  • **
  • Posts: 419
  • Country: us
 

Offline djacobow

  • Super Contributor
  • ***
  • Posts: 1170
  • Country: us
  • takin' it apart since the 70's
What time-series databases have over flat binary files or whatever is the special indexing on the time stamps that makes them VERY fast to query time intervals, reduce data to longer periods of averaging, or max/min values, etc. You can of course do all that with a flat binary file, but the time series databases have it all worked out. If you have not used one before, it is worth adding it to your quiver.

I think Borjam mentioned Elasticsearch, which is convenient and flexible, but Amazon's offering for time series is Timestream, and for interval queries, it will outperform elasticsearch handily. Depending on how much data you have, this may or may not matter.

As for why you'd want SQL, I think you need it when you know you need it. If you have not one time series, but thousands and thousands of related time series, at different rates, and related to each other in interesting ways, managing the metadata that keeps track of what series means what becomes a bit of a chore. For example, I helped a Lab that stores interval data from thousands of sensors, but those sensors calibration is sometimes changed, and sometimes the same sensors are redeployed. We needed a database that could store not just the raw sensor data, but the calibration data and what the sensor was actually measuring, so that years from now we can make sense of what the time series data meant at the time it was collected. It's a real problem in labs who infrastructure is redeployed for new experiments.

You can then keep the metadata in an RDBMS and the time series in a TSDB, or you can play with TimescaleDB, which lets you just use one database system, with the "special sauce" that it knows how to index a special time stamp column to enable very fast interval queries as above. It works nicely.
 
The following users thanked this post: rs20

Offline voltsandjolts

  • Supporter
  • ****
  • Posts: 2549
  • Country: gb
What time-series databases have over flat binary files or whatever is the special indexing on the time stamps that makes them VERY fast to query time intervals

Fair enough but the OP is recording 12bit data at 1kHz, there is nothing simpler or faster than just appending those 12bits to a file. With the constant sample rate you don't even need a timestamp, making it VERY VERY fast to query time intervals. The OP also 'didn't want to spend weeks' developing software; so just write to a file then to plot just pick every n'th point in the time range you want and plot those in Octave or whatever. This is like 30 minutes of coding effort but for a perhaps interesting learning exercise it could be done with databases.
 
The following users thanked this post: Siwastaja

Offline rs20Topic starter

  • Super Contributor
  • ***
  • Posts: 2322
  • Country: au
What I didn't mention or didn't mention clearly is that I wanted to mess around and experiment with different functions, have those functions be nicely precomputed for me, and expose the processed data as interactive graphs in the cloud. With those things in mind, using InfluxDB looks like it'll be vastly less effort than a hacky agglomeration of C/awk/octave scripts. Thanks for the advice, I'll report back on how it goes!
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 9328
  • Country: fi
Remember that often IO and storage is the bottleneck, both performance-wise, and for programming complexity. So precomputing and storing things often isn't a good idea, unless it can reduce the amount of data (so that you mostly read in the precomputed data, and only occasionally need to come back to the raw data).

But whenever your processing functions are computationally small (and operate over short number of samples), and amount of data out >= amount of data in, it is likely faster to recalculate each time, even if you do it thousands of time again, since a data fetch from disk (or even from RAM!) to L1 cache or CPU registers is like 99.9% of the work.

But all in all, the problem can't be generalized very well. In some cases, a 100 lines of code custom job does wonders. In other cases, it can get really difficult. So while it's good to collect generic wisdom like you get in this thread, each reply is only a grain of salt, a possible tool in a massive toolbox, when we don't know the exact details.

But based on what we know now, you are having a fixed-frequency timebase and assuming padding 12->16 bits, generating only 2*1000*60*60*24*14/(1024^3) = 2.25 gigabytes of data during the 2 weeks. This is peanuts, this is something a computer from late 1990's could process with no issues whatsoever, even in one flat file in ancient FAT32 filesystem, and seek any value, performance limited by the actual disk seek time only.

Now, this data completely fits RAM at once, and reading it from SSD won't take many seconds, so you don't need to even think about disk seeking for reading it partially. Just trivially index the table.

So from performance viewpoint, there's no need to complicate things. The challenges are all in achieving the graphical visualization that pleases your eye (and allows you to see what you want to see), either by finding a suitable graphical tool, or programming the graphics part of your own processing.

OTOH, the amount of the data is probably low enough that you can successfully (ab)use many database systems if they give you the style of output you like, even if they are not algorithmically the "right tool for the job". For creating new software for mass users, I loathe bloat and misusing computing resources and energy, but for actual one-offs and analysis jobs, doesn't matter. In cases like this, you optimize for development resources instead of computing efficiency, and often the best choice is something that just matches with your personal experience or way of doing this due to your personal history, no matter how "ineffective" others think it is.
« Last Edit: June 23, 2019, 09:04:03 am by Siwastaja »
 

Offline 3roomlab

  • Frequent Contributor
  • **
  • Posts: 850
  • Country: 00
Have a look at EDFbrowser: https://www.teuniz.net/edfbrowser/



I gave it a whirrrr
I used my old log data 20k points 2 columns 424kb in csv
it churned out edf 78.8kb
never used it before, I am not familiar with the interface. but it looked like a audio editor trying to play data points  :P
at first glance, I think the compression looks nice. 424 to 79 kb (2 col data)
my data was generated from a python logger that auto saves to xls in 20k point batches. I could make it do csv.
assuming all things being the same if I were to do the data collection, 1m points in edf = 3.95mb in 50 batches

I also tried the timescale reduction function, a reduction by 10 produces a 39kb file from the 79kb file.
what is interesting is that it has a screen calibration function to fit 1:1 mm scale.

in terms of generating a chart in opencalc, revisiting the chart (reloading) again vs using opening edf in terms of load time. I think edf is faster.

now here comes something interesting.
I re exported out the edf to csv
and we can now see, the compression adds artifacts to the data
example
col 2 the data is -223nV
the compressed data is now -222.89
now sure how you calculate additional bits of error, but every value is now changed as you can see in the pic. larger values suffer less change. so if I were to reformat my python logging to edf, the data may need to be multiplied up by x10? x100? to mitigate error. maybe now the original data need to be -223000pV, so the compressed data is 222999.9pV

and I tried it
-223000pV in original becomes -222892.11 in edf
I guess the compression data loss is a fixed thing
who want to try and calculate the bit loss from -223 to -222.89 ? (looks like 0.05% error?)

if I have collected data for say measuring a 10kohm precision resistor in milliohms, say 10,000,000.67 milliohm? I think the saved data in edf will no longer be the same and usable.
which I also tried using randomised 99999.xxxxx data
the edf render is unusable as it converts everything to 9999.69482. all 20k data points is this value after compression.

so my conclusion/guess is that edf data can only be used for approx 3 to 3.5 digit of precision recording
which is what? 10 bit? 11bit?
« Last Edit: June 26, 2019, 04:36:32 pm by 3roomlab »
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 9328
  • Country: fi
I used my old log data 20k points 2 columns 424kb in csv
it churned out edf 78.8kb
...
and we can now see, the compression adds artifacts to the data
example
col 2 the data is -223nV
the compressed data is now -222.89

Just for lulz, try zipping the csv file and see how small it gets.
 

Offline Twoflower

  • Frequent Contributor
  • **
  • Posts: 742
  • Country: de
My first idea was why not use a lossless(!) audioformat and use existing software to process it. I think even octave should be able to read audiofiles files, or not? The good thing is, they are binary files. So reasonably small. But you can't add additional data to the samples (e.g. timecode).

But then the simplest way would be to use a CSV to record the data. Depending on the requirements do a on the fly compression if the system has limited storage capacity (e.g. on linux you might be able to pipe the file trough gz or similar). Or use some binary to text coding from the beginning (e.g. Base64). If it turns out the data is in a format not reach your required performance, re-format the data to match the computation requirements. As others pointed out actual computers are fast, especially if you can store the data on a SSD.

You should think about the effort you spend on that. For a one-time test you might life with a low performing solution and be done with it. If you re-do that multiple times spending on the performance might start t be useful. It is kind of finding the point of good enough.
 

Offline bson

  • Supporter
  • ****
  • Posts: 2497
  • Country: us
The classic is rrdtool.
 

Offline Karel

  • Super Contributor
  • ***
  • Posts: 2275
  • Country: 00
EDF uses 16-bit integers, so the range is from -32768 to +32767.
When you do the conversion from CSV to EDF, you can also select to convert to BDF
(which is EDF but with 24-bit integers). This increases the range 256 times.
Probably that's what you need.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf