Author Topic: Best library/algorithm for low RAM embedded data compression? (Read 15366 times)

Martin F · « **on:** October 01, 2019, 06:08:52 pm »

Hi all,

We're looking to implement streamed data compression on our data logger.

The data logger records vehicle data at very high frequency (generating 1 MB/minute of time series data), which is stored on the internal SD card. Data includes raw CAN bus form of time series data like vehicle speed, RPM, etc. We log this data in a binary format - see a sample here: https://uploadfiles.io/khxs13bt

Currently we're looking to implement embedded compression on our device - and we're trying to find the best suitable libraries/starting points.

Algorithm requirements (UPDATED):
- Implemented on ARM Cortex-M7 running 300 MHz
- Algorithm cannot use heap (malloc)
- Support for fast shutdown, max ~50 ms (likely not working if large output / working buffers are used)
- Flash usage <100kb (the flash could potentially store a static dictionary)
- Ram usage <30kb
- Target compression rate > 40% on test data file
- System is currently running at 30% idle time. Target idle time with compression >20%
- Entire file zipped
- Lossless compression

We're particularly interested in suggestions for:
- Suitable compression libraries (we've e.g. looked at LZ4)
- Good articles/literature on e.g. benchmarks of different libraries or new promising techniques/algorithms
- Specific compression concepts/methods that could be particularly relevant for this type of embedded compression
- Experts e.g. from academia that could potentially provide a bit of sparring
- Libraries & concepts related to the use of dictionaries

Your inputs would be highly appreciated - thanks!

Martin

Siwastaja · « **Reply #1 on:** October 01, 2019, 07:20:37 pm »

Consider zlib, very well known, very widely used. Open source, tools available. I think you can get the memory footprint somewhere around 20-30kB with compile-time parameters limiting the search window. You can stream it, you need a buffer to hold chunks, but you can decide the chunk size yourself.

You can approximate the zlib-like generic data compression ratio by simply zipping your current dataset to give a first order approximation (zlib with reduced memory footprint and reduced compression level setting (to save CPU) will perform somewhat worse, but the ballpark will be the same). If your data contains a lot of sensor noise, your expectation of 30-40% compression is valid. If your data has a lot of repeat, similar values, only small amount of noise, you can easily have compression ratios well over 90%. Test it.

What about worst case? Can you accept the fact that sometimes your compression ratio sucks, or even slightly increases the size? Or are you just looking to maximize the service time between SD card readouts, so only average matters and isn't critical.

SiliconWizard · « **Reply #2 on:** October 01, 2019, 07:28:50 pm »

Quote from: Siwastaja on October 01, 2019, 07:20:37 pm

Consider zlib, very well known, very widely used. Open source, tools available. I think you can get the memory footprint somewhere around 20-30kB with compile-time parameters limiting the search window.

Are you sure? You may be right, but remembering having to use zlib a couple years ago, and deal with its source code, my first impression would be that I'd really doubt that... or maybe it needs to be seriously stripped down and configured.

nctnico · « **Reply #3 on:** October 01, 2019, 07:30:43 pm »

What I have done in the past for vehicle data (albeit at a much lower rate but for long periods) is to store data as text in an XML-ish format and only fill a field when it has changed. Otherwise it is a comma. This way I got to much less data compared to binary data. I don't recall the compression rate exactly but I is significant (like 50% IIRC). Because you only store numbers you could compress the ASCII into 5 bits instead of 8.

SiliconWizard · « **Reply #4 on:** October 01, 2019, 07:46:49 pm »

Well, for text data, using just a simple Huffman algorithm (very lightweight) can get you something in the 30% to 50% on average. Takes almost nothing both in code and data. Could be enough. Now if it's pure binary data, that will really depend on the content.

Of course, for logged data, a simple approach can be to leverage the fact that successive measurements may be very close to one another? So you could devise your own compression based on this. Data loggers that log physical measurements rarely yield completely random data...

Siwastaja · « **Reply #5 on:** October 01, 2019, 08:02:03 pm »

Quote from: SiliconWizard on October 01, 2019, 07:28:50 pm

Quote from: Siwastaja on October 01, 2019, 07:20:37 pm
Consider zlib, very well known, very widely used. Open source, tools available. I think you can get the memory footprint somewhere around 20-30kB with compile-time parameters limiting the search window.

Are you sure? You may be right, but remembering having to use zlib a couple years ago, and deal with its source code, my first impression would be that I'd really doubt that... or maybe it needs to be seriously stripped down and configured.

I'm referencing to this:

Quote

The memory requirements for compression depend on two parameters, windowBits and memLevel:

deflate memory usage (bytes) = (1 << (windowBits+2)) + (1 << (memLevel+9))

For the default values of 15 and 8, respectively, this is 256 KB. Both windowBits and memLevel can be set to lower values at compile time via the MAX_WBITS and MAX_MEM_LEVEL macros, but only at a cost in compression efficiency.

The memory requirements for decompression depend only on windowBits, but this is, in a sense, a harsher limitation: whereas data streams compressed with a smaller window will merely be a bit larger than they would have otherwise, a reduced window size for decompression means that streams compressed with larger windows cannot be decompressed at all. Having said that:

inflate memory usage (bytes) = (1 << windowBits) + 1440*2*sizeof(int)

( https://zlib.net/zlib_tech.html )

Assuming decompression isn't needed on-board, 20-30kB of RAM should be possible. I have no idea how detrimental setting windowBits=8 and memLevel=5, for example, would be for compression ratio. It's worth trying. These parameters can be adjusted runtime, example code to test it on a workstation is approx. 20 lines of code.

Zlib is one of the easiest libraries I have ever seen. First ever implementation from zero to a working solution was less than two hours. This is why it may be worth trying before going into custom.

The point is, if the data is raw CAN frames, it's going to contain the same repeating IDs, likely a lot of always-zero MSbs, flag fields and whatnot, and it's going to compress really well using a generic algorithm (like zlib), even with limited search window and limited chunk size (since the CAN frames are small, and a vehicle CAN system won't have data from gazillion different sources, only a dozen max). Any custom solution would be either reimplementing some classical generic compression, or adding a lot of runtime interpretation and parsing of the data to store it more efficiently.

If you do parse all of the data anyway, then reorganizing it might be the best bet.

Martin F · « **Reply #6 on:** October 01, 2019, 08:13:35 pm »

Hi again,

Thanks a lot for all the great inputs already! One challenge in our setup is that we're unable to use heap and hence malloc.
I think zlib uses heap from what we've gathered, but let me know if I'm wrong on this account.

Thanks again!

nctnico · « **Reply #7 on:** October 01, 2019, 08:20:44 pm »

Quote from: SiliconWizard on October 01, 2019, 07:46:49 pm

Well, for text data, using just a simple Huffman algorithm (very lightweight) can get you something in the 30% to 50% on average. Takes almost nothing both in code and data. Could be enough. Now if it's pure binary data, that will really depend on the content.

I'm not talking about compressing text. I'm talking about using text as a way to reduce the amount of data. In binary a 4 byte number will always take 4 bytes. Even if it doesn't change then it depends entirely on the data surrounding it whether it can be compressed or not. Say you have a binary record with 6 4 byte numbers (24 bytes in total) The text format I used works as follows:

Code: [Select]

1234;56473;587596;226;5859;492|
;;;;;|
1236;56472;;;;|

The first record is complete. In the second record nothing has changed, in the third record only the first 2 fields changed. In total 52 characters are used instead of 72 bytes. Shove the 52 characters into 4 bits (0..9 and 2 extra characters as field seperators and record seperators) and you need 26 bytes which is more than a 60% compression ratio. Like any compression algorithm the actual compression will depend on the data but if you need something with a small footprint and low processing overhead then this is a simple & effective way.

@Martin F: I think an existing compression library will be hard to use on a microcontroller since you'll need lots of memory (both flash and RAM) and a heap. I have been down that road trying to compress FPGA images.

Siwastaja · « **Reply #8 on:** October 01, 2019, 08:36:46 pm »

Quote from: nctnico on October 01, 2019, 08:20:44 pm

Quote from: SiliconWizard on October 01, 2019, 07:46:49 pm
Well, for text data, using just a simple Huffman algorithm (very lightweight) can get you something in the 30% to 50% on average. Takes almost nothing both in code and data. Could be enough. Now if it's pure binary data, that will really depend on the content.
I'm not talking about compressing text. I'm talking about using text as a way to reduce the amount of data. In binary a 4 byte number will always take 4 bytes. Even if it doesn't change then it depends entirely on the data surrounding it whether it can be compressed or not. Say you have a binary record with 6 4 byte numbers (24 bytes in total) The text format I used works as follows:
Code: [Select]
1234;56473;587596;226;5859;492| ;;;;;| 1236;56472;;;;|The first record is complete. In the second record nothing has changed, in the third record only the first 2 fields changed. In total 52 characters are used instead of 72 bytes. Shove the 52 characters into 4 bits (0..9 and 2 extra characters as field seperators and record seperators) and you need 26 bytes which is more than a 60% compression ratio. Like any compression algorithm the actual compression will depend on the data but if you need something with a small footprint and low processing overhead then this is a simple & effective way.

@Martin F: I think an existing compression library will be hard to use on a microcontroller since you'll need lots of memory (both flash and RAM) and a heap. I have been down that road trying to compress FPGA images.

You are taking advantage of a peculiar feature of your dataset of it mostly containing small numbers, but having some odd large outliers. Sure, in that case, the extra overhead of using ASCII may be less than the compression you achieve.

You can achieve the same idea in many other ways, without using ASCII, in binary, and then the result will be even smaller. For example, this:
If the value is < 65535, write it out in 16 bits (2 bytes).
If the value is >= 65535, first write 65535 (an identifier that 4 more bytes will follow), then all the 4 bytes.
As a result, a typical "median" case will take 2 bytes (instead of 5 as in your ASCII solution), and the worst case takes 5 bytes (as opposed to 11 bytes in your ASCII version).

Your first 6-item dataset is 24 bytes in full 32-bit numbers, 30 bytes in your ASCII representation (it has grown, not shrunk as you seem to think), and 15 bytes in my proposed encoding.

Your data size reduction comes from your "not changed" compression. Binary solution for the same is obviously even smaller.

Your idea of applying temporal compression by only saving changed fields is sound and widely used, but I struggle to understand how you somehow attribute it to using ASCII, and especially XML-ism (which your encoding obviously doesn't have, thank god)

Do note that in presence of noisy sensor readings, such simple scheme is only useful if it's about some sensors sampling more often than others, and your logging scheme forcing you to do equitemporal writes. In reality, a logger which can list what values are present and what are not (instead of always comma-separating everything) would likely be even smaller. Especially if you could fit the data ID into 8 bits.

Usually in vehicle data logging, I'd guesstimate, the sensor measurements fall within a much smaller range, so that 32-bit values are not needed. And here lies the clue to the idea of rearranging: if you parse the data anyway, you may be able to fit RPM, for example, to 14 bits. If you happen to have a 8-bit wide flag field which only contains 1 or 2 bits, you can just put them to the always-zero MSbs of the 16-bit RPM value. But if you want to just dump raw CAN data stream to the card, having to parse everything to rearrange it may be an unnecessary burden that even a simple generic compression algorithm would solve for you without much custom work.

Martin F · « **Reply #9 on:** October 01, 2019, 08:42:57 pm »

One extra note: We do recognize that an existing library may be a challenge due to us being unable to use heap and due to the 20-30 kB RAM limit — as such were also interested in basic algorithms as some have suggested.

Ie if there’s a base algorithm that would make sense for our type of data and setup, as well as perhaps some suggestions for concepts to combine with this in order to achieve what we’re looking for - then we’d probably be able to code this up from scratch in C in relatively short time.

In particular it would be interesting for us if there’s some frontier techniques and algorithms that might be particularly suited for this type of application. We’ve tried to look to academia as well for inputs, though it’s also a jungle and difficult to separate the good stuff from the rest.

nctnico · « **Reply #10 on:** October 01, 2019, 08:54:20 pm »

Quote from: Siwastaja on October 01, 2019, 08:36:46 pm

Do note that in presence of noisy sensor readings, such simple scheme is only useful if it's about some sensors sampling more often than others

Not really. Data from noisy sensors won't compress either way. After all you can't compress noise. What all compression algortihms have in common is that you assign the shortest 'code' to the most occuring data pattern. However in some cases using an algorithm tailored to the data is more effective than a standard compression algorithm. A good example is Xilinx configuration data. If you pull this through a generic compression algorithm you won't get much compression. Use an algorithm which is optimised and you get much better results; even if the algorithm in itself isn't very complicated. Data from sensors is a similar situation. You can make use of the fact that the data consists of rows. A generic compression algorithm won't really care about this and just sees it as a blob of random data.

Siwastaja · « **Reply #11 on:** October 01, 2019, 08:59:15 pm »

Quote from: nctnico on October 01, 2019, 08:54:20 pm

Quote from: Siwastaja on October 01, 2019, 08:36:46 pm
Do note that in presence of noisy sensor readings, such simple scheme is only useful if it's about some sensors sampling more often than others
Not really. Data from noisy sensors won't compress either way. After all you can't compress noise.

This is stupid. Data like:
1000
1001
999
1005

compresses fairly well. How well, it depends on the algorithm, but unless it totally sucks, it will compress significantly.

But it doesn't compress through a simple "is changed?" algorithm.

You are right if your sensor only gives 100% white noise. That can't be compressed.

nctnico · « **Reply #12 on:** October 01, 2019, 09:02:01 pm »

Quote from: Siwastaja on October 01, 2019, 08:59:15 pm

Quote from: nctnico on October 01, 2019, 08:54:20 pm
Quote from: Siwastaja on October 01, 2019, 08:36:46 pm
Do note that in presence of noisy sensor readings, such simple scheme is only useful if it's about some sensors sampling more often than others
Not really. Data from noisy sensors won't compress either way. After all you can't compress noise.

This is stupid. Data like:
1000
1001
999
1005

compresses fairly well. How well, it depends on the algorithm, but unless it totally sucks, it will compress significantly.

Try it and you'll see it isn't as easy to come up with a generic algorithm. I have taken some courses on compression algorithms BTW.

The first step is to analyse the data and see how it can be compressed.

Siwastaja · « **Reply #13 on:** October 01, 2019, 09:03:19 pm »

Code: [Select]

hrst@jorma ~/Desktop $ xxd -p 7f34b296_00000009_00000008.mf4 | grep "ff" -o | wc -l
1049531
hrst@jorma ~/Desktop $ xxd -p 7f34b296_00000009_00000008.mf4 | grep "00" -o | wc -l
2518089

Taking a quick look at the data, it seems zero and 255 bytes are taking approx. 40% of your data.

Martin F · « **Reply #14 on:** October 01, 2019, 09:06:38 pm »

That’s correct, 00 and FF are very common in the data so there might be some option to utilize that in the compression

Siwastaja · « **Reply #15 on:** October 01, 2019, 09:13:34 pm »

Quote from: nctnico on October 01, 2019, 08:54:20 pm

What all compression algortihms have in common is that you assign the shortest 'code' to the most occuring data pattern. However in some cases using an algorithm tailored to the data is more effective than a standard compression algorithm. A good example is Xilinx configuration data. If you pull this through a generic compression algorithm you won't get much compression. Use an algorithm which is optimised and you get much better results; even if the algorithm in itself isn't very complicated.

Can confirm this is spot on, and I recommend looking at the data if it is easily available. (By available, I mean, you parse the CAN packets on that MCU, and read the data in meaningful variables like rpm, speed...)

OTOH, if you need just to log all the CAN traffic and cater for future messages you don't know about, parsing and understanding everything may be too tedious. If you need to look at all the data anyway, then by all means compress it in a specifically tailored way. Simple restructuring (in binary, not in ASCII nor XML) without actual compression may already solve the case. By restructuring, I mean inefficiently chosen bit widths and data fields that can be combined or even stripped away if you can be sure they will be of no interest.

This dataset has a header from some sort of CAN logging device, so I'm kinda assuming they are after "raw" logging. If the data contains full CAN packet headers, and timestamps with massive resolution, the obvious thing to do is to remove most of the header and reduce the number of timestamp bits, even allowing it to wrap around (you can count the number of wrap arounds during decode, assuming there are no long silent periods).

rstofer · « **Reply #16 on:** October 01, 2019, 09:41:05 pm »

Have you thought about run length encoding? If the data doesn't change, it isn't written. When it is written, there is a length count ahead of the actual data value.

jhpadjustable · « **Reply #17 on:** October 01, 2019, 09:42:21 pm »

Quote from: Martin F on October 01, 2019, 08:42:57 pm

One extra note: We do recognize that an existing library may be a challenge due to us being unable to use heap and due to the 20-30 kB RAM limit — as such were also interested in basic algorithms as some have suggested.

You can verify by visual inspection that zlib only uses dynamic memory allocation when building and tearing down the state object and buffers at the start and end of the (in|de)flate functions, does not allocate more while the (in|de)flation is in progress, and does not reallocate at all. You could modify the code to replace the dynamic allocations with static buffers quite easily if your coding standards so require. It's extremely permissively licensed (public-domain) so nobody needs to see what you did to it.

The zlib deflate algorithm is documented in an IETF RFC if you still prefer not to use zlib code.

Mechatrommer · « **Reply #18 on:** October 01, 2019, 10:12:52 pm »

best? i dont know but this is another https://liblzg.bitsnbites.eu/ i think to know whats best, you have to try few schemes, including your own custom made specific to your data pattern. dictionary based is the famous general purpose to give better average compression rate but the code is not so simple. to know how good is the scheme, you may try to compare with well known application such as winzip, your 8MB data compressed to 2MB using winzip here, so that can be used as benchmark. the 8MB compressed to about 5MB using my super crappy simple (FIN) compressor 37% rate so within your spec range

iirc its 256 bytes sliding table, you may improve by increasing the table size... http://www.darkridge.com/~jpr5/mirror/alg/node175.html but sometime "best" is not dictated by compression rate alone, some other aspect such as program or memory size required, time needed to do the compression, is data recoverable even one byte corrupted? etc... ymmv.

brucehoult · « **Reply #19 on:** October 01, 2019, 10:44:49 pm »

lz4, which you say you've already looked at, is still the best low CPU low memory use compressor. You're not going to come close to it using "a simple huffman" compressor. zlib can get a little better compression but at the cost of a LOT of CPU and memory.

brucehoult · « **Reply #20 on:** October 01, 2019, 11:06:54 pm »

Quote from: brucehoult on October 01, 2019, 10:44:49 pm

lz4, which you say you've already looked at, is still the best low CPU low memory use compressor. You're not going to come close to it using "a simple huffman" compressor. zlib can get a little better compression but at the cost of a LOT of CPU and memory.

lz4 -1 reduced your 8660615 byte file to 4004159 bytes in 0.025s.
lz4 -9 reduced your 8660615 byte file to 3047573 bytes in 0.057s.
gzip reduced your 8660615 byte file to 2839987 bytes in 0.328s.
gzip -9 reduced your 8660615 byte file to 2661840 bytes in 2.454s.
bzip2 reduced your 8660615 byte file to 2465104 bytes in 0.556s.

lz4 can reduce your file size by 50% to 60% very quickly. Yes, other algorithms can get an extra 7% to 12% compression over lz4 -9 or 20% over lz4 -1, but at the cost of a lot of time and memory.

I didn't experiment with the lz4 library, but you can decide how much RAM you want it to use. Using 20-30 kb is not a problem. Using less memory means you might pass up some opportunity to reuse a string from further back in the history, but I don't think it will be a big effect with your data.

Martin F · « **Reply #21 on:** October 02, 2019, 07:14:05 am »

Hi again,

LZ4 could also be an option. We've also looked at e.g. Zstandard and c-blosc2, but we find it a bit difficult to identify the best option.

Note also that if we use a rolling dictionary, we're limited to the 20-30 kb RAM usage. However, an alternative might be to use a static dictionary, which we would then train based on data samples. The advantage of a static dictionary would be that we could utilize the device flash, in which we'd be able to allocate 100-150 kb - i.e. a significantly larger dictionary could be used.

If we go down the static dictionary route, do you have recommendations for what would be the best libraries for this - would it still be e.g. LZ4/Zstandard or similar, or something else? Also, do you know of suggested tweaks/optimizations to optimally utilize e.g. a 100 kb static dictionary on an embedded device for real-time streamed compression purposes?

Thanks again,
Martin

hamster_nz · « **Reply #22 on:** October 02, 2019, 07:27:15 am »

Generic compression gets generic results. Data-aware compression can usually transform the raw data into something better,

For example, GPS location data compresses better if you convert it into a stream of an occasional reference points and lots of deltas.

When the GPS receiver is moving the deltas are more likely to be the same (and compress better) than a stream of constantly shifting lat-long-alt breadcrumbs.

Mechatrommer · « **Reply #23 on:** October 02, 2019, 09:33:28 am »

Quote from: Martin F on October 02, 2019, 07:14:05 am

If we go down the static dictionary route, do you have recommendations for what would be the best libraries for this - would it still be e.g. LZ4/Zstandard or similar, or something else? Also, do you know of suggested tweaks/optimizations to optimally utilize e.g. a 100 kb static dictionary on an embedded device for real-time streamed compression purposes?

those dynamically changing dictionary based algorithms work hard on producing optimal code for the continually evolving dictionary/table from previous input data in real time, sort of dynamic huffman code generation, but will not produce optimal code with drastically changing data pattern, unless you give it pretty big memory to store. furthermore you cant rebuild the same table during decode if the encoded data bits changed even the slightest, the decoded data afterward will be beyond recognition. if you can build static table/library, that will be a lot better you wont need complex algorithm such as LZx or what, you only need the precalculated optimal huffman code for your static library, so you'll have a static huffman code beforehand, and then all is needed is 1 to 1 map on your encoded output, and 1 to 1 remap back (reverse) during decode, pretty fast O(1) operation, no one algorithm can beat. and the corrupted encoded data can still be reconstructed to some degree on receiver end. the problem is when you change revision or upgrade to different data pattern, and needing different static table/dictionary to avoid suboptimal compression rate, decoder need to be aware of which revision the data is using. still you can have the fastest encoder, but slightly bloated decoder. but since system usually put much effort on encoder efficiency, so decoder is not much of a problem. ymmv.

SiliconWizard · « **Reply #24 on:** October 02, 2019, 12:10:43 pm »

Just a note about sensor noise. Whether it really hinders compression depends on the compression scheme and, of course, on the bandwidth of the noise. Wideband noise is kind of a worst case - it's getting close to completely random data, but you rarely get that from sensors - but a small amount of noise, which you'll often encounter, is not a big problem if you use some kind of compression scheme that only encode value differences. The key will just be the noise amplitude, and not the fact there is noise or not. Of course some schemes like RLE won't work well for noisy data, even if the noise amplitude is low, but other schemes, like just encoding differences, do. In the vein of ADPCM and the likes.

You can devise a compression scheme that combines several methods of course.

Finally, if you can get either zlib or LZ4 to yield >40% compression, and fit within 30KB RAM and 100KB code (including possible predefined tables), That'll be interesting. Don't hesitate to post figures if you can. For RAM usage, it seems possible from what has been noted above, but I still wonder whether this is not optimistic as to the real total RAM usage, including all local variables etc... but the thing I wonder about the most is code size. So, it'll be interesting to see.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Best library/algorithm for low RAM embedded data compression? (Read 15366 times)

Share me