Well, for text data, using just a simple Huffman algorithm (very lightweight) can get you something in the 30% to 50% on average. Takes almost nothing both in code and data. Could be enough. Now if it's pure binary data, that will really depend on the content.
I'm not talking about compressing text. I'm talking about using text as a way to reduce the amount of data. In binary a 4 byte number will always take 4 bytes. Even if it doesn't change then it depends entirely on the data surrounding it whether it can be compressed or not. Say you have a binary record with 6 4 byte numbers (24 bytes in total) The text format I used works as follows:
1234;56473;587596;226;5859;492|
;;;;;|
1236;56472;;;;|
The first record is complete. In the second record nothing has changed, in the third record only the first 2 fields changed. In total 52 characters are used instead of 72 bytes. Shove the 52 characters into 4 bits (0..9 and 2 extra characters as field seperators and record seperators) and you need 26 bytes which is more than a 60% compression ratio. Like any compression algorithm the actual compression will depend on the data but if you need something with a small footprint and low processing overhead then this is a simple & effective way.
@Martin F: I think an existing compression library will be hard to use on a microcontroller since you'll need lots of memory (both flash and RAM) and a heap. I have been down that road trying to compress FPGA images.
You are taking advantage of a peculiar feature of your dataset of it mostly containing small numbers, but having some odd large outliers. Sure, in that case, the extra overhead of using ASCII may be less than the compression you achieve.
You can achieve the same idea in many other ways, without using ASCII, in binary, and then the result will be even smaller. For example, this:
If the value is < 65535, write it out in 16 bits (2 bytes).
If the value is >= 65535, first write 65535 (an identifier that 4 more bytes will follow), then all the 4 bytes.As a result, a typical "median" case will take 2 bytes (instead of 5 as in your ASCII solution), and the worst case takes 5 bytes (as opposed to 11 bytes in your ASCII version).
Your first 6-item dataset is 24 bytes in full 32-bit numbers, 30 bytes in your ASCII representation (it has grown, not shrunk as you seem to think), and 15 bytes in my proposed encoding.
Your data size reduction comes from your "not changed" compression. Binary solution for the same is obviously even smaller.
Your idea of applying temporal compression by only saving changed fields is sound and widely used, but I struggle to understand how you somehow attribute it to
using ASCII, and especially
XML-ism (which your encoding obviously doesn't have, thank god)
Do note that in presence of noisy sensor readings, such simple scheme is only useful if it's about some sensors sampling more often than others, and your logging scheme forcing you to do equitemporal writes. In reality, a logger which can list what values are present and what are not (instead of always comma-separating everything) would likely be even smaller. Especially if you could fit the data ID into 8 bits.
Usually in vehicle data logging, I'd guesstimate, the sensor measurements fall within a much smaller range, so that 32-bit values are not needed. And here lies the clue to the idea of rearranging: if you parse the data anyway, you may be able to fit RPM, for example, to 14 bits. If you
happen to have a 8-bit wide flag field which only contains 1 or 2 bits, you can just put them to the always-zero MSbs of the 16-bit RPM value. But if you want to just dump raw CAN data stream to the card, having to parse everything to rearrange it may be an unnecessary burden that even a simple generic compression algorithm would solve for you without much custom work.