Do we even agree to the distinction between "text" and "binary" formats? In my use, they correspond roughly to "human-readable/writable" and "non-human-readable", which is pretty vague and unhelpful. Here are the four format types that
do matter to myself:
- Fixed formats, where the relevance of each byte is dictated by its position
- Chunked formats, where each chunk includes the exact length of said chunk
- Stream formats, where control and formatting begins with a reserved value that is escaped or otherwise masked in the data stream
- Hybrid formats
It does not matter to me much if values are restricted in range or in a specific base: parsing big-endian multi-byte binary integers is not that different to parsing big-endian decimal numbers.
For example, HTTP/1 is a hybrid format. Most of it is a stream format with CR LF (0x0D 0x0A) being a delimiter, but with chunked transfer encoding, it provides a length (decimal encoded, followed by a CR LF), then exactly that many raw bytes of payload (followed by an "extra" CR LF delimiter). It even supports gzip compression for the data payload, on top of chunked transfer encoding, in which case the length refers to the encoded length, not the decoded length.
GIF images are an example of a fixed format. PNG and JFIF JPEG images are an example of a chunked format. HTML and XML are stream formats.
Standard encoding of protobufs is a stream format, but netstrings are a chunked format.
Many people suggest chunked formats, but they tend to require significant buffering (consider e.g. netstrings: to provide a value, you need to know its exact length in bytes before you can start emitting it, thus necessitating an output buffer of the maximum value length), making them annoying/problematic on small microcontrollers.
I do have my own solution, a stream-based format that can be generated and parsed with minimal memory and processing, using a relatively simple state machine, supports lossless conversion from any XML-derived format with a lot less overhead than XML, can be used to replace protocols like HTTP, and can easily be extended to support multiple independent data streams on top of the same transport stream with minimal buffering (4-8 bytes). It can be expressed and defined in human-readable terms, but a binary representation (reserved byte values) tends to be more effective. With the multi-stream extension, the worst case overhead can be as high as 33%, but is neglible for typical data given a sensible choice of reserved byte values.
I only bring it up because it fits both "binary" and "text" format categories, depending on the reserved byte value choices.
(I really should try and publish it, because I think it would be useful to many if they just knew of the technique. Alas, I don't have the
social werewithal to become a vocal proponent for it and push it to the relevant working groups and projects. Mentioning it here, like DiTBho does for their my-c, is near my limit.)
It is not suitable for CAN bus, however, which has its own standard frame types with base and extended frames having 0 to 8 data payload bytes (and the number of payload bytes already specified in the data frame). Each payload segment is also so short that only a fixed format makes sense here. For up to a reasonable number of different logical payloads, different CAN bus identifiers can be used for each part/sub-message (each bus supporting 2048 unique identifiers), so multi-part messages do not need to use the payload bytes to identify the payload itself.
Note that if the CAN bus message order cannot be absolutely controlled, you can use
N times the number of CAN bus identifiers, in a cyclic round-robin scheme, for
N different logical messages in time. (You'd also want to have the additional bit mask I described, to determine when all parts of a logical message have been received.) This is also useful when using an RTOS and a mailbox scheme, because then you have (theoretically)
N-1 logical message intervals to process the message in the main loop, before it is overwritten by a receive interrupt. (Obviously you can also use a queue for the logical messages to avoid losing any messages, but the queue primitives needed may not be available in an interrupt context – see my post about the problem of using a mutex in an interrupt context above.)