In debug mode it easily generates several kilobytes of formatted number containing text per minute. And, what the library does is so complicated that we really need the debug output to be generated all of the time (both in development and production).
Unfortunately, It turns out that printf is so slow that it causes the system to fail in a variety of subtle yet completely catastrophic ways. (due to it missing real time deadlines and not being able to detect that it has missed them)
Has anyone dealt with this before?
Not the exact situation, but I've worked a lot on high-performance computing, especially on classical-type molecular dynamics simulators, which generate a LOT of numerical data. For example, a colleague ran a large simulation with short snapshot time intervals which produced some 4 Tb of data, basically in
Protein Data Bank file format –– essentially formatted text with many numeric fields.
I've specialized in simulator development. Currently, the main problem is very similar to what you see: instead of interleaving communications and state snapshotting with computation, they compute, then communicate/snapshot, then compute, and so on, wasting a significant fraction of available CPU time. (Some people have objected to my complaint by saying that if they fully utilized the CPUs in the clusters, it "might overwhelm their cooling capability", which to me is an asinine cop-out.)
The solution here is obvious: instead of formatting the output immediately, you need to store the output data, and optionally format it at leisure.
Because
printf() format string parsing takes a surprising amount of unnecessary CPU time, and you have to do that in order to determine the number and type of each parameter, I would recommend reimplementing the output using individual functions per parameter type. That is, you convert
printf("Foo %d bar %.4f baz\n", x, y);into something like
// Converted from printf("Foo %d bar %.4f baz\n", x, y); emit_const("Foo "); emit_int(x); emit_const(" bar "); emit_float(y, 4); emit_const(" baz"); emit_newline();Note that I'm assuming you use C and not C++ here. With C++, overloading
<< is a better approach.
The
emit_() family of functions (or the overloaded operator in C++) appends the type of the parameter and the parameter value to a circular buffer. This takes minimal time.
When the processor is idle, it converts
one parameter from the circular buffer to output and returns, minimizing the latency/duration to one single conversion at a time. You can then choose whether you allow the buffer to overrun, or whether you convert from the circular buffer before adding more to it. It is even possible to flush this buffer to storage (say microSD card or similar).
In C, I would separate the parameter type into its own circular buffer, and the parameter data to its own fixed-size (32- or 64-bit/4- or 8-byte) records:
#define LOG_SIZE 29127 // Number of elements in the circular buffer; this is 262143 == 256k - 1
union log_item_union { /* Type, example values only */
// 0 reserved for unused item
char c[8]; // 1 .. 8, length
uint8_t u8[8]; // 9 .. 16, count+8
int8_t i8[8]; // 17 .. 24, count+16
uint16_t u16[4]; // 25 .. 28, count+24
int16_t i16[4]; // 29 .. 32, count+28
uint32_t u32[2]; // 33 .. 34, count+32
int32_t i32[2]; // 35 .. 36, count+34
uint64_t u64[1]; // 37
int64_t i64[1]; // 38
float f32[2]; // 39 .. 40, count+38
double f64[1]; // 41
const char *cc; // 42 .. 255, length+41
};
uint8_t log_type[LOG_SIZE];
union log_item_union log_item[LOG_SIZE];
uint_fast16_t log_head = 0; // Next free item in buffer
uint_fast16_t log_tail = 0; // Oldest item in buffer, unless equal to head
This uses 9 bytes of RAM per log item. The type specifies the item type in the 64-bit union, and supports various numeric types and vectors (whatever fits into 64 bits), various constant strings stored in Flash, plus short strings of up to 8 bytes/
chars long. The reason for not interleaving the type and the item data is hardware alignment requirements; typically, larger than byte sized items either require an aligned address, or are slower to access from an unaligned address, depending on the hardware architecture used.
Note that the above format would also allow
emit_2float(f1,f2),
emit_i16(i1),
emit_2i16(i1,i2),
emit_3i16(i1,i2,i3),
emit_4i16(i1,i2,i3,i4), and so on.
Note that you can then choose whether you convert the circular buffer contents to strings at your leisure, or whether you simply write them to some serial stream. The key is to do it when your CPU is otherwise idle, and keep each conversion/write operation short enough to not block so long as to affect the main computation.
Writing the data synchronously just will not work; the circular buffer here is what makes the prints asynchronous, and lets you handle the actual operations later on in small enough chunks to not tie up your resources for too long. For floating point types, using a fixed number of decimals and a limited range (say, -99999999.99999999 to +99999999.99999999) means you can speed up the conversion significantly compared to printf() et al.; it is supporting values very close to zero or very large in magnitude that makes generic precise conversion to string slow.