Computing > Programming

Constructing short C strings in limited or constrained situations

(1/6) > >>

Nominal Animal:
When constructing strings in C, the snprintf() and vsnprintf() are one of the first tools we reach for.  Unfortunately, in many embedded and freestanding environments they may not be available or their code size is too large to be practical; and even in hosted POSIXy systems, one cannot use this in signal handlers because they are not async-signal safe.

I recently realized that the tools I usually reach for in these situations can be divided into two categories: small string buffers and reverse builders.  Small string buffers are exactly what they sound like, just written to be small and efficient (both time and space) with minimal to no external dependencies (so usable even in interrupt handlers in embedded environments, and in signal handlers in hosted POSIX C enviroments).  Reverse builders are also string buffers, except they work bass-ackwards, from the end towards the beginning of the string.  If the string consists (mostly) of integers in decimal form, the reverse builders are desirable as they don't do any extra work at all (like moving the string buffer around, or extra divisions by ten to find out how many characters one needs to reserve for them; most integer formatting implementations tend to have to do one or the other).

So, let me describe the small string buffers: their use cases, their design requirements, and my preferred implementation.  If you are a new C programmer or still learning about these things, perhaps there is something about the process or how one ends up with this kind of thing that may be useful to you.  If you are an old hand at C (or the subset of C++ used in embedded environments, which sadly usually omit all the nice C++ string stuff), perhaps you have a different approach you use, that you care to contrast and compare to?  Or if you find this yukky, feel free to say so, although I do need to point out I only have these because I've needed them in practice before.  You cannot always reasonably punt the work from the constrained/limited/critical section to say a worker thread; sometimes it just makes most sense to do it right there in the bind, even if one needs to implement ones own tools for it.

Use cases

Most typical use cases are human-readable time stamps ("YYYY-MM-DD HH:MM:SS.sss" being my preferred format) and reporting variable values to a buffer or stream of some sort, for logging purposes, or as responses to a query or a command.

The common theme here is a very restricted environment.  Your code may be an interrupt handler on an AVR, with limited memory and computing resources, and because almost none of the base library functions are re-entrant (safe to use in an interrupt handler), you either do the formatting elsewhere, or with your own code.  Or you might have a Linux program or daemon of some sort, and want to log some signals directly from their signal handlers, without setting up a thread to sigwait()/sigtimedwait() on those signals so one is not restricted to async-signal safe functions.  If you can use snprintf()/vsnprintf(), by all means do so: you should only reach for this kind of alternative when you reasonably cannot use snprintf()/vsnprintf() or whatever the formatting functions provided by your C (or subset-of-C++) environment provides.

For the same reasons, these strings are severely restricted in length.  Typically, they are less than a hundred characters long.  Sometimes you have a small formatted part before and/or after a bulk data part, so the total may be much higher.  I personally find a hard upper limit of say 250 characters here perfectly acceptable.

When working on embedded systems, I do not want to waste RAM, though.  For the same reason, I don't want to use C++ templates either for each different string buffer length; I need the functionality small and tight – think very specific surgical tool, not a swiss army knife.

Design criteria

I want to specify the string buffer size at compile time (or if on-stack, at run time), and not waste any memory.
I want a single set of functions to act on the buffers. I do not want multiple copies of the functions just because they access string buffers of slightly different sizes.
I want the functions to be fully re-entrant (meaning, if code working on one string buffer is interrupted, the interruptee is safe to work on any other string buffers) and async-signal safe.
I accept a low hard upper limit on the buffer size, say around 250 chars or so.  Typically, mine are much smaller, on the order of 30 or so.
I want these to be efficient.  They don't need to be optimized yet, just do what they do without wasting effort on theoretical stuff.
I want each string buffer to also record its state, so that I can stuff the components that form the strings into it, and only have to check at the end whether it all fits.

Practical use example

An interrupt or signal handler needs to provide a string containing a timestamp in human readable form.  The timestamp is a 32-bit unsigned integer in milliseconds.  That is, if it has value 212×1000×60×60+34×1000×60+56×1000+789 = 765,296,789 = 0x2D9D8095, the timestamp should read "212:34:56.789".  Thus:

        uint32_t  millisecs;  // To be described in human readable form
        SSB_DEFINE(timestamp, 14);  // Maximum 32-bit timestamp corresponds to 1193:02:47.295, which has 14 characters
       
        const int  hours = millisecs / (60*60*1000);
        millisecs -= (uint32_t)hours * (60*60*1000);  // or millisecs %= 60*60*1000;
        ssb_append_int(timestamp, hours, 0);
        ssb_append_chr(timestamp, ':');
        const int  minutes = millisecs / (60*1000);
        millisecs -= (uint32_t)minutes * (60*1000);  // or millisecs %= 60*1000;
        ssb_append_int(timestamp, minutes, 2);
        ssb_append_chr(timestamp, ':');
        const int  seconds = millisecs / 1000;
        millisecs -= (uint32_t)seconds * 1000;  // or millisecs %= 1000;
        ssb_append_int(timestamp, seconds, 2);
        ssb_append_chr(timestamp, '.');
        ssb_append_int(timestamp, (int)millisecs, 3);
       
        if (ssb_valid(timestamp)) {
            // string is ssb_ptr(timestamp, NULL), length ssb_len(timestamp, 0).
        } else {
            // Failed. Either buffer was too short, or a conversion couldn't fit
            // in the allotted space, or something similar.
        }

In a file scope, one can declare a small string buffer via say
    SSB_DECLARE(common, 30);
and in each scope where it is used –– but note that these uses must not interrupt each other! ––, it is just initialized with
        SSB_INIT(common);

In a local scope, to reuse memory currently unused for a small string buffer, one can use say
        SSB_STEAL(newname, pointer_to_memory, size_of_memory);
but again, it is up to you to make sure that while this memory is being used as a string buffer, we don't get interrupted by some other code that also uses it.  Similarly, if we steal it, we'll be rewriting its contents, so at this point it better not have any useful stuff in it anymore.
This will declare a new variable named newname, compatible with the ssb_ functions, pointing to the data.  Hopefully, the compiler is smart enough to optimize the variable away.  (In C, one can only make the new one const, but in C++ the newname pointer can be a constexpr.)

To append floating point number in decimal format (not scientific format) using six decimal places, one could use say
        ssb_append_float(buffer, value, 0, 6);
where the first 0 indicates "as many digits as needed on the left side of the decimal point", and 6 "six digits on the right side of the decimal point".

(I like using positive numbers to indicate fixed numbers of digits with zero padding, negative for fixed width with zero padding, and zero for "the size of the value without any extra padding". I do not normally have a way to force a positive sign; the decimal strings that I tend to need are either negative (with a negative sign), or nonnegative (without any signs).)

In hosted environments, I like having ssb_write(buffer, descriptor); which tries hard to write the string buffer contents to the specified descriptor.  That is, if the string buffer is valid, and none of the operations on it have failed; if they have, then this (and all other ssb_ functions) do nothing.

Implementation

A small string buffer is an array of 4 to 258 unsigned chars.
The first one is the maximum number of chars it can hold (maxlen), not including the trailing nul char.  This one must be between 1 and 254 (UCHAR_MAX-1), inclusive; the smallest and largest possible values are reserved (and detected as *broken* buffers, scribbled over by some other code).
The second one is the current number of chars in the buffer (len), not including the trailing nul char.
There is always room for the trailing nul, but an implementation may only set it there when the buffer pointer is obtained via a ssb_ptr() call.

Each small string buffer has three states:
* Valid: Operations have all succeeded thus far. (maxlen >= 1 && maxlen < UCHAR_MAX && len >= 0 && len <= maxlen).
* Invalid: Operations have failed. (maxlen >= 1 && maxlen < UCHAR_MAX && len > maxlen).
* Broken: Someone overwrote maxlen. (maxlen < 1 || maxlen >= UCHAR_MAX).Initializing operations cause the buffer to have maxlen computed based on their size, with len zero, and the first byte of data also zero.  The rest of the data bytes do not need to be initialized.  (On embedded architectures, this uses three assignments instead of an initializer, to keep the ROM/Flash overhead minimal.)

Aside from the maximum length value, and the state logic encoded by the length and maximum length values, this is a common string or string buffer format, called either length-prefixed, or more commonly Pascal strings, since many Pascal compilers use(d) a length-prefixed format for strings.

ssb_valid(), ssb_invalid(), and ssb_broken() functions can be used to check the buffer state, but I rarely need them; usually only to set up a debugging flag if a report had to be skipped because the string buffer was too small.  These functions also tend to be static inline and/or have the pure used leaf function attributes, so that the compiler doesn't include them at all in the final binary if they're not used, can can optimize them with rather extreme prejudice.  They are so simple that even on AVR, it would be more effort to move the data to suitable registers for a function call, than just straight-up inlining the accesses involves.

Other ssb_function() only operate on small string buffers in the Valid state.  If the buffer is in Invalid or Broken state, it is not modified nor accessed beyond the two bytes that define maxlen and len.

A particular detail I like is that the pointer accessor function takes an extra ifbad argument,
    char *ssb_ptr(buffer, char *ifbad);
as does the length accessor function,
    int  ssb_len(buffer, int ifbad);
so that having e.g. ternary expressions, one can just handle the invalid/broken buffer states with a custom return value.
These are static inline accessor functions, because on most architectures moving the parameters to their respective registers is as much code as doing the accesses, so the extra if-bad parameter/return value doesn't really cost anything.

My current implementations for the equivalents of ssb_append_int() and ssb_append_float() do include an "extra" loop that divides the integral part by ten repeatedly, to find the number of integer digits needed.  That loop is then repeated to generate the actual digits.  (This can be avoided by using buffer space left starting at the very end, so that only one divide-by-ten loop is needed, followed by copying/moving the data to the correct position in the buffer.  Both have their use cases.)  The fractional digits are generated by repeated multiplications by ten, followed by extracting the integer part, so is also suitable for any fixed-point formats one might use (although one does need to write the ssb_append_type() function for each different fixed-point format).

While the repeated division-by-ten with remainder giving the next decimal digit, and multiplying fractional part repeatedly by ten and grabbing the integral part for the next digit (and incrementing it if necessary for the final digit after comparing the final fractional part to 0.5), may sound inefficient, they actually easily outperform most printf() implementation variants with the same formatting specs, while producing the same string.  So, I definitely do not consider this (or at least mine) small string buffers optimized, I do consider them efficient.  With possible exception being that integer part loop on different architectures: those that have a fast and compact integer divide-by-ten-and-keep-remainder machine instruction will prefer the extra division loop, others will prefer copying the few bytes over.

I can provide my own implementations as CC-BY, but at this point, I haven't fully integrated everything into a single codebase, and I'm seeing so much #ifdef macros in the single-source, I think it might be better to keep the interface same but decidedly different implementations (say, AVR vs. AMD64) separate.  I'm not sure about whatchamacallit, either; my source right now uses tinystr, but I think some three-letter acronym implying "buf" would work better, although "ssbuf" sounded a bit off to me to use.  And my idea here was not to look at just my code, I want to see what others do and use.  You know, shop talk.

Those I've talked to face-to-face tell me they don't like showing this kind of code, because although theirs is just as functional as mine, this kind of constrained code is often left in its ugliest working form; only worked on if it breaks, rewritten from scratch (or from memory) for each use case.  Because face it, people most often just choose to wing it instead, hoping that this time they won't get bitten by re-entrancy issues or async-signal unsafety, if they use snprintf() etc.; and if they do get bitten, they (and I myself) more often punt these to worker threads and less constrained parts, rather than work on such ugly little code snippets.  In all honesty, just look at even the outlined interfaces: I even need macros to declare/initialize/define the buffer variables, and that just starts the odd side.

The reason I originally started paying attention to integer and floating-point number formatting, was when I was dealing with huge molecular datasets ("movies", dozens of gigabytes in slowish spinning rust, well over a decade ago now) in a nicely spatially and temporally sliceable distributed binary form (dataset scattered over hundreds of binary-formatted files).  I needed to generate the interchange/transport formats (various human-readable text formats for visualization et cetera), and that severely bottlenecking on printf().  A main source saving to those binary files was written in Fortran 95, too... So, I wrote faster convert-doubles-to-string versions, and verified the strings matched for all normal (as in finite non-subnormal) doubles.  (Well, not all 262 or so of such values, but maybe 240 of them.)  It was so ugly I hid the code, but it was/is functional.

Nominal Animal:
Reverse string building

Building strings containing mostly integers can simplify and speed up the operation significantly.  For us humans, using an imperative language like C, that reverse order can be somewhat overwhelming.

For example, consider a function that is provided by a buffer with room for at least 25 chars, and three signed 16 bit integers, and formats the integers into human-readable three-component geometric vector, "(x, y, z)":

--- Code: ---#include <stdint.h>

static char *backwards_i16(char *ptr, int16_t val)
{
    uint16_t  u = (val < 0) ? (uint16_t)(-val) : (uint16_t)(val);

    do {
        *(--ptr) = '0' + (u % 10);
        u /= 10;
    } while (u);

    if (val < 0)
        *(--ptr) = '-';

    return ptr;
}

/** Return a pointer to "(x, y, z)" somewhere within the buffer,
 *  with buffer having room for at least 1+6+2+6+2+6+1+1 = 25 chars.
*/
char *i16_vec3_to_asciiz(char *buffer, int16_t x, int16_t y, int16_t z)
{
    char *p = buffer + 1 /* '(' */
                     + 6 /* -32768 .. 32767 */
                     + 2 /* ", " */
                     + 6 /* -32768 .. 32767 */
                     + 2 /* ', " */
                     + 6 /* -32768 .. 32767 */
                     + 1 /* ')' */
                     + 1 /* terminator, '\0'. */ ;

    *(--p) = '\0';
    *(--p) = ')';

    p = backwards_i16(p, z);

    *(--p) = ' ';
    *(--p) = ',';

    p = backwards_i16(p, y);

    *(--p) = ' ';
    *(--p) = ',';

    p = backwards_i16(p, x);

    *(--p) = '(';

    /* Note: if the caller expects the result to start at the beginning
       of the buffer, we need to do the equivalent of
            if (p > buffer)
                memmove(buffer, p, (size_t)(buffer+1+6+2+6+2+6+1+1 - p));
       Our description says result is somewhere within the buffer,
       so we do not need to.
    */

    return p;
}

--- End code ---
As you can see, the code is very compact, and relatively straightforward.  What is difficult with it, is to remember that to see what kind of string it constructs one needs to read the code backwards: start at the Note: comment, then go upwards, until you see *(--p) = '\0'; which is responsible for terminating the string.

The basic operation used here is divide by ten with remainder.  It does not imply that hardware division is actually used, though.  We can write the basic operation as
    unsigned char  div10(unsigned int *const arg) {
        const unsigned char  result = (*arg) % 10;
        (*arg) /= 10;
        return result;
    }
which on most architectures does not actually involve hardware division at all, but a (wider word-width) multiplication using a reciprocal represented by a fixed point integer.  For example, on AMD64, Clang-10 -O2 generates the same code for above as for
    unsigned char  div10(unsigned int *const arg) {
        const unsigned int  dividend = *arg;
        const unsigned int  quotient = (uint64_t)((uint64_t)dividend * 3435973837) >> 35;
        const unsigned char  remainder = dividend - 10*quotient;
        *arg = quotient;
        return remainder;
    }
where the magic constant, 3435973837 / 235 represents 0.1000000000.

In general, dividing an unsigned integer q with a constant non-power-of-two positive divisor d, are based on q/d ≃ (a×q+c)/2n, with the parenthesized part using extra range (often twice that of the quotient q).

Converting integers to strings using subtraction only

Sometimes you have an architecture where division by constant (implemented either in hardware, or similarily to above as multiplication via reciprocal) is too costly.  There, you can use repeated subtractions.

For example, on and 8-bit architectures without hardware division or multibyte multiplication, you might find you need to efficiently convert 32-bit signed and unsigned integers to strings using subtraction only.  Consider:

--- Code: ---#include <stdint.h>

static const  uint32_t  powers_of_ten[9] = {
    UINT32_C(10),
    UINT32_C(100),
    UINT32_C(1000),
    UINT32_C(10000),
    UINT32_C(100000),
    UINT32_C(1000000),
    UINT32_C(10000000),
    UINT32_C(100000000),
    UINT32_C(1000000000),
};

unsigned char u32_to_asciiz(char *const buffer, uint32_t value)
{
    unsigned char  pot = 0;
    unsigned char  len = 0;

    while (value >= powers_of_ten[pot] && pot < 9) {
        pot++;
    }

    /* Note: pot is the power of ten for the most significant decimal digit.
             pot == 0 is equivalent to saying value < 10, and
             pot == 9 is equivalent to saying value >= 1000000000. */

    while (pot--) {
        const uint32_t  base = powers_of_ten[pot];
        char            digit = '0';

        while (value >= base) {
            value -= base;
            digit ++;
        }

        buffer[len++] = digit;
    }

    buffer[len++] = '0' + value;
    buffer[len  ] = '\0';

    return len;
}

unsigned char i32_to_asciiz(char *const buffer, int32_t value)
{
    if (value >= 0) {
        return u32_to_asciiz(buffer, (uint32_t)(value));
    } else {
        buffer[0] = '-';
        return u32_to_asciiz(buffer + 1, (uint32_t)(-value)) + 1;
    }

--- End code ---
These require a buffer of sufficient size (12 chars will suffice for all possible values), and return the length, excluding the terminating nul byte.  The string starts at the beginning of the buffer.

For example, compiling the above to ATmega32u4 using GCC 5.4.0 via avr-gcc -std=c11 -Os -Wall -ffreestanding -mmcu=atmega32u4 -c above.c, the above takes 202 bytes of ROM/Flash (166 bytes of code, 36 bytes for the constant array); 262 bytes with -O2 (226 bytes of code, 36 bytes for the constant array).

Interestingly, the runtime cost is not as nearly as big as one might think.  The slowest 32-bit unsigned value to convert is 3,999,999,999, which does 33 iterations of the subtraction loop overall.  In essence, one trades each division-by-ten-with-remainder operation, for up to nine subtractions.  (This does not count keeping the iteration count – which is always less than 10, or between 48 and 57 if using ascii digits '0'..'9', nor updating the length of the buffer etc., since those should be "native" but the subtraction may be between much wider integers than fit in a single machine register.)

Even implementing the conversion via repeated subtraction for 64-bit integers isn't horribly slow, since even the slowest values like 17,999,999,999,999,999,999 only need 170 iterations of the subtraction loop.

An interested Arduino programmer might wish to try the above on their favourite 8-bit architecture, and see what the timing/efficiency is like, and compare to the conversion functions provided by the base libraries (snprintf() or the String() constructor in Arduino, for example).


Whenever I discover myself needing some kind of integer to string conversion in a very constrained situation (i.e., without existing conversion functions I can use), I do often end up implementing some crude conversion first, then cleaning it up later – not because of laziness, but because I need to see what is needed and useful first, before I am willing to commit to a specific approach.  Just like the Linux kernel developers, who steadfastly refuse any idea of an internal ABI just because it would bind their hands to such ABIs, I too want to keep my options as open as possible, whenever I'm dealing with very tightly constrained situations like interrupt handlers and POSIX signal handlers.

One important takeaway for those learning C from this and my previous post above, is that there is no "best approach".  I've shown some of the tradeoffs I make in certain situations, but the process of first finding out what kind of tradeoffs are possible, and then making informed choices, is the interesting bit.

My initial choices are often wrong, because I learn as I create.  There is nothing bad about that, and indeed I do not even notice, because I try to keep such choices refactorable, and if I have time, sometimes refactor code just to see if I have learned enough in between (writing the original code and when I decide to refactor) to make a difference.  It reminds me of trying new recipes and techniques when cooking, really.

SiliconWizard:
I was going to reply, but your two posts are so detailed that I'm not sure what to add really.
All I can say is that I certainly second the implementation of your own conversion functions if you don't have access to the standard xxprintf() functions, they take up too much space or they are just not efficient enough for your application. I also second the "one-function-per-type" scheme. Formatted functions are all nice, but they are highly inefficient by nature, and can pose some security issues as well. (Just think that the format string itself can be modified in memory, and imagine what can happen in this case...)

Nominal Animal:

--- Quote from: SiliconWizard on July 02, 2021, 08:41:06 pm ---I was going to reply, but your two posts are so detailed that I'm not sure what to add really.

--- End quote ---
Sorry about that; in my enthusiasm for listing my current thoughts on this, I now see I made it really hard for anyone to really grab the talking stick and run with it.
My failure; I do apologize, and am working on it.

Anecdotes of ones own memorable tight spots and how one found ones way outside them, would be very valuable to both old hands and newbies.  I for one promise to read every single one twice; in my experience, they are just that useful, interesting, and fun.

(In case any one wonders, I am not interested in writing monographs; I desire discussion and argumentation for and against, as that is the fertile soil ideas need to grow on.  In.  At?  English!  If I wanted monographs or just attention to myself, I'd put them on my web site or at github.  This darned verbosity of mine is a fault I do recognize and try to deal with.)

Nominal Animal:
And to continue the data spew, a quick look at those normal, non-constrained situations a C programmer should know and reach for first.

Not all of these are defined in the C standard.  Some are defined in POSIX, and some are very commonly available GNU and/or BSD extensions.  On the rare architectures they do not exist, it is relatively straightforward (but dull, careful work and testing) to implement these in terms of standard C functions.  My links point to the Linux man pages online project, but only because it very carefully and systematically describes which standards a function conforms to (under the Conforming to heading), plus notes, bugs, oddities, and implementation differences.  I am not pushing you towards Linux, I've just found it more reliable and up to date than any of the alternatives like linux.die.net/man or unix.com/man-page, although those do have some not listed in Linux man pages, especially the latter wrt. functions not implemented in any Linux variants, for example those only on OpenSolaris for example).

These do have the same limitations as normal C printf() family of functions have, and none of these are async-signal safe.

* printf(), fprintf(), vprintf(), and vfprintf() (Standard C)

These are the most commonly used functions of the printf family.  Plain printf() prints to standard output, and fprintf() to any stream (FILE *).
The parameters the formatting specification refers to are passed as extra parameters to fprintf() and printf(), and as a variable argument list to vfprintf() and vprintf() (see <stdarg.h>).


* snprintf() and vsnprintf() (Standard C)

These take a pointer to a buffer, the size of that buffer, a formatting string, and either the parameters needed by the formatting string, or a variable argument list containing those (see <stdarg.h>).  They return the number of characters needed to describe the string (not including a terminating nul), or negative in case of an error.

If the buffer is not large enough, the return value will be the buffer size or larger, but the function will not overwrite memory past the specified size.  This means that if the return value matches the buffer size, the buffer is NOT terminated with a nul char, and was not large enough.  It is easy to get this logic wrong if you are not aware of it, but here is a snippet as an example of proper handling:
   
    char  buf[10];
    int  len = snprintf(buf, sizeof buf, "%d", X);
    if (len < 0) {
        /* snprintf() reported an error.  Nothing useful in buf[]. */
    } else
    if ((size_t)len < sizeof buf) {
        /* We have the decimal representation of X in buf as a string.
           There is a terminating nul char at buf[len]. */
    } else {
        /* buf was too small to hold the entire string.
           len may be much bigger than sizeof buf,
           so do not assume buf[len] can be accessed.
           There is no particular reason to assume that
               buf[sizeof buf - 1] == '\0'
           and it almost never is. */
    }


* dprintf() and vdprintf() (POSIX, was GNU long time ago)

These are the printf family functions you can use to format and write a string to a raw descriptor (without the standard I/O stream abstraction represented by FILE * handles).  Typically, you do clear errno to zero before calling these, to be able to differentiate between printf formatting errors and I/O errors; and I only recommend using these on files, character devices, and stream sockets, not on datagram sockets.  For datagram sockets, using asprintf()/vasprintf() and then send() on the properly constructed message, is the proper way to ensure the entire message is sent correctly (since send() on datagram sockets do not return short counts in some non-error situations like write() does), and lets you the programmer differentiate between message formatting ("printf-related") and connection I/O issues.


* strftime() (POSIX), the Swiss Army Knife for formatting timestamps in a struct tm structure.

You should use clock_gettime(CLOCK_REALTIME, &timespec) to obtain the Unix Epoch time (same as time() and gettimeofday() report) at nanosecond resolution, then convert the .tv_sec to a struct tm using localtime_r() (if you want the readable form in local time) or gmtime_r() (if you want the readable form in UTC/GMT or the closest equivalent).  The only downside with strftime() is that since struct tm does not have a field for fractions of seconds (like struct timespec and struct timeval have), you are limited to second granularity in your timestamps.

If you have multiple 'clients' your code is connected to or services, use newlocale() to get a separate locale_t for each one, then use uselocale() before using localtime_r(). uselocale() is thread-specific, like errno, so only affects the current thread.

If you intend your code to be localizable, i.e. messages and date and timestamps confugurable to each locale and language via gettext(), this function is indispensable, because you only need to make the strftime() format pattern a gettext message, and those creating translations and localizations can then define the date/time format in the localization catalog for this program.  Very powerful, even if limited to one-second precision!


* asprintf() and vasprintf() (GNU, BSD)

These functions dynamically allocate a buffer large enough to hold the result, that pointer stored to the location pointed by the first parameter.  The second parameter is the familiar printf() family formatting string.  asprintf() takes additional parameters just like printf() does, and vasprintf() takes a variable argument list (see <stdarg.h>).  They return the number of chars in the buffer, not including the terminating nul byte, or negative if an error occurs.  BSD and GNU behave a bit differently if that happens: BSD resets the buffer pointer to NULL, while GNU leaves it unmodified.

I personally warmly recommend the following pattern:
   
    char *buffer = NULL;
    errno = 0;
    int  length = asprintf(&buffer, ...);
    if (length >= 0) {
        /* No problems; buffer[length] == '\0' */
        do_something_with(buffer, length);
        free(buffer);
    } else {
        /* Error; see 'errno' for cause.
           You don't need to, but it is safe to
               buffer = NULL;
           here; you won't leak memory. */
    }


If the <stdarg.h> "variable argument list" stuff sounds odd to you, consider the following working example snippet:
   
    #define  _POSIX_C_SOURCE  200809L
    #define  _GNU_SOURCE
    #include <stdlib.h>
    #include <stdarg.h>
    #include <stdio.h>
    #include <errno.h>
   
    static volatile int  my_logging_oom = 0;
   
    /* Log an error with printf formatting support.
       Returns 0 if success, errno error code otherwise. */
    int my_logging_function(const char *fmt, ...)
    {
        va_list  args;
        char  *message = NULL;
        int  len;
       
        va_start(args, fmt);
        len = vasprintf(&message, fmt, args);
        va_end(args);
        if (len < 0) {
            my_logging_oom = 1;
            return errno = ENOMEM;
        }
       
        somehow_log_message(message, len);
       
        free(message);
        return 0;
    }

The #defines tell the C library on Linux to expose both POSIX and GNU extensions in the header files included.  BSDs expose them by default.

The my_logging_oom variable is just a volatile flag used to record if logging ever fails due to Out Of Memory.  I'd expect other code to examine it every now and then, and report to the user if it ever becomes nonzero.

The only "trick" with this kind of variadic functions is that their parameters go through default argument promotion, as described in the C standard.  Essentially, both float and double are passed as double .  Any integer types smaller than int will be converted to int if that retains the value, and to unsigned int otherwise.  Fortunately, this does not affect pointers: a pointer to a float is passed as a pointer to a float, because pointers are not subject to default argument promotion.  It doesn't affect arrays either, because they decay to a pointer to their first element, so passing a name of an array to variadic function is the same as passing a pointer to the first element of that array.

So, if you wanted, you definitely could implement your own printf() family of functions using <stdarg.h>.  However, as SiliconWizard mentioned, this formatting approach is not always superior to just constructing the message piece by piece, using type-specific functions call for each conversion.  Aside from ease of use, the one truly useful thing is making the printf format specification be a gettext() message, so that end users can very easily translate and localize the program without recompiling the binaries and adding code.  A practical example:

--- Code: ---#include <stdlib.h>
#include <locale.h>
#include <string.h>
#include <stdio.h>
#include <libintl.h>

#define _(msgid) (gettext(msgid))

int main(int argc, char *argv[])
{
    /* This program is locale-aware. */
    setlocale(LC_ALL, "");

    /* Let's call ourselves 'greeting', so that if you want, you can put
       a message catalog at say /usr/share/locale/<yourlocale>/LC_MESSAGES/greeting.mo
    */
    textdomain("greeting");

    if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        const char  *arg0 = (argc > 0 && argv && argv[0] && argv[0][0]) ? argv[0] : "(this)";

        fprintf(stderr, _("\n"
                          "Usage: %1$s [ -h | --help ]\n"
                          "       %1$s NAME1 NAME2\n"
                          "\n"
                          "This prints a localizable greeting.\n"
                          "\n"), arg0);

        return EXIT_FAILURE;
    }

    const char  *name1 = (argv[1][0]) ? argv[1] : _("(emptyname1)");
    const char  *name2 = (argv[2][0]) ? argv[2] : _("(emptyname2)");

    printf(_("Hey %1$s, %2$s sends their greetings.\n"), name1, name2);

    return EXIT_SUCCESS;
}

--- End code ---
Note that a formatting directive that begins with say %3$ means "the first variadic argument", whereas % means "the next variadic argument".  You can use either one in any printf formatting string, but you cannot and must not mix the two forms.  As an example, %3$d formats the third variadic argument as an int, and %2$s the second variadic argument as a string.

A message catalog for say "Formal English" might replace the "Hey .... message with say "Greetings from %2$s to %1$s.\n" .

I personally do not like using a _(msgid) macro at all, and much prefer say MESSAGE() or LOCALIZE() instead.  The reason I kept it in above, is that it is common pattern in C that confuses those who don't know about it beforehand, so I thought it as a good idea to stuff that in there as well.

If you want to play with the above, save the following as greeting.po:

--- Code: ---msgid ""
msgstr ""
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"

msgid ""
"\n"
"Usage: %1$s [ -h | --help ]\n"
"       %1$s NAME1 NAME2\n"
"\n"
"This prints a localizable greeting.\n"
"\n"
msgstr ""
"\n"
"Usage: %1$s [ -h | --help ]\n"
"       %1$s NAME1 NAME2\n"
"\n"
"This prints a localizable greeting from NAME2 to NAME1.\n"
"\n"

msgid "(emptyname1)"
msgstr "(unknown person)"

msgid "(emptyname2)"
msgstr "(unknownn person)"

msgid "Hey %1$s, %2$s sends their greetings.\n"
msgstr "Greetings from %2$s to %1$s.\n"

--- End code ---
where msgid describes the exact key the program is looking for, and msgstr the replacement for this message catalog.
(If there is no message catalog, msgid is used as-is. That's why it looks a bit funky at a first glance. It's a very simple, easy to manage format, though.)

You can 'compile' the human-readable greeting.po into an efficient binary message catalog file greeting.mo using
    msgfmt greeting.po -o greeting.mo
and install that to say the en_GB locale via
    sudo install -m 0644 -o root -g root  greeting.mo  /usr/share/locale/en_GB/LC_MESSAGES/
You can then compare the program output when run with different locales:
    LC_MESSAGES=C ./greeting
    LC_MESSAGES=en_GB.utf8 ./greeting

Do check locale -a output and the /usr/share/locale/ directory to see which locales you use.  Many Linux distributions use the .utf8 suffix to denote UTF-8 locales, but there are alternate ways, so the above might not apply exactly as-is to yours.

Obviously, there are much better tools and even IDEs for maintaining and dealing with message catalogs; the above is just the most basic functioning example I could put together.  Interesting stuff, anyway, and perhaps important as a counterpoint to why/when one should use the standard tools for string formatting, instead of rolling ones own.

Navigation

[0] Message Index

[#] Next page

There was an error while thanking
Thanking...
Go to full version