Author Topic: Constructing short C strings in limited or constrained situations (Read 5836 times)

Nominal Animal · « **on:** July 02, 2021, 04:20:23 am »

When constructing strings in C, the snprintf() and vsnprintf() are one of the first tools we reach for. Unfortunately, in many embedded and freestanding environments they may not be available or their code size is too large to be practical; and even in hosted POSIXy systems, one cannot use this in signal handlers because they are not async-signal safe.

I recently realized that the tools I usually reach for in these situations can be divided into two categories: small string buffers and reverse builders. Small string buffers are exactly what they sound like, just written to be small and efficient (both time and space) with minimal to no external dependencies (so usable even in interrupt handlers in embedded environments, and in signal handlers in hosted POSIX C enviroments). Reverse builders are also string buffers, except they work bass-ackwards, from the end towards the beginning of the string. If the string consists (mostly) of integers in decimal form, the reverse builders are desirable as they don't do any extra work at all (like moving the string buffer around, or extra divisions by ten to find out how many characters one needs to reserve for them; most integer formatting implementations tend to have to do one or the other).

So, let me describe the small string buffers: their use cases, their design requirements, and my preferred implementation. If you are a new C programmer or still learning about these things, perhaps there is something about the process or how one ends up with this kind of thing that may be useful to you. If you are an old hand at C (or the subset of C++ used in embedded environments, which sadly usually omit all the nice C++ string stuff), perhaps you have a different approach you use, that you care to contrast and compare to? Or if you find this yukky, feel free to say so, although I do need to point out I only have these because I've needed them in practice before. You cannot always reasonably punt the work from the constrained/limited/critical section to say a worker thread; sometimes it just makes most sense to do it right there in the bind, even if one needs to implement ones own tools for it.

Use cases

Most typical use cases are human-readable time stamps ("YYYY-MM-DD HH:MM:SS.sss" being my preferred format) and reporting variable values to a buffer or stream of some sort, for logging purposes, or as responses to a query or a command.

The common theme here is a very restricted environment. Your code may be an interrupt handler on an AVR, with limited memory and computing resources, and because almost none of the base library functions are re-entrant (safe to use in an interrupt handler), you either do the formatting elsewhere, or with your own code. Or you might have a Linux program or daemon of some sort, and want to log some signals directly from their signal handlers, without setting up a thread to sigwait()/sigtimedwait() on those signals so one is not restricted to async-signal safe functions. If you can use snprintf()/vsnprintf(), by all means do so: you should only reach for this kind of alternative when you reasonably cannot use snprintf()/vsnprintf() or whatever the formatting functions provided by your C (or subset-of-C++) environment provides.

For the same reasons, these strings are severely restricted in length. Typically, they are less than a hundred characters long. Sometimes you have a small formatted part before and/or after a bulk data part, so the total may be much higher. I personally find a hard upper limit of say 250 characters here perfectly acceptable.

When working on embedded systems, I do not want to waste RAM, though. For the same reason, I don't want to use C++ templates either for each different string buffer length; I need the functionality small and tight – think very specific surgical tool, not a swiss army knife.

Design criteria

I want to specify the string buffer size at compile time (or if on-stack, at run time), and not waste any memory.
I want a single set of functions to act on the buffers. I do not want multiple copies of the functions just because they access string buffers of slightly different sizes.
I want the functions to be fully re-entrant (meaning, if code working on one string buffer is interrupted, the interruptee is safe to work on any other string buffers) and async-signal safe.
I accept a low hard upper limit on the buffer size, say around 250 chars or so. Typically, mine are much smaller, on the order of 30 or so.
I want these to be efficient. They don't need to be optimized yet, just do what they do without wasting effort on theoretical stuff.
I want each string buffer to also record its state, so that I can stuff the components that form the strings into it, and only have to check at the end whether it all fits.

Practical use example

An interrupt or signal handler needs to provide a string containing a timestamp in human readable form. The timestamp is a 32-bit unsigned integer in milliseconds. That is, if it has value 212×1000×60×60+34×1000×60+56×1000+789 = 765,296,789 = 0x2D9D8095, the timestamp should read "212:34:56.789". Thus:

uint32_t millisecs; // To be described in human readable form
SSB_DEFINE(timestamp, 14); // Maximum 32-bit timestamp corresponds to 1193:02:47.295, which has 14 characters

const int hours = millisecs / (60*60*1000);
millisecs -= (uint32_t)hours * (60*60*1000); // or millisecs %= 60*60*1000;
ssb_append_int(timestamp, hours, 0);
ssb_append_chr(timestamp, ':');
const int minutes = millisecs / (60*1000);
millisecs -= (uint32_t)minutes * (60*1000); // or millisecs %= 60*1000;
ssb_append_int(timestamp, minutes, 2);
ssb_append_chr(timestamp, ':');
const int seconds = millisecs / 1000;
millisecs -= (uint32_t)seconds * 1000; // or millisecs %= 1000;
ssb_append_int(timestamp, seconds, 2);
ssb_append_chr(timestamp, '.');
ssb_append_int(timestamp, (int)millisecs, 3);

if (ssb_valid(timestamp)) {
// string is ssb_ptr(timestamp, NULL), length ssb_len(timestamp, 0).
} else {
// Failed. Either buffer was too short, or a conversion couldn't fit
// in the allotted space, or something similar.
}

In a file scope, one can declare a small string buffer via say
SSB_DECLARE(common, 30);
and in each scope where it is used –– but note that these uses must not interrupt each other! ––, it is just initialized with
SSB_INIT(common);

In a local scope, to reuse memory currently unused for a small string buffer, one can use say
SSB_STEAL(newname, pointer_to_memory, size_of_memory);
but again, it is up to you to make sure that while this memory is being used as a string buffer, we don't get interrupted by some other code that also uses it. Similarly, if we steal it, we'll be rewriting its contents, so at this point it better not have any useful stuff in it anymore.
This will declare a new variable named newname, compatible with the ssb_ functions, pointing to the data. Hopefully, the compiler is smart enough to optimize the variable away. (In C, one can only make the new one const, but in C++ the newname pointer can be a constexpr.)

To append floating point number in decimal format (not scientific format) using six decimal places, one could use say
ssb_append_float(buffer, value, 0, 6);
where the first 0 indicates "as many digits as needed on the left side of the decimal point", and 6 "six digits on the right side of the decimal point".

(I like using positive numbers to indicate fixed numbers of digits with zero padding, negative for fixed width with zero padding, and zero for "the size of the value without any extra padding". I do not normally have a way to force a positive sign; the decimal strings that I tend to need are either negative (with a negative sign), or nonnegative (without any signs).)

In hosted environments, I like having ssb_write(buffer, descriptor); which tries hard to write the string buffer contents to the specified descriptor. That is, if the string buffer is valid, and none of the operations on it have failed; if they have, then this (and all other ssb_ functions) do nothing.

Implementation

A small string buffer is an array of 4 to 258 unsigned chars.
The first one is the maximum number of chars it can hold (maxlen), not including the trailing nul char. This one must be between 1 and 254 (UCHAR_MAX-1), inclusive; the smallest and largest possible values are reserved (and detected as *broken* buffers, scribbled over by some other code).
The second one is the current number of chars in the buffer (len), not including the trailing nul char.
There is always room for the trailing nul, but an implementation may only set it there when the buffer pointer is obtained via a ssb_ptr() call.

Each small string buffer has three states:

Valid: Operations have all succeeded thus far. (maxlen >= 1 && maxlen < UCHAR_MAX && len >= 0 && len <= maxlen).
Invalid: Operations have failed. (maxlen >= 1 && maxlen < UCHAR_MAX && len > maxlen).
Broken: Someone overwrote maxlen. (maxlen < 1 || maxlen >= UCHAR_MAX).

Initializing operations cause the buffer to have maxlen computed based on their size, with len zero, and the first byte of data also zero. The rest of the data bytes do not need to be initialized. (On embedded architectures, this uses three assignments instead of an initializer, to keep the ROM/Flash overhead minimal.)

Aside from the maximum length value, and the state logic encoded by the length and maximum length values, this is a common string or string buffer format, called either length-prefixed, or more commonly Pascal strings, since many Pascal compilers use(d) a length-prefixed format for strings.

ssb_valid(), ssb_invalid(), and ssb_broken() functions can be used to check the buffer state, but I rarely need them; usually only to set up a debugging flag if a report had to be skipped because the string buffer was too small. These functions also tend to be static inline and/or have the pure used leaf function attributes, so that the compiler doesn't include them at all in the final binary if they're not used, can can optimize them with rather extreme prejudice. They are so simple that even on AVR, it would be more effort to move the data to suitable registers for a function call, than just straight-up inlining the accesses involves.

Other ssb_function() only operate on small string buffers in the Valid state. If the buffer is in Invalid or Broken state, it is not modified nor accessed beyond the two bytes that define maxlen and len.

A particular detail I like is that the pointer accessor function takes an extra ifbad argument,
char *ssb_ptr(buffer, char *ifbad);
as does the length accessor function,
int ssb_len(buffer, int ifbad);
so that having e.g. ternary expressions, one can just handle the invalid/broken buffer states with a custom return value.
These are static inline accessor functions, because on most architectures moving the parameters to their respective registers is as much code as doing the accesses, so the extra if-bad parameter/return value doesn't really cost anything.

My current implementations for the equivalents of ssb_append_int() and ssb_append_float() do include an "extra" loop that divides the integral part by ten repeatedly, to find the number of integer digits needed. That loop is then repeated to generate the actual digits. (This can be avoided by using buffer space left starting at the very end, so that only one divide-by-ten loop is needed, followed by copying/moving the data to the correct position in the buffer. Both have their use cases.) The fractional digits are generated by repeated multiplications by ten, followed by extracting the integer part, so is also suitable for any fixed-point formats one might use (although one does need to write the ssb_append_type() function for each different fixed-point format).

While the repeated division-by-ten with remainder giving the next decimal digit, and multiplying fractional part repeatedly by ten and grabbing the integral part for the next digit (and incrementing it if necessary for the final digit after comparing the final fractional part to 0.5), may sound inefficient, they actually easily outperform most printf() implementation variants with the same formatting specs, while producing the same string. So, I definitely do not consider this (or at least mine) small string buffers optimized, I do consider them efficient. With possible exception being that integer part loop on different architectures: those that have a fast and compact integer divide-by-ten-and-keep-remainder machine instruction will prefer the extra division loop, others will prefer copying the few bytes over.

I can provide my own implementations as CC-BY, but at this point, I haven't fully integrated everything into a single codebase, and I'm seeing so much #ifdef macros in the single-source, I think it might be better to keep the interface same but decidedly different implementations (say, AVR vs. AMD64) separate. I'm not sure about whatchamacallit, either; my source right now uses tinystr, but I think some three-letter acronym implying "buf" would work better, although "ssbuf" sounded a bit off to me to use. And my idea here was not to look at just my code, I want to see what others do and use. You know, shop talk.

Those I've talked to face-to-face tell me they don't like showing this kind of code, because although theirs is just as functional as mine, this kind of constrained code is often left in its ugliest working form; only worked on if it breaks, rewritten from scratch (or from memory) for each use case. Because face it, people most often just choose to wing it instead, hoping that this time they won't get bitten by re-entrancy issues or async-signal unsafety, if they use snprintf() etc.; and if they do get bitten, they (and I myself) more often punt these to worker threads and less constrained parts, rather than work on such ugly little code snippets. In all honesty, just look at even the outlined interfaces: I even need macros to declare/initialize/define the buffer variables, and that just starts the odd side.

The reason I originally started paying attention to integer and floating-point number formatting, was when I was dealing with huge molecular datasets ("movies", dozens of gigabytes in slowish spinning rust, well over a decade ago now) in a nicely spatially and temporally sliceable distributed binary form (dataset scattered over hundreds of binary-formatted files). I needed to generate the interchange/transport formats (various human-readable text formats for visualization et cetera), and that severely bottlenecking on printf(). A main source saving to those binary files was written in Fortran 95, too... So, I wrote faster convert-doubles-to-string versions, and verified the strings matched for all normal (as in finite non-subnormal) doubles. (Well, not all 2⁶² or so of such values, but maybe 2⁴⁰ of them.) It was so ugly I hid the code, but it was/is functional.

Nominal Animal · « **Reply #1 on:** July 02, 2021, 08:18:33 pm »

Reverse string building

Building strings containing mostly integers can simplify and speed up the operation significantly. For us humans, using an imperative language like C, that reverse order can be somewhat overwhelming.

For example, consider a function that is provided by a buffer with room for at least 25 chars, and three signed 16 bit integers, and formats the integers into human-readable three-component geometric vector, "(x, y, z)":

Code: [Select]

#include <stdint.h>

static char *backwards_i16(char *ptr, int16_t val)
{
    uint16_t  u = (val < 0) ? (uint16_t)(-val) : (uint16_t)(val);

    do {
        *(--ptr) = '0' + (u % 10);
        u /= 10;
    } while (u);

    if (val < 0)
        *(--ptr) = '-';

    return ptr;
}

/** Return a pointer to "(x, y, z)" somewhere within the buffer,
 *  with buffer having room for at least 1+6+2+6+2+6+1+1 = 25 chars.
*/
char *i16_vec3_to_asciiz(char *buffer, int16_t x, int16_t y, int16_t z)
{
    char *p = buffer + 1 /* '(' */
                     + 6 /* -32768 .. 32767 */
                     + 2 /* ", " */
                     + 6 /* -32768 .. 32767 */
                     + 2 /* ', " */
                     + 6 /* -32768 .. 32767 */
                     + 1 /* ')' */
                     + 1 /* terminator, '\0'. */ ;

    *(--p) = '\0';
    *(--p) = ')';

    p = backwards_i16(p, z);

    *(--p) = ' ';
    *(--p) = ',';

    p = backwards_i16(p, y);

    *(--p) = ' ';
    *(--p) = ',';

    p = backwards_i16(p, x);

    *(--p) = '(';

    /* Note: if the caller expects the result to start at the beginning
       of the buffer, we need to do the equivalent of
            if (p > buffer)
                memmove(buffer, p, (size_t)(buffer+1+6+2+6+2+6+1+1 - p));
       Our description says result is somewhere within the buffer,
       so we do not need to.
    */

    return p;
}

As you can see, the code is very compact, and relatively straightforward. What is difficult with it, is to remember that to see what kind of string it constructs one needs to read the code backwards: start at the Note: comment, then go upwards, until you see *(--p) = '\0'; which is responsible for terminating the string.

The basic operation used here is divide by ten with remainder. It does not imply that hardware division is actually used, though. We can write the basic operation as
unsigned char div10(unsigned int *const arg) {
const unsigned char result = (*arg) % 10;
(*arg) /= 10;
return result;
}
which on most architectures does not actually involve hardware division at all, but a (wider word-width) multiplication using a reciprocal represented by a fixed point integer. For example, on AMD64, Clang-10 -O2 generates the same code for above as for
unsigned char div10(unsigned int *const arg) {
const unsigned int dividend = *arg;
const unsigned int quotient = (uint64_t)((uint64_t)dividend * 3435973837) >> 35;
const unsigned char remainder = dividend - 10*quotient;
*arg = quotient;
return remainder;
}
where the magic constant, 3435973837 / 2³⁵ represents 0.1000000000.

In general, dividing an unsigned integer q with a constant non-power-of-two positive divisor d, are based on q/d ≃ (a×q+c)/2ⁿ, with the parenthesized part using extra range (often twice that of the quotient q).

Converting integers to strings using subtraction only

Sometimes you have an architecture where division by constant (implemented either in hardware, or similarily to above as multiplication via reciprocal) is too costly. There, you can use repeated subtractions.

For example, on and 8-bit architectures without hardware division or multibyte multiplication, you might find you need to efficiently convert 32-bit signed and unsigned integers to strings using subtraction only. Consider:

Code: [Select]

#include <stdint.h>

static const  uint32_t  powers_of_ten[9] = {
    UINT32_C(10),
    UINT32_C(100),
    UINT32_C(1000),
    UINT32_C(10000),
    UINT32_C(100000),
    UINT32_C(1000000),
    UINT32_C(10000000),
    UINT32_C(100000000),
    UINT32_C(1000000000),
};

unsigned char u32_to_asciiz(char *const buffer, uint32_t value)
{
    unsigned char  pot = 0;
    unsigned char  len = 0;

    while (value >= powers_of_ten[pot] && pot < 9) {
        pot++;
    }

    /* Note: pot is the power of ten for the most significant decimal digit.
             pot == 0 is equivalent to saying value < 10, and
             pot == 9 is equivalent to saying value >= 1000000000. */

    while (pot--) {
        const uint32_t  base = powers_of_ten[pot];
        char            digit = '0';

        while (value >= base) {
            value -= base;
            digit ++;
        }

        buffer[len++] = digit;
    }

    buffer[len++] = '0' + value;
    buffer[len  ] = '\0';

    return len;
}

unsigned char i32_to_asciiz(char *const buffer, int32_t value)
{
    if (value >= 0) {
        return u32_to_asciiz(buffer, (uint32_t)(value));
    } else {
        buffer[0] = '-';
        return u32_to_asciiz(buffer + 1, (uint32_t)(-value)) + 1;
    }

These require a buffer of sufficient size (12 chars will suffice for all possible values), and return the length, excluding the terminating nul byte. The string starts at the beginning of the buffer.

For example, compiling the above to ATmega32u4 using GCC 5.4.0 via avr-gcc -std=c11 -Os -Wall -ffreestanding -mmcu=atmega32u4 -c above.c, the above takes 202 bytes of ROM/Flash (166 bytes of code, 36 bytes for the constant array); 262 bytes with -O2 (226 bytes of code, 36 bytes for the constant array).

Interestingly, the runtime cost is not as nearly as big as one might think. The slowest 32-bit unsigned value to convert is 3,999,999,999, which does 33 iterations of the subtraction loop overall. In essence, one trades each division-by-ten-with-remainder operation, for up to nine subtractions. (This does not count keeping the iteration count – which is always less than 10, or between 48 and 57 if using ascii digits '0'..'9', nor updating the length of the buffer etc., since those should be "native" but the subtraction may be between much wider integers than fit in a single machine register.)

Even implementing the conversion via repeated subtraction for 64-bit integers isn't horribly slow, since even the slowest values like 17,999,999,999,999,999,999 only need 170 iterations of the subtraction loop.

An interested Arduino programmer might wish to try the above on their favourite 8-bit architecture, and see what the timing/efficiency is like, and compare to the conversion functions provided by the base libraries (snprintf() or the String() constructor in Arduino, for example).

Whenever I discover myself needing some kind of integer to string conversion in a very constrained situation (i.e., without existing conversion functions I can use), I do often end up implementing some crude conversion first, then cleaning it up later – not because of laziness, but because I need to see what is needed and useful first, before I am willing to commit to a specific approach. Just like the Linux kernel developers, who steadfastly refuse any idea of an internal ABI just because it would bind their hands to such ABIs, I too want to keep my options as open as possible, whenever I'm dealing with very tightly constrained situations like interrupt handlers and POSIX signal handlers.

One important takeaway for those learning C from this and my previous post above, is that there is no "best approach". I've shown some of the tradeoffs I make in certain situations, but the process of first finding out what kind of tradeoffs are possible, and then making informed choices, is the interesting bit.

My initial choices are often wrong, because I learn as I create. There is nothing bad about that, and indeed I do not even notice, because I try to keep such choices refactorable, and if I have time, sometimes refactor code just to see if I have learned enough in between (writing the original code and when I decide to refactor) to make a difference. It reminds me of trying new recipes and techniques when cooking, really.

SiliconWizard · « **Reply #2 on:** July 02, 2021, 08:41:06 pm »

I was going to reply, but your two posts are so detailed that I'm not sure what to add really.
All I can say is that I certainly second the implementation of your own conversion functions if you don't have access to the standard xxprintf() functions, they take up too much space or they are just not efficient enough for your application. I also second the "one-function-per-type" scheme. Formatted functions are all nice, but they are highly inefficient by nature, and can pose some security issues as well. (Just think that the format string itself can be modified in memory, and imagine what can happen in this case...)

Nominal Animal · « **Reply #3 on:** July 02, 2021, 09:47:44 pm »

Quote from: SiliconWizard on July 02, 2021, 08:41:06 pm

I was going to reply, but your two posts are so detailed that I'm not sure what to add really.

Sorry about that; in my enthusiasm for listing my current thoughts on this, I now see I made it really hard for anyone to really grab the talking stick and run with it.
My failure; I do apologize, and am working on it.

Anecdotes of ones own memorable tight spots and how one found ones way outside them, would be very valuable to both old hands and newbies. I for one promise to read every single one twice; in my experience, they are just that useful, interesting, and fun.

(In case any one wonders, I am not interested in writing monographs; I desire discussion and argumentation for and against, as that is the fertile soil ideas need to grow on. In. At? English! If I wanted monographs or just attention to myself, I'd put them on my web site or at github. This darned verbosity of mine is a fault I do recognize and try to deal with.)

Nominal Animal · « **Reply #4 on:** July 03, 2021, 12:01:48 am »

And to continue the data spew, a quick look at those normal, non-constrained situations a C programmer should know and reach for first.

Not all of these are defined in the C standard. Some are defined in POSIX, and some are very commonly available GNU and/or BSD extensions. On the rare architectures they do not exist, it is relatively straightforward (but dull, careful work and testing) to implement these in terms of standard C functions. My links point to the Linux man pages online project, but only because it very carefully and systematically describes which standards a function conforms to (under the Conforming to heading), plus notes, bugs, oddities, and implementation differences. I am not pushing you towards Linux, I've just found it more reliable and up to date than any of the alternatives like linux.die.net/man or unix.com/man-page, although those do have some not listed in Linux man pages, especially the latter wrt. functions not implemented in any Linux variants, for example those only on OpenSolaris for example).

These do have the same limitations as normal C printf() family of functions have, and none of these are async-signal safe.

printf(), fprintf(), vprintf(), and vfprintf() (Standard C)

These are the most commonly used functions of the printf family. Plain printf() prints to standard output, and fprintf() to any stream (FILE *).
The parameters the formatting specification refers to are passed as extra parameters to fprintf() and printf(), and as a variable argument list to vfprintf() and vprintf() (see <stdarg.h>).
snprintf() and vsnprintf() (Standard C)

These take a pointer to a buffer, the size of that buffer, a formatting string, and either the parameters needed by the formatting string, or a variable argument list containing those (see <stdarg.h>). They return the number of characters needed to describe the string (not including a terminating nul), or negative in case of an error.

If the buffer is not large enough, the return value will be the buffer size or larger, but the function will not overwrite memory past the specified size. This means that if the return value matches the buffer size, the buffer is NOT terminated with a nul char, and was not large enough. It is easy to get this logic wrong if you are not aware of it, but here is a snippet as an example of proper handling:

char buf[10];
int len = snprintf(buf, sizeof buf, "%d", X);
if (len < 0) {
/* snprintf() reported an error. Nothing useful in buf[]. */
} else
if ((size_t)len < sizeof buf) {
/* We have the decimal representation of X in buf as a string.
There is a terminating nul char at buf[len]. */
} else {
/* buf was too small to hold the entire string.
len may be much bigger than sizeof buf,
so do not assume buf[len] can be accessed.
There is no particular reason to assume that
buf[sizeof buf - 1] == '\0'
and it almost never is. */
}
dprintf() and vdprintf() (POSIX, was GNU long time ago)

These are the printf family functions you can use to format and write a string to a raw descriptor (without the standard I/O stream abstraction represented by FILE * handles). Typically, you do clear errno to zero before calling these, to be able to differentiate between printf formatting errors and I/O errors; and I only recommend using these on files, character devices, and stream sockets, not on datagram sockets. For datagram sockets, using asprintf()/vasprintf() and then send() on the properly constructed message, is the proper way to ensure the entire message is sent correctly (since send() on datagram sockets do not return short counts in some non-error situations like write() does), and lets you the programmer differentiate between message formatting ("printf-related") and connection I/O issues.
strftime() (POSIX), the Swiss Army Knife for formatting timestamps in a struct tm structure.

You should use clock_gettime(CLOCK_REALTIME, &timespec) to obtain the Unix Epoch time (same as time() and gettimeofday() report) at nanosecond resolution, then convert the .tv_sec to a struct tm using localtime_r() (if you want the readable form in local time) or gmtime_r() (if you want the readable form in UTC/GMT or the closest equivalent). The only downside with strftime() is that since struct tm does not have a field for fractions of seconds (like struct timespec and struct timeval have), you are limited to second granularity in your timestamps.

If you have multiple 'clients' your code is connected to or services, use newlocale() to get a separate locale_t for each one, then use uselocale() before using localtime_r(). uselocale() is thread-specific, like errno, so only affects the current thread.

If you intend your code to be localizable, i.e. messages and date and timestamps confugurable to each locale and language via gettext(), this function is indispensable, because you only need to make the strftime() format pattern a gettext message, and those creating translations and localizations can then define the date/time format in the localization catalog for this program. Very powerful, even if limited to one-second precision!
asprintf() and vasprintf() (GNU, BSD)

These functions dynamically allocate a buffer large enough to hold the result, that pointer stored to the location pointed by the first parameter. The second parameter is the familiar printf() family formatting string. asprintf() takes additional parameters just like printf() does, and vasprintf() takes a variable argument list (see <stdarg.h>). They return the number of chars in the buffer, not including the terminating nul byte, or negative if an error occurs. BSD and GNU behave a bit differently if that happens: BSD resets the buffer pointer to NULL, while GNU leaves it unmodified.

I personally warmly recommend the following pattern:

char *buffer = NULL;
errno = 0;
int length = asprintf(&buffer, ...);
if (length >= 0) {
/* No problems; buffer[length] == '\0' */ do_something_with(buffer, length); free(buffer); } else { /* Error; see 'errno' for cause. You don't need to, but it is safe to buffer = NULL; here; you won't leak memory. */ }

If the <stdarg.h> "variable argument list" stuff sounds odd to you, consider the following working example snippet:

#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdarg.h>
#include <stdio.h>
#include <errno.h>

static volatile int my_logging_oom = 0;

/* Log an error with printf formatting support.
Returns 0 if success, errno error code otherwise. */
int my_logging_function(const char *fmt, ...)
{
va_list args;
char *message = NULL;
int len;

va_start(args, fmt);
len = vasprintf(&message, fmt, args);
va_end(args);
if (len < 0) {
my_logging_oom = 1;
return errno = ENOMEM;
}

somehow_log_message(message, len);

free(message);
return 0;
}

The #defines tell the C library on Linux to expose both POSIX and GNU extensions in the header files included. BSDs expose them by default.

The my_logging_oom variable is just a volatile flag used to record if logging ever fails due to Out Of Memory. I'd expect other code to examine it every now and then, and report to the user if it ever becomes nonzero.

The only "trick" with this kind of variadic functions is that their parameters go through default argument promotion, as described in the C standard. Essentially, both float and double are passed as double . Any integer types smaller than int will be converted to int if that retains the value, and to unsigned int otherwise. Fortunately, this does not affect pointers: a pointer to a float is passed as a pointer to a float, because pointers are not subject to default argument promotion. It doesn't affect arrays either, because they decay to a pointer to their first element, so passing a name of an array to variadic function is the same as passing a pointer to the first element of that array.

So, if you wanted, you definitely could implement your own printf() family of functions using <stdarg.h>. However, as SiliconWizard mentioned, this formatting approach is not always superior to just constructing the message piece by piece, using type-specific functions call for each conversion. Aside from ease of use, the one truly useful thing is making the printf format specification be a gettext() message, so that end users can very easily translate and localize the program without recompiling the binaries and adding code. A practical example:

Code: [Select]

#include <stdlib.h>
#include <locale.h>
#include <string.h>
#include <stdio.h>
#include <libintl.h>

#define _(msgid) (gettext(msgid))

int main(int argc, char *argv[])
{
    /* This program is locale-aware. */
    setlocale(LC_ALL, "");

    /* Let's call ourselves 'greeting', so that if you want, you can put
       a message catalog at say /usr/share/locale/<yourlocale>/LC_MESSAGES/greeting.mo
    */
    textdomain("greeting");

    if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        const char  *arg0 = (argc > 0 && argv && argv[0] && argv[0][0]) ? argv[0] : "(this)";

        fprintf(stderr, _("\n"
                          "Usage: %1$s [ -h | --help ]\n"
                          "       %1$s NAME1 NAME2\n"
                          "\n"
                          "This prints a localizable greeting.\n"
                          "\n"), arg0);

        return EXIT_FAILURE;
    }

    const char  *name1 = (argv[1][0]) ? argv[1] : _("(emptyname1)");
    const char  *name2 = (argv[2][0]) ? argv[2] : _("(emptyname2)");

    printf(_("Hey %1$s, %2$s sends their greetings.\n"), name1, name2);

    return EXIT_SUCCESS;
}

Note that a formatting directive that begins with say %3$ means "the first variadic argument", whereas % means "the next variadic argument". You can use either one in any printf formatting string, but you cannot and must not mix the two forms. As an example, %3$d formats the third variadic argument as an int, and %2$s the second variadic argument as a string.

A message catalog for say "Formal English" might replace the "Hey .... message with say "Greetings from %2$s to %1$s.\n" .

I personally do not like using a _(msgid) macro at all, and much prefer say MESSAGE() or LOCALIZE() instead. The reason I kept it in above, is that it is common pattern in C that confuses those who don't know about it beforehand, so I thought it as a good idea to stuff that in there as well.

If you want to play with the above, save the following as greeting.po:

Code: [Select]

msgid ""
msgstr ""
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"

msgid ""
"\n"
"Usage: %1$s [ -h | --help ]\n"
"       %1$s NAME1 NAME2\n"
"\n"
"This prints a localizable greeting.\n"
"\n"
msgstr ""
"\n"
"Usage: %1$s [ -h | --help ]\n"
"       %1$s NAME1 NAME2\n"
"\n"
"This prints a localizable greeting from NAME2 to NAME1.\n"
"\n"

msgid "(emptyname1)"
msgstr "(unknown person)"

msgid "(emptyname2)"
msgstr "(unknownn person)"

msgid "Hey %1$s, %2$s sends their greetings.\n"
msgstr "Greetings from %2$s to %1$s.\n"

where msgid describes the exact key the program is looking for, and msgstr the replacement for this message catalog.
(If there is no message catalog, msgid is used as-is. That's why it looks a bit funky at a first glance. It's a very simple, easy to manage format, though.)

You can 'compile' the human-readable greeting.po into an efficient binary message catalog file greeting.mo using
msgfmt greeting.po -o greeting.mo
and install that to say the en_GB locale via
sudo install -m 0644 -o root -g root greeting.mo /usr/share/locale/en_GB/LC_MESSAGES/
You can then compare the program output when run with different locales:
LC_MESSAGES=C ./greeting
LC_MESSAGES=en_GB.utf8 ./greeting

Do check locale -a output and the /usr/share/locale/ directory to see which locales you use. Many Linux distributions use the .utf8 suffix to denote UTF-8 locales, but there are alternate ways, so the above might not apply exactly as-is to yours.

Obviously, there are much better tools and even IDEs for maintaining and dealing with message catalogs; the above is just the most basic functioning example I could put together. Interesting stuff, anyway, and perhaps important as a counterpoint to why/when one should use the standard tools for string formatting, instead of rolling ones own.

brucehoult · « **Reply #5 on:** July 03, 2021, 03:13:41 am »

I really struggle to think of or even imagine any machine in which repeating a divide by ten is going to be cheaper than copying a character from one place to another. If an actual divide instruction is used that's usually going to be at least 32 clock cycles on a 32 bit machine.

Even if you use n/10 = (n*0xCCCCCCCD)>>35 that's not super cheap because you need a 32x32->64 multiplier (or MUL and MULH instructions).

It's going to be cheaper to do something like:

Code: [Select]

digits = 1;
tst = 10;
while (digits < 10 && n >= tst){
  digits++;
  tst *= 10;
}

The multiply by 10 will turn into something like either tst = (tst<<1) + (tst<<3) or tst += tst<<2;tst <<= 1 depending on the instruction set.

But, seriously, copying characters from a format-number-backwards buffer to the message-construction-buffer is super quick so there's no reason not to do it.

mazurov · « **Reply #6 on:** July 03, 2021, 04:02:50 am »

Hi,

I've done something similar, for an application where I'd have to print a ton of strings and some numbers. Due to that, it made sense to have everything as

Code: [Select]

const char* and instead of chars put pointers to strings to the print queue. Consequently, I had to print numbers as strings too. Here is the code:

https://github.com/felis/ICBM/blob/master/src/console.c#L436

the

Code: [Select]

point argument takes a position where a decimal point shall be printed, this is the only formatting argument. Perhaps you can mod this to insert dashes into fixed positions or something along these lines.

Enjoy,
/felis

PS: Sorry for the mess. Anyone knows how to put code pieces inline on this board?

SiliconWizard · « **Reply #7 on:** July 03, 2021, 05:37:38 pm »

Quote from: brucehoult on July 03, 2021, 03:13:41 am

I really struggle to think of or even imagine any machine in which repeating a divide by ten is going to be cheaper than copying a character from one place to another. If an actual divide instruction is used that's usually going to be at least 32 clock cycles on a 32 bit machine.

Well, many performance-oriented CPUs will take much fewer cycles for a divide than this, but, that's right that it's usually more than 1 - and even on optimized implementations, it's rarely less than 4, so all in all, you're right regarding comparing that to copying a character, although, occasional cache misses (when there are caches) may ruin it.

Some CPUs may also implement improved divide-by-ten instructions (like the now relatively uncommon BCD instructions), but that would really be the exception rather than the norm.

Nominal Animal · « **Reply #8 on:** July 03, 2021, 08:51:25 pm »

Quote from: brucehoult on July 03, 2021, 03:13:41 am

I really struggle to think of or even imagine any machine in which repeating a divide by ten is going to be cheaper than copying a character from one place to another. If an actual divide instruction is used that's usually going to be at least 32 clock cycles on a 32 bit machine.

Very good point! Of course, if you don't mind a power of ten look up table (for 10=10¹, 100=10²) – 36 bytes for 32-bit –, it turns to just a set of comparisons.

I think uint32_digits() is one of those rare functions better written in (extended inline) assembly on 8-bit MCUs. On 32-bit it is trivial (binary decision tree with nine comparisons and ten results); and on 16-bit is two decision trees with a bit of overlap; so those need no hand-holding, and even simple C code should compile to pretty efficient machine code.

I got sucked down a rabbit hole trying to find the minimum-depth decision tree (ternary tree, with < = > leaves); even wrote some Python3 code for formal analysis (with a tunable cost model) for arithmetic comparison rule decision trees...

I get so excited when I get to chew on a type of problem I haven't chewed on in a while, you see.
On 8-bit MCUs, 32-bit integers span four bytes; the values we need to compare against are
1000000000 = 3b 9a ca 00 = 10⁹; at and above, the result is 10
100000000 = 05 f5 e1 00 = 10⁸; at and above, the result is 9
10000000 = 00 98 96 80 = 10⁷; at and above, the result is 8
1000000 = 00 0f 42 40 = 10⁶; at and above, the result is 7
100000 = 00 01 86 a0 = 10⁵; at and above, the result is 6
10000 = 00 00 27 10 = 10⁴; at and above, the result is 5
1000 = 00 00 03 e8 = 10³; at and above, the result is 4
100 = 00 00 00 64 = 10²; at and above, the result is 3
10 = 00 00 00 0a = 10¹; at and above, the result is 2
with the result being 1 otherwise.
If the most significant byte is 3c..ff, the result is always 10.
If the most significant byte is 3b, the result is either 9 or 10, depending on the three other bytes.
If the most significant byte is 06..3a, the result is 9.
If the most significant byte is 05, the result is either 8 or 9 depending on the three other bytes.
If the most significant byte is 01..04, the result is 8.
Otherwise, the most significant byte must be zero, and the result depends on the three other bytes.
For the most significant byte, the decision tree involves comparing against 3b, 05, and 00.
So, one of the most efficient approaches is to first compare the most significant byte to 05, and cascade from there on down in a ternary tree (with 'less', 'equal', and 'greater' edges from each node, each node comparing one of the four bytes to a constant).

The actual number of byte comparisons needed to find any single answer is surprisingly small. Code that does the comparisons in linear order will be almost as efficient, depending somewhat on the instruction set. For the most efficient approaches, the code ends up being write-only: nobody will ever debug or modify it, only rewrite if needed.
I don't mind, because I personally test all 32-bit conversion functions with all possible inputs, and for larger input spaces, I apply some sequential unit testing over any regions where the algorithm changes (like arguments between 9999..99999 above, because they span both 16- and 24-bit arguments), plus randomized testing over the entire input range. So I know what I trust or not, but cow-orkers might balk. But that's why I like writing low-level utility functions: once implemented this way, they are efficient and trustworthy.

Quote from: brucehoult on July 03, 2021, 03:13:41 am

But, seriously, copying characters from a format-number-backwards buffer to the message-construction-buffer is super quick so there's no reason not to do it.

Yes, and since I suspected there is much more experience behind that statement than one might think, I went looking. And you're absolutely right.

One use case I used that approach with was when I needed to emit the digits in left-to-right order to some device, say an UART, or parallel GPIO (stealing the high bit as a clock/write strobe, so two writes per byte), but don't want to keep a buffer in the first place (because it was a quick-and-dirty data stream, emitted from a interrupt context). On AVR, conversion uses so many registers (according to code emitted by gcc and llvm/clang) that just using a buffer, even one on stack, uses less RAM. So, the approach is counterproductive there, buffer is more efficient even memory-use-wise.

On 32-bit architectures like ARMs, spilling three registers into the stack takes up the memory you need to buffer any 32-bit signed or unsigned integers. If the code that generates the string without a RAM buffer uses three registers that otherwise would not have needed to be saved and restored, one just does extra work without any gain, having to pay the same RAM overhead due to clobbering more registers than the buffer would have taken. Since pointers etc. fit in single registers, the conversion-to-buffer code is particularly simple.

On AMD64 and 64-bit arches, the buffer is basically free. You can do register-based conversion there, holding up to 8 ASCII characters in a single register, because of the sheer size of the registers, but using even two extra 64-bit registers that otherwise would not have been clobbered (and therefore not saved and restored), takes up more memory than a typical 32-bit integer conversion string buffer. For 64-bit integers, the buffer needs less RAM than storing and restoring three registers.

So, it isn't even that "this is simpler and faster", but more like "If you check, you'll almost always find you didn't save any memory by doing it the slow, complicated way".

In very memory constrained situations using BCD would halve the buffer size needed, and given a suitably aligned easily accessible 16-entry lookup table, one can stuff non-digit stuff like implicit or explicit padding, decimal points, thousands separators, and plus or minus signs in them very easily.

(Those who are unfamiliar: In Binary Coded Decimal, byte value v consists of two decimal digits a=0..9 and b=0..9 via v=a≪4+b, where ≪ is binary shift left. The ASCII character corresponding to b is (v&15)+48, and the one corresponding to a is (v≫4)+48. a is a valid decimal digit if v is less than 160. If you use (((v≫4)+6)&15)+42 and ((v+6)&15)+42, then 10-15 encode ASCII characters * + , - . /. Yet another reference to the good ol' forty-two.. A lot of old tricks used with BCD have been forgotten.)

Quote from: mazurov on July 03, 2021, 04:02:50 am

Anyone know how to put code pieces inline on this board?

I use [tt]..[/tt] for the teletype look. The forum software retains newlines inserted in the editor so for example

Code: [Select]

[tt]    foo = 2[sup]5[/sup][/tt]
[tt]    bar = 11100100[sub]2[/sub][/tt]

renders nicely as
foo = 2⁵
bar = 11100100₂

DiTBho · « **Reply #9 on:** July 03, 2021, 11:28:18 pm »

(little OT: I am implementing a little parser in php; given a text with forum-like keywords, it generates html code. I think I will include the sub/sup/tt keywords. Nice trick! Thank you even for this

)

brucehoult · « **Reply #10 on:** July 04, 2021, 12:42:54 am »

Quote from: SiliconWizard on July 03, 2021, 05:37:38 pm

Quote from: brucehoult on July 03, 2021, 03:13:41 am
I really struggle to think of or even imagine any machine in which repeating a divide by ten is going to be cheaper than copying a character from one place to another. If an actual divide instruction is used that's usually going to be at least 32 clock cycles on a 32 bit machine.

Well, many performance-oriented CPUs will take much fewer cycles for a divide than this, but, that's right that it's usually more than 1 - and even on optimized implementations, it's rarely less than 4

What CPU can do a 32/32 divide in 4 clock cycles?

With a significant amount of effort and hardware you can get up to 3 bits of result each clock cycle.

ARM A72 (and I think A75 the same) has a worst-case of 12 clock cycles for a 32 bit divide and 20 clock cycles for 64 bit.

Intel Skylake needs up to 95 cycles for a 64 bit divide, and AMD Ryzen 47. Apparently Canonlake is 18 cycles, which is a lot better than Skylake (hardly a low performance CPU!) but still a heck of a lot more than a load and store in L1 cache.

In microcontroller cores, Cortex M0 doesn't have a divide instruction at all. M3 takes up to 12 clock cycles for a 32 bit divide.

Most implementations can be a lot faster than these maximums with the right inputs, for example the ARM cores with 12 cycle maximum have 2 cycle minimum. The minimum happens when the two inputs are about the same size so the answer is 0 or 1 or maybe a little bigger. Dividing a big number by 10 is not a fast case.

brucehoult · « **Reply #11 on:** July 04, 2021, 01:01:59 am »

Quote from: Nominal Animal on July 03, 2021, 08:51:25 pm

I think uint32_digits() is one of those rare functions better written in (extended inline) assembly on 8-bit MCUs. On 32-bit it is trivial (binary decision tree with nine comparisons and ten results); and on 16-bit is two decision trees with a bit of overlap; so those need no hand-holding, and even simple C code should compile to pretty efficient machine code.

I'm not sure this is the best idea.

- a binary decision tree with nine comparisons is the same amount of code as straight line code with the same nine comparisons, but has more taken conditional branches which tend to take longer than falling through. So three or four comparisons probably isn't a lot faster than nine.

- you need to think about the distribution of input values. If the values are randomly chosen from a uniform distribution (e.g. standard rand() function) then 77% of them will be over 999,999,999 so your decision tree will probably be faster than a linear search starting from 1 digit and working up. But it will be slower than a linear search starting from 10 digits :-) But this is probably not usual. If the inputs follow a log distribution with equal numbers of inputs for each number of digits then the linear search will need 5 comparisons on average, which is barely more than the tree. But in real programs I'd expect most numbers to be under 1000, and maybe under 100.

The byte-by-byte decision tree makes much more sense on an 8 bit processor. But you might minimise total size of code+data by using a table-driven approach, at very little cost in speed.

Nominal Animal · « **Reply #12 on:** July 04, 2021, 05:05:30 pm »

Quote from: DiTBho on July 03, 2021, 11:28:18 pm

generates html code [..] include the sub/sup/tt keywords.

Consider extending the support to all (or your preferred subset) of the HTML5 text level element, specifically i b u s em strong small code kbd var sub sup cite q elements, with their visual styling defined in the CSS stylesheet. Or you can allow these elements unchanged in the text (as HTML elements), since their effect will be limited to the block element they're in (and an error, say missing a closing element, should not affect visuals or layout beyond that particular block). (You can define their layout and flow as block level element, so that e.g. [code]...[/code] generates <code class="blockcode">...</code> but [tt]..[/tt] generates <code class="inlinecode">...</code>; former laid out as a paragraph, and the latter as inline code snippet. Similarly, if you want to implement [pre]..[/pre], just reuse the HTML code element with a style class that has white-space: pre; attribute.

In my experience, storing content snippets converted to HTML except for the highest-level constructs (that need per-client dynamic processing at runtime to map to HTML -- say, time right now (as opposed to when submitted/stored/updated) –– is the best option. On the server side, there are many libraries (including Tidy) you can use to check that your markup-to-HTML munger generates correct code at submit/preview time (and will complain to the user when needed). Finally, you can use HTML5 comments with a fixed internal form, say  to describe the high-level constructs within the now-HTML server-side stored snippets, with a simple post-processor that rips all HTML5 comments out, and injects any dynamically generated HTML for those high-level-constructs. Do note that you do NOT use such constructs for private or confidential data, because at some point you'll have a bug that leaks those comments unformatted to the user.

(Why then "hide" the constructs? It's not to "hide" them per se, just make them easily detectable, but pass through Tidy validation. Any special character sequence like <({[ high-level-construct-information ]})> either gets Tidy to complain, or can occur in user-submitted text. Tidy ignores the comment format, and you already convert any HTML-like markup, so anything looking like a HTML comment gets converted into HTML text, i.e. <!-- into <!-- , so HTML comments are the only thing that work well here without additional constraints. The fact that if (when!) these leak through to the user they will not be visible except in the source is actually an unwanted side effect: now you need to check the generated HTML for not just visual correctness, but now also the sources for comments leaking high level constructs.)

Quote from: brucehoult on July 04, 2021, 01:01:59 am

I'm not sure this is the best idea.

- a binary decision tree with nine comparisons is the same amount of code as straight line code with the same nine comparisons, but has more taken conditional branches which tend to take longer than falling through. So three or four comparisons probably isn't a lot faster than nine.

Right. If the architecture has any kind of branch prediction or speculative execution features, it cannot be; any jumps will cost more than the arithmetic operators. Generally, on 32-bit (and wider) architectures

unsigned char u32_digits(const uint32_t value) {
return (value >= UINT32_C(1000000000))
+ (value >= UINT32_C(100000000))
+ (value >= UINT32_C(10000000))
+ (value >= UINT32_C(1000000))
+ (value >= UINT32_C(100000))
+ (value >= UINT32_C(10000))
+ (value >= UINT32_C(1000))
+ (value >= UINT32_C(100))
+ (value >= UINT32_C(10))
+ 1;
}

tends to generate pretty efficient code across all compilers. If one takes a look at the code e.g. Clang-10 generates on AMD64, it is fun-ky indeed.
Depending on the instruction set, at least clang-10 on 32-bit arches may generate better code for the exactly equivalent function written as

unsigned char u32_digits(const uint32_t value) {
return 10 - (value < UINT32_C(1000000000))
- (value < UINT32_C(100000000))
- (value < UINT32_C(10000000))
- (value < UINT32_C(1000000))
- (value < UINT32_C(100000))
- (value < UINT32_C(10000))
- (value < UINT32_C(1000))
- (value < UINT32_C(100))
- (value < UINT32_C(10));
}

since this one seems to help clang-10 use the "subtract with borrow" machine instruction, instead of "load carry to register, then add two registers" idiom it uses for the former.

And, as brucehoult suggested earlier (slightly altered to match the above functions), writing the function as

unsigned char u32_digits(const uint32_t value) {
unsigned char result = 1;
uint32_t limit = 10;
while (result < 10 && value >= limit) {
result++;
limit *= 10;
}
return result;
}

generates acceptable code with least silliness across all compilers I tested: avr-gcc 5.4.0, gcc-7.5.0, clang-10; all with -O2 and -Os, on 8-bit (AVR5, for ATmega32u4), 32-bit (x86), and 64-bit (AMD64 aka x86_64). The code generates may be less than optimal, but it should never generate crazy-silly machine code in any situation.

(In case anyone wonders, yes, I did verify that all above versions and radix_digits(v, 10) yield identical results over all possible inputs. It only takes a minute to verify via brute force comparison of all inputs across the three variants on an x86-64 laptop with Core i5-7200U, after all.)

Using this simple form as suggested by brucehoult is, in my opinion, also a perfect example why it makes sense to think of the related standards as recording compatible practices instead of describing or dictating correctness. If you ascribe to the former like I do, you'll intuitively understand that C compilers are "more practical" when given commonly used code patterns; and if given "tricky code", they can generate oodles of non-practical machine code that is nevertheless perfectly compliant to those standards.

I also can't help but point this as the perfect case where a descriptive macro (say, #define UINT32_DIGITS 10) would make a big positive difference, as it can be hard to keep track of what each of the "10" in that function actually mean. The while loop cause should really read while (result < UINT32_DIGITS && value >= limit); the other two tens are the radix (also known as base). If you change the radix, the maximum digit count must be updated to match as well. (Mathematically, we can describe the function as \$1 + \lfloor \log_{10} v\rfloor\$.)

Since the compilers I used here all support the __builtin_overflow_mul() integer overflow built-in, I personally prefer the "generic" form, mathematically for \$1 + \lfloor \log_r v \rfloor\$, for any unsigned integer arithmetic type UTYPE via e.g.

__attribute__((const, leaf, nothrow, pure))
static inline unsigned char radix_digits(const UTYPE value, const UTYPE radix) {
unsigned char result = 0;
UTYPE limit = radix;
while (value >= limit) {
++result;
if (__builtin_mul_overflow(radix, limit, &limit))
break;
}
return result;
}

noting that this is a pattern that when inlined with a radix the compiler recognizes as a compile-time constant. For example, Clang-10 tends to turn the loop into a sequence of comparisons like earlier, but with radix-specific constants. Because it detects the unsigned type limit from the overflow, it does not need to know the maximum number of digits in that radix beforehand, either.

Finally, those nonstandard function attributes tell at least GCC and Clang (which may warn about not recognizing 'leaf', but will not consider it an error) that if it cares to, they could treat expressions like radix_digits(~(uint32_t)0, 10) as constants. Currently, they do not treat them as compile time constants in the sense of the C standard, even though Clang-10 with -O2 or -Os compiles n = radix_digits(~(uint32_t)0, 10); literally into n = 10; at the machine code level; no computation of any sort. (And it is not just a special case; n7 = radix_digits(~(uint32)0, 7); compiles to n7 = 12; too, telling us unsigned 32-bit integers have up to 12 digits in base 7.)

Right now, my preference in the general case (code not specific to some hardware or word or register width) is the radix_digits() form, hands down. Of course, the integer overflow built-ins have not yet been standardized, do whether one can use this or not depends on their compiler. Always a tradeoff of some sort!

Quote from: brucehoult on July 04, 2021, 01:01:59 am

The byte-by-byte decision tree makes much more sense on an 8 bit processor. But you might minimise total size of code+data by using a table-driven approach, at very little cost in speed.

Perhaps, but getting the approach expressed in C in such a form that one never gets silly machine code (for 8 bit processors) is hard. I definitely haven't found a way to do that yet.

My decision tree approach did tell me that on an 8-bit architecture with cheap full ternary branching (a single comparison of register against a constant followed by any two conditional branches among <, <=, ==, >=, and >, with the third case fall-through), the code only needs 24 comparisons between a register and a constant, total; and to arrive at the answer, at most six comparisons are done (and six branches with at most one unconditional jump) are needed. The graph seems correct, but I haven't verified this in practice yet.

SiliconWizard · « **Reply #13 on:** July 04, 2021, 05:31:42 pm »

Quote from: brucehoult on July 04, 2021, 12:42:54 am

Quote from: SiliconWizard on July 03, 2021, 05:37:38 pm
Quote from: brucehoult on July 03, 2021, 03:13:41 am
I really struggle to think of or even imagine any machine in which repeating a divide by ten is going to be cheaper than copying a character from one place to another. If an actual divide instruction is used that's usually going to be at least 32 clock cycles on a 32 bit machine.

Well, many performance-oriented CPUs will take much fewer cycles for a divide than this, but, that's right that it's usually more than 1 - and even on optimized implementations, it's rarely less than 4

What CPU can do a 32/32 divide in 4 clock cycles?
(...)

Yeah well, as you mentioned we can have a typical average of 12-13 cycles on at least Intel CPUs. Worst case can be more than this, and IIRC, 64-bit divides can be up to 40 or 50 cycles.
That's for a single divide.

But for dual-issue CPUs with OoO exec and a proper "pre-scheduling" done by the compiler, in a typical loop execution an integer divide, the actual number of cycles taken by it on average can be much less than this, and I've seen some benchmarks that would show something in the order of 4 cycles, on average, hence what I remembered here. My memory may have been a bit optimistic here, but that's what I remembered. Now, yes those are best cases for nicely scheduled code when the code itself lends itself to this.

But anyway, exact number of cycles is not that relevant here IMHO at least on complex modern architectures. And that was what I meant to say. For one because the exact number of cycles heavily depends on context, and as I mentioned, for memory access, that's the same. Due to caches, a given execution could take more time copying data than issuing a divide. Deciding on one or the other to improve performance would be pretty much trying to micro-optimize.

Of course, on simple CPUs such as small MCUs, there is no contest, and this kind of consideration is a lot more relevant.

Nominal Animal · « **Reply #14 on:** July 04, 2021, 09:32:11 pm »

Quote from: SiliconWizard on July 04, 2021, 05:31:42 pm

But anyway, exact number of cycles is not that relevant here IMHO at least on complex modern architectures. And that was what I meant to say. For one because the exact number of cycles heavily depends on context, and as I mentioned, for memory access, that's the same. Due to caches, a given execution could take more time copying data than issuing a divide. Deciding on one or the other to improve performance would be pretty much trying to micro-optimize.

Well put.

For those learning software development in any language, this also means that no matter how much you optimize an operation and or prove in a microbenchmark that it is more efficient than anything else, until you implement it in practice on a real work load, you don't know the combined work behaves. And what matters is the combined work: without Superman vision, we can't see inside the combined work to see how the parts are balanced against each other. End users, even those with Superman vision, do not care, because what matters is how well the tool works for the job. It is the whole that matters, not its individual parts in isolation.

This is because the operations are not singletons in vacuum: the way an operation uses available resources (memory cache in particular), affect any operations that are done in parallel or immediately afterwards. Sure, your code may be better than anything else on the planet, but if it causes all other operations done at the same time or immediately afterwards to crawl to a halt, it can also mean that using your code for that operation, the overall program takes longer time to run.

(Radix sorting is an example I usually reach here. With data having fixed size keys, it is the optimum sorting method, since it can do it in O(N) time complexity. However, it has a particularly hard to predict memory access pattern, and needs quite a lot of temporary storage, so it basically evicts most of your local cache. Its "constant factors", or the setup cost and cost of computing per element, are also so high that you need lots of elements for the linear scaling with high setup and unit cost to be faster than worse scaling but small setup and unit costs. Indeed, it is common to use O(N²) time complexity algorithms for the smallest arrays, because of the setup and unit cost is so low it makes sense. Mathematically, \$C x^2 + D x + E < A x + B\$ is true but only for a range of values \$x\$, given \$A \ne 0\$ and \$C \ne 0\$. The last time I checked, for IEEE 754 Binary64 (double in most C implementations), I needed an array with dozens to hundreds of millions of values, before radix sort could sort uniformly random data faster than the well-known O(N log N) alternatives. Radix sort run time does not vary much depending on the element order prior to sorting, but some sort algorithms with O(N²) time complexity (and possibly some with O(N log N) time complexity) effectively skip already sorted data, so unless you know the data is randomly ordered and you have lots and lots and lots of it, radix sort isn't the tool you should reach for, despite being clearly superior time complexity wise. The trick is to be smart lazy, and avoid doing unnecessary work in the first place.)

In practice, that is also why for interactive programs, we strive to keep the latency to a minimum, at the cost of possibly using a bit more CPU time. Humans don't really care if the operation takes 3.0 or 3.1 seconds, but if their tool is frozen and unresponsive for most of that time, they will be unhappy. Even if you were to shave a quarter off an operation that takes a minute to run – so it runs in 45 seconds instead of 60 seconds –, if it means the user interface becomes unusable or even choppy, the users won't be happy. The users need the device/application to respond immediately, even if it is just to tell the user that "yes, will comply; but it will take a while until I can provide you with the results, so please wait".

On the other hand, in systems programming, background housekeeping tasks try to be as unintrusive as possible. Things like Youtube playback getting choppy just because your Helpful Service decides that 3AM is always a good time to do housekeeping work and spend quite a lot of CPU time and memory to do so, leads to angry users. Good sysadmins and software package maintainers know how to use tools like nice and ionice to make such policy decisions, but it is up to us the systems programmer to make sure our tools respect such policy decisions – and better yet, provide more fine-grained control over the details if and when necessary. Such services don't care much when they get to do their work, as long as it gets done eventually; and doing the work with least disruption to human-interactive stuff is way more important than saving some CPU cycles.

Simply put, "best" has no meaning beyond that which you decide to give it. "Fast" can mean either "low latency" or "short run time". Beginners do not realize there is a distinction, and if enthusiastic, often get caught up optimizing some small detail that either does not affect the overall task at hand, or even actively hinders the task at hand. "Premature optimization" is a common pit to accidentally fall in: we concentrate on a detail, when we should worry about the whole. We should always look at the whole first, see if there is work or computation we are doing that we don't actually need to do, and if the algorithm we are using is suitable for the dataset at hand. Proper benchmarking involves running actual tasks and measuring their cost to the human running them, in human terms whereas microbenchmarking is just a basic indicator showing either "this looks promising" or "this does not look promising at all", but not much beyond that.

brucehoult · « **Reply #15 on:** July 04, 2021, 11:04:32 pm »

Quote from: SiliconWizard on July 04, 2021, 05:31:42 pm

Due to caches, a given execution could take more time copying data than issuing a divide.

For a character at some random address that you don't know the history of, sure.

For a character being read from a 10 byte buffer on the stack where you just a few instructions ago wrote it while converting it from an integer to a string? No. If there is a cache, it *will* be in it.

Assuming you haven't taken an interrupt / been task switched / been migrated to a different CPU core / been put into hibernate etc etc in the meantime.

Nominal Animal · « **Reply #16 on:** July 05, 2021, 09:34:47 pm »

Quote from: brucehoult on July 04, 2021, 11:04:32 pm

For a character being read from a 10 byte buffer on the stack where you just a few instructions ago wrote it while converting it from an integer to a string? No. If there is a cache, it *will* be in it.

You have no idea how hard I'm trying to not use that as a the perfect spring-board into introducing the concepts of locality of reference and efficient cache use patterns to anyone still reading the posts in this thread, but not yet familiar with them. They are crazy-useful to programmers on architectures that do have caches, both on the low-level end, and in the high-level end, algorithm design/choice/implementation, in high-performance computing. Nnnngh.

(I won't; that'd veer too far from the subject even for me.)

westfw · « **Reply #17 on:** July 05, 2021, 10:32:20 pm »

I always wonder why it's not popular to implement "custom" versions of sprintf() that behave the way a particular application needs, but doesn't include the features not needed... I mean, there's a well-established framework for implementing it (though varargs can be expensive), and it's likely to be immediately recognized by anyone reading the code.

Code: [Select]

sprintfFoo(output, "%d:%02d:%02d:%03d", hours, minute, seconds, millis);Division too expensive? Only implement hex... etc.

brucehoult · « **Reply #18 on:** July 06, 2021, 01:16:53 am »

Quote from: westfw on July 05, 2021, 10:32:20 pm

I always wonder why it's not popular to implement "custom" versions of sprintf() that behave the way a particular application needs, but doesn't include the features not needed... I mean, there's a well-established framework for implementing it (though varargs can be expensive), and it's likely to be immediately recognized by anyone reading the code.
Code: [Select]
sprintfFoo(output, "%d:%02d:%02d:%03d", hours, minute, seconds, millis);Division too expensive? Only implement hex... etc.

If you only need to format characters, strings, and register-sized integers then printf() can be pretty small. Supporting field widths and padding options adds a little but but not bad. It's floating point capability that bloats things whether you're using it of not, especially if you don't have floating point hardware. And integers bigger than your registers.

Newlib Nano has a cut-down printf. Most embedded toolchains use NewLib and you enable Nano by adding "—specs=nano.specs" to CFLAGS and LDFLAGS.

Nominal Animal · « **Reply #19 on:** July 06, 2021, 06:07:09 am »

Parsing the formatting string is extra work that could be done at compile time, but in C and C++ has to be done at run time when using formatting strings.

In hosted environments, at least GCC and Clang turn printf("no formatting") to fputs("no formatting", stdout), but that's about it.

There is no reason why a compiler could not convert say
int len = snprintf(buf, sizeof buf, "%02d:%02d:%02d.%03d", hh, mm, ss, ttt);
into say
int len;
{
struct snprintf_context ctx;
ctx.buf = buf;
ctx.maxlen = sizeof buf - 1;
ctx.len = 0;
snprintf_ctx_int(ctx, hh, 2);
snprintf_ctx_char(ctx, ':');
snprintf_ctx_int(ctx, mm, 2);
snprintf_ctx_char(ctx, ':');
snprintf_ctx_int(ctx, ss, 2);
snprintf_ctx_char(ctx, '.');
snprintf_ctx_int(ctx, ttt, 3);
len = snprintf_ctx_terminate(ctx);
}
since the format specification is no longer a detail internal to the C library, but is actively analyzed and parsed by the C compiler already; see e.g. GCC format function attribute, which is also supported by LLVM/Clang. Not only is the printf formatting patterns known and parsed to match specific parameters already (albeit just so that the compiler can verify the parameter formats match the specification string), but so are the patterns used for scanf and strftime.

(Well, there is one reason, which is that it does not match the abstract state machine used by the C standard lawyers, and cannot be mapped to one in a way that wouldn't leave ones mind reeling from the complexity. But I've ranted enough about those already.)

What we lack, is a way to tell the compiler which functions to call and with what parameters, because those definitely are implementation details we do not want to hardcode.
Sort of a library-interface-to-function mapping, which is currently a deeply secretive internal contract between the C compiler and its standard library counterpart.

Indeed, GCC can use its own built-in implementation of snprintf() unless told otherwise, which can surprise those who expect their interposed snprintf() to be used instead, and isn't.

(To try and use the GCC version instead of your standard library version, see if __builtin_snprintf() is available; if it is, it is GCC's own implementation of snprintf(). Which can, depending on the version, fall back to the library version. Who knows? We end users are not privy to such privileged information.)

If would be sooo nice if we had a way to specify formatting string formats for the compiler to map to individual calls, and specify whether the order is left-to-right or right-to-left. Consider it an extra-powerful preprocessing step. In the case of a runtime formatting string, you'd get those functions called from a runtime parser version the compiler would provide (since in the compile-time constant version, it would replace exactly that with the fixed sequence of conversion function calls instead). Right now, the Linux kernel internal logging functions use extremely wonky formatting string formats, because they "need" to stay compatible with the C printf() formatting strings so the compiler can keep checking the format matches the arguments for literal constant formatting strings. They really could use a better minispec there.

SiliconWizard · « **Reply #20 on:** July 06, 2021, 04:54:43 pm »

Quote from: Nominal Animal on July 06, 2021, 06:07:09 am

Parsing the formatting string is extra work that could be done at compile time, but in C and C++ has to be done at run time when using formatting strings.

When using the standard functions as they are defined, yes, but I guess it should be possible to implement something fully static in C++ with templates? I'm frankly not skilled enough with C++ templates, especially with recent std versions, to do this myself, but I'm guessing this should be possible?

Modern compilers actually do static analysis of format strings while compiling when using the std functions, and emit warnings when they see that the additional parameters do no match the formats. So, in theory, they would be perfectly capable of transforming a xxprintf() call, for instance, into a fully static implementation using the format string, at compile time. Of course that would work only if the format string is a constant (which IMO covers the vast majority of cases.) But those functions actually allow to pass any string, even built at run-time, as a format string, so in this case, the compiler would still call the full xxprintf() functions from the C library. (Note that using dynamic string formats is very slippery and often considered bad practice. I admit I've used this occasionally though, as that can be pretty powerful, but it's an excellent way of shooting yourself in the foot.)

Nominal Animal · « **Reply #21 on:** July 06, 2021, 11:32:05 pm »

I personally dislike both method chaining (s.append("Hello, ").append("World").append("!")) and operator overload (I absolutely detest cout << "Hello, " << "World" << "!") for building or emitting strings, but I think my reasons are personal rather than technical.

Quote from: SiliconWizard on July 06, 2021, 04:54:43 pm

(Note that using dynamic string formats is very slippery and often considered bad practice.

Except when internationalization (i18n) is involved: then, dynamic string formats obtained from a message catalog via gettext()/catgets()/LoadString() is definitely the pattern recommended for use.

(If one ever finds oneself wondering what the heck a formatting specification like "%3$d" means, it is the positional variant of %d needed for formatted strings to be easily internationalized. Specifically, it means "the third variadic argument, converted using %d". We're not allowed to mix sequential and positional variants, though: "%s%2d" is not allowed.)

This can be useful in even embedded devices, where for example the displayed strings are localized and localizable: consider one with run-time selectable languages, without knowing which or how many languages you want to support.

Contrasting to formatting methods used in other programming languages, and examining how the formatting string parsing could be done more efficiently at runtime, it is obvious that the one C uses right now is not particularly good at all.

Let me describe a variant that I've puttered with a bit. This ought to be on-topic, because I do intend it to be usable even in very limited and constrained situations, with dynamic memory allocation in particular completely optional. While implementable in C and C++ right now without additional compiler support, this is only speculative, not a description of anything I've fully explored in practice yet.

Escape sequences

A major limitation in C string literals (including formatting strings) is that \\ is used to escape any backslash characters, \. Because of this, and this only, string literals are computationally expensive to de-escape right-to-left.

The trivial, and surprisingly effective fix, is to deprecate \\ in favour of \/. Similarly, in formatting strings, all starting and/or terminating characters surrounding a formatter, should be escaped by replacement instead of prefixing with a backslash. I will use { and } to bracket formatting string expressions, and in a formatting strings, such literals are escaped as (for example) $ and $ .

This does not look at all important, but if you write some C or C++ code that implements this escape scheme, you'll find that with this, you can build formatted strings either forwards, or backwards; there is no ambiquity. Try to implement the backwards scanning for current C string escape sequences and formatting patterns (printf, scanf, strftime, strptime), then contrast with this scheme, and the difference is HUGE.

Formatting schemes

The C formatting patterns are an agglomeration of practices, and it shows. Individually, they work; but there is no common scheme. For example, compare printf and scanf patterns: you cannot use the same parser/tokeniser to separate fixed strings from formatting specifications for both.

Bourne shell and derivatives, up to current standard POSIX shells and including Bash (becoming somewhat a standard of its own, being so ubiquitous), started with $varname formatting and `command...` command substitution, but quickly moved to ${varexpression) for formatting and $(command...) command substitution, to allow for nested expressions and simple formatting operations. For example, in POSIX shells, current Bourne derivatives, and Bash, ${varname%/*} yields the value of variable varname but its last slash and everything following that removed (if it has any): given x="foo/bar/baz", ${x%/*} yields foo/bar, and for example ${x%J*} yields foo/bar/baz.

Python supports more than one string formatting schemes. First one is "C-like formatting string" % (formatted items), which is in favour of C programmers due to its similarity to same operation in C. The second one, introduced in Python 2.6 as a str.format() method for strings, is the default/standard in Python 3. For example, to convert a floating point value value to a string with the number of decimals specified by another variable fracdigits, you can use the str format method, "{0:.{1}}".format(value, fracdigits), or "{val:.{digits}}".format(val=value, digits=fracdigits) ; or f"{value:.{fracdigits}}" per the Python 3 format specification mini-language.
Unfortunately, because they chose {{ as the escape sequence for a single opening brace, "{{name}}".format(value=1, name="value") does not evaluate to "1" (it evaluates to "{name}" instead).

PHP supports simple shell-like string substitutions (like "Hello, $name!"), has printf()-style functions, and generic expansion of PHP expressions via "{expression}".

It seems that decades of experience says that using { and } to denote a conversion specification works best, so let's run with that. At this point I'm not considering whether to forbid, support, or optionally support recursion within conversion specifications, and will leave it for the future. If we use \/, $, and $ to escape \,[tt} {[/tt], and }, respectively, we have a formatting specification that allows one to build the strings either left-to-right or right-to-left. (Note, however, that ignoring recursion, each formatting specification can be treated as a singular unit; for example, passed to a function that provides a compatible conversion facilities.)

Because C does not have dictionaries as a base type, the conversion specifications should refer to the position of the variadic argument, not to any "name". (Because I envision that there might be a need for a conversion specification that does not refer to any variadic arguments, I'm inclined to reserve 0 for those, and refer to the first variadic argument as 1.) There may be need for implicit names, say number of characters in the string buffer before/after a specific conversion, or the length of a specific conversion.
Since the formatting parser has to determine the type for each variadic argument, the format spec should begin with the position and type of the variadic argument.
Since a formatting string has to specify the type of each variadic argument without holes, we also need a way to just specify the type of an unused variadic argument, without it generating anything into the string; just like the * suppression in scanf patterns.

Despite its shortcomings, the C style conversion specifications have endured the test of time, so we should try to reuse everything we can from them. A particular feature is compactness, especially single letter names for types. Explicit-width and minimum-width unsigned and twos complement integer types do have numbers in their names, so although basic types should have single-letter names, it would be nice if at least in theory any valid type name in C could be used instead.

Note that C does already have more than one string literal type. For example, L"foo" is a wide string literal. So, there is precedent for adding a new base type, say formatchar, with f"formatting string" telling the compiler to apply formatting rule checks per this base specification.

Example C string formatting (speculative alternative)

Although augmented Backus-Naur form is the recommended method to specify such formatting, lets look at the typical cases:

'{' pos type '}'
'{' pos type '!' '}'
'{' pos type '/' flags '}'
'{' pos type ':' proto '}'
'{' pos type '/' flags ':' proto '}'
'{' pos type ':' proto '/' flags '}'

where

pos is the positive decimal integer identifying the variadic argument, 1 being the first variadic argument.
type determines the type of the variadic argument. This is either one or more characters identifying standard types, or comma (,) followed by the unqualified type name (visible as a typedef in this scope) this variadic argument is a pointer to.
The first form is the standard one everybody would use. It probably should include pointers to standard types, too.
The second one is how the formatting facilities would be user-extensible to arbitrary types, and would be restricted to pointers or references to those types.
flags are an unordered set of single-character boolean formatting options.
There include things like always preceding numerical values with their sign, and an extra check for parameters passed via a pointer so that if the pointer is NULL, this conversion is just silently ignored.
proto is a type-specific formatting prototype or formatting directives.
For example, for floating-point conversion, it might be 5.2 (indicating five integral digits and two fractional digits); for strings, it might be -20 to indicate left-aligning within a 20-character field.
In cases where none of the flags are valid for proto, it is okay to omit the separator between the two (but not between type and these two).

For the named type support, unless the format string parsing is fully integrated into the compiler, instead of relying on the concept of compatible types, we'd have to rely on the exact type name instead. In practice, the full type name would really just be a type identifier string, used when registering a formatting handler for a new type.
So, in a real way, you could read say {2,typename} conversion simply as: Use the formatter registered for "typename" with no flags or prototype to format the second variadic argument.

The core idea is that the format string parser would be implemented by the C compiler, not by any library. It would have two variants: one that takes only formatting string literals, and one that takes a pointer to a formatting string at runtime. The compiler is explicitly allowed to parse literal formatting strings at compile time. The expressions associating the literal formatter with a custom format are always in the file scope, and the compiler is not required to provide any objects in the compiled object file corresponding to these registrations. Run-time registration of custom formatters affect the non-literal-string formatter, and has similar execution context limitations as say memory allocation functions.

Essentially, given a formatting string literal f"Process {1d} says Hello, {2s}", the compiler would explicitly be allowed to generate something like
APPEND_LITERAL(output, "Process ");
APPEND_INT(output, first variadic parameter goes here);
APPEND_LITERAL(output, " says hello, ");
APPEND_STRING(output, second variadic parameter goes here);
instead of emitting a call to a specific formatting function. As I mentioned before, GCC and Clang do a minor variant of this to printf()/fprintf() family of functions where the formatting string contains no parameter specs at all, turning them into fputs(). But, describing this facility in terms of the C abstract state machine is utterly horrible, not worth the effort.

Given a pointer to a formatting string, the C compiler would provide the code that parses the formatting string, and calls either its internal formatting functions, or registered formatting functions, to generate the result.

There are a few additional warts, of course; this isn't something I present as a done deal, but as a concept we could work on to push C string formatting forward – not just by extending its features towards more complexity and user extensibility, but towards the lower end, becoming more frugal with resources used to construct strings in the first place.

Raw linear string comparison would not fly for complex processes, so we explicitly need to allow the compiler to refuse a specific type identifier at compile time, or a registration function to fail at runtime, just because of a hash collision or such. (Good implementation wouldn't, we just want to allow it so that tightly constrained space-efficient compilers are not required to emit oodles of binary code just to support corner cases the programmer can easily avoid – but might in rare cases only be noticed via testing.) Things like hash algorithm selection are implementation details, although I'd like compile-time options, please. DJB2 works well here, though.

The exact same overall specification format would also work for scanf. For strptime and strftime patterns, we'd probably need to omit the positional argument (which as a number would logically refer to the specific broken-down time field) and merge it into the short type instead; like Y for years, and so on.

Additionally, we could introduce a new family of formatting/scanning functions that instead of variadic arguments, take an array of pointers, or a pointer and an array of offsets, instead.

I can't really express how useful that would be for me in practice, but it boils down to this: such an array would essentially be context, and any number of formatting strings could construct information from that context. If the number of entries in that array is included when calling such functions, the formatting string would have full access to the context – printing functions describing it to any detail desired, scanning functions replacing values in it if the scanning pattern matches – but no risk of accessing anything else, or leaking unrelated information. It would be stupid powerful and efficient in several real world use cases I have.

No, wait, I can: they are the extension of strftime()/strptime() to user structures, except that the structure is described using pointers to its fields (or as offsets constructed via _Offsetof or similar based on the structure itself, and the separately passed base pointer is to the structure itself). strftime() and strptime() would be just the library-provided versions of these, with the user being able to write their own for their own custom types. Stooo-pid powerful, but very compact and efficient.

SiliconWizard · « **Reply #22 on:** July 07, 2021, 12:42:26 am »

Quote from: Nominal Animal on July 06, 2021, 11:32:05 pm

Quote from: SiliconWizard on July 06, 2021, 04:54:43 pm
(Note that using dynamic string formats is very slippery and often considered bad practice.
Except when internationalization (i18n) is involved: then, dynamic string formats obtained from a message catalog via gettext()/catgets()/LoadString() is definitely the pattern recommended for use.

I'm assuming your "except" here was answering my "bad practice". The fact that it is used for i18n purposes doesn't make this particularly good. It's a weird argument. But, while I think my point was clear (but you never know how people will exactly understand your points in the end), I was mostly pointing out the risks of security exploits and potential nasty bugs associated with using dynamic format strings. I also said they can be pretty powerful, so I was definitely not denying their potential usefulness.

To make things even clearer: what I define here as dynamic format strings are format strings built dynamically at run-time. Just taking a *constant* string from an existing table of constant strings and using it as a format string was not what I was talking about, although it still of course makes static analysis harder/or even impossible to do.

While both cases can be risky, completely dynamically building format strings (as opposed to merely fetching them from a table) adds even more risks. That's really the case I had in mind when I said bad practice.

Now, even the case of fetching a constant string from a table is of course more risky than using a fixed constant string. If your table contains even just one string the format of which doesn't match the expected format, then that could lead to a nasty crash. I frankly don't like this too much. And, if those catalogs come from additional files, especially files that can possibly be easily altered, that could be considered like this:

https://cwe.mitre.org/data/definitions/134.html

So, just my 2 cents and to make my point clearer.

Nominal Animal · « **Reply #23 on:** July 07, 2021, 10:19:51 am »

Quote from: SiliconWizard on July 07, 2021, 12:42:26 am

To make things even clearer: what I define here as dynamic format strings are format strings built dynamically at run-time.

Ah, I see now. I was thinking of those as run-time built format strings. I don't like them either.

And yes, it definitely is important to verify that formatting strings obtained from some sort of a message catalog, are "safe": only use the variadic arguments the original also used, and with the same or compatible type. Sadly, current i18n approaches do not verify these, as far as I know.

emece67 · « **Reply #24 on:** July 07, 2021, 03:28:58 pm »


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Constructing short C strings in limited or constrained situations (Read 5836 times)

Share me