Author Topic: [SOLVED] 24bits/96kHz conversion of s32le to f32le in a stdout to stdin pipe?

0 Members and 1 Guest are viewing this topic.

RoGeorge

  Super Contributor
  • ***
  Posts: 7085
  • Country: ro
Re: 24bits/96kHz conversion of s32le to f32le in a stdout to stdin pipe?
« Reply #50 on: August 08, 2022, 05:28:49 pm »
TBH, I don't mind if incorrect info will be posted, no matter by whom, though maybe do not lock the topic, please.

I was planning to test the buferless C convertor from Nominal Animal, and post how it went.  Just that I don't have the time to study his code write now (at a brief look seems something that might be interesting to look closer), but will do it later or tomorrow, and post how it goes.  If you know it will be further rants, then close it and I'll PM to Nominal Animal, or open another one about that code only.  Your call.

From my side, it was solved when I've posted the successfull printscreens and the change in noise floor with 24/16 bits:
thank you all, solved!  :D
« Last Edit: August 08, 2022, 05:32:54 pm by RoGeorge »
The following users thanked this post: Ed.Kloonk, gnif

gnif

  Administrator
  • *****
  Posts: 1720
  • Country: au
  • Views and opinions are my own
    • AMD
No worries, I will keep this topic on my watch list :)
The following users thanked this post: RoGeorge

Nominal Animal

  Super Contributor
  • ***
  Posts: 7313
  • Country: fi
    • My home page and email address
Re: 24bits/96kHz conversion of s32le to f32le in a stdout to stdin pipe?
« Reply #52 on: August 08, 2022, 05:58:50 pm »
The choppy thing of 5fps-like only happens while measuring the spectrum with baudline, and it does not affect anything else except the spectrum image rolling in baudline.
Well, 5 fps corresponds to a DFT with a 0.2 second non-overlapping windows, doesn't it?  Does the minimum frequency happen to be 5 Hz?

I do not use baudline, so I don't know how tunable its spectrum (discrete Fourier transform) options are.

When using FFTW3 for DFTs, it is important to gather "wisdom" first for the selected window size.  This "wisdom" is, simply put, testing the different methods of performing the same DFT (window size); and once found, will yield an extremely fast transform on that machine on that window size.

The window size affects the spectrum, because the windowing function itself acts like a filter.  (When the window is advanced by half, and the spectra summed, you often get a better time-domain representation of the actual signal spectrum.  You could, in theory, advance the window sample by sample, but that just interpolates between the different spectrum representations, and does not really provide any new information.)

A windowed sinc has a very steep (but not vertical) stop band, and very flat low pass response, so it works well for this kind of application.  It would be more efficient if it was incorporated into the DFT (it would save one multiplication per sample), but any desktop machine should be able to do this without noticeable slowdown at any window size one wants.  (Just remember that with a minimum frequency f, the maximum update rate is 2/f when advancing each DFT by half a window, and 1/f when advancing each DFT by a full window.  Having audio buffers of exactly 2/f or 4/f in duration will yield minimum latency and best overall throughput.  4/f buffer size can help on multicore machines.)

In practice, the program would reads those fractional buckets of samples, then apply the windowed sinc to 2 or 4 consecutive buckets, and DFT to the windowed samples.  (The older half of the buckets can then be recycled for new data.)  The DFT consists of complex numbers, with the absolute magnitude (square root of the sum of squared real and imaginary parts) the interesting part, and the phase (atan2(imaginary part, real part)) often not visualized at all.  This is then updated to the display, usually as a graph, but sometimes also as an intensity diagram in a continuously scrolling display.  Each of these sub-tasks are well handled by a separate thread, utilizing multiple cores simultaneously.  Whenever the DFT-to-absolute-magnitudes thread has constructed a new spectrogram, it just pushes it to a queue and uses a thread-safe notification to tell the UI toolkit that a new one is available for drawing.  The total latency does depend on how much "extra" CPU power you have available, but should not be more than one or two full windows in duration, which should be acceptable for realtime spectrum analysis.

I was planning to test the buferless C convertor from Nominal Animal, and post how it went.
That would be nice.  Note that if you want to add scaling, say divide each sample by 32768, you can just change the conversion loop to
Code: [Select]
        /* Convert to float. */
            float *const end = out_buf + have;
            int32_t     *src = in_buf;
            float       *dst = out_buf;

            while (dst < end)
                *(dst++) = (float)(*(src++)) / 32768.0f;
or, if you prefer to do something more complicated, then maybe
Code: [Select]
        /* Convert to float. */
            float *const end = out_buf + have;
            int32_t     *src = in_buf;
            float       *dst = out_buf;

            while (dst < end) {
                const double  sample = *(src++);
                *(dst++) = sample / 32768.0;  /* Or some other operations */
It is not technically bufferless either; it does read up to BLOCK_SIZE samples, whatever is immediately available in the pipe, and converts and outputs them as soon as possible.  It never waits for more data to arrive, unless it has already processed all previous data (or the data it has does not end at a sample boundary).  You can increase BLOCK_SIZE if you want (by using gcc -Wall -O2 -DBLOCK_SIZE=65536 ... when compiling); anything above 262144 is probably useless (because of the default pipe size limitation in Linux).  A more proper term would be that it tries to use the same buffer size as its input, but not more than BLOCK_SIZE samples at a time.

RoGeorge

  Super Contributor
  • ***
  Posts: 7085
  • Country: ro
Wow, 125+ fps, thank you!  :-+

So far that's the fastest one.  BLOCK_SIZE almost doesn't matter, but I had to specify 10ms of buffer in arecord (-B 10000).  Though, the code is way over my head, with all the low access and the inline.  :)

Nominal Animal

  Super Contributor
  • ***
  Posts: 7313
  • Country: fi
    • My home page and email address
Well, in case it is useful, let me describe the code!

If not, no worries; I'm perfectly happy to just leave this here in case someone else may stumble on this later on, and find this useful.

The core idea is that we use a single read() call, to obtain whatever is already available in the pipe.  See man 7 pipe, especially the "I/O on pipes" and "Pipe capacity" sections.
To avoid dealing with partial samples, we do additional reads until we have a multiple of 4 characters (or a full BLOCK_SIZE) buffer.

If read() returns 0, it indicates end-of-input, and the program can exit.  For simplicity, if this occurs after a supplementary read because the previous ones did not return an integral number of samples, we just throw the entire tail buffer away.  (This really is intended to be used as a real-time continuous data filter.  To fix this, we could just set a flag ending the outer loop iterations, instead of immediately returning.)

Read should return -1 on error, but because of certain old bugs related to 32-vs-64, I like to consider all other negative values as EIO errors as well. 
(Technically, read will return -1 with errno==EINTR when a signal is delivered to a handler installed without SA_RESTART flag; and if nonblocking, errno==EAGAIN or errno==EWOULDBLOCK if no data is available, but since this program does not have userspace handlers and we can assume standard input and output are blocking (uh, not nonblocking), we can safely consider all of them as actual errors the program cannot deal with.)

On the write side, we assume that most reads we do, are of the original buffer size, and therefore will fit in the output buffer in a single write() call.
However, the code does not assume that indeed will happen every single time.  Instead, it uses a loop to write the entire converted output buffer.
See the man 2 write man page for description and rules and further information.

Since there is no cleanup or such to do when complete, and we exit the program directly from within the loop, the outermost loop is an forever one.  I prefer while (1) { ... }, some others prefer for (;;) { ... }; either one would work fine here.

The logic thus described, let's open up the important parts.  For clarity, I'll separate each chunk with a horizontal line.

Code: [Select]
        /* Read some (complete) int32_t's. */
        do {
            ssize_t  n = read(STDIN_FILENO, (char *)(in_buf) + have, (sizeof in_buf) - have);
            if (n > 0)
                have += n;
            if (n == 0)
                return EXIT_SUCCESS;
                return EXIT_FAILURE;
        } while (!have || (have % sizeof in_buf[0]));
Since read() returns the number of chars, we need to consider our input buffer as a buffer of chars.  Within this loop, have is the number of chars we already have in the in_buf.  Thus, (char *)(in_buf) + have is a pointer to the first unused char in in_buf, and (sizeof in_have) - have is the number of unused chars in it.  We do a read(), trying to fill the rest of the buffer –– and since have starts at 0, we actually try to fill in the entire buffer.  Of course, read() will block until there is some data in the pipe, and then return that data (up to the limit we specified).  n is the actual number of chars we read.

The if clauses check if we did get any data; and if not, exit the program.  (Again, while we may throw some already read data away, we only do so if that already read data did not end at a valid sample boundary.  To me, this is sufficient to indicate the tail part of the data is suspect anyway, and not worth processing.)

The loop condition can be read in English as "as long as have is zero, or have is not a multiple of the size of in_buf array elements".

After the above loop is done, we divide have by the size of the in_buf array elements, so that it becomes the number of samples we have in in_buf.  Because our reads started at the beginning of the buffer, we know the data is properly aligned.  Essentially, it is at this point that we change from treating the input as a stream of chars, and interpret it as the representation of the in_buf array elements.
(I like having such clear logical transition points.)

The next part, the conversion from in_buf to [/tt]out_buf[/tt],
Code: [Select]
        /* Convert to float. */
            float *const end = out_buf + have;
            int32_t     *src = in_buf;
            float       *dst = out_buf;

            while (dst < end)
                *(dst++) = *(src++);
is just the pointer version of the simple loop
Code: [Select]
        for (size_t i = 0; i < have; i++) {
            out_buf[i] = in_buf[i];
Why did I write it in the pointer form, when the simple loop is so much more readable?

I'm a creature of habit, and GCC used to generate better code when using pointers, compared to array indexing, on some architectures.  x86 and x86-64 has powerful indexing built in to its instruction set, so the simple loop tended to be preferable on x86 and x86-64 even using old versions of GCC.
Or, you could equally say that since I was in the pointer-logic-thought-mode when writing the code, I just didn't stop and think before I wrote the loop, and just let my fingers type the solution when I was already thinking something else.

Even examining the possible code generated for the two loops at Compiler Explorer shows I really should have written the array indexing form instead.  What can I say in my defense? I never claimed the code was the best I could think of, I only indicated this would be something I would write in a couple of minutes to perform the task I needed it to perform, with the logic I described above.
There is always room to learn and improve.

The final part, writing out the (full) converted out_buf, uses the same logic as the read loop, except that this time we loop until the buffer is empty.
end points to just past the final char to be written.  Note how (char *)(out_buf + have) is the char pointer to the have'th element; it is deliberately NOT (char *)(out_buf) + have, which would be a char pointer to the have'th char.

Both ptr (pointing to the first char that needs to be written) and end (pointing to the char following the last char to be written, or first char after the data to be written) are pointers to char, because that is the units in which write() operates with.  The expression (size_t)(end - ptr) is the number of chars between the two pointers, if and only if ptr <= end (or equivalently end >= ptr).

The loop condition can be read in English as "while we have chars between ptr and end to be written":
Code: [Select]
            while (ptr < end) {
                ssize_t  n = write(STDOUT_FILENO, ptr, (size_t)(end - ptr));
                if (n > 0)
                    ptr += n;
                    return EXIT_FAILURE;
All error cases are aggregated into the else clause.  In theory, write() should return a positive value, or -1 with errno set (see man 2 write for a full description).  I personally consider both other negative values and 0 as equivalent to EIO error.  (For descriptions of errno codes, see man 3 errno.)

Again, the key thing is that the low-level write() call does not necessarily write the entire buffer: it can return a short count, for example because the output is to a pipe and the pipe is nearly full (because the reader is too slow in reading).  It will block until at least one char is written, or an error occurs.
(If the descriptor was nonblocking, it would return -1 with errno==EINTR if interrupted by a signal delivery to an userspace handler installed without SA_RESTART flag, errno==EAGAIN or errno==EWOULDBLOCK if nothing can be written immediately, and so on.)

If standard output is a pipe, and the read end closes its end, this process will be killed by the SIGPIPE signal the first time we attempt to write to the pipe after the read end is closed.  We can catch or ignore that signal, in which case the write would fail with errno==EINTR or errno==EPIPE.  However, since being killed by SIGPIPE is fine with us (humans using this tool for its stated purpose), we don't need to worry about this either.

Even if you personally –– whoever might be reading this post –– do not find this useful, I think there is a chance it might be useful to someone finding this thread later using a web search due to the keywords (stdin stdout pipe conversion).

In particular, those comparing the freestanding implementation to the standard library implementation might find it useful, because it illustrates how the logic stays the same even when the standard library is not used, and how GCC/clang/ICC extended assembly can be used to interface to something completely different (in this case, to a Linux kernel providing us with read, write, and exit/exit_group syscalls).

In a very real way, the standard C library is an abstraction we can replace if we wanted to, as long as we can devise a sane/effective/acceptable function interfaces for the things we need.  Because the three syscalls this program needs are so simple, I just used the Linux equivalents for the API.
In particular, in all Linux architectures, sizeof (long) == sizeof (void *), so that long has the same properties as intptr_t, and unsigned long the same properties as uintptr_t.  This is why the function wrappers around the syscalls use long and not some other type.  Other, non-Linux (compatible) systems, use other conventions, so this is the kind of thing one has to think about when creating interfaces and APIs in freestanding C.
The following users thanked this post: Ed.Kloonk, gnif, phs, newbrain, RoGeorge

