In general, you can tune the /proc/sys/vm/dirty* (https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html) pseudofiles to get the exact behaviour you want. (That is, what you see is not "Linux behaviour", it is just a consequence of the defaults set by your Linux distribution, or perhaps even just the kernel defaults that are the developers' best guess. They are definitely exposed for sysadmins to tune if so desired.)
For this, I'd use the same approach as you have.
I'd recommend using larger power of two write chunks, up to 2 MiB (so 32768, 65536, 131072, 262144, 524288, 1048576, or 2097152 bytes). I'd also use a third thread, regularly calling fdatasync() on the file descriptor. With multiple files, you only really need one common datasync thread. Just make sure that if the descriptor is already clean (the call does not block), you don't just hammer it thousands of times per second. In a very real sense, this is an "asynchronous" datasync. Also, do remember to use a sensible size stack for the threads: 2*PTHREAD_STACK_MIN suffices well if you do not do recursion or have large local structures or arrays on stack, and keeps the process VM size small, which is always a plus.
I'd also make the parameters tunable via command-line arguments: the size of each data bucket (to be written on disk using a single write() call if possible), the number of buckets in rotation, the minimum interval between fdatasync() calls. Experiment a bit, and set the defaults to whatever works for you now. (I'd even make them build-time configurable, for maximum tunability.)
If you use
BUCKET_SIZE := 131072
CFLAGS := -DBUCKET_SIZE=$(BUCKET_SIZE) -Wall -O2
in your Makefile, you can use
#ifndef BUCKET_SIZE
#define BUCKET_SIZE 65536
#endif
in your C source files. This way, the bucket size will default to 64k in the C source, but the Makefile will override it to 128k or whatever the user specifies at build time (make BUCKET_SIZE=262144 clean all, for example).
Alternatively, put the plain defines in an intuitively named file, say config.h.
Command-line parsing is very simple using getopt()/getopt_long() (https://www.man7.org/linux/man-pages/man3/getopt.3.html) and value parsers of type
#include <stdlib.h>
#include <ctype.h>
int parse_size(const char *src, size_t *to)
{
if (!src || !*src)
return -1; /* No string specified */
const char *end;
unsigned long val;
end = src;
errno = 0;
val = strtoul(src, (char **)&end, 0);
if (errno || end == src)
return -1; /* Conversion failed */
if ((unsigned long)((size_t)val) != val)
return -1; /* Overflow, does not fit into target type */
while (isspace((unsigned char)(*end)))
end++;
if (*end)
return -1; /* Garbage at end of string; could also check for size suffixes (G,M,k,m,u,n) */
if (to)
*to = 0;
return 0;
}
(and similarly for double, except for using strtod() instead of strtoul(), and so on).
If you always implement -h | --help, with a short description of what the utility does in the usage, you'll thank yourself years later. I've done this for years myself, for all kinds of runtime tests and experiments, and I literally have hundreds of these (well over a thousand, actually; but most of them in cold storage). Just running the utility with --help tells me what it does, so I don't have to memorize anything. I also add a README file describing the reasons why I created it, and my key findings, so that when I come back to it later, I don't have to spend much cognitive effort to continue working on it.
I keep these one per directory, in various trees (currently, I have one directory for EEVBlog examples, one for physics stuff, one for esoteric data structures and experiments, and so on).
Make sure you aren't fully covered by
nc 192.168.1.2 123 |dd bs=8M iflag=fullblock oflag=dsync of=output.file
If you need to roll your own (say, the TCP protocol is more complex than just receiving a stream) then yeah, you are probably already good enough for anything but a ridiculously underpowered hardware. You can hopefully queue many seconds before you run out of RAM and CPU load is likely negligible.
Syncing larger chunks could help the filesystem better organize data continuously on disk, but 8MB is perhaps nothing to worry about already.
You could still use POSIX_FADV_DONTNEED after fdatasync() to hopefully discard written data from page cache and simply reduce system pollution. Particularly useful if other things run on the same machine.
Make sure you aren't fully covered by
nc 192.168.1.2 123 |dd bs=8M iflag=fullblock oflag=dsync of=output.file
Good point. I admit I assumed there was more to the communications than just the above; but the above does work very well (although I like to use bs=2M as it seems to be the sweet spot on my machines).
Syncing larger chunks could help the filesystem better organize data continuously on disk, but 8MB is perhaps nothing to worry about already.
True. There are also I/O speed benefits to using 2n block sizes up to 2M; above that, you are basically telling the FS to use continuous fragments for the writes. Largest single writes Linux supports are 0x7FFFF000 = 2,147,479,552 bytes, so anything above 2G will add extra work for the FS. (There are only 9 powers of two between 2M and 2G: 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M, and 1G.)
You could still use POSIX_FADV_DONTNEED after fdatasync() to hopefully discard written data from page cache and simply reduce system pollution. Particularly useful if other things run on the same machine.
Yep, that is a good idea.
There is also the Linux-specific splice() (https://man7.org/linux/man-pages/man2/splice.2.html) call if you do not do anything to the data received via TCP except write to the file. You do need to call it in a loop until it returns 0 (indicating sender closed the write end) or -1 (with errno set to indicate the error). Thing is, I haven't experimented with it in this direction (socket → file), so I don't know how it would behave in this particular situation. It would be worth carefully examining, though, if a Linux-only solution is acceptable.
I suspect that one would still want a caretaker thread to examine the file position (which splice() is supposed to maintain) via len = lseek(fd, SEEK_CUR, 0), and then fdatasync(fd) and posix_fadvise(fd, 0, len, POSIX_FADV_DONTNEED) to keep the data off the page cache. The len could then be used to measure progress, too.
This does reduce memory copies, and reduces the need for userspace buffers, but again, it is Linux-specific (so completely unportable to other systems), and I haven't verified whether it would actually work measurably better in this use case.
The queue between the network receive thread and the file write thread fills up. Once that happens, no more data can be inserted into the queue by the network receiver, and it opts to drop the data rather than stall. The server cannot accept backpressure - it's reading data from an SDR.
The queue (https://pastebin.com/BcS0AUZV) is a circular buffer with semaphores to count the number of entries and signal when it's safe to increment the read or write pointers. The queue length defaults to 1000 packets (~32k each), but can be changed. I found that in operation, the queue depth is at or near zero almost all the time, but due to variable I/O delay, it can fill up to a couple of hundred entries.
Hardware: not super modern, but no slouch - AMD FX8350, 16 GB RAM, i219 NIC, a 1TB 7200 rpm SATA disk (xfs, had tried it with ext4 with similar results). It really is not the hardware though, it's poorly managed page caching. Once I use posix_fadvise() to let the kernel know that I don't really need it to cache the data in memory, it flushes it out earlier, rather than wait and try do it all at once.
It's fairly easy to verify that the kernel is indeed keeping the dirty pages for a long time by doing:
cat /proc/meminfo | grep -i dirty
Without calling posix_fadvise() or fdatasync(), the dirty cache size keeps growing until all free memory is used, then it decides to write it all at once, causing subsequent write()s to block.
Edit:
I haven't tried a single-thread approach, but that would not work since if the write() to disk blocks, then I'm relying on the server to queue up data that it can't send() to the client. In my (limited) experience, file I/O, especially to mechanical HDDs can have brief spikes in latency, hence the use of a queue. Ideally, the kernel buffering data should do this for me, but that hasn't been my experience so far.