EEVblog Electronics Community Forum

Products => Computers => Programming => Topic started by: radar_macgyver on October 28, 2022, 06:26:01 am

Title: Streaming network data to disk
Post by: radar_macgyver on October 28, 2022, 06:26:01 am
I'm writing a simple program in C for Linux that reads a stream of data from a TCP socket, and writes it to disk. It uses two threads, one for reading from the network, and one to write to disk. In between is a queue that holds packets that were received but not yet written. The queue is thread-safe. The average data rate coming in is about 30 to 60 MB/s, each packet is about 32 kB.

The issue I'm having is that the thread writing to disk is affected by Linux's page caching behavior. Initially, nothing's written to disk, it just piles up in the page cache and writes proceed smoothly. Then, once most free system memory is used up, the kernel flushes those buffers, and the next write() takes a long time to complete, resulting in the queue filling up, and data being dropped.

One naive mitigation I tried was to call fdatasync() after every 8 MB was written. The call to fdatasync() blocks until the buffers are flushed, but since it's only 8 MB, the queue absorbs the latency. I did stumble across the O_DIRECT flag to pass to open(), but it does come with a lot of other requirements, like requiring size and alignment restrictions on the buffer passed to write(). Torvalds apparently doesn't think it's a good idea (https://yarchive.net/comp/linux/o_direct.html), and recommends using posix_fadvise() instead. I tried that, with the POSIX_FADV_DONTNEED flag, but it didn't seem to do very much when called right after doing a write(). If called before calling close(), it seems to have the same effect as calling fdatasync().

Is there a better way to stream data to disk?
Title: Re: Streaming network data to disk
Post by: Nominal Animal on October 28, 2022, 01:39:52 pm
In general, you can tune the /proc/sys/vm/dirty* (https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html) pseudofiles to get the exact behaviour you want.  (That is, what you see is not "Linux behaviour", it is just a consequence of the defaults set by your Linux distribution, or perhaps even just the kernel defaults that are the developers' best guess.  They are definitely exposed for sysadmins to tune if so desired.)



For this, I'd use the same approach as you have.

I'd recommend using larger power of two write chunks, up to 2 MiB (so 32768, 65536, 131072, 262144, 524288, 1048576, or 2097152 bytes).  I'd also use a third thread, regularly calling fdatasync() on the file descriptor.  With multiple files, you only really need one common datasync thread.  Just make sure that if the descriptor is already clean (the call does not block), you don't just hammer it thousands of times per second.  In a very real sense, this is an "asynchronous" datasync.  Also, do remember to use a sensible size stack for the threads: 2*PTHREAD_STACK_MIN suffices well if you do not do recursion or have large local structures or arrays on stack, and keeps the process VM size small, which is always a plus.

I'd also make the parameters tunable via command-line arguments: the size of each data bucket (to be written on disk using a single write() call if possible), the number of buckets in rotation, the minimum interval between fdatasync() calls.  Experiment a bit, and set the defaults to whatever works for you now.  (I'd even make them build-time configurable, for maximum tunability.)

If you use
    BUCKET_SIZE := 131072
    CFLAGS := -DBUCKET_SIZE=$(BUCKET_SIZE) -Wall -O2
in your Makefile, you can use
    #ifndef  BUCKET_SIZE
    #define  BUCKET_SIZE  65536
    #endif
in your C source files.  This way, the bucket size will default to 64k in the C source, but the Makefile will override it to 128k or whatever the user specifies at build time (make BUCKET_SIZE=262144 clean all, for example).
Alternatively, put the plain defines in an intuitively named file, say config.h.

Command-line parsing is very simple using getopt()/getopt_long() (https://www.man7.org/linux/man-pages/man3/getopt.3.html) and value parsers of type
Code: [Select]
#include <stdlib.h>
#include <ctype.h>

int parse_size(const char *src, size_t *to)
{
    if (!src || !*src)
        return -1;  /* No string specified */

    const char *end;
    unsigned long  val;

    end = src;
    errno = 0;
    val = strtoul(src, (char **)&end, 0);
    if (errno || end == src)
        return -1;  /* Conversion failed */

    if ((unsigned long)((size_t)val) != val)
        return -1;  /* Overflow, does not fit into target type */

    while (isspace((unsigned char)(*end)))
        end++;
    if (*end)
        return -1;  /* Garbage at end of string; could also check for size suffixes (G,M,k,m,u,n) */
   
    if (to)
        *to = 0;
    return 0;
}
(and similarly for double, except for using strtod() instead of strtoul(), and so on).

If you always implement -h | --help, with a short description of what the utility does in the usage, you'll thank yourself years later.  I've done this for years myself, for all kinds of runtime tests and experiments, and I literally have hundreds of these (well over a thousand, actually; but most of them in cold storage).  Just running the utility with --help tells me what it does, so I don't have to memorize anything.  I also add a README file describing the reasons why I created it, and my key findings, so that when I come back to it later, I don't have to spend much cognitive effort to continue working on it.
I keep these one per directory, in various trees (currently, I have one directory for EEVBlog examples, one for physics stuff, one for esoteric data structures and experiments, and so on).
Title: Re: Streaming network data to disk
Post by: magic on October 28, 2022, 06:58:42 pm
Make sure you aren't fully covered by
Code: [Select]
nc 192.168.1.2 123 |dd bs=8M iflag=fullblock oflag=dsync of=output.file

If you need to roll your own (say, the TCP protocol is more complex than just receiving a stream) then yeah, you are probably already good enough for anything but a ridiculously underpowered hardware. You can hopefully queue many seconds before you run out of RAM and CPU load is likely negligible.

Syncing larger chunks could help the filesystem better organize data continuously on disk, but 8MB is perhaps nothing to worry about already.

You could still use POSIX_FADV_DONTNEED after fdatasync() to hopefully discard written data from page cache and simply reduce system pollution. Particularly useful if other things run on the same machine.
Title: Re: Streaming network data to disk
Post by: Nominal Animal on October 28, 2022, 10:00:18 pm
Make sure you aren't fully covered by
Code: [Select]
nc 192.168.1.2 123 |dd bs=8M iflag=fullblock oflag=dsync of=output.file
Good point.  I admit I assumed there was more to the communications than just the above; but the above does work very well (although I like to use bs=2M as it seems to be the sweet spot on my machines).

Syncing larger chunks could help the filesystem better organize data continuously on disk, but 8MB is perhaps nothing to worry about already.
True.  There are also I/O speed benefits to using 2n block sizes up to 2M; above that, you are basically telling the FS to use continuous fragments for the writes.  Largest single writes Linux supports are 0x7FFFF000 = 2,147,479,552 bytes, so anything above 2G will add extra work for the FS.  (There are only 9 powers of two between 2M and 2G: 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M, and 1G.)

You could still use POSIX_FADV_DONTNEED after fdatasync() to hopefully discard written data from page cache and simply reduce system pollution. Particularly useful if other things run on the same machine.
Yep, that is a good idea.



There is also the Linux-specific splice() (https://man7.org/linux/man-pages/man2/splice.2.html) call if you do not do anything to the data received via TCP except write to the file.  You do need to call it in a loop until it returns 0 (indicating sender closed the write end) or -1 (with errno set to indicate the error).  Thing is, I haven't experimented with it in this direction (socket → file), so I don't know how it would behave in this particular situation.  It would be worth carefully examining, though, if a Linux-only solution is acceptable.

I suspect that one would still want a caretaker thread to examine the file position (which splice() is supposed to maintain) via len = lseek(fd, SEEK_CUR, 0), and then fdatasync(fd) and posix_fadvise(fd, 0, len, POSIX_FADV_DONTNEED) to keep the data off the page cache.  The len could then be used to measure progress, too.

This does reduce memory copies, and reduces the need for userspace buffers, but again, it is Linux-specific (so completely unportable to other systems), and I haven't verified whether it would actually work measurably better in this use case.
Title: Re: Streaming network data to disk
Post by: radar_macgyver on October 28, 2022, 11:52:38 pm
Thank you Nominal and magic for your detailed responses.

The program in question does have to do some handshaking with the server prior to data being available, so using splice() or nc wouldn't work. Would have been nice though!

The thing I was missing was one can call fdatasync() or posix_fadvise() asynchronously from another thread. That keeps the thread doing the write() from blocking, and the queue from growing too large (as long as the average data rate stays below that of the underlying HDD). I am now spawning a new thread each time I need to do the sync operation. I guard against more than one concurrent sync thread by calling pthread_join() on the previous thread ID (if it exists). The pthread_join() would of course block, but that's an indication that the underlying drive can't keep up.

I missed the para in the posix_fadvise() man page that mentioned that the offset and length must be multiples of the underlying device's block size - this is why my previous attempt failed - I tried to call write(), followed by posix_fadvise() with the appropriate offset and the same length as the buffer previously written. Simply setting both to zero worked like a champ. While the man pages were silent on the difference between POSIX_FADV_DONTNEED and POSIX_FADV_NOREUSE, this post (https://lwn.net/Articles/449420/) indicated that POSIX_FADV_DONTNEED is appropriate for my use-case.

Strangely, calling posix_fadvise(POSIX_FADV_DONTNEED) by itself (without the fdatasync()) blocked. I thought that posix_fadvise() would just set a flag somewhere and the kernel cleanup threads would be signaled to do their job asynchronously. Oh well, calling it from a separate thread did the trick. Thank you for the tip!

Adjusting the VM tunables helped. On my CentOS 8 box, most were at their defaults. Adjusting dirty_expire_centisecs from 3000 down to 100 made things work better, but this was not needed after adding posix_fadvise(POSIX_FADV_DONTNEED) in a separate thread.

My program does use getopt_long to configure itself, along with a few environment variables to set defaults (like the server to connect to). That saves me from having to type it out each time, and also from hard-coding it. However, I do need to get into the habit of doing this for various 'throwaway' test code.


As an aside, while testing this, I was using a mechanical SATA drive to avoid putting wear on my SSD. smartctl reports that this disk's power-on counter is 132609, so at least 15 years old! Put a newer disk in, and it's a lot faster.
Title: Re: Streaming network data to disk
Post by: ejeffrey on October 29, 2022, 05:45:40 am
The issue I'm having is that the thread writing to disk is affected by Linux's page caching behavior. Initially, nothing's written to disk, it just piles up in the page cache and writes proceed smoothly. Then, once most free system memory is used up, the kernel flushes those buffers, and the next write() takes a long time to complete, resulting in the queue filling up, and data being dropped.

Where is the data being dropped?  Nothing you described should drop data.  If you have some real time constraints or systems that can't accept back pressure that would be helpful information.

30-60 MB/s is not nothing but it shouldn't really require fine tuning of kernel parameters to sustain.  Linux shouldn't leave dirty buffers around for that long the way it sounds like you are seeing.  When you do a write, it should start flushing to disk reasonably quickly.  The data will stay in memory and the VM system is not very aggressive about detecting streaming IO and discarding data early but there it should always be clean pages that can be diacarded cheaply to make room for new data.

 What kind of hardware is this?  I'm assuming a reasonably modern x86 CPU, 1 Gb built in networking, and an SSD? NVMe or SATA?  How much memory is available?

Have you tried a single threaded approach?  Or is there other work going on that makes that impractica?

What type of queue are you using and how big is it?
Title: Re: Streaming network data to disk
Post by: magic on October 29, 2022, 07:07:12 am
Linux absolutely is retarded like that and infamous for latency problems in presence of heavy write workloads.

It's legendary that you can fill a pendrive with gigabytes of data in a few seconds and then wait a few minutes until its all written back, during which time you are effectively out of RAM and later all your previously cached files need to be re-fetched from disk ::)

In OP's case it works somewhat like this:
You want to write something? Great, I will keep it around for a moment and maybe optimize the on-disk placement better and whatnot.
Oh, you just wrote more? Fine, we will find some memory to buffer it by discarding stale data nobody accessed in the last 100 milliseconds.
Oh fuck fuck fuck, I just swapped out the code segment of the process doing the writes and it's blocked waiting for swap-in.
Gotta flush some of those dirty buffers to disk...
Meanwhile TCP stops accepting packets and the sender is forced to drop data after it runs out of its own buffer memory.

The program in question does have to do some handshaking with the server prior to data being available, so using splice() or nc wouldn't work. Would have been nice though!
For the record, it's possible to do some initial back and forth manually and then call splice() or even execl("dd", ...).

That keeps the thread doing the write() from blocking, and the queue from growing too large (as long as the average data rate stays below that of the underlying HDD). I am now spawning a new thread each time I need to do the sync operation.
This is misguided and the root of your problem. If your write() doesn't block, that just means the data have been moved from one place in memory to another. And the kernel will accept a lot of data before doing anything about it and it will always keep a fairly long backlog which does nothing good for you, only using up your RAM and increasing the amount of data lost in the event of crash or power loss.

OTOH, if you block the writer thread, then the kernel will have nothing better to do than writing your data to disk. Particularly if the writer is blocked on fdatasync(), explicitly demanding a writeback. BTW, note that this call can return I/O errors, which indicates your data are not saved. And the more data you queue up in the kernel before calling fdatasync(), the less idea you will have of which part had been lost.

There is the O_DSYNC flag to open(), which causes all writes to block (and possibly return errors) until an implicit sync completes. Possibly not a bad idea. This is what dd uses with oflag=dsync.

There is zero problem with keeping a queue in your process. You only need to ensure that your TCP side receives data into the queue with low latency, because otherwise TCP will stop the stream and the sender will have to deal with it, which it may not be able to for long.
Title: Re: Streaming network data to disk
Post by: radar_macgyver on October 29, 2022, 07:19:16 am
The queue between the network receive thread and the file write thread fills up. Once that happens, no more data can be inserted into the queue by the network receiver, and it opts to drop the data rather than stall. The server cannot accept backpressure - it's reading data from an SDR.

The queue (https://pastebin.com/BcS0AUZV) is a circular buffer with semaphores to count the number of entries and signal when it's safe to increment the read or write pointers. The queue length defaults to 1000 packets (~32k each), but can be changed. I found that in operation, the queue depth is at or near zero almost all the time, but due to variable I/O delay, it can fill up to a couple of hundred entries.

Hardware: not super modern, but no slouch - AMD FX8350, 16 GB RAM, i219 NIC, a 1TB 7200 rpm SATA disk (xfs, had tried it with ext4 with similar results). It really is not the hardware though, it's poorly managed page caching. Once I use posix_fadvise() to let the kernel know that I don't really need it to cache the data in memory, it flushes it out earlier, rather than wait and try do it all at once.

It's fairly easy to verify that the kernel is indeed keeping the dirty pages for a long time by doing:
Code: [Select]
cat /proc/meminfo | grep -i dirty
Without calling posix_fadvise() or fdatasync(), the dirty cache size keeps growing until all free memory is used, then it decides to write it all at once, causing subsequent write()s to block.

Edit:
I haven't tried a single-thread approach, but that would not work since if the write() to disk blocks, then I'm relying on the server to queue up data that it can't send() to the client. In my (limited) experience, file I/O, especially to mechanical HDDs can have brief spikes in latency, hence the use of a queue. Ideally, the kernel buffering data should do this for me, but that hasn't been my experience so far.
Title: Re: Streaming network data to disk
Post by: radar_macgyver on October 29, 2022, 07:26:12 am
This is misguided and the root of your problem. If your write() doesn't block, that just means the data have been moved from one place in memory to another. And the kernel will accept a lot of data before doing anything about it and it will always keep a fairly long backlog which does nothing good for you, only using up your RAM and increasing the amount of data lost in the event of crash or power loss.
Not if I tell the kernel I'm not interested in keeping the data around by calling posix_fadvise(POSIX_FADV_DONTNEED). I did verify that this works for me, and that the kernel is not keeping the data around by watching /proc/meminfo.

OTOH, if you block the writer thread, then the kernel will have nothing better to do than writing your data to disk. Particularly if the writer is blocked on fdatasync(), explicitly demanding a writeback. BTW, note that this call can return I/O errors, which indicates your data are not saved. And the more data you queue up in the kernel before calling fdatasync(), the less idea you will have of which part had been lost.
I don't actually need an explicit writeback, just want to tell the kernel to not keep the data around that it won't be needing. So rather than call both fdatasync() and posix_fadvise(), I just do the latter. I was mildly surprised that this still blocks, but it doesn't matter once it's being called in a different thread than the one doing the write(). For the same reason, it doesn't make sense to do add the O_DSYNC flag.
Title: Re: Streaming network data to disk
Post by: magic on October 29, 2022, 07:49:33 am
Well, fair enough, if this works for you.

My recommendation to sync often is motivated by catching and handling I/O errors, as my background is in software which needs to write data to disk and guarantee that they are there ;)

If you want, you can still get your I/O errors in the return value of close(), just call it explicitly before exiting the program.

Non blocking calls don't wait for completion and therefore are not guaranteed to notify you about errors.
Title: Re: Streaming network data to disk
Post by: radar_macgyver on October 29, 2022, 08:03:53 am
I didn't set O_NONBLOCK, just call fdatasync() from a different thread than write(), so that while the call to fdatasync() blocked, write()s could proceed. This is what Nominal Animal suggested in his first reply. Later, I replaced the call to fdatasync() with posix_fadvise().

If you have the time: https://pastebin.com/wHjCZ7m6

(please don't judge too harshly!)
Title: Re: Streaming network data to disk
Post by: magic on October 29, 2022, 09:12:35 am
Not setting O_NONBLOCK doesn't guarantee that writes will wait until data hit the platter, it just means that they may block sometimes. For example, because a significant fraction of your 16GB has been filled with dirty data and the kernel decided it no longer accepts writes and all your applications that write to any disk slow down to a crawl. Sometimes also those which just need to allocate more memory, because it's all filled up with junk. The default behavior (in absence of fsync/fadvise) is that stupid.

Without syncing there is no guarantee that anything has been written or will be written successfully, even if write() returns success. As far as I see in the manual, posix_fadvise doesn't notify you of I/O errors, but fdatasync() does.

Detecting I/O errors is admittedly kinda irrelevant if you have no plan to deal with them other than using a good disk in the first place. Although it may be helpful to know that they happen, before you waste a few hours recording some data of which only the fist few minutes actually survive.

If you want to deal with errors, it's easiest to just keep the data in your process and if write() or fdatasync() fails, either retry the write (and likely get the same error, print a message and exit) or send the data to another disk/machine to be dealt with (high reliability system scenario).

There is not much difference between keeping a queue in your own process vs shoving it as fast as possible into the page cache. You are clogging up RAM either way. With the in-process approach your program maintains full control and ownership of the data and has real time visibility of the actual progress of writing to disk. So I would prefer to just enlarge the internal queue. And fadvise Linux to drop each synced data block before writing another one.
Title: Re: Streaming network data to disk
Post by: radar_macgyver on October 30, 2022, 05:44:03 pm
Detecting I/O errors is admittedly kinda irrelevant if you have no plan to deal with them other than using a good disk in the first place. Although it may be helpful to know that they happen, before you waste a few hours recording some data of which only the fist few minutes actually survive.
True - and in this case, I'm mostly using the recorded data to verify performance of the SDR's data acquisition. In normal use, the SDR data is streamed to a real-time signal processor, which reduces the data rate by about a factor of 10.

I'm surprised that Linux's handling of large amounts of writes is the way it is - as you mentioned earlier, this becomes very apparent when writing an ISO to a USB stick, dd happily reports that it's done, but issuing a 'sync' prior to ejecting the stick is when the data is actually written. Wouldn't it be prudent for the kernel to note which PID is doing a lot of writes (as determined by a VM tunable) to a block device and preferentially flush just those dirty pages?
Title: Re: Streaming network data to disk
Post by: magic on October 30, 2022, 06:04:34 pm
It's a double edged sword. In some cases it helps - before I had SSDs in my PC, I was glad that I could download a few gigabyte ZIP archive, unpack it in a second to the page cache and start browsing the contents right away, while the disk was slowly absorbing it all in the background. On the downside, if I deleted the ZIP immediately and then hard-rebooted the machine, both the ZIP and the contents would be lost.

Yes, it's annoying when a single process can completely fill available "dirty bytes" quota and paralyze not only itself but everything else, but that's what it is ::)

There are effective workarounds, as discussed above.
Title: Re: Streaming network data to disk
Post by: Nominal Animal on October 30, 2022, 07:15:30 pm
I'm surprised that Linux's handling of large amounts of writes is the way it is
Except it isn't, because it is tunable, and you're just seeing the effects of the default setting.

It is a complex problem if you consider all the different possible workloads a Linux kernel can be used for, from embedded appliances to HPC clusters, and absolutely everything in between.

For a very simple test, run
    sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; sync'
to flush all currently cached data to disk, and clear all caches.  (This is safe, and will never discard data; the only way it can lose data is if you have an actual storage media write error.)

Then, change the duration in which data starts to be considered for writeback.  It is in centiseconds (hundredths of a second), and defaults to five seconds (500).  Change it to say one second:
    sudo sh -c 'echo 100 > /proc/sys/vm/dirty_expire_centisecs'
or equivalently
    sudo sysctl vm.dirty_expire_centisecs=100

There are two main triggers for writeback: online/normal, and background.  They are triggered by the amount of cached data, set either in bytes, or as (integer) percentage of memory available for userspace processes.  Typical desktop settings on x86-64 are 20% normal limit, and 10% background limit.
Set for example a 1 megabyte background limit,
    sudo sh -c 'echo $[1024*1024] > /proc/sys/vm/dirty_background_bytes'
and a 32 megabyte online/normal limit,
    sudo sh -c 'echo $[32*1024*1024] > /proc/sys/vm/dirty_bytes'
or equivalently
    sudo sysctl vm.dirty_background_bytes=1048576 vm.dirty_bytes=33554432

If the workload is such that it keeps modifying the same file, and we use lazytime to reduce the amount of actual storage writes, we may wish to control how old the data can be before it is actually written back to storage.  This is in seconds, and defaults to 12 hours.  In your case, consider something like 15 minutes:
    sudo sh -c 'echo $[15*60] > /proc/sys/vm/dirtytime_expire_seconds'
or equivalently
    sudo sysctl vm.dirtytime_expire_seconds=900



To reset all these back to defaults, first run for N in /proc/sys/vm/dirty* ; do echo "$N: $(cat $N)" ; done to see what the defaults are on your system, then write the ones that default to zeros first, and finally the nonzero values, just like above.  They're run-time tunables, usually set at boot time to values set in /etc/sysctl.conf and/or /etc/sysctl.d/*.conf unless kernel defaults are used; Linux Mint 20.3 on x86-64 definitely uses kernel defaults.

If you find a set of values you like, all you need to do is create a file, say /etc/sysctl.d/20-nominal-animal-prefs.conf, containing say
    # These are the settings corresponding to the example values Nominal Animal mentioned.
    vm.dirty_expire_centisecs = 100
    vm.dirty_background_bytes = 1048576
    vm.dirty_bytes = 33554432
    vm.dirtytime_expire_seconds = 900
and they will be set at next boot automagically.

For testing, you can create a file that contains the defaults, and one or more files with values you test (perhaps commenting them with your testing results or observations), and load one –– noting that only the specified values are changed; nothing is "reset" to a default –– using
    sudo sysctl --load filename

Copying one to /etc/sysctl.d/ will then use those settings on subsequent boots.



See?  I know most users would prefer Linux to just have better defaults, but thing is, what would be better?

The actual task is to first find the values that work well for you, without causing any annoyances.  The suggestions above are just extreme values I pulled out of my backside, good for perhaps starting with, and perhaps not even those, because I haven't experimented with with exactly the kind of workload you're dealing with.  I do often lower the dirty limits, because my memory use is "spiky", and if the dirty writeback is too slow, I may have to wait for a second or so for the kernel to evict dirty data before it can provide backing for my multi-gigabyte memory maps.

Yet, when you have found some settings that work for you better –– the tunables are explained in detail at Linux admin guide, sysctl section (https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html) –– it is very easy to make them stick.
Title: Re: Streaming network data to disk
Post by: magic on October 31, 2022, 08:12:39 am
The kernel default of dirty_expire_centisecs is 30s IIRC.
Reducing this or dirty_background_bytes could help by pushing the data out somewhat earlier, hopefully.
Reducing dirty_bytes is counterproductive when OP's problem is exactly inability to buffer enough of the stream when I/O slows down for whatever random reasons.

None of that is as effective as regular flushing, none of it addresses the page cache pollution.

I regard such tuning as desperate measures against problems with black box applications provided by others, but hardly the go-to choice for your own programs. And it's one set of tweaks for the whole system, good luck on a personal or development desktop machine running all sorts of things at various times.
Title: Re: Streaming network data to disk
Post by: Nominal Animal on October 31, 2022, 12:07:32 pm
None of that is as effective as regular flushing, none of it addresses the page cache pollution.
Of course, that's why I suggested using a third thread to do the writeback asynchronously: no reason to rely on kernel heuristics, when you can tell the kernel to sync and flush the data.  I still think that (the third thread syncing and advising) is the appropriate solution here.

I regard such tuning as desperate measures against problems with black box applications provided by others, but hardly the go-to choice for your own programs.
No, no, that's not what I use the tunables for.  There are two use cases I fiddle with them: one is when I build my own workstation, say with two SSD's in software RAID0/1 configuration, and lots of RAM.  As my RAM use is spiky, I do want it to be used for caching, but I also want dirty caches to be flushed to storage early and often.  The defaults don't do that.  The second use case is when I do my own systems integration with e.g. OpenWRT and even my own bare kernel appliances – consider things like dedicated logging from several servers with no direct internet access (for security), media servers, NAS boxes, and so on.  They often have very little RAM, but also the storage medium (especially if it is just an SD card) doesn't like to be hit too often, so tuning the dirty writeback becomes a tradeoff between latency (initiating writeback only when necessary, prolonging storage lifetime) and storage lifetime.
In particular, mounting your system partition read-only by default (only remounting it read-write for updates), and separate partitions for /var/log/ and /tmp can significantly increase the reliability of an appliance/SBC running off an SD card.

I definitely would not use the tunables to "fix" the behaviour of a single program, either.

The point of my posts about the VM tunables, with examples, is that the observed behaviour is not "Linux behaviour", it is just a result of default settings that one can tune if the behaviour overall (as in system-wide) is not something you like.  They are global tunables, affecting all processes, because they exactly affect when and how the kernel does the dirty writeback.  And that although they are trivial to change, the true problem is to find settings that work well for all workloads.

Obviously, for a single process, you want the process to work well regardless of the current kernel settings and heuristics.  That is exactly why the venerable dd (https://www.man7.org/linux/man-pages/man1/dd.1.html) supports oflags like direct, dsync, sync, and nocache.  In my mind, OP asked how to implement similar behaviour to their own program; and the Linux VM dirty parameters are just a sidetrack, related to why one (or dd) would need such options in the first place.

Apologies for the confusion, I should have been clearer.
Title: Re: Streaming network data to disk
Post by: radar_macgyver on October 31, 2022, 10:42:13 pm
For the record, I understood what you were trying to say, Nominal. With the changes made to my program, it works exactly like how I expected it to.

It is quite amazing how much one can tweak the behavior of the kernel with the vm tunables, but I was wondering what drove the design decision to not have the flushing thresholds applied per process, rather than globally. I'm guessing if I asked a question like this on LKML or the like, the answer would be along the lines of "rewrite your application, you deranged monkey on mind-controlling substances!"
Title: Re: Streaming network data to disk
Post by: magic on November 01, 2022, 06:10:27 am
There could be an ulimit for that, or something of that sort.

Maybe it's possible with cgroups, dunno :-//
Title: Re: Streaming network data to disk
Post by: Nominal Animal on November 01, 2022, 04:48:53 pm
Maybe it's possible with cgroups
Yes, it is possible with cgroup (https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html)s (documentation link).

It is even integrated to sysctl processing, so that if you do create a cgroup, and you set the sysctls from within that cgroup (using a process that is part of the cgroup), they only affect the cgroup and not the entire machine.

I was wondering what drove the design decision to not have the flushing thresholds applied per process, rather than globally.
From the kernel side, making the thresholds per process would be quite complicated.  You'd need a separate writeback elevator (priority implementing mechanism) to choose what to writeback and when.  Even then, considering how rarely sysadmins touch these tunables, in 99.9% of cases they'd just be exact same for all processes, so all that overhead would be wasted effort.  In comparison, having kernel-wide tunables is straightforward, and caters the needs of the advanced sysadmins that do touch these tunables –– until fifteen years ago, when they started needing more finely focused tunables for specific workloads (in particular, virtual machines), spearheaded by some Google types.  See cgroup history (https://en.wikipedia.org/wiki/Cgroup).

Do note that there are already ionice (https://www.man7.org/linux/man-pages/man1/ionice.1.html) (setting the I/O priority at process granularity) and nice (https://www.man7.org/linux/man-pages/man1/nice.1.html) (setting the CPU priority at process granularity) that can be used to control the "preference" or "importance" of each process.

(The sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; sync' trick is also useful when you completely switch workloads, as it clears all your caches without losing any data.  In particular, it is useful when doing microbenchmarking from a "cold cache" state (as opposed to subsequent runs, where given enough memory, the caches are already "hot" with useful data).  Just pointing this out if you ever find yourself doing that kind of a stuff.)

I'm guessing if I asked a question like this on LKML or the like, the answer would be along the lines of "rewrite your application, you deranged monkey on mind-controlling substances!"
No, they just ignore you.  Even LKML nowadays follows a quite woke Code of Conduct (https://www.kernel.org/doc/html/latest/process/code-of-conduct.html), which means that unless you get a response from Greg KH or Michael Kerrisk or someone personally interested in the issue, the kernel developers won't risk/bother responding to you at all (actually, they won't even risk/bother reading your email).  (Not responding being always safe wrt. the Code of Conduct; even when called on it, it is easy to throw a few excuses that will be accepted.)

Personally, I'd take your suggested response calling me a deranged monkey any time over being ignored with zero response.  I liked the past Linus much more than the carefully mellow Linus, but I'm a perkele-class Finn myself, so likely a tiny minority in this.  :-//

You might get a positive response at kernelnewbies.org (https://kernelnewbies.org/), though.  Even there, I believe the suggestions would be along the same lines as I've posted here.  Best case, you'd get a response from someone who has used splice() in a similar manner (you can do handshaking and transfer data before splicing the descriptors together, as long as the subsequent bulk data is stored as-is and needs no specific responses to continue), maybe with some example code.

It is interesting, because splice() would definitely be more efficient (avoid at least two copies of the data over the kernel-userspace barrier); but I just feel too lazy/tired/potato to examine it myself right now :-[.