Author Topic: Glibc read/write operations (Read 2073 times)

Karel · « **on:** March 29, 2022, 09:48:17 am »

Cppcheck reports:

"Read and write operations without a call to a positioning function (fseek, fsetpos or rewind) or fflush in between result in undefined behaviour"

Is this true for glibc when using functions like fgetc() and fputc(), etc?

If you believe this is true for glibc, please show me where this is written in the documentation of glibc because I can't find it.

brucehoult · « **Reply #1 on:** March 29, 2022, 10:38:50 am »

Presumably that means switching between read and write or vice versa.

It makes sense. There can be independent buffering on each I/O direction that needs to be synchronised.

I learned back on the PDP-11 that it was necessary to call flush on stdio for for simple programs such as 'print "What is your name?: ";input $name'. There are probably implementations where it is not necessary for disk files, but it makes perfect sense that there will be some where it is.

Apparently this is in C99, but the standard is not free and I don't have a copy. But you can do a Google search for "output shall not be directly followed by input without an intervening call to the fflush function or to a file positioning function" and you will find this quoted in many sources, including at Microsoft and other vendors.

Karel · « **Reply #2 on:** March 29, 2022, 11:31:46 am »

Thank you brucehoult, I understand what you mean.

However, from a technical point of view, is it necessary to flush or use any of the filepositioning functions when mixing in/output when using glibc?

Karel · « **Reply #3 on:** March 29, 2022, 11:35:03 am »

I want to clarify that my question isn't about the Cxx standard. Instead it's about how it's implemented in glibc.

Ian.M · « **Reply #4 on:** March 29, 2022, 01:58:45 pm »

Compare the text in the man page for glibc fopen():

Quote

Reads and writes may be intermixed on read/write streams in any
order. Note that ANSI C requires that a file positioning
function intervene between output and input, unless an input
operation encounters end-of-file. (If this condition is not met,
then a read is allowed to return the result of writes other than
the most recent.) Therefore it is good practice (and indeed
sometimes necessary under Linux) to put an fseek(3) or fgetpos(3)
operation between write and read operations on such a stream.
This operation may be an apparent no-op (as in fseek(..., 0L,
SEEK_CUR) called for its synchronizing side effect).

and that in the C99 standard (draft):

Quote from: '7.19.5.3 The fopen function', clause 6

6 When a file is opened with update mode ('+' as the second or third character in the above list of mode argument values), both input and output may be performed on the associated stream. However, output shall not be directly followed by input without an intervening call to the fflush function or to a file positioning function (fseek, fsetpos, or rewind), and input shall not be directly followed by output without an intervening call to a file positioning function, unless the input operation encounters endof-file. Opening (or creating) a text file with update mode may instead open (or create) a binary stream in some implementations.

I'd read that as: If you don't flush or file position, results of read after write may be unpredictable on *SOME* systems . . .

Ed.Kloonk · « **Reply #5 on:** March 29, 2022, 02:04:08 pm »

Philosophical question incoming...Warning!!

Sorry for the hijack but what is the goal here? Is your code supposed to tell the r/w file pointer what it should be or is the lib supposed to tell you where the file pos pointer is?

Who is supposed to be in charge?

brucehoult · « **Reply #6 on:** March 29, 2022, 02:24:48 pm »

Quote from: Ed.Kloonk on March 29, 2022, 02:04:08 pm

Philosophical question incoming...Warning!!

Sorry for the hijack but what is the goal here? Is your code supposed to tell the r/w file pointer what it should be or is the lib supposed to tell you where the file pos pointer is?

Who is supposed to be in charge?

You can do a relative seek of 0 from the current position, so you don't have to know the actual position.

brucehoult · « **Reply #7 on:** March 29, 2022, 02:29:28 pm »

Quote from: Karel on March 29, 2022, 11:35:03 am

I want to clarify that my question isn't about the Cxx standard. Instead it's about how it's implemented in glibc.

Read the code?

I don't know why you'd want to write code that doesn't do a seek() or flush(), knowing that it will work on glibc (if that is so), but WILL fail on other libc implementations. If no synchronisation is actually needed then a seek() will probably be almost free anyway.

dave j · « **Reply #8 on:** March 29, 2022, 02:52:47 pm »

Quote from: Karel on March 29, 2022, 11:35:03 am

I want to clarify that my question isn't about the Cxx standard. Instead it's about how it's implemented in glibc.

Which version of glibc?

bruceholt has mentioned the code failing on other libc implementations but there's no guarantee that the behaviour won't change between glibc versions either.

Karel · « **Reply #9 on:** March 29, 2022, 03:01:23 pm »

Quote from: Ian.M on March 29, 2022, 01:58:45 pm

Compare the text in the man page for glibc fopen()

Thanks! Exactly the answer I was looking for.
I already checked the man pages for fputc(), fgetc(), fseek(), etc. except fopen()...

Karel · « **Reply #10 on:** March 29, 2022, 03:05:02 pm »

Quote from: brucehoult on March 29, 2022, 02:29:28 pm

I don't know why you'd want to write code that doesn't do a seek() or flush(), knowing that it will work on glibc (if that is so), but WILL fail on other libc implementations.

Because glibc is a requirement.

Anyway, I'll consider to call fseek(..., 0L, SEEK_CUR) when changing I/O direction.

Nominal Animal · « **Reply #11 on:** March 29, 2022, 04:55:24 pm »

Just to be clear: this entire discussion only applies when doing reads and writes to the same stream.

Between a write and a read, you want to do a fflush(stream). This ensures that the writes are visible to future reads.

Between a read and a write, you want to do a fseek(stream, 0, SEEK_CUR). This ensures that the C library is prepared for either input or output to that offset in the stream.

In general, I would recommend using a tristate flag (some signed integer type, say int8_fast_8) that is initialized to zero, set to positive after a read, and set to negative after a write.
Before a read, check the flag. If it is negative, do a fflush(stream).
Before a write, check the flag. If it is positive, do a fseek(stream,0,SEEK_CUR).

Because I do not want my programs to destroy my existing data, I do prefer programs to write the modified data to a new file, and only if no errors occurred writing the new file, replace the old file with the new file (e.g. using rename()). In Linux (and all other POSIXy systems), files can always be renamed or replaced with new ones, even if they are in use, because of the dnode (file or directory name) and inode (contents and metadata) separation. Any process having the old file open will see the old contents, and the disk space will only be freed after the last handle closes.

Because of this, this pattern of mixing reads and writes to the same stream, is actually rather rare.

SiliconWizard · « **Reply #12 on:** March 29, 2022, 05:23:09 pm »

Quote from: Nominal Animal on March 29, 2022, 04:55:24 pm

Because of this, this pattern of mixing reads and writes to the same stream, is actually rather rare.

I've done this pretty rarely myself too, but it's probably not that rare if you're using files as some kind of databases. Like SQLite.

DiTBho · « **Reply #13 on:** March 29, 2022, 06:31:23 pm »

Quote from: Nominal Animal on March 29, 2022, 04:55:24 pm

Because of this, this pattern of mixing reads and writes to the same stream, is actually rather rare.

Not often but it happens with one of the applications I linked to my B*tree library.

It's the side-effect cause by its disk-virtual-memory, the B*tree tries to allocate a new block on the pool, but when the pool is full, it looks for the less used block and it flushes it back to the disk, then it reloads a new block from the disk to the ram.

Here we are, the patterns of behavior looks like { seek(iblock1), write, seek(iblock2), read } on the same stream

Nominal Animal · « **Reply #14 on:** March 29, 2022, 07:18:31 pm »

Whenever I need both read and write access to the same file, I use the lower-level <unistd.h> POSIX interfaces –– open(), read()/pread()/readv(), write()/pwrite()/writev(), fsync()/fdatasync(), close(), with advisory record locking via fcntl() –– that use file descriptors, and have a very different set of rules.

For database stuff, I use POSIX memory mapping with the Linux-specific MAP_NORESERVE, so that the size of the mapping is not limited to the size of available RAM and swap, but can exceed a terabyte on 64-bit architectures, while consuming only moderate amounts of actual RAM (kernel page tables in particular; those have to stay in memory). The way memory maps are implemented in Linux means that all normal file accesses (that go through the page cache; i.e. everything except open(path, O_DIRECT|...)) are in sync with the memory map, which is quite useful.

With the low-level I/O, taking an exclusive lock on the region to be written, then writing the data noting that each write call may be short (write less than requested) so a loop is needed, then releasing the record lock, makes for very robust file access even when multiple processes access that same file. When the file is stored on a shared volume, as long as it is configured to support file locking, the same automatically works even across machines.

I've even used POSIX fcntl() file leases, for the case when an untrusted black box process occasionally opens an important file and sometimes scribbles all over it, to grab a copy of the contents of the file prior to that access. It has its limitations, and nowadays the Linux fanotify provides a better interface for it; just don't confuse that with inotify, which provides only filesystem events, and not access interception capabilities.

However, this is quite POSIX-specific, and thus not really portable (to Windows; everything else is more or less POSIX-y already), unlike the standard C <stdio.h> I/O.

The one function I wish that C or POSIX would adopt, that already is available in Linux and BSDs, is asprintf(). It is so nice to not worry about the buffer size, and have it just allocate one dynamically as needed. It can be implemented in terms of snprintf() by "printing" it twice: the first time to find out the size, and the second time if the initial dynamic buffer size guess was wrong. A "native" implementation is usually more efficient than that. It would be even better if the interface was ssize_t msprintf(char **dataptr, size_t *sizeptr, const char *format, ...) similar to getline(), so that one could reuse the same buffer but just have the print function reallocate it as needed. But me and the C and POSIX standards folks are not on speaking terms

.

For those writing tools that read file names or paths from a stream, I recommend supporting the nul ('\0') separator (similar to xargs -0) among the more commonly used options. You then read the names using POSIX getdelim(&lineptr, &sizeptr, '\0', stream), until it returns negative. (At that point, check that feof(stream) is true and ferror(stream) is false; otherwise, a read error occurred.) This way, you get the file names in the exact same format the Linux kernel uses them, so all possible file names (including those with newlines and such in their names) will work without issues.

I could go on, and explain why the common tutorial stuff like opendir()/readdir()/closedir() is horrible (because scandir(), glob(), wordexp(), and nftw() exist), but yeah. There is a reason for everything I suggest, and I'll be happy to describe those reasons, if anyone just asks. My opinion is worth nothing, but those reasons, they can be discussed rationally.

DiTBho · « **Reply #15 on:** March 31, 2022, 08:58:02 am »

Quote from: Nominal Animal on March 29, 2022, 07:18:31 pm

For database stuff, I use POSIX memory mapping with the Linux-specific MAP_NORESERVE, so that the size of the mapping is not limited to the size of available RAM and swap, but can exceed a terabyte on 64-bit architectures, while consuming only moderate amounts of actual RAM (kernel page tables in particular; those have to stay in memory). The way memory maps are implemented in Linux means that all normal file accesses (that go through the page cache; i.e. everything except open(path, O_DIRECT|...)) are in sync with the memory map, which is quite useful.

Yes, this is the best way to choose an application that needs to run on Linux natively. In my case, Linux is a kind of test bed for software running on a bare-metal system.

Basically I manually implemented a mini-virtual memory engine, things that are offered for free by the Linux kernel, as you pointed out, just ... I have seen this coding-model applied on Haiku-OS and WxWorks. Probably because Linux offers features that are lacking on other UNIX-like operating systems

SiliconWizard · « **Reply #16 on:** March 31, 2022, 05:41:06 pm »

Quote from: Nominal Animal on March 29, 2022, 07:18:31 pm

For database stuff, I use POSIX memory mapping with the Linux-specific MAP_NORESERVE,(...)

Yes, that would be the best approach, but is not fully portable...
Speaking of SQLite, I've used it, but I admit I haven't taken a look at how they do it.

For performance reasons, the "best" approach is probably to use POSIX mmap for platforms supporting it, and an alternative for other platforms.

Nominal Animal · « **Reply #17 on:** March 31, 2022, 06:31:07 pm »

Quote from: DiTBho on March 31, 2022, 08:58:02 am

Basically I manually implemented a mini-virtual memory engine, things that are offered for free by the Linux kernel, as you pointed out

Right; sometimes the memory mapping approch just isn't valid. Another example would be a low-powered embedded device, which provides access to some large database, where the time is not as big of a factor as RAM footprint is: then I would use low-level I/O as well.

Quote from: SiliconWizard on March 31, 2022, 05:41:06 pm

Quote from: Nominal Animal on March 29, 2022, 07:18:31 pm
For database stuff, I use POSIX memory mapping with the Linux-specific MAP_NORESERVE,(...)
Yes, that would be the best approach, but is not fully portable...
Speaking of SQLite, I've used it, but I admit I haven't taken a look at how they do it.

It is actually quite nice, implementing its own pager (page cache), with memory mapping support on many OSes (even partial maps, not simply "map this entire file" stuff).

For implementing low-level I/O access to a binary database-like file, I do recommend taking a look at the POSIX pread() and pwrite() functions. They take a file descriptor, pointer to the buffer, the size of that buffer (noting that it is not guaranteed that all of that is read or written!), plus the offset at which the read/write should start. These do not affect the file position, you see.

For portable code that needs to do reads and writes to the same stream, I would consider implementing wrapper functions
size_t file_read(FILE *stream, void *buffer, size_t size, size_t count, off_t offset);
size_t file_write(FILE *stream, const void *buffer, size_t size, size_t count, off_t offset);
the functions returning zero with errno set to indicate the error, if an error occurs; otherwise the count of successfully read or written records of size bytes each.

On Linux and Unix systems that do provide unlocked stdio, they can be made thread-safe by locking the stream handle (using flockfile()/funlockfile()). They'd do an fseek() to the specified offset by default, and the write a fflush() afterwards (before releasing the stream handle).

On systems that do have the file descriptor abstraction (basically all; Windows just calls them handles instead of descriptors), I would consider
size_t fd_read(int desc, void *buffer, size_t size, size_t count, off_t offset);
size_t fd_write(int desc, const void *buffer, size_t size, size_t count, off_t offset);
with three different implementations: one for Linux, BSDs, and Unix systems having pread() and pwrite(), one for those that do not, and one for Windows; based on pre-defined compiler macros. I would also use a compile or run-time option, or perhaps add a sixth flags parameter, so that if desired, the operation takes an advisory record lock via fcntl(); this provides "atomic" accesses (for processes that do take advisory locks; across processes, but not across threads in the same process).

As you can see, the POSIX approach yields much more options – even avoiding the file position mess completely –, and that's why it is more applicable to mixed read and write accesses to binary data. Of course, one could say you just switch the set of pitfalls, because short reads and writes are always possible in practice (i.e., you need more than one call in a loop to get all the data you want), and because some systems, like Linux, limit a single read or write call to just under 2 GiB (because of historical bugs in certain filesystem kernel drivers).

Thinking about this further, I would claim that the solution here, in this particular case, is not to add a call before and/or after each standard I/O call, but to create wrapper functions that implement the logical read and write operations one needs. These wrapper functions need to take care of the fflush()/fseek(), obviously, but the "trickiness" is then restricted to those wrapper functions.

My own mind needs this kind of tools to work well on complex applications and problems. It not only lets my mind concentrate on the issues at the correct complexity level (from nitty gritty details, to the highest concept level "okay, so how are the users going to do their thang with this app?"), but it also lets me unit test such wrappers, and after testing, trust them. That means that whenever there is a bug, I have sort-of automatically limited the scope of that bug, simply by observing which function reports the problem first. (That also means my programs are often full of "unnecessary" error checks, with people offhandedly mentioning that "that call can never fail". Ha! In a perfect world, yes. In my world, everything I touch can fail.

)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Glibc read/write operations (Read 2073 times)

Karel

Glibc read/write operations

brucehoult

Re: Glibc read/write operations

Karel

Re: Glibc read/write operations

Karel

Re: Glibc read/write operations

Ian.M

Re: Glibc read/write operations

Ed.Kloonk

Re: Glibc read/write operations

brucehoult

Re: Glibc read/write operations

brucehoult

Re: Glibc read/write operations

dave j

Re: Glibc read/write operations

Karel

Re: Glibc read/write operations

Karel

Re: Glibc read/write operations

Nominal Animal

Re: Glibc read/write operations

SiliconWizard

Re: Glibc read/write operations

DiTBho

Re: Glibc read/write operations

Nominal Animal

Re: Glibc read/write operations

DiTBho

Re: Glibc read/write operations

SiliconWizard

Re: Glibc read/write operations

Nominal Animal

Re: Glibc read/write operations

Share me