Author Topic: Python3, simpleflock - problems with multiple processes accessing same file (Read 5739 times)

Delta · « **on:** November 18, 2018, 05:15:29 pm »

(This is on Linux)

I am playing with Python and towards a project in which will have one script writing to a file, and one or more other scripts reading from the same file. Each script is running in a separate instance of the Python interpreter (I have been advised to avoid python threads if possible!)
Ultimately I plan to have on script reading analogue sensors, processing the data, and saving it to a file each cycle as a pickled dictionary (on a tmpfs ramdisk). One or more scripts will the read the file and display the data, or serve it over TCP, or whatever. Writes and reads will be 3 times a second.
At the moment I am just experimenting with a writer and reader script, and using SimpleFlock to use a lockfile to prevent clashes between the scripts. I don't care if either the writer or reader cannot acquire the lock file before timeout - they will just miss a write / read for one cycle. (I catch the "Blocking IOError" thrown by SimpleFlock if it times out)

Anyway, the reader still gets the odd "EOFError", which I assume is that it is still reading from the file when the writer truncates that file in order to write the fresh data.
Then when inserting a delay into the reader to test how the writer copes with not being able to acquire the lockfile, I get "OSError: [Errno 24] Too many open files: 'live_data.pdick"
I thought my WITH block would make sure the file got closed every time, so how can I have too many open files?

So what am I doing wrong? I was hoping that using a lockfile would 100% guarantee that only one process could access the pickle file at a time, the other just waits and gives up nicely if it can't get it. (I have chose my loop times and inserted delays deliberately to induce clashes in the test code)

Or is SimpleFlock (or lockfiles in general) just not robust enough for this kind of thing?

code snippet from the writer (runs every 0.333 seconds)

Code: [Select]

	lock_acquire_start = timer()
	try:
		with simpleflock.SimpleFlock("live_data.pdick.lock", timeout = FLOCK_TIMEOUT):
			status["lockfile_acquire_time"] = timer() - lock_acquire_start
			with open("live_data.pdick", "wb") as handle:
				pickle.dump(live_data_dick, handle)
	except BlockingIOError:
		timed_out = round(timer() - lock_acquire_start, 3)
		logging.warning("Unable to aquire lockfile! Timed-out after " + str(timed_out))
		status["lockfile_acquire_timeouts"] += 1

and the reader, also runs every 0.333 seconds

Code: [Select]


	cycle_start_time = timer()
	try:
		with simpleflock.SimpleFlock("live_data.pdick.lock", timeout = FLOCK_TIMEOUT):
			with open("live_data.pdick", "rb") as handle:
				try:
					live_data_dick = pickle.load(handle)
					time.sleep(0.4)
				except EOFError:
					logging.warning("EOF error when reading dickfile!")
					dickfile_failures += 1

	except BlockingIOError:
		logging.warning("Unable to acquire lockfile!")
		lockfile_failures += 1

I have used timeouts between 0.1 and 0.01 (again, I was deliberately getting it to timeout, catch the exception and carry on, which it does fine)

So why are two processes accessing the same file at the time?

EDIT: One or more readers.

bd139 · « **Reply #1 on:** November 18, 2018, 05:24:32 pm »

Use a named pipe instead of a shared file: https://linux.die.net/man/3/mkfifo

Write a line to the pipe in one process.

Read a line from the pipe in another process.

Concurrency is better solved by messaging than by shared state.

Edit: to note, locks aren’t necessarily synchronous as between establishing that the lock needs to be written and writing the lock file the other process could have opened the file. Plus literally every damn lock implementation seems to be broken from experience.

Edit 2: I wouldn’t pickle it. Make your own serializer that writes the data out as a deterministic text lines, be that JSON or something easy to parse such as KV pairs in text. Unix likes text as do humans when they have to debug it

I would use

:field1:field2:field3\n

voltsandjolts · « **Reply #2 on:** November 18, 2018, 05:40:52 pm »

It sounds like you have just one writer which only appends to a file.
In that case, you don't need locks, just use the file as normal in all programs.
Your reader(s) need to handle the case of reading the file when only a partial write has been done.
Say you are writing 100 byte datasets, your reader(s) should check the filesize to see if there are any more complete datasets i.e. is floor(filesize/100) more than last time it was checked, if so, read new data.

Delta · « **Reply #3 on:** November 18, 2018, 05:43:42 pm »

Quote from: bd139 on November 18, 2018, 05:24:32 pm

Use a named pipe instead of a shared file: https://linux.die.net/man/3/mkfifo

Write a line to the pipe in one process.

Read a line from the pipe in another process.

Concurrency is better solved by messaging than by shared state.

I have played with named pipes when playing with bash scripting, will I not run into problems if the writer is (well, erm,) writing to the fifo but there are no readers? This is a plausible scenario, that's why files appealed to me; it doesn't matter if anything is reading it.

bd139 · « **Reply #4 on:** November 18, 2018, 05:51:54 pm »

I usually fork the reader and writer in the same process so this isn’t usually a problem.

Really I use RabbitMQ for such tasks though as it is persistent, restartable and both ends are transaction aware.

Delta · « **Reply #5 on:** November 18, 2018, 07:09:48 pm »

Quote from: voltsandjolts on November 18, 2018, 05:40:52 pm

It sounds like you have just one writer which only appends to a file.

There is only one writer, but it doesn't append, it overwrites the file. There is only the latest data available.

Bd, also wouldn't a fifo be unsuitable as I want to have multiple readers read the same data?
That the writer can write merrily regardless of if anything is reading, and that multiple readers can read the data, is the thing that put me towards using a file.

I do appreciate the suggestions of messaging techniques etc, but I'd like to figure this issue out for now.

I can handle (and now understand why I get) the odd EOF error, but why am getting the too many files open error. This happened when I deliberately caused the writer to often timeout when trying to acquire the lockfile. Does this mean that my code is leaving a file descriptor open when flock timesout?

cv007 · « **Reply #6 on:** November 18, 2018, 09:19:14 pm »

This may not help much, but I can write to a file at 100/sec and read file from multiple 'threads' at the same rate. It takes me a while to get up to speed on Python so usually avoid it unless necessary, so wrote this in Rebol instead-

Code: [Select]

;http://www.rebol.com/downloads.html
;http://www.rebol.com/docs.html

;----------------
;writer thread
;press ESC to break, Q or Ctrl-C to quit
;----------------
mydata: context [
    dat1: dat2: dat3: 0
]
forever [
    mydata/dat1: random 256
    mydata/dat2: random 256
    mydata/dat3: random 256
    ;write mydata object to loadable format (text representation)
    write %test.txt mold mydata 
    wait 0.01 ;write at 100/s
]



;----------------
;reader thread(s)
;press ESC to break, Q or Ctrl-C to quit
;----------------
forever [
    all [
        attempt [ mydata: do read %test.txt ]
        print rejoin [ mydata/dat1 " " mydata/dat2 " " mydata/dat3 ]
    ]
    wait 0.01
]




;----------------
;reader thread
;show where attempt fails
;press ESC to break, Q or Ctrl-C to quit
;----------------
r-count: 0
d-count: 0
forever [
    r-ok: false
    d-ok: false
    all [
        attempt [ mydata: read %test.txt ]
        r-ok: true
        attempt [ mydata: do mydata ]
        d-ok: true
    ]
    all [ 
        not d-ok  
        print rejoin [ "bad/imcomplete data: " d-count ]
        d-count: d-count + 1
    ]
    all [
        not r-ok
        print rejoin [ "read error: " r-count ]
        r-count: r-count + 1
    ]
    wait 0.01
]
;----------------
;mostly data errors
;a few file read errors
;------------------------------

been running a few 'threads' doing the reading (for about 10 minutes), and no problems. I'm sure the key is in the readers- in the Rebol version here, attempt is doing all the work. If the file cannot be read, or if the 'do' on the read data fails (mostly), try again (the 'do' gets text back to Rebol data type, in this case an object).

You could probably come up with something similar in Python- if cannot read the file, or the data does not make sense, try again. I'm not sure what Python has for loading data from a text representation, but I'm sure it cannot be too hard to come up with some way to make sure the 'loaded' data is valid.

edit-
I should also add, in my example the 'do' will only succeed if all the chars of the file are read, as the last char is an end block char (])- without that last char it will fail (no success on incomplete reads).

and here is a Python version (I'm not a Python guru)-

Code: [Select]

#---------------
#writer thread
#---------------
from random import randrange as random
from time import sleep

while True:
    with open("test.txt",'w') as f:
        dat1 = random(256)
        dat2 = random(256)
        dat3 = random(256)
        #each var on a line, end of data marker is ':'
        f.write(str(dat1) + '\n' + str(dat2) + '\n' + str(dat3) + '\n:')
    sleep(0.01)


#---------------
#reader thread(s)
#---------------
from time import sleep

while True:
    try:
        with open("test.txt",'r') as f:
            dat = f.read().split('\n')
    except:
        print('failed open/read')
    else:
        if len(dat) == 4 and dat[3] == ':':
            print(dat[0] + ' ' + dat[1] + ' ' + dat[2])
            sleep(0.01)
        else:
            print('incomplete read')

Nominal Animal · « **Reply #7 on:** November 19, 2018, 06:41:30 am »

Quote from: Delta on November 18, 2018, 05:15:29 pm

(I have been advised to avoid python threads if possible!)

Bad rule of thumb. Here's a better one:

Only one thread will execute Python code at a time. Multiple threads can be blocking in a system call, for example reading from or writing to a pipe or a socket.

Quote from: Delta on November 18, 2018, 05:15:29 pm

Ultimately I plan to have on script reading analogue sensors

That is exactly the pattern were Python threads will work just fine.

Use one Python threading thread to read the sensors, and process them.

There are two options on how to handle the data. If the data is a stream (with readers expected to obtain successive readings, not just a snapshot), use a queue. If the readers are only interested in snapshots, use a threading.Lock to protect the stored sensor readings; before accessing or modifying them, acquire() the lock, quickly copy the value, and release() it.

Use a separate thread to respond to sensor queries. If the queries are limited to the local machine, use an Unix Domain datagram socket. If the queries are limited to local area network, use an UDP socket. This way there are no persistent connections to worry about, and each query is immediately responded with the corresponding answer. (You could use a protocol where the request names the sensors it is interested in, with the response containing the corresponding values; perhaps with a special query that returns the list of currently supported sensors. At such low data rates, ASCII text is fine.)

For streaming sensor readings, a connection-oriented socket (Unix domain stream, or TCP) is obviously better, but then you also need to worry about the maximum number of connections allowed, and how to detect when the connection drops or the other end is too far behind. I personally like to use a separate thread for accepting new connections, passing them via a queue to the request-response thread, and notifying the request-response thread via a pipe. I prefer asynchronous request-response handlers, based on select or selector, and non-blocking sockets.

Delta · « **Reply #8 on:** November 19, 2018, 07:52:15 am »

I have found why I get the "Too many open files" error. When I check /proc/PID/fd for the writer I can see loads of FDs for the lockfile, but they all say (deleted).

Code: [Select]

...
33 -> /home/delta/pythontesting/live_data.pdick.lock (deleted)
34 -> /home/delta/pythontesting/live_data.pdick.lock (deleted)
35 -> /home/delta/pythontesting/live_data.pdick.lock (deleted)
.....

I bit of googling says that the limit is not how many files you can have open, but how many have ever been open within that process.

Is there a way to completely remove the (deleted) file descriptors?

bd139 · « **Reply #9 on:** November 19, 2018, 08:19:02 am »

Sounds like it didn’t close the file. It won’t get rid of the FD until it is closed AND unlinked. If you unlink it the FD will stay around until the file is closed. Unlink is called unlink for a reason. The file is only gone when unlinked from the FS and all FDs closed.

Delta · « **Reply #10 on:** November 19, 2018, 08:49:26 am »

Quote from: bd139 on November 19, 2018, 08:19:02 am

Sounds like it didn’t close the file. It won’t get rid of the FD until it is closed AND unlinked. If you unlink it the FD will stay around until the file is closed. Unlink is called unlink for a reason. The file is only gone when unlinked from the FS and all FDs closed.

Even though the fd is marked (by when viewing with ls -l /proc/PID/fd) as deleted?

bd139 · « **Reply #11 on:** November 19, 2018, 09:10:46 am »

That is correct. Quick example now I'm at a Linux machine:

Code: [Select]

#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    FILE *fd;
    fd = fopen("/tmp/ourfile", "w");
    unlink("/tmp/ourfile");
    getchar(); // FIRST PAUSE
    fclose(fd);
    getchar(); // SECOND PAUSE
}

At FIRST PAUSE:

Code: [Select]

$ ls -l /proc/16876/fd
total 0
lrwx------. 1 chris chris 64 Nov 19 09:07 0 -> /dev/pts/2
lrwx------. 1 chris chris 64 Nov 19 09:07 1 -> /dev/pts/2
lrwx------. 1 chris chris 64 Nov 19 09:07 2 -> /dev/pts/2
l-wx------. 1 chris chris 64 Nov 19 09:07 3 -> /tmp/ourfile(deleted)

Note fd is still there but marked as deleted. At this point you can actually get the file back and undelete it by catting that fd out to a new file or using ln to link the inode number back to the filesystem namespace. It's not gone until fd's are closed.

At SECOND PAUSE:

Code: [Select]

$ ls -l /proc/16876/fd
total 0
lrwx------. 1 chris chris 64 Nov 19 09:07 0 -> /dev/pts/2
lrwx------. 1 chris chris 64 Nov 19 09:07 1 -> /dev/pts/2
lrwx------. 1 chris chris 64 Nov 19 09:07 2 -> /dev/pts/2

Note fd is now gone after the close.

So you have opened that file multiple times which suggests that simpleflock is probably buggy or something is wrong.

Edit: for any casual observers the first three fd's are 0=stdin, 1=stdout, 2=stderr hence the stream numbers in your shell scripts 1>&2 etc.

Delta · « **Reply #12 on:** November 19, 2018, 09:18:11 am »

That's good info, and thanks for the example bd.

So is this answer from SO wrong?

Quote

And I want to make it clear that it doesn't matter you remove the files created while python session still running, it will still throws such error.

Think as it's maximum number of files ever created (including deleted) per python session.

As to me that implies that even if my process does correctly close the fd each time, I am limited to how many it can ever open.

You are right though, it certainly looks like SimpleFlock is doing something wrong when it fails to acquire the lockfile.

TL,DR: SimpleFlock is leaves a (deleted) file descriptor sitting around when it fails to acquire the lockfile.

bd139 · « **Reply #13 on:** November 19, 2018, 09:29:36 am »

The answer is totally wrong. ulimit sets total number of file descriptors that the open at one time for that process and all child processes. Lots of useless information out there I find on SO.

Edit: I looked at the source.

https://github.com/derpston/python-simpleflock/blob/master/src/simpleflock.py#L18

It has a while loop that spins until a lock is acquired. Or does it?

Looks like the developer can't RTFM. fctnl.flock doesn't have those arguments fcntl.lockf does

LOCK_EX / LOCK_NB aren't part of the underlying POSIX call and are python extensions. This is consistent across python 2.7 and 3.

Throw simpleflock in the bin! Also looking at the issue tracker there is an issue that is close that is 4 years old and not been dealt with. Trash!

I'd just use simple fcntl API described here: https://docs.python.org/3/library/fcntl.html#fcntl.lockf

TomS_ · « **Reply #14 on:** November 19, 2018, 11:46:25 am »

Quote from: bd139 on November 18, 2018, 05:24:32 pm

Make your own serializer that writes the data out as a deterministic text lines, be that JSON ...

If youre just going to use JSON, dont bother writing your own serialiser, just

Code: [Select]

import json
and use an existing library.

You could, of course, always Base64 encode your dumped pickle to send it in "plain text".

Nominal Animal · « **Reply #15 on:** November 19, 2018, 12:15:49 pm »

I fully agree with what bd139 said about that simpleflock thing.

To read and replace a file with advisory record locks, you only need something like

Code: [Select]

import  fcntl

def save_file(path, contents, encoding="UTF-8"):
    data = bytes(contents, encoding)
    with open(path, "r+b") as handle:
        fcntl.lockf(handle.fileno(), fcntl.LOCK_EX)
        handle.write(data)
        handle.truncate(len(data))
        handle.close()

def read_file(path, encoding="UTF-8"):
    with open(path, "rb") as handle:
        fcntl.lockf(handle.fileno(), fcntl.LOCK_SH)
        data = handle.read()
        handle.close()
    return data.decode(encoding)

The read_file(path) returns the contents of the file as a string, taking an advisory shared record lock while reading the file.

The save_file(path) replaces the contents of the file, with the new contents specified as a string. It takes an advisory exclusive record lock (on Linux) while modifying the file.

Exclusive lock is also called a write lock, because it is taken when the target is modified, and is exclusive. Shared lock is also called a read lock, because it is taken when the target is read but not modified, and so multiple readers/shared locks/read locks are allowed at the same time. Exclusive locks and shared locks are not allowed at the same time.

"Advisory" in this context means that the locking only works with co-operating processes. Reads and writes are not affected at all by these locks; only the lock operations themselves.

If you use only the above two functions to read and modify a file, and no OSError exceptions occur, then you can be sure that each process gets a valid snapshot of the file, no matter how many processes you have reading and writing to that file.

Delta · « **Reply #16 on:** November 19, 2018, 06:49:48 pm »

Thanks for all the advice, I'm on my phone now so can't post any code snippets just now.

I think my issue was (as bd pointed out) that SimpleFlock is a pile of crap.
I ditched it and blundered my way through using fcntl.lockf.
I used it in blocking mode, and I no longer get loads of file descriptors littering the place, and don't need to use a separate lockfile.

However when cranking the reader up to read every 10ms, I still get lots of EOF errors!
Is this type of locking just not atomic enough?

Delta · « **Reply #17 on:** November 19, 2018, 06:51:32 pm »

Nominal, thanks for the examples, I'll have a go with them when I'm back on the computer.

bd139 · « **Reply #18 on:** November 19, 2018, 07:09:28 pm »

If you’re getting a lot of EOFs it is likely because the file is being truncated when you open it with “wb” and there’s another file handle pointing past EOF.

Using a queue / fifo makes this stuff much easier. It is a concurrency primitive basically.

Delta · « **Reply #19 on:** November 19, 2018, 07:53:34 pm »

Quote from: bd139 on November 19, 2018, 07:09:28 pm

If you’re getting a lot of EOFs it is likely because the file is being truncated when you open it with “wb” and there’s another file handle pointing past EOF.

Should my locks not prevent that from happening?

Quote

Using a queue / fifo makes this stuff much easier. It is a concurrency primitive basically.

I've played with fifos in Bash, and queues in Python.
I couldn't find a good way to reliably:
1. Be able to write even if nothing is reading.
2. Have multiple (and varying numbers of) readers get the same data.
I've even resorted to broadcasting UDP on 127.255.255.255.

Delta · « **Reply #20 on:** January 16, 2019, 09:50:53 am »

Quote from: bd139 on November 19, 2018, 07:09:28 pm

If you’re getting a lot of EOFs it is likely because the file is being truncated when you open it with “wb” and there’s another file handle pointing past EOF.

Using a queue / fifo makes this stuff much easier. It is a concurrency primitive basically.

I had forgotten about this thread, so thought I'd post back to say thanks. You were absolutely correct there - the problem was my opening the file with "wb" - I now know that opening a file for writing immediately truncates it; and obviously the file cannot be locked before it is opened!

I now open the file as "ab", and then before writing the fresh data I do a truncate(0) to wipe it. The file is locked during this operation, so the reader can't get at it, so no more EOF errors!

Thanks also to NA for the code snippets.

I open the file outside the infinite writing loop, leaving to open the entire time the script is running, just locking/unlocking for each write. I assume this is the best way, rather than open/lock/write/unlock/close each time.

bd139 · « **Reply #21 on:** January 16, 2019, 10:00:39 am »

Cool glad to hear it is solved.

Regarding when to lock it depends on what you are doing and to some degree the weather. If you’re doing small writes then that is correct.

Edit: perhaps ironically I’m actually dealing with a file locking bug today

MarkR42 · « **Reply #22 on:** January 28, 2019, 05:05:05 pm »

I realise I'm a bit late to the party, but,

How about writing the file under a new name and rename() the new file over the old file? (e.g. open('mydata.bin.temp', 'wb').write(b'hello'); rename('mydata.bin/temp', 'mydata.bin') )

Renaming files happens atomically, and if another process just opened the old file, it continues to read the old file, another process never sees a partially completed file.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Python3, simpleflock - problems with multiple processes accessing same file (Read 5739 times)

Share me