I believe I have located the issue, it is a bug with how PHP handles file locking of the sessions. I attached a debugger to PHP during antoher outage just now which confirmed it was hung on a file lock request.
To work around this I am evaluating switching session storage to memcached, it seems though that SMF doesn't like this much, I am investigating options right now to try to resolve this.
Update: SMF is trying to store too much data in sessions to use memcached, I have instead switched it over to session storage in the database. We turned this off years ago due to the performance hit due to the inability to use InnoDB tables, and a lack of server RAM. This is no longer the case, and as such we can return to using the native SMF database session storage.
Does SMF follow the same session locking strategy whether memcached is used or a table? One being faster than the other only. If that is the case (I don't claim to know one way or the other) then won't the session lock still happen under the same circumstances albeit perhaps with a smaller timing window in the faster case?
At the moment just based on reading what you've said I don't see that waiting on an unavailable lock is a bug necessarily. Waiting may not be the best response in some cases, and it might be better for a session to release all locks and try later if some locks are currently held.
Perhaps if gnif hasn't the time explain the nature of the bug someone else here who does this stuff can explain it.
PHP can handle sessions using several different methods.
1) Files (default). It creates a /tmp/sess_XXXXX file for each session which contains the PHP serialized data for the session.
2) Memcache, in this instance memcache handles the atomic operations in RAM rather then on the FS, which is extremely fast, but... the record size is constrained too much for SMF to work with it.
3) Custom, you can register your own handlers and store it however you want, in the case of SMF it is using a session table in the database.
With files, PHP relies on the flock system call, and then funlock at the end of the request, this is fine normally, but when you start to get a very busy website if a PHP process takes an age to complete and gets terminated by the CGI handler (FPM in this instance), the call to funlock never gets called, the file remains locked (PHP processes are reused, the process doesn't terminate so the kernel doesn't clean these up). Then on the next request, PHP hangs without a timeout waiting on it's call to flock the session. This should really be handled better, such as registering the locks with the CGI handler, or something similar. There is only a finite amount of PHP handlers running, once they are all hung up on waiting for locks that will never occur, there are none left for Nginx to pass the request to and thus the 502.
With memcache, there is a limit on the object size stored, SMF is storing way too much in the session to store it in memcache. Also there is the issue of persistance, memcache is not guaranteed not to evict the data stored to make room for other stuff, there is a chance that your cached data gets dropped.
With the custom method, in this instance DB storage, it is up to MySQL to handle the locking, which detects if the session is terminated and unlocks any locks held when the client disconnects.
I hope this clears things up a little.