Also consider deadlock/livelock that isn't related to the messaging mechanism, but to the applications themselves. That is a user design error, but if different companies produce different parts of the application, then no one person can comprehend the entire system.
Yes; I skipped that in an effort to keep my responses from being so darn long. I know some people hate my walls of text. Sorry!
Deadlock and livelock situations inherently involve more than one lock, with the same sets of users (the code doing the locking). It is an interesting topic, and very important in practical parallel applications and services, but not directly related to or associated with asynchrony. "Grab", "try", "hold", and "release" are clear and easily understood concepts when dealing with mutexes; others are needed for rwlocks, semaphores, and condition variables. But, as a whole, nontrivial locking is a subject that can be handled separately.
For example, when writing multi-threaded Python code -- and you will want to if you use Gtk+ or Qt, and do any kind of communication or significant computation in the same process, without freezing the user interface --, you can use Queue, a synchronous queue class between threads. Now, the most common Python interpreters only execute one thread at a time, but it does not mean threads have no benefits in Python. Using blocking I/O with Python threads (Threading, or GTK+ or Qt threads) makes a lot of sense, for both code simplicity, and efficiency, as the Python threads release the interpreter lock for the duration of the blocking call; essentially, such code works as if multiple Python threads did execute at the same time. For heavy computation, you'll want to use a separate processes -- the multiprocessing module -- to make use of more than one core for concurrent computation. With this approach, your code may not have any explicit locks, and introducing locking issues is a matter of not requiring a specific message order in Queues. (Combining Arduino and Python to create an user interface for an USB-connected microcontroller is one of my long-term projects; I'd like to write a tutorial about it.)
I personally do not have experience in teaching how to avoid deadlock/livelock in multi-lock, multi-party systems. I'm stupid enough to distrust locking schemes that I don't understand, so I tend to rework the structure to avoid the problem entirely. I am aware of the locking issues in the Linux kernel, and how they evolved (from the time when there was just one Big Kernel Lock), but other than how to analyze such schemes with the help of tools like Graphviz graphs, and how to simplify such locking schemes, I don't know much. (I know nothing about current CS research about locking schemes, for example; and haven't used an FPGA yet.)
A dual-core (heterogenous crossover) processor like the i.MX RT1170 is not something I'd consider so complex that locking would be particularly susceptible to deadlock/livelock, especially since their tasks would be rather separated. In particular, I believe most use cases would involve atomics instead; and that one of the cores would in most use cases never hold more than one lock at a time, which is one of the cases where lock analysis and avoiding deadlock/livelock is easy.
The only resolution (not solution) to that is liberal use of timeouts when a reply is required or a transmit queue might become full.
Well, you need those to detect unexpected problems, which can always occur in real life. I consider them complementary to a robust design; a "but if a situation occurs where these assumptions do not hold" -catch clause.
Did you really mean complementary? Dealing with the unexpected is at the very heart of creating robust designs.
I was referring to
liberal timeouts, the ones that do not have an immediately obvious purpose. (Remember, the context is avoiding deadlock/livelock, things that should not happen if the system is working as designed.)
Timeouts and checking whether a buffer has enough space, are an integral, low-level part of all robust designs, yes.
Even a robust design makes assumptions about the context, and what should happen in what order. Typically, this is reflected in how some problems can be worked around, and how some problems are fatal. These assumptions are inherent in the design. Remember, a computer program cannot really deal with anything truly unexpected: all you can do, is check whether each operation succeeds or fails, and be prepared for their failure. (And, obviously, one has to remember that just because a send() succeeded, does not mean the data has reached the other end, or is even on the wire yet; and so on.)
As an analog for liberal timeouts, consider a C function checking if a pointer is NULL, even when calling that function with a NULL pointer makes absolutely no sense, and you're sure your code never calls it with a NULL pointer. Similarly, a networked service might have a timeout for the entire handshake-and-authorization process, so that if it does not happen within say half a minute, the connection is dropped (or moved to a tar bucket as hostile), even though no error has occurred.
The "complementary" bit I mentioned, is writing
defensive code, that has extra checks and liberal timeouts, with the purpose of catching errors in the assumptions inherent in the design. Typically, they catch programming errors, and design errors like a deadlock/livelock-susceptible locking scheme. (Unfortunately, POSIX pthreads does not have a function to check if the current thread is already holding a mutex, without affecting the state of that mutex. If there was, sprinkling such checks to code that is not supposed to be called with certain locks held, would belong to this category too.) For example, the most commonly used code path might check twice that the connection handle is still valid, and many programmers feel that the second check is unnecessary. However, I may have put it there deliberately, because I believe it likely that an addition is made in the future that adds a secondary code path that shares the latter check, and I want to catch the most likely errors the implementors of that secondary code path writers will do. (A typical reason for such is that I've seen similar cases leading to that sort of a bug.)
Others call it paranoia, but I've found it a very useful approach in security-sensitive situations.