On the hardware level and kernel-userspace interface level, the parts needed to implement a good message passing interface are surprisingly useful in other tasks as well.
For example, from the Berkeley Packet Filter implementation, Linux grew
seccomp (usually used via libseccomp, i.e.
seccomp_init() et al.) This is essentially a very lightweight bytecode interpreter on the kernel side of each syscall, that a process can install. The filter can be made permanent, and is often used to implement sandboxes where only a specific set of syscalls are possible.
As an IPC mechanism, message passing works really well (in my experience) with event driven programming models. This approach tends to work well when designing an application (say, a simulator), or say a distributed implementation of some algorithm like Merge sort.
I'd really be interested if anyone is up for the challenge.
A generic approach (something providing a simple API like
distributed_msort()) is unlikely to perform that well, because the sources of efficiency boosts are in the details. For example:
- Is the data to be sorted on a specific node, duplicated over all nodes, distributed evenly to the nodes, or distributed unevenly to the nodes?
- Should the sorted results be on a specific node, duplicated over all nodes, or distributed (evenly?) to the nodes?
- Can we collect simple statistics on the data to be sorted beforehand? Specifically, minimum and maximum?
- Can we collect a simple histogram of the data, say 4×nodes number of bins, and combine and distribute the histogram to all nodes?
The last two are steps that can often be done in earlier passes, for example when the data is obtained in the first place.
(The practical example of this kind of stuff I like to tell, is the classic
sort program. If you want to minimize the wall-clock time taken to sort the data, you do not read the data first, and then use your spiffiest sort function to sort it, because the I/O will almost always be the bottleneck. Instead, you use an on-line sort algorithm, say a binary heap, that may use more CPU time, but lets you sort each line as it arrives. Each sort/insert is likely to take less time than reading a new line on average, so by the time the last input line is read, the lines are already in sorted order, and can be output immediately. You want to avoid waiting for an event, and instead use (most) of those wait times to do useful work.)
In the case of a distributed sort, the local sort is "cheap", and passing the data is "expensive", in terms of 2021 computer hardware.
We can assume O(
N log
N) time complexity for the local sort (although if you have tens of megabytes to gigabytes of data per node, radix sort will do it in O(
N) but have a horrible cache behaviour). Since the slowest node will dictate when the entire operation completes, you'll want each node to do roughly the same amount of work. (Or rather, need about the same wall clock time to do the processing, since that minimizes the overall time taken.)
Because the message passing always introduces a latency, you'll want to initiate the transfers as early as possible, and preferably do some of the local sort work during the transfers.
With 2021 hardware, transfer
latencies will dominate your wall clock time, especially when using actual network hardware (MPI, even with InfiniBand). Note that this is not a question of bandwidth; it is about how much wall clock time is needed to transfer all the messages. (Even if the interface is implemented via shared memory or DMA, not only does copying the data take its time, but the processes need to be informed of the events also; that means context switches or inter-process interrupts or similar.)
In general, there are three different approaches to solve the merge/broadcast/distribute problem (following the local sort). This same problem occurs in distributed molecular dynamics simulators as well, when the simulated volume is spatially divided among the computational nodes.
- You find out the pivot points in the sorted data (first and last entry in the sorted data per node), and have each node distribute their data directly to the target nodes
- You merge the data as e.g a binary tree (in pairs), with the root of the tree broadcasting or distributing the completely sorted and merged data to all others
- You have each node broadcast the data (as all-recipient messages) to each other node.
The first one has the least amount of I/O, but the problem of finding the pivot points in the distributed data is nontrivial. However, usually the pivot points do not need to be exactly correct, as the balance between nodes does not need to be exact, and a few percent difference in the dataset sizes is perfectly acceptable. Thus, various histogram methods with perhaps a post-distribution balancing pass between neighboring nodes (in the sort order) can be quite effective, but complex to implement.
The second one has (log
k nodes) levels, but on each level, the messages are passed pairwise (or within the merge group, to the shared parent node). This means that most messages are independent, and don't need to be routed through a common choke point, so better use of the theoretical bandwidth available is achieved. The downside is that only a few nodes will do the merge pass, and most nodes will only do the local sort, then send the sorted data to their parent, and then just sit and wait to receive the completed dataset from the root; and the passing of data from root to each node may be a severe performance bottleneck.
(A binary tree merge will have half the nodes idle in the initial level, so in real life, you might wish to reuse nodes and transfer extra data, just so that as many nodes are doing useful work at each level, and the level count as small as possible.)
The third one is the obvious one if the sorted data is to be duplicated on all nodes, assuming the message passing interface does support "broadcast", or all-recipient messages. For MPI implementations, the details vary even at runtime, but it is a rather common operation, so something to consider when implementing ones own message passing stuff. The simplest "broadcast" implementation is to use a ring, with each node receiving only from the previous node, and sending only to the next node, and each node sending their locally sorted data first, then forwarding the (node count - 1) sorted subsets they receive to the next one, and while doing the I/O, doing a merge pass on their current datasets, ending up with the complete dataset at end. Next simplest is a tree/graph approach, similar to the second step, where each node distributes their own data to two or more designated nodes; this has the same number of transfers, but fewer steps/levels. Best is kernel or hardware routing support, of course.
A real-world MPI example should really be run on real distributed hardware, since using e.g. OpenMPI on a single x86-64 or ARM computer implements the message passing using shared memory; I'm not aware of OpenMPI support for artificially delayed I/O for testing, although that'd be real nice. (That is, an adjustable millisecond/sub-millisecond delay, wall-clock time, before the recipient is notified the transfer is complete, to simulate network latencies.)