EEVblog Electronics Community Forum
Products => Computers => Programming => Topic started by: SiliconWizard on August 19, 2021, 05:09:25 pm
-
Message-passing looks like a clean, simple and trouble-free way of handling synchronizing and sharing data between concurrent tasks (threads / processes / ...)
Some languages, like XC from XMOS, use this paradigm exclusively, as far as I've gotten. But it's well adapted to their XMOS architecture. In most concurrent programming approaches to this day, though, memory sharing and all kinds of associated synchronization means (mutexes, semaphores, you name them...) seem to be the "standard way" of doing it.
What are your thoughts about this? The pros of message-passing only are relatively simple to identify. Now there must be cons as well. What are they, in your opinion - which could explain why this approach is not really mainstream yet?
One con I see - but I could be wrong, it would really need to be properly benchmarked in real cases - is that if you're dealing with very large memory areas, for instance for some heavy number crunching, passing all data through messages from one 'task' to another could have a lot more overhead. But as I said, I'm not completely sure this is even true, if the message passing mechanism itself is efficient.
Anyway. Thoughts? Because it looks to me as though that would solve a whole lot of issues with concurrent programming that we are still suffering from.
-
Back in the early 90's, I did an internship in Occam (a CSP/π-calculus language).
It was very easy to write explicitly parallel code and I got more accomplished in that summer internship than I probably would have if I'd tried to do it in C with threads.
One of my tasks was to double the throughput of the system (by adding additional processing nodes, configuring them, and adjusting to code to take advantage of them). That turned into just a two day task for a lowly intern and about half of that was the mechanical and wiring work to rewire the system.
Since then, I've mostly coded in "traditional" languages, though I've done some dabbling in Erlang.
Why isn't this style of programming more popular? My strong suspicion is that the ecosystems around the jvm, python, and javascript are wildly richer such that most problems are easier to solve in those languages and for most applications, concurrency isn't the main issue that is make-or-break for the company.
Have a look at/play around with Erlang, OTP, and Elixir. I enjoyed fiddling around with them, but the problems I face at work are, by and large, not concurrency-heavy where items are processed in a mix of parallel and serial styles. (We have some massively parallel problems, but they're of the form "branch and never rejoin" so we can handle them with traditional queue processing, each in isolation.
If you've used Discord, WeChat, or WhatsApp, you've used a system based on Erlang. If you've paid using Klarna, same.
Those are applications (particularly the chat ones) where concurrency is a major subset of the complexity of the application.
-
xC+xCORE is an implementation of Hoare's Communicating Sequential Processes computational model. That was designed so that CSP properties could be proved mathematically.
Message passing can be used as an inplementation technique for CSP, but is usually used in other ways without the characteristics of CSP.
Message passing is apparently a favoured technique in high performance computing, which has historically pushed computation hardware and software to the breaking point and beyond, and continues to do so.
Its principal advantage is that data is not shared and the processing can be done on a remote machine, hence scaled with minimal and well-defined coherence requirements.
Its principal disadvantage is that data is not shared and the results of processing have to be recombined somewhere.
An engineering challenge is that you have to balance computation and communication costs: bigger computation lumps tend to imply more information has to be communicated.
-
Message passing is crazy old (like most things in CS). Windows event loop is based off of message passing and that is hardly the first it's been done. If message passing works for your situation; great! But no one has invented a solution to parallel programming the solves all problems. Its a bit like saying that dependency injection is great, why isn't all code just dependency injection? It solves one problem one way, if that's what you need, go for it.
To give you an example, consider transactional memory. Its based on what has been used in databases forever but applies to memory. You can have two threads reading and writing to a state in transactions and transactional memory will ensure that the state is always consistent. That works great in parallel code, but doesn't solve every parallel problem.
Lock-free data structures are great too. Why pass a message or take out a lock if the operation you need to perform can be done in parallel without a lock?
The idea of avoiding working with the basic synchronization privatives and instead using a higher-level model is very good. My point is, there isn't a single higher-level model that solves all problems.
-
Lock-free data structures are great too.
As long as someone else writes them.
Why pass a message or take out a lock if the operation you need to perform can be done in parallel without a lock?
With most lock free programming it's more like as long as it is statistically likely it can be done in parallel ... and otherwise you try again.
Lock free programming feels to me like going back to CSMA from a fully switched network.
-
Have a look at/play around with Erlang, OTP, and Elixir. I enjoyed fiddling around with them, but the problems I face at work are, by and large, not concurrency-heavy where items are processed in a mix of parallel and serial styles.
I know Erlang, and all concurrent programming in it is based on spawning processes and message passing between them, AFAIR. So that's a good example of what I'm talking about. Elixir is the same.
Go has channels, which do essentially the same thing as far as I got it. I don't know Go enough to know if it supports shared memory and other synchronization means.
-
My point is, there isn't a single higher-level model that solves all problems.
Anybody that made such a claim would be a snake oil salesman. Nobody here has made such a claim.
-
I know Erlang
eRlang is good, but Elixir is pure crap, what have you programmed in eRlang?
-
Why did I put an eRlang VM in kernel space? To directly take advantage of Infiniband :D
This is *the* only way, message-passing with no compromise!
-
Since then, I've mostly coded in "traditional" languages, though I've done some dabbling in Erlang.
Why isn't this style of programming more popular? My strong suspicion is that the ecosystems around the jvm, python, and javascript are wildly richer such that most problems are easier to solve in those languages and for most applications, concurrency isn't the main issue that is make-or-break for the company.
Have a look at/play around with Erlang, OTP, and Elixir. I enjoyed fiddling around with them, but the problems I face at work are, by and large, not concurrency-heavy where items are processed in a mix of parallel and serial styles. (We have some massively parallel problems, but they're of the form "branch and never rejoin" so we can handle them with traditional queue processing, each in isolation.
Until recently sequential computer power was increasing so rapidly that it was easy to wait a few months and get whatever speedup that parallel processing might have brought. That's no longer true, even to the least visionary developers! Useful ecosystems developed around such "adequate" technologies.
But where a technology is no longer adequate, new technologies with a solid theoretical base become advantageous. Hopefully a solid theoretical basis will make it easer to develop an ecosystem than it is for one where several "conveniences" have either been glued together or have accreted over time. CSP and message passing are clean and useful abstractions, but we need more.
-
I know Erlang
eRlang is good, but Elixir is pure crap, what have you programmed in eRlang?
I've learned Erlang and thus have done a few exercises with it, nothing significant. I just know how it works and the concepts. I don't know Elixir much, but was under the impression it was just a variant of Erlang. Nothing major? So I don't know what would make it pure crap compared to its older brother. But it's off-topic here.
-
In high-performance computing, MPI (https://en.wikipedia.org/wiki/Message_Passing_Interface) (either OpenMPI or vendor-provided implementation, often on top of InfiniBand) is extremely common, even between processes on the same machine.
(I like to separate concurrent into terms parallel and distributed, using distributed for parallel processing not on the same machine, and parallel for processing done on a single machine. Just my own convention, and thus must be always described before used.)
Remote memory-mapping techniques are not responsive enough to be used in distributed HPC. It turns out that typical scientists writing simulators don't really grasp the current hardware that well, with things like threads being outside their typical skillset. Message passing interfaces provide exactly the kind of, uh, mental concepts?, that lets them implement various algorithms efficiently. In comparison, I've seen similar algorithms implemented using OpenMP (thread-based parallelism), with atrocious performance. (It is common for MPI-using programmers to claim it is more efficient than thread-based parallelism.)
(Indeed, most widely used distributed simulators still do computation and communication as separate steps, instead of interleaving them so that the processor(s) would do maximal work in the shortest possible time. And it is easier to buy more hardware, than hire the expensive developers that know how to implement those simulators more efficiently. Personal experience.)
So, while theoretical performance is interesting, it is the real world that one needs to concentrate on.
I can do hybrid parallel-distributed simulations, no problem, with say MPI used between machines, but threads and shared memory within each machine.
However, it turns out that batch processing systems (or rather, their admins) don't really like that sort of complexity at all, even though one might show a bit of efficiency boost that way: it is a maintenance burden, especially compared to single-threaded MPI processes (and their batch queue configuration).
Fortunately, even OpenMPI uses shared memory for messaging between processes on the same machine, so the difference (to native shared memory use between processes on the same machine) in performance is basically irrelevant: just a couple of memory copies, really.
Exactly how you implement the message passing, makes a BIG difference.
In MPI, I use async I/O a lot, with often about a dozen different messages in flight at the same time (some to/from the same processes, others to/from different ones). As long as you do it correctly (follow the model, with unique tags identifying each message type, and don't expect any particular message arrival order), it works extremely well, even when running on top of different MPI implementations. Typically, there is an extra thread in the process that does the I/O, that is completely controlled by the MPI libraries, and not really accessible to the program at all, even when writing Fortran.
I often have small chunks of descriptive data, and then large chunks or streams of bulk data (size and propertied described by the descriptive data). Because this is common for HPC, MPI implementations MUST handle both well: otherwise, they'd perform poorly in common real life scenarios.
Similarly, when designing a message passing interface (the API, I mean), one needs to make very darn sure the API can handle the real-world use cases.
I'd consider asynchronous interface for bulk data a necessity (and I don't mean nonblocking, I mean true async transfers).
If used for IPC, I'd require the possibility of point-to-point links for bulk data, instead of using the "common bus" (whatever it might be) for everything. (This is also the deficiency in D-Bus (https://en.wikipedia.org/wiki/D-Bus), and the reason it is completely unsuitable for bulk data.)
So, my cons would be deficiencies in the API, especially a lack of async bulk data transfer support (and especially if bulk data transfers block short messages, say due to buffer bloat or such), nothing specific to message passing as a technique.
The massive-bandwidth cases, where immense amount of data is continuously transferred, and extra memory copies affect the achievable bandwidth (and thus message-passing interfaces have unwanted overhead), are in my experience very rare exceptions, and need hardware-specific implementations anyway; they're so rare and special that they cannot really be used as a con against message passing in general.
-
People want to try some Go. This seems to kill a lot of the shortcomings of message passing APIs.
channel<-message
message := <-channel
select {
case m<-channel:
case p<-channel2:
}
Channels are buffered or not. Add coroutines. Channels can have channels sent over them too. That’s about it. I’ve written an implementation which bridges this over the network transparently with some limitations. I’m working on sending channels over channels over the network which allows fully notifiable remote cancellation of async tasks :popcorn:
Slightly different domain here ie massive distributed systems. Regarding bulk transfers they don’t scale and do block. This is a solved problem. We disallow them and enforce maximum message sizes and priority tagging. If you want to do fully asynchronous transfers of large messages you have to split and implement reassembly. This vastly simplifies QoS enforcement. The other option is just sending the references to the bulk data and leave it where it is. This seems to work nicely if you’re using something in AWS where you end up with an S3 bucket containing a few hundred terabytes (our record is 1200TiB in one so far :-DD)
-
In high-performance computing, MPI (https://en.wikipedia.org/wiki/Message_Passing_Interface) (either OpenMPI or vendor-provided implementation, often on top of InfiniBand) is extremely common, even between processes on the same machine.
Yes, indeed.
Exactly how you implement the message passing, makes a BIG difference.
Absolutely. Both in terms of performance, of course, but also in terms of security. Because for the latter, the benefits of message passing (not having to share any memory) can be at least partially obscured by the fact that you would be doing this using libraries that themselves use shared memory and classic synchronization for passing messages.
Ideally, hardware facilities for message passing would be much better. Some CPU include tightly-coupled hardware mailboxes to pass messages between cores, bypassing the memory subsystems entirely. Now that's an interesting approach both in terms of performance and security.
To me, the "security" point is major here. Ensuring good (and provable) security with message passing is much easier than with shared memory.
The massive-bandwidth cases, where immense amount of data is continuously transferred, and extra memory copies affect the achievable bandwidth (and thus message-passing interfaces have unwanted overhead), are in my experience very rare exceptions, and need hardware-specific implementations anyway; they're so rare and special that they cannot really be used as a con against message passing in general.
I don't have evidence to back this up, but this is also my impression. And you can almost always design your "tasks" so that they minimize the amount of passed data between them. This isn't easy though. But programming using this paradigm requires specific algorithms IMO. You can't just use message passing as you would use shared memory, but just transferring the same data blocks through messages. That wouldn't make sense. As I said though, it isn't easy, and is probably one of the reasons why the approach is not more popular.
-
People want to try some Go. This seems to kill a lot of the shortcomings of message passing APIs.
Yes indeed. (I mentioned Go above.)
I don't like all of Go, but Go channels and coroutines are definite winners here.
-
Go is nuanced and compromised I find. In all the right places :popcorn:
I’d rather write everything in Scheme or SBCL but no one wants to pay me to do that.
-
Channels can have channels sent over them too.
That makes a surprisingly big difference, and can be crucial for agent/delegate type applications/services (at the systems programming level).
All issues with bulk transfers can be avoided, if you can use the channel to send a pipe or any kind of point-to-point bulk link to each recipient.
I like that more than splitting the bulk into chunks, as then the bulk transfers do not load the base channel.
But programming using this paradigm requires specific algorithms IMO. You can't just use message passing as you would use shared memory, but just transferring the same data blocks through messages.
Exactly.
One of my pet peeves is trying to get programmers using MPI to use the async interfaces. I've had more than one "MPI expert" indignantly spout loudly all sorts of crap about "async interfaces being fragile/not well supported", just because they themselves haven't learned to do async I/O properly. And that is just async features of MPI, not a different paradigm (like shared memory vs. message passing).
-
Some CPU include tightly-coupled hardware mailboxes to pass messages between cores, bypassing the memory subsystems entirely. Now that's an interesting approach both in terms of performance and security.
Yep, I've seen quite a few Linux kernel drivers with those, from Intel Atom audio engine (SST) to TV cards. Not too rare with PCIe devices that do active data processing. The mailbox on these is on the device, and exposed on the host computer via memory mapping.
At the opposite end of the security spectrum is Firewire (OHCI in particular), and the DMA attacks (https://en.wikipedia.org/wiki/Firewire#Security_issues) by nefarious devices.
-
Remote memory-mapping techniques are not responsive enough to be used in distributed HPC.
or too expensive and complex :D
Flex ASIC was introduced in 2017 and used in the HPE Superdome Flex. On the paper it's capable of ~850 GB/s of bisection peak bandwidth for up to 32 socket system and 48TB of coherent shared memory, however, in practice, when you approach programming it's not a piece of cake and squeezing performance and reliability is tricky, especially with languages like C.
-
One of my pet peeves is trying to get programmers using MPI to use the async interfaces. I've had more than one "MPI expert" indignantly spout loudly all sorts of crap about "async interfaces being fragile/not well supported", just because they themselves haven't learned to do async I/O properly. And that is just async features of MPI, not a different paradigm (like shared memory vs. message passing).
To *learn*, I do consider the cost for the knowledge.
There are several grades of certifications for things introduced from 2009 (2012, 2014) to 2017. If you want to be hired for a job with HPe technology where you are supposed to *know* how to do async I/O properly, you have to pay a lot of money for courses and training.
Just to learn *how* to program properly with the best technology available on the planet. I think only big companies can afford it, so only lucky programmers can really master it :-//
-
The XMOS processors that make such heavy use of message passing do have its limitations sometimes.
The XC language used to program them really tries to make sure you never use the same piece of memory or resource from multiple threads. This is for a good reason since most of the 'fun' multithreading bugs happen due to those. However there is a way to force it into using shared memory. The language supports mixing with regular C, and C supports pointers, so you can use C to exchange pointers between two threads and then just start accessing each others memory using that smuggled pointer. I found this trick used in some of the official example applications that they host on there own website. So even the people working at XMOS got fed up with using message passing for some things.
Of course when using this trick using a XMOS processor with multiple cores (called 'Tiles' by there marketing wank) then you have to make sure you are running this between threads (called 'Cores' by there marketing wank) that exist on the same CPU core since each core has its own dedicated RAM. But you have to be mindful of that anyway since messages between chips are slower than messages between threads of the same die, so you want to keep things that communicate a lot together anyway.
So if the XC language tries so hard to keep you from using shared memory why did the programmers use the dirty hack to circumvent that? Well it's mostly speed. For some things shared memory is just plain faster. If you need to have random access to elements of an array sitting in another thread then you need to message the thread what you want and wait for a response, the other thread might also be busy and not answer right away. In this case it can be orders of magnitude faster to simply go grab it from memory yourself. And even if the other thread responds instantly, this was still more instruction cycles than a simple memory load instruction. So the XMOS programmers use shared memory for things like framebuffers or large cross thread FIFO buffers. They are very performant and perfectly safe as long as you use it in a thread safe manner.
But for things that operate in a way that fits messages they are really nice. For example it makes sense to have a network card with mailboxes where threads have the incoming traffic pile up in the inbound mailbox and they can drop off any transmit traffic in the outbound mailbox. That way you could have a super fast single NIC feed lots of CPU cores without one core having to be the middleman that decides what traffic goes where.
-
it makes sense to have a network card with mailboxes where threads have the incoming traffic pile up in the inbound mailbox and they can drop off any transmit traffic in the outbound mailbox. That way you could have a super fast single NIC feed lots of CPU cores without one core having to be the middleman that decides what traffic goes where.
Sure, my infiniband card has it :)
-
The XMOS processors that make such heavy use of message passing do have its limitations sometimes.
The XC language used to program them really tries to make sure you never use the same piece of memory or resource from multiple threads. This is for a good reason since most of the 'fun' multithreading bugs happen due to those. However there is a way to force it into using shared memory. The language supports mixing with regular C, and C supports pointers, so you can use C to exchange pointers between two threads and then just start accessing each others memory using that smuggled pointer. I found this trick used in some of the official example applications that they host on there own website. So even the people working at XMOS got fed up with using message passing for some things.
All technologies have their limitations, of course, and the important point is to recognise them and work around them as far as possible. As I stated in reply #2 above, "An engineering challenge is that you have to balance computation and communication costs: bigger computation lumps tend to imply more information has to be communicated."
Of course when using this trick using a XMOS processor with multiple cores (called 'Tiles' by there marketing wank) then you have to make sure you are running this between threads (called 'Cores' by there marketing wank) that exist on the same CPU core since each core has its own dedicated RAM. But you have to be mindful of that anyway since messages between chips are slower than messages between threads of the same die, so you want to keep things that communicate a lot together anyway.
That's too simplistic. They are right to distinguish between threads, cores, tiles and chips.
Usually there is one thread/core, but if some simple programming conventions are observed, multiple threads can be merged onto a single core. The development tools do that automatically merge threads and tell you if it isn't possible. That is an advantage of having a simple clean language: the tools can help you.
There are eight cores in a tile, and memory can be shared between threads running on a single tile.
There are, currently, up to 4 tiles per chip. Message passing is used between tiles.
You can have multiple interconnected chips. Message passing is used between chips.
Pleasingly, message passing is also used with I/O. That cleanly emphasises that off chip (hardware) processes are equivalent to on-chip (software) computation processes.
So if the XC language tries so hard to keep you from using shared memory why did the programmers use the dirty hack to circumvent that? Well it's mostly speed. For some things shared memory is just plain faster. If you need to have random access to elements of an array sitting in another thread then you need to message the thread what you want and wait for a response, the other thread might also be busy and not answer right away. In this case it can be orders of magnitude faster to simply go grab it from memory yourself. And even if the other thread responds instantly, this was still more instruction cycles than a simple memory load instruction. So the XMOS programmers use shared memory for things like framebuffers or large cross thread FIFO buffers. They are very performant and perfectly safe as long as you use it in a thread safe manner.
The key point is that shared memory - in any form - is not scalable. That could be concealed until relatively recently, but we've hit wall that can no longer be ignored.
Some limitations at small scales are equivalent to limitations at large scales. They always have to be dealt with, and message passing is one of the few concepts that scales well and operates at small and large scales. If you have others, let us know; I suspect they will make you famous!
In a conventional desktop processor, some the cache memories aren't shared between cores. That causes coherance problems for the hardware designer and software designers. Many of the latter don't understand enough about processors to realise the problem exists: witness the lack of comprehension about what various C/C++ key words don't mean!
In a server system main memories become a limitation. Some companies (e.g. one of Andy Bechtolsheim's) have tried building AMD systems with too many chips. The cache coherence messages dominated useful memory read/writes. In HP's top-end SuperDome servers, there was internal partitioning that couldn't be ignored.
In HPC server farms, shared memory is a non-starter, of course.
But for things that operate in a way that fits messages they are really nice. For example it makes sense to have a network card with mailboxes where threads have the incoming traffic pile up in the inbound mailbox and they can drop off any transmit traffic in the outbound mailbox. That way you could have a super fast single NIC feed lots of CPU cores without one core having to be the middleman that decides what traffic goes where.
That can introduce its own problems, of course. "Big buffers" in networks can cause catastrophic failures when retransmission messages timeout inside long buffers.
-
People want to try some Go. This seems to kill a lot of the shortcomings of message passing APIs.
The Go channel constructs are directly descended from Hoare's CSP channels. Ditto the xC channels.
Languages with such constructs are the future. Those without are tomorrow's COBOL.
-
What's supposed to be the difference between true asynchronous and mere non blocking? Somewhere there's a dynamic queue which will likely fail catastrophically when it's full (if async programmers weren't too lazy to take into account full queues they wouldn't be async programmers). Do you call it async when a system layer does the queuing and non blocking when the library does it?
-
I am not saying that message passing between XMOS threads/cores/chips is a bad design.
Just that it is not the single perfect solution to the problem. It works great for some stuff while not so much for others. In the case of the XMOS chip it turns out its not the solution to rule them all. Especially because in this type of system shared memory is actually very attractive. The whole tile can access all of the RAM with single cycle latency. So communicating between 'cores' trough memory is actually a lot easier than in a typical x86 machine.
Sure there is the expandability of XMOS where you can just stick together more and more chips to create this rightly woven cluster that behaves as one machine. However i never seen this actually put to a good use. I worked with XMOS chips quite a bit, i have a bunch of there dev boards still laying around and was even a moderator on there official forums. The most common use of these chips innovative multithreading features was for background bitbanging peripherals on the IO pins since you didn't get a UART or SPI or I2C. Building very large complex applications was awkward because those often need one big chunk of memory, so it might force you to split up the application into threads just to make it fit into the memory of two chips. The whole thing made worse due to the chips executing code from RAM. With the early chips only having 64KB of RAM this made the code eat up most of memory. On the other hand the threads doing bitbanged UART would only need a few 100 bytes of memory(tho most of it again code storage). The actual applications for the XMOS chips became mostly USB, ethernet, audio... etc that regular MCUs can also do because they have peripherals that handle all those things. I still think the architecture is really neat, but unfortunately it appears it is a solution in search of a problem.
Multithreading in large clusters of servers is a whole different can of worms that requires different solutions, message passing is indeed a good fit.
Then you have things like GPUs that perform many TFLOPs of computation spread across 1000 to 10000 cores and yet its all pretty much done using shared memory. It's just that in this case the workload is particularly well fitted for multithreading. In fact sharing memory actually helps it run faster since cores working on separate near by pixels will require mostly same the vertex and texture data, so this only needs to get read from external RAM once and then used to feed multiple computational units. The only reason this works is because each core grabs its own pixel and runs with it, so cache coherence is near non existent. If you try to read recently written data then what you get is a luck of the draw, sometimes you might get old data, might be new data, might be garbage. But graphics typicaly don't need to do this so it doesn't matter.
-
Message passing with language support can still pass single owner references without losing any of the logic of CSP. Or make data read only to allow shared references.
Lock free programming, of which z-buffer rendering can be an example, should be the clearly delineated exception (or in Rust terms, unsafe code). It's simultaneously bug prone and hard to debug. None of the CSP logic applies.
-
Anybody that made such a claim would be a snake oil salesman. Nobody here has made such a claim.
I assure you the claim is common.
Python programmers - a single global lock is a fine solution for all parallel programming
Go programmers - CSP is the way to go, everything should be CSP it solves everything
Javascript programmers - Don't do things in parallel, use async and run sequential. If you really need parallel, run a cluster of node processes, see, no need for threads.
Docker programmers - Don't use shared memory or locks, everything should be over network sockets. Pretend like your processes are on different computers even if they aren't.
Yes, I've known developers who thought this. The original post understood there are cons to CSP, my point isn't so much cons, it just doesn't solve all parallel problems.
-
What's supposed to be the difference between true asynchronous and mere non blocking?
The POSIX C terminology differs a bit from MPI terminology. In POSIX C:
Nonblocking: O_NONBLOCK, MSG_NOWAIT. The send or receive operation is done synchronously, but may fail (with "please retry later", errno having value EWOULDBLOCK or EAGAIN).
Asynchronous: The send or receive operation is *started*. It may succeed or fail later. The send or receive operation returns immediately, but the actual sending or receiving of data occurs simultaneously while the program keeps running. See e.g. POSIX aio (https://man7.org/linux/man-pages/man7/aio.7.html) (man 7 aio).
MPI calls asynchronous "nonblocking" (see e.g. MPI_Irecv() (https://www.open-mpi.org/doc/v4.1/man3/MPI_Irecv.3.php), MPI_Isend() (https://www.open-mpi.org/doc/v4.1/man3/MPI_Isend.3.php)) and describes them as starting the operation, with buffer contents being accessed after the call returns, and must not be modified until the operation completes. It also calls normal blocking operations "synchronous".
I believe the POSIX C terminology is clearer: normal operations are synchronous, either blocking or nonblocking. Asynchronous operations are only started by a function call, and complete independently of the program execution, not before the function call returns like synchronous operations. This definition of the terms seems to be the typical one in e.g. CS articles at ACM.
-
People want to try some Go. This seems to kill a lot of the shortcomings of message passing APIs.
The Go channel constructs are directly descended from Hoare's CSP channels. Ditto the xC channels.
Languages with such constructs are the future. Those without are tomorrow's COBOL.
I'm increasingly coming to the same conclusion.
-
Just a "halftime" thought. It looks like many arguments made in this thread are about performance, almost exclusively. I mentioned security as a strong point of message-passing, and I think only tggzzz also mentioned it as part of the CSP concept and how it makes correctness provable in concurrent systems. Now I apologize if I missed points related to security and/or correctness from other people. If this is the case, please chime in.
As to performance, as tggzzz also mentioned, MP scales a lot better than shared memory mechanisms. And, I think you have to look at the bigger picture as well: using MP on a system that is not optimized for it (typically a general-purpose modern computer) will of course not enable one to take full advantage of it. But if you think of a whole system architectured around this, you can actually remove almost all memory protection measures (or at least greatly simplify them), and remove all cache coherence mechanisms as well. Both can be relatively expensive.
Also, well-crafted benchmarks can tell us a thing or two about compared performance, in particular for parallel algorithms dealing with large data sets, for which you'd suspect shared memory would be a lot more efficient. There are a number of papers about this, and the conclusions may surprise some of you.
As I said, *efficient* implementation of parallel algorithms on large data sets using MP is usually a lot harder than with shared memory though. A "fun" exercise is to implement an efficient "parallel" merge sort using MP only, able to do as well, or maybe even better than using share memory. My guess is that you're going to spend a lot of time. But that is what I call computer science. =)
-
There are lots and lots of security details an implementation has to worry about if used as a privilege separation mechanism, like authorization tests having time-of-test-vs-time-of-use races, leaking message passing channels inadvertently to child processes, and so on.
One surprisingly hard problem is intentional or accidental denial-of-service attacks by spamming too many messages. Implementing heuristics and bandwidth shaping sounds easy, but making them unavoidable/mandatory isn't.
Unix Domain sockets are a pipe-like construct within a single machine, but allow passing ancillary data, including open file descriptors and sockets. One possible denial-of-service attack on some architectures is to open random file descriptors, and pass them via an Unix Domain socket to an unsuspecting privileged service, and immediately self close the local copies. Depending on the implementation, the unexpectedly received descriptors may take up space in the descriptor table in the recipient, which is of finite size, and can thus lead to the privileged service unexpectedly running out of descriptors it does not know it has open. Passing credentials (ancillary data verified by the kernel to match the sending process) is also vulnerable to time-of-test-vs-time-of-use races (by causing the client to execute a new binary right after the credentials check, while keeping the socket open, and the new binary providing nefarious data), which can be difficult to wrap ones mind around when implementing privileged delegates like say Apache SuEXEC.
Even very simple message channels, like the Linux kernel early logging ring buffer, can be overwhelmed by unexpectedly large number of messages. (This has already happened, when a certain service decided that when DEBUG was used in the kernel boot command line, it'd spew so much data the initial kernel log messages were lost. And this was just a temporary disagreement, amicably resolved with a bit of diplomacy.) A nefarious program can use something similar to hide any tracks of its wrongdoing, or even use message stuffing to widen a timing race window in a buggy privileged service, making the race window exploitable.
A "fun" exercise is to implement an efficient "parallel" merge sort using MP only, able to do as well, or maybe even better than using share memory. My guess is that you're going to spend a lot of time. But that is what I call computer science. =)
Perhaps, perhaps not, considering that the tape sort variant of merge sort is old hat.
Most HPC clusters demand a minimum scalability factor, if you wish to use a specific simulator with multiple cores in parallel. This means that not only does the message passing need to be efficient, but the amount of work (or rather, the time taken to do the work) per core needs to be roughly equal, as the overall time taken is measured by the slowest core.
Now that adds some really interesting wrinkles to the problem.
(If you're like me, you also like to do the communications in parallel with the calculations. Ouchie. The most interesting variant to me is the one where the sorted dataset also needs to be distributed across the cores so that each core has a roughly equal continuous section of the sorted data. This significantly reduces the amount of data that needs to be transferred.)
Doing the same on shared memory might sound much easier, but there, cacheline ping-pong can really ruin the performance. Doing the sorting directly on the data in the shared memory is not going to be the most efficient method (except on architectures without any data caching); doing local sorts in per-thread/per-core memory will be faster (assuming sufficiently large dataset, of course, that using multiple cores/threads makes sense in the first place).
-
Oh don’t get me started on logging DoS attacks. We did it to ourselves :-DD
-
As I said, *efficient* implementation of parallel algorithms on large data sets using MP is usually a lot harder than with shared memory though. A "fun" exercise is to implement an efficient "parallel" merge sort using MP only, able to do as well, or maybe even better than using share memory. My guess is that you're going to spend a lot of time. But that is what I call computer science. =)
Without having considered that problem, my inclination would be to start by understanding the algorithms used for sorting large datasets stored on magnetic tape.
Edit: nominal animal beat me to that!
-
Doing the same on shared memory might sound much easier, but there, cacheline ping-pong can really ruin the performance. Doing the sorting directly on the data in the shared memory is not going to be the most efficient method (except on architectures without any data caching); doing local sorts in per-thread/per-core memory will be faster (assuming sufficiently large dataset, of course, that using multiple cores/threads makes sense in the first place).
Caches are great on average. But apparently trivial changes anywhere in entire application's code or in the dataset can degrade performance by orders of magnitude.
-
As I said, *efficient* implementation of parallel algorithms on large data sets using MP is usually a lot harder than with shared memory though. A "fun" exercise is to implement an efficient "parallel" merge sort using MP only, able to do as well, or maybe even better than using share memory. My guess is that you're going to spend a lot of time. But that is what I call computer science. =)
Without having considered that problem, my inclination would be to start by understanding the algorithms used for sorting large datasets stored on magnetic tape.
Edit: nominal animal beat me to that!
Yep, also thought about that. I remember reading about it in a book about algorithms, but I do not remember the details. It wasn't completely trivial though AFAIR, and I'm not even sure they really could achieve the N*log(N) complexity of the merge sort. But I'll have to dig this up. And, whereas it was meant for being relatively efficient with data only available sequentially (from tape), it wasn't meant to be done in parallel, that I can remember of. So, whereas there are good ideas to take there, it probably requires some additional work.
I'd really be interested if anyone is up for the challenge. I will look it up as well, because chances are, it has already been done.
Something like this in Erlang should be fun to look at. But you could have a shot at it in Go. Or OpenMPI.
-
Doing the same on shared memory might sound much easier, but there, cacheline ping-pong can really ruin the performance. Doing the sorting directly on the data in the shared memory is not going to be the most efficient method (except on architectures without any data caching); doing local sorts in per-thread/per-core memory will be faster (assuming sufficiently large dataset, of course, that using multiple cores/threads makes sense in the first place).
Caches are great on average. But apparently trivial changes anywhere in entire application's code or in the dataset can degrade performance by orders of magnitude.
That's broken unfortunately in some places. Caches are now horrible as we have to share them with other tenants on the same virtualized excrement >:(
Oh it runs at 20% the speed in AWS but the CPU is newer generation? :-DD
-
On the hardware level and kernel-userspace interface level, the parts needed to implement a good message passing interface are surprisingly useful in other tasks as well.
For example, from the Berkeley Packet Filter implementation, Linux grew seccomp (https://man7.org/linux/man-pages/man2/seccomp.2.html) (usually used via libseccomp, i.e. seccomp_init() (https://man7.org/linux/man-pages/man3/seccomp_init.3.html) et al.) This is essentially a very lightweight bytecode interpreter on the kernel side of each syscall, that a process can install. The filter can be made permanent, and is often used to implement sandboxes where only a specific set of syscalls are possible.
As an IPC mechanism, message passing works really well (in my experience) with event driven programming models. This approach tends to work well when designing an application (say, a simulator), or say a distributed implementation of some algorithm like Merge sort.
I'd really be interested if anyone is up for the challenge.
A generic approach (something providing a simple API like distributed_msort()) is unlikely to perform that well, because the sources of efficiency boosts are in the details. For example:- Is the data to be sorted on a specific node, duplicated over all nodes, distributed evenly to the nodes, or distributed unevenly to the nodes?
- Should the sorted results be on a specific node, duplicated over all nodes, or distributed (evenly?) to the nodes?
- Can we collect simple statistics on the data to be sorted beforehand? Specifically, minimum and maximum?
- Can we collect a simple histogram of the data, say 4×nodes number of bins, and combine and distribute the histogram to all nodes?
The last two are steps that can often be done in earlier passes, for example when the data is obtained in the first place.
(The practical example of this kind of stuff I like to tell, is the classic sort program. If you want to minimize the wall-clock time taken to sort the data, you do not read the data first, and then use your spiffiest sort function to sort it, because the I/O will almost always be the bottleneck. Instead, you use an on-line sort algorithm, say a binary heap, that may use more CPU time, but lets you sort each line as it arrives. Each sort/insert is likely to take less time than reading a new line on average, so by the time the last input line is read, the lines are already in sorted order, and can be output immediately. You want to avoid waiting for an event, and instead use (most) of those wait times to do useful work.)
In the case of a distributed sort, the local sort is "cheap", and passing the data is "expensive", in terms of 2021 computer hardware.
We can assume O(N log N) time complexity for the local sort (although if you have tens of megabytes to gigabytes of data per node, radix sort will do it in O(N) but have a horrible cache behaviour). Since the slowest node will dictate when the entire operation completes, you'll want each node to do roughly the same amount of work. (Or rather, need about the same wall clock time to do the processing, since that minimizes the overall time taken.)
Because the message passing always introduces a latency, you'll want to initiate the transfers as early as possible, and preferably do some of the local sort work during the transfers.
With 2021 hardware, transfer latencies will dominate your wall clock time, especially when using actual network hardware (MPI, even with InfiniBand). Note that this is not a question of bandwidth; it is about how much wall clock time is needed to transfer all the messages. (Even if the interface is implemented via shared memory or DMA, not only does copying the data take its time, but the processes need to be informed of the events also; that means context switches or inter-process interrupts or similar.)
In general, there are three different approaches to solve the merge/broadcast/distribute problem (following the local sort). This same problem occurs in distributed molecular dynamics simulators as well, when the simulated volume is spatially divided among the computational nodes.- You find out the pivot points in the sorted data (first and last entry in the sorted data per node), and have each node distribute their data directly to the target nodes
- You merge the data as e.g a binary tree (in pairs), with the root of the tree broadcasting or distributing the completely sorted and merged data to all others
- You have each node broadcast the data (as all-recipient messages) to each other node.
The first one has the least amount of I/O, but the problem of finding the pivot points in the distributed data is nontrivial. However, usually the pivot points do not need to be exactly correct, as the balance between nodes does not need to be exact, and a few percent difference in the dataset sizes is perfectly acceptable. Thus, various histogram methods with perhaps a post-distribution balancing pass between neighboring nodes (in the sort order) can be quite effective, but complex to implement.
The second one has (logk nodes) levels, but on each level, the messages are passed pairwise (or within the merge group, to the shared parent node). This means that most messages are independent, and don't need to be routed through a common choke point, so better use of the theoretical bandwidth available is achieved. The downside is that only a few nodes will do the merge pass, and most nodes will only do the local sort, then send the sorted data to their parent, and then just sit and wait to receive the completed dataset from the root; and the passing of data from root to each node may be a severe performance bottleneck.
(A binary tree merge will have half the nodes idle in the initial level, so in real life, you might wish to reuse nodes and transfer extra data, just so that as many nodes are doing useful work at each level, and the level count as small as possible.)
The third one is the obvious one if the sorted data is to be duplicated on all nodes, assuming the message passing interface does support "broadcast", or all-recipient messages. For MPI implementations, the details vary even at runtime, but it is a rather common operation, so something to consider when implementing ones own message passing stuff. The simplest "broadcast" implementation is to use a ring, with each node receiving only from the previous node, and sending only to the next node, and each node sending their locally sorted data first, then forwarding the (node count - 1) sorted subsets they receive to the next one, and while doing the I/O, doing a merge pass on their current datasets, ending up with the complete dataset at end. Next simplest is a tree/graph approach, similar to the second step, where each node distributes their own data to two or more designated nodes; this has the same number of transfers, but fewer steps/levels. Best is kernel or hardware routing support, of course.
A real-world MPI example should really be run on real distributed hardware, since using e.g. OpenMPI on a single x86-64 or ARM computer implements the message passing using shared memory; I'm not aware of OpenMPI support for artificially delayed I/O for testing, although that'd be real nice. (That is, an adjustable millisecond/sub-millisecond delay, wall-clock time, before the recipient is notified the transfer is complete, to simulate network latencies.)
-
Doing the same on shared memory might sound much easier, but there, cacheline ping-pong can really ruin the performance. Doing the sorting directly on the data in the shared memory is not going to be the most efficient method (except on architectures without any data caching); doing local sorts in per-thread/per-core memory will be faster (assuming sufficiently large dataset, of course, that using multiple cores/threads makes sense in the first place).
Caches are great on average. But apparently trivial changes anywhere in entire application's code or in the dataset can degrade performance by orders of magnitude.
That's broken unfortunately in some places. Caches are now horrible as we have to share them with other tenants on the same virtualized excrement >:(
Oh it runs at 20% the speed in AWS but the CPU is newer generation? :-DD
That's a new level of pain that I'm never going to have to feel!
I have no idea how anybody can make any plausible performance prediction on a virtualised system.
-
You can’t. Even load testing is now non deterministic. Performance differs by up to 100% on the same node classes. The only exit is making sure your stuff is horizontally scalable far beyond your expected headroom and winging it. This inevitably just ends up making Bezos richer and the CxO class more purple and steam come out of their ears while there is a cost reduction scramble.
Can blame the big toilet paper producing groups like Gartner for this shit show.
-
Snipping many points, I'll concentrate on...
On the hardware level and kernel-userspace interface level, the parts needed to implement a good message passing interface are surprisingly useful in other tasks as well.
Yes indeed, and one pleasing example of that is the way xC+xCORE unifies i/o and computation: everything is a "process" connected by "messages/events".
As an IPC mechanism, message passing works really well (in my experience) with event driven programming models. This approach tends to work well when designing an application (say, a simulator), or say a distributed implementation of some algorithm like Merge sort.
Very true. Interestingly many user-level GUI-based applications/frameworks are being designed using event-based concepts. I wonder if that will, over time, encourage more average developers to "think" in terms of events and then to architect their applications that way.
Personally I've always frameworks to be explicitly event-based rather than, say, callback-based. The latter feel too much like trying to preserve an illusion of single thread of execution semantics where it really doesn't exist. Yes, I know they are both equally expressive, but the thought processes and implementation pain points are different.
Because the message passing always introduces a latency, you'll want to initiate the transfers as early as possible, and preferably do some of the local sort work during the transfers.
Latency is a physics problem; bandwidth is an economics problem.
-
Very true. Interestingly many user-level GUI-based applications/frameworks are being designed using event-based concepts. I wonder if that will, over time, encourage more average developers to "think" in terms of events and then to architect their applications that way.
My own first introduction to an event-based domain-specific language was Lingo, the script used in Macromedia Director, in 1995 or so; did a paid CD-ROM project. (Similar to Flash authoring environment, preceded Flash; was used to make lots of CD-ROM based interactive stuff especially for kids. Worked exceptionally well on Macs.) It even used the "on event ... end [event]" idiom for the handlers.
However, even the venerable awk (https://www.gnu.org/software/gawk/manual/) is really event-based. Each record (typically a line, but only because the default record separator is the newline) comes in sequentially, and is handled by one or more rules, or event handlers. (Each record is also split into fields, for easier manipulation, according to the field separator.) The rules are written using a simple imperative script language with good support for regular expressions, so it is easy to miss the event-based approach.
Even things like thread pools share many of the same ideas/approaches as event-driven programming. There, instead of an "event", you have a "work nugget", or similar. The same rules apply: you don't want to impose any specific order for the completion of (parallel) events, especially not rely on sequential event completion, and so on.
Xlib (https://en.wikipedia.org/wiki/Xlib) and XCB (https://en.wikipedia.org/wiki/XCB) are almost completely event-based (aside from initialization), and mostly asynchronous (using request-response messages). Xlib is almost four decades old, and still heavily used; by no means perfect, but that's mostly because of historical baggage and not the event-based approach.
Very useful idioms overall. I do not personally feel constricted or limited, if message-passing is the only IPC method. In C, I do need threads within the same process, which implies some sharing of memory, but that's a low-level computational performance issue. In higher-level languages like Python, I can switch to process parallelism and messages, no problem.
Latency is a physics problem; bandwidth is an economics problem.
That is damned well put! :-+
-
Very true. Interestingly many user-level GUI-based applications/frameworks are being designed using event-based concepts. I wonder if that will, over time, encourage more average developers to "think" in terms of events and then to architect their applications that way.
My own first introduction to an event-based domain-specific language was Lingo, the script used in Macromedia Director, in 1995 or so; did a paid CD-ROM project. (Similar to Flash authoring environment, preceded Flash; was used to make lots of CD-ROM based interactive stuff especially for kids. Worked exceptionally well on Macs.) It even used the "on event ... end [event]" idiom for the handlers.
In general making things explicit aids comprehension and usability. Sometimes the how-to descriptions/"definitions" of how to use a framework actively avoid explaining the computational model. Yes, Apple, I'm looking at you :)
However, even the venerable awk (https://www.gnu.org/software/gawk/manual/) is really event-based. Each record (typically a line, but only because the default record separator is the newline) comes in sequentially, and is handled by one or more rules, or event handlers. (Each record is also split into fields, for easier manipulation, according to the field separator.) The rules are written using a simple imperative script language with good support for regular expressions, so it is easy to miss the event-based approach.
Yes, but there's a big grey area between event and pattern based programming.
The XML DOM processing engines typically either create a large in-memory structure replicating the DOM, or have pattern-matched "events" minimising computational overhead until you encounter the DOM elements of interest. Both are valuable, both have strengths and weaknesses.
Even things like thread pools share many of the same ideas/approaches as event-driven programming. There, instead of an "event", you have a "work nugget", or similar. The same rules apply: you don't want to impose any specific order for the completion of (parallel) events, especially not rely on sequential event completion, and so on.
I'll argue thread pools and event based programming are orthogonal. Having said that, I usually use thread pools as an execution mechanism enabling parallel processing of independent events. They have the advantage that they can minimise thrashing if there is ~N threads in the pool for an N-core processor.
Very useful idioms overall. I do not personally feel constricted or limited, if message-passing is the only IPC method. In C, I do need threads within the same process, which implies some sharing of memory, but that's a low-level computational performance issue. In higher-level languages like Python, I can switch to process parallelism and messages, no problem.
I attempt to work out how to neatly describe the application-domain level computation, and then use whatever tools aid in the implementation. Having many tools in my toolbox enables me to select the appropriate tool for this job. I don't need to "mis-apply the only tool I know to the current job".
To me the interest is noting large-scale similarities between languages/frameworks, so that I can rapidly understand their strengths/limitations. Unfortunately it is in marketeers interests to obfuscate that, and young software engineers don't have the depth of experience.
Arguably my first event-driven application was a hard-realtime embedded program in C and executing in a home-grown cooperatively-scheduled RTOS that has many similarities to co-routine programming. Worked nicely, especially since I could debug the high level FSMs and processing at the event level in Xenix on a PDP-11.
As for FSMs, it appears I unwittingly and triumphantly reinvented the concept while at school, with my first assembler-level program which ran on a 39-bit, 8Kword 2kIPS machine :)
Latency is a physics problem; bandwidth is an economics problem.
That is damned well put! :-+
Unfortunately I can't claim originality - but that merely illustrates that it is a memorable way of encapsulating the issues.
-
Because the message passing always introduces a latency, you'll want to initiate the transfers as early as possible, and preferably do some of the local sort work during the transfers.
I find this statement both right and odd at the same time.
Any operation introduces "latency". But as a means of communicating between tasks, whether message passing introduces more latency than memory sharing, for instance, is absolutely not a given. It entirely depends on the implementation. It's certainly not a hard fact.
Not sure exactly what you have in mind. From your posts, I'm assuming you may think of message passing through networks, for instance. But while MP lends itself well to network-distributed processors, this is just one particular case. And even so, if you're dealing with such a topology, you can't share memory anyway - at least not directly. So there's no definite difference, in any case you have to move data around, and latency will essentially be the same.
-
However, even the venerable awk (https://www.gnu.org/software/gawk/manual/) is really event-based. Each record (typically a line, but only because the default record separator is the newline) comes in sequentially, and is handled by one or more rules, or event handlers. (Each record is also split into fields, for easier manipulation, according to the field separator.) The rules are written using a simple imperative script language with good support for regular expressions, so it is easy to miss the event-based approach.
Yes, but there's a big grey area between event and pattern based programming.
The XML DOM processing engines typically either create a large in-memory structure replicating the DOM, or have pattern-matched "events" minimising computational overhead until you encounter the DOM elements of interest. Both are valuable, both have strengths and weaknesses.
Sure, agreed. I perhaps should have written that as "kinda event-based".
My point was that the ideas underlying the event based approach are ubiquitous, and nothing new.
Both awks I currently use (mawk and gawk; mawk is faster for many tasks, but gawk has more useful features and additional functions) do process records as they are read, and that is one of their core strengths. The event-based approach is useful as a mental model for new awk users; that way, they tend to "grok" how they work, and generate more efficient awk scripts.
If one wanted to be pedantic, then it would be fair to say awk only has two actual events: BEGIN and END, the two special rules ran before any input is processed and after all input has been processed. Me, I don't have the kind of control of English language that would permit me to be that pedantic without being utterly wrong almost all of the time. (As it is, now I'm only often wrong.)
I'll argue thread pools and event based programming are orthogonal.
From the point of the worker function implementation, you can consider the work as an incoming event, and the result produced as an outgoing event, when the work is a computing task (as opposed to e.g. client service type task).
For example, in a local/non-distributed classical molecular dynamics simulation (not quantum mechanics, but using interaction potentials or force fields as the chemists like to call them), each time step the interaction between each pair (and depending on model, triplet, and even quad) of atoms, up to a maximum distance, needs to be evaluated. For many models, there are two functions to be evaluated; for a pairwise interaction, Vi,j(ri,j) and its gradient -∇Vi,j(ri,j). Some models, like embedded atom models, require a first pass over all atoms (in EAM, to calculate a scalar approximating the density of relevant electrons at each atom location), and a second pass to calculate the total potential energy at each atom V and its gradient (force vector) -∇V.
While the simulator overall is usually not touched by scientists, they often develop new potential models, and add quirks to them. So, giving them an interface that provides an atom (its type, say iron or copper), and the types and distances (both as a scalar and as a vector) to its neighbors, with the function (those scientists can create or modify) having to calculate the interactions as above, works better than letting them trash the entire codebase. I describe those functions as a kind of an "event handler"; this seems to provide a good mental model for implementing such functions. The simulator itself should manage the thread pool, the order in which the handlers are called, and so on, with those details outside the grubby hands of the dirty scientists who don't know good software engineering if it hit them on the head.
Even if you have asymmetric hardware that produces exact same results (say, a GPU supporting double-precision IEEE 754 arithmetic), this kind of approach works absolutely fine. The slower hardware will process fewer atoms, but that's perfectly okay.
I don't know what one should call this kind of approach, from the point of view of those writing/modifying only the functions calculating the interactions around a single atom. To me, "event-based" matches best, but I'm perfectly willing to be corrected, if you know a better, widely used term.
Arguably my first event-driven application was a hard-realtime embedded program in C and executing in a home-grown cooperatively-scheduled RTOS that has many similarities to co-routine programming. Worked nicely, especially since I could debug the high level FSMs and processing at the event level in Xenix on a PDP-11.
If it matters, I like to use "foo-based" when using the approach/paradigm but not restricted to it only, and "foo-driven" when strictly using the foo model.
Very few things I write follow any specific pure model, though; I'm messy.
I fully agree wrt. having a varied toolkit one can use, and not trying to fit a favourite approach/paradigm to every problem one encounters, too.
Because the message passing always introduces a latency, you'll want to initiate the transfers as early as possible, and preferably do some of the local sort work during the transfers.
I find this statement both right and odd at the same time.
I wrote it as part of the description of the distributed merge sort, not as a general fact.
I meant that the preferred pattern when implementing such algorithms using message passing is- Send messages providing information needed later
- Work on local data as much as possible
- Wait until all messages providing the necessary additional information to complete the work are received
- Complete work on data
Most distributed simulators using MPI currently do- Work on local data
- Send changes to affected nodes
- Receive changes from other nodes that affect this node
- Repeat
which means that instead of useful computation being done while the messages are in flight, computation and communication is done as separate steps. This is a horrible pattern, because the more communication is needed between the distributed processing nodes, the smaller fraction of time the processors are actually doing useful work.
It is also the real reason why HPC clusters "require" InfiniBand: it has smaller latencies than IP over Ethernet, and often papers nicely over such design flaws.
I want to burn this later pattern on a pyre and ban it.
But as a means of communicating between tasks, whether message passing introduces more latency than memory sharing, for instance, is absolutely not a given. It entirely depends on the implementation. It's certainly not a hard fact.
No, definitely not. The point of the above is to consider even the message passing itself as an asynchronous operation, not a synchronous one, when designing an implementation.
That is, logically, you think of the pattern as "you send a message, and sometime later, the recipient will receive it", instead of "when you send a message, the recipient immediately receives it, and then the send-message function returns". See? Disagree?
Not sure exactly what you have in mind. From your posts, I'm assuming you may think of message passing through networks, for instance.
They were just examples of use I'm very familiar with, to define the underlying reasons why I don't have any issues using message-passing only for concurrent programming (as long as I can use multiple threads or processes to do computation in parallel). I did point out stuff like X11, too.
-
No, definitely not. The point of the above is to consider even the message passing itself as an asynchronous operation, not a synchronous one, when designing an implementation.
But the physical reality is that everything is just synchronous with FIFOs, some of those FIFOs might be close to the sender end to allow it to quickly go on it's merry way while the hardware and hidden software layers figure out how to empty the close FIFO into the remote FIFO ... but FIFOs can fill up. Asynchronous is an illusion, synchronous is reality. Highest performance doesn't generally come from entertaining illusions.
-
But the physical reality is that everything is just synchronous
Actually, if we accept the relativistic worldview, physical reality itself is asynchronous: there is no global, synchronous clock at all, and the order of events can depend on the observer.
Consider this: - Sending a message is a synchronous operation. A function call to send a message returns successfully if the message was queued, and with failure if the queue was full or some other reason prevented the message from being sent.
- Receiving a message is a synchronous operation. A function call to check for a/the next message will return with the message, or block until a message is available, or report no messages to receive, or report an error.
- The two operations do not occur at the same time, nor are they so tightly coupled that an unrelated message could not be sent or received (by either of the two parties) in between.
The only physical limitation is that the message cannot be received before it is sent. - There is always an interval between the two operations, when the sender and/or the receiver can perform other actions, unless both operations are done sequentially by the same processor.
This means that considering the whole operation (and not just sending and receiving in isolation), passing a message is by default asynchronous (https://en.wikipedia.org/wiki/Asynchrony_(computer_programming)).
You can implement a synchronous message passing interface by not letting the sender succeed until the message has been received, but for the reasons I explained above, it will perform worse (take longer per operation) than an asynchronous one. Such a synchronous interface may be important in some specific rare use cases, of course, but can always be built on top of an asynchronous (and even unreliable) one.
Asynchronous does not imply unreliability. Many message passing interfaces, including MPI, provide an unique handle for the sender when a message is sent. The underlying implementation can provide an acknowledgement when the message has been received (and/or processed), and the sender can check for this via the handle.
Synchrony implies a global clock and exact ordering of events. In reality, we have neither.
-
No, definitely not. The point of the above is to consider even the message passing itself as an asynchronous operation, not a synchronous one, when designing an implementation.
But the physical reality is that everything is just synchronous with FIFOs, some of those FIFOs might be close to the sender end to allow it to quickly go on it's merry way while the hardware and hidden software layers figure out how to empty the close FIFO into the remote FIFO ... but FIFOs can fill up. Asynchronous is an illusion, synchronous is reality. Highest performance doesn't generally come from entertaining illusions.
Where comms and physics are involved, asynchronous is the basic mode of operation and synchronous behaviour is something imposed on that because it makes it easier to understand and deploy.
Obvious examples:
- look at how a flip-flop works, and understand metastability
- on-chip logic operations take a small time compared with the time taken to propagate across the chip
- ditto where computations take place on a different machine
and there are many others.
-
However, even the venerable awk (https://www.gnu.org/software/gawk/manual/) is really event-based. Each record (typically a line, but only because the default record separator is the newline) comes in sequentially, and is handled by one or more rules, or event handlers. (Each record is also split into fields, for easier manipulation, according to the field separator.) The rules are written using a simple imperative script language with good support for regular expressions, so it is easy to miss the event-based approach.
Yes, but there's a big grey area between event and pattern based programming.
The XML DOM processing engines typically either create a large in-memory structure replicating the DOM, or have pattern-matched "events" minimising computational overhead until you encounter the DOM elements of interest. Both are valuable, both have strengths and weaknesses.
Sure, agreed. I perhaps should have written that as "kinda event-based".
My point was that the ideas underlying the event based approach are ubiquitous, and nothing new.
Accepted.
Regrettably computer programming courses neglect considering processing as a sequence of events, in favour of data structures and sequential algorithms. Too often I've heard "FSMs are those things in parsers, aren't they?"
I hope that changes; it needs to.
If it matters, I like to use "foo-based" when using the approach/paradigm but not restricted to it only, and "foo-driven" when strictly using the foo model.
Very few things I write follow any specific pure model, though; I'm messy.
I fully agree wrt. having a varied toolkit one can use, and not trying to fit a favourite approach/paradigm to every problem one encounters, too.
Yes indeed, and I want to see radically improved tools, not mere variations on a theme. Ignoring the me-too variations (e.g. Delphi and C#) has saved me an awful lot of time.
If CSP, for example, was the last word, I'd be worried!
-
My problem is: how to ensure that the recipient gets the messages sent exactly in the order they were sent.
-
My problem is: how to ensure that the recipient gets the messages sent exactly in the order they were sent.
Easy, and you probably want to add "once and only once". That's what transport protocols achieve.
Then you realise that in a distributed system there can be no common concept of time, and the best you can achieve is partial ordering. See Leslie Lamport's seminal works from >40 years ago.
Example: in telecom systems events for a call are generated from all over the world by different companies systems and equipment that you don't know exist.
-
Synchrony implies a global clock and exact ordering of events. In reality, we have neither.
Yes, maybe the self-synchronizing and self-synchronizing data bus protocols with twisted pairs carrying balanced differential signaling can offer some sort of help here, which is better than nothing :-//
-
Synchrony implies a global clock and exact ordering of events. In reality, we have neither.
Yes, maybe the self-synchronizing and self-synchronizing data bus protocols with twisted pairs carrying balanced differential signaling can offer some sort of help here, which is better than nothing :-//
Nope, the problem is fundamental and unavoidable. Short distances mean shorter latency, which means the timing uncertainty is reduced - but that's all.
If you want to consider purely practical limits, where you have two independent clocks you will have variable delays due to metastability and "cycle slippage" between plesiochronous clocks.
-
you probably want to add "once and only once"
Precisely.
That is somehow easy with solutions like Ethernet with things like go-back-n in the lower level of the communication stack to ensure that things arrive in order. The kernel does its job and from the userspace applications have the illusion that things happen in order and once and only once;
but it's different when the communication is done between queue-pairs instead of channels, and the data flow to/from user space straight to/from HW instead of going thru the kernel stack.
This is basically what you have to solve when you have to create a { read, write, atomic } request on the requestor side, send the request to the responder, and directly access the memory on the responder side.
You basically have ~10 times less than typical TCP/UDP latency, but no kernel support.
-
On latency it’s fun to think of this on an interplanetary scale and when time dilation is accounted for. You get to the point that reasoning about synchronisation makes no sense outside of locality. Your events become a combination of locality and ordering. This thinking scaled down to a single computer as well. Thus you have to divide the problem into what your high level local concern needs to represent and worry about the implementation only when that is established. Rarely do you end up having to resolve time of arrival conflicts or synchronisation surprisingly I find.
Your bank clearing is a fine example of this.
-
you probably want to add "once and only once"
Precisely.
That is somehow easy with solutions like Ethernet with things like go-back-n in the lower level of the communication stack to ensure that things arrive in order. The kernel does its job and from the userspace applications have the illusion that things happen in order and once and only once;
Ethernet doesn't have any concept of "go-back-n". It is fire-and-forget, with the ability to detect one failure mechanism. The rest is the responsibility of the transport protocol, several layers up the stack. Same is true for other PHY/MAC layers.
but it's different when the communication is done between queue-pairs instead of channels, and the data flow to/from user space straight to/from HW instead of going thru the kernel stack.
This is basically what you have to solve when you have to create a { read, write, atomic } request on the requestor side, send the request to the responder, and directly access the memory on the responder side.
It isn't possible to guarantee those semantics, unless you solve the Byzantine General's problem. Do that and you will be famous.
Partial workarounds are two-phase commits and, since those can also fail, three-phase commits, and since ...
You basically have ~10 times less than typical TCP/UDP latency, but no kernel support.
... and no worthwhile guarantees!
For fun, try to define how you would recover a system which has suffered the "split brain" problem. That can occur at the application level and also at the PHY/MAC level where the classic example is a split token ring leading to zero or two tokens circulating.
-
On latency it’s fun to think of this on an interplanetary scale and when time dilation is accounted for. You get to the point that reasoning about synchronisation makes no sense outside of locality. Your events become a combination of locality and ordering. This thinking scaled down to a single computer as well. Thus you have to divide the problem into what your high level local concern needs to represent and worry about the implementation only when that is established. Rarely do you end up having to resolve time of arrival conflicts or synchronisation surprisingly I find.
Your bank clearing is a fine example of this.
I've encountered it in pay-as-you-go telecom billing systems, where all the different comms channels have to be terminated when the money runs out.
OT, but it is fun to consider that ours may be the only generation where it is possible to have a voice conversation with any member of the human race.
-
Ethernet doesn't have any concept of "go-back-n"
Data link layer, "Go back-n and repeat" is implemented in the kernel; I know because I have often found myself in the position of having to get my hands on it, and not for fun. Now I am with RDMA, and this technology directly shoots in kernel memory bypassing the CPU.
I stand corrected, but that is only of very limited utility since it requires that both ends are on the same link - and that can be difficult to guarantee in an installation over time. In effect it is only beneficial where one specific local link is unreliable.
Message passing requires end-to-end guarantees, and that requires a transport protocol.
-
You could say that one of the major cons of using message-passing is the implementation complexity, when both security and utility are carefully considered.
That stuff cannot be added in as an afterthought. It has to be baked in to the design, or it will be only a half measure, as decades of past experience and examples show.
-
You could say that one of the major cons of using message-passing is the implementation complexity, when both security and utility are carefully considered.
That stuff cannot be added in as an afterthought. It has to be baked in to the design, or it will be only a half measure, as decades of past experience and examples show.
That, of course, is true in any application or library. It is a "non-functional requirement" that you can't write tests for, and therefore it gets ignored in agile development.
Wasn't it Bruce Schneier who said "If you think encryption will solve your problem, then you don't understand encryption - and you don't understand your prooblem".