Author Topic: Pros and Cons of using message-passing only for concurrent programming (Read 10361 times)

Berni · « **Reply #25 on:** August 20, 2021, 10:36:51 am »

I am not saying that message passing between XMOS threads/cores/chips is a bad design.

Just that it is not the single perfect solution to the problem. It works great for some stuff while not so much for others. In the case of the XMOS chip it turns out its not the solution to rule them all. Especially because in this type of system shared memory is actually very attractive. The whole tile can access all of the RAM with single cycle latency. So communicating between 'cores' trough memory is actually a lot easier than in a typical x86 machine.

Sure there is the expandability of XMOS where you can just stick together more and more chips to create this rightly woven cluster that behaves as one machine. However i never seen this actually put to a good use. I worked with XMOS chips quite a bit, i have a bunch of there dev boards still laying around and was even a moderator on there official forums. The most common use of these chips innovative multithreading features was for background bitbanging peripherals on the IO pins since you didn't get a UART or SPI or I2C. Building very large complex applications was awkward because those often need one big chunk of memory, so it might force you to split up the application into threads just to make it fit into the memory of two chips. The whole thing made worse due to the chips executing code from RAM. With the early chips only having 64KB of RAM this made the code eat up most of memory. On the other hand the threads doing bitbanged UART would only need a few 100 bytes of memory(tho most of it again code storage). The actual applications for the XMOS chips became mostly USB, ethernet, audio... etc that regular MCUs can also do because they have peripherals that handle all those things. I still think the architecture is really neat, but unfortunately it appears it is a solution in search of a problem.

Multithreading in large clusters of servers is a whole different can of worms that requires different solutions, message passing is indeed a good fit.

Then you have things like GPUs that perform many TFLOPs of computation spread across 1000 to 10000 cores and yet its all pretty much done using shared memory. It's just that in this case the workload is particularly well fitted for multithreading. In fact sharing memory actually helps it run faster since cores working on separate near by pixels will require mostly same the vertex and texture data, so this only needs to get read from external RAM once and then used to feed multiple computational units. The only reason this works is because each core grabs its own pixel and runs with it, so cache coherence is near non existent. If you try to read recently written data then what you get is a luck of the draw, sometimes you might get old data, might be new data, might be garbage. But graphics typicaly don't need to do this so it doesn't matter.

Marco · « **Reply #26 on:** August 20, 2021, 11:28:28 am »

Message passing with language support can still pass single owner references without losing any of the logic of CSP. Or make data read only to allow shared references.

Lock free programming, of which z-buffer rendering can be an example, should be the clearly delineated exception (or in Rust terms, unsafe code). It's simultaneously bug prone and hard to debug. None of the CSP logic applies.

dferyance · « **Reply #27 on:** August 20, 2021, 01:31:01 pm »

Quote from: tggzzz on August 19, 2021, 09:26:45 pm

Anybody that made such a claim would be a snake oil salesman. Nobody here has made such a claim.

I assure you the claim is common.
Python programmers - a single global lock is a fine solution for all parallel programming
Go programmers - CSP is the way to go, everything should be CSP it solves everything
Javascript programmers - Don't do things in parallel, use async and run sequential. If you really need parallel, run a cluster of node processes, see, no need for threads.
Docker programmers - Don't use shared memory or locks, everything should be over network sockets. Pretend like your processes are on different computers even if they aren't.

Yes, I've known developers who thought this. The original post understood there are cons to CSP, my point isn't so much cons, it just doesn't solve all parallel problems.

Nominal Animal · « **Reply #28 on:** August 20, 2021, 02:26:05 pm »

Quote from: Marco on August 20, 2021, 09:50:28 am

What's supposed to be the difference between true asynchronous and mere non blocking?

The POSIX C terminology differs a bit from MPI terminology. In POSIX C:

Nonblocking: O_NONBLOCK, MSG_NOWAIT. The send or receive operation is done synchronously, but may fail (with "please retry later", errno having value EWOULDBLOCK or EAGAIN).

Asynchronous: The send or receive operation is *started*. It may succeed or fail later. The send or receive operation returns immediately, but the actual sending or receiving of data occurs simultaneously while the program keeps running. See e.g. POSIX aio (man 7 aio).

MPI calls asynchronous "nonblocking" (see e.g. MPI_Irecv(), MPI_Isend()) and describes them as starting the operation, with buffer contents being accessed after the call returns, and must not be modified until the operation completes. It also calls normal blocking operations "synchronous".

I believe the POSIX C terminology is clearer: normal operations are synchronous, either blocking or nonblocking. Asynchronous operations are only started by a function call, and complete independently of the program execution, not before the function call returns like synchronous operations. This definition of the terms seems to be the typical one in e.g. CS articles at ACM.

SiliconWizard · « **Reply #29 on:** August 20, 2021, 05:42:34 pm »

Quote from: tggzzz on August 20, 2021, 08:42:50 am

Quote from: bd139 on August 19, 2021, 10:28:42 pm
People want to try some Go. This seems to kill a lot of the shortcomings of message passing APIs.

The Go channel constructs are directly descended from Hoare's CSP channels. Ditto the xC channels.

Languages with such constructs are the future. Those without are tomorrow's COBOL.

I'm increasingly coming to the same conclusion.

SiliconWizard · « **Reply #30 on:** August 20, 2021, 06:04:53 pm »

Just a "halftime" thought. It looks like many arguments made in this thread are about performance, almost exclusively. I mentioned security as a strong point of message-passing, and I think only tggzzz also mentioned it as part of the CSP concept and how it makes correctness provable in concurrent systems. Now I apologize if I missed points related to security and/or correctness from other people. If this is the case, please chime in.

As to performance, as tggzzz also mentioned, MP scales a lot better than shared memory mechanisms. And, I think you have to look at the bigger picture as well: using MP on a system that is not optimized for it (typically a general-purpose modern computer) will of course not enable one to take full advantage of it. But if you think of a whole system architectured around this, you can actually remove almost all memory protection measures (or at least greatly simplify them), and remove all cache coherence mechanisms as well. Both can be relatively expensive.

Also, well-crafted benchmarks can tell us a thing or two about compared performance, in particular for parallel algorithms dealing with large data sets, for which you'd suspect shared memory would be a lot more efficient. There are a number of papers about this, and the conclusions may surprise some of you.

As I said, *efficient* implementation of parallel algorithms on large data sets using MP is usually a lot harder than with shared memory though. A "fun" exercise is to implement an efficient "parallel" merge sort using MP only, able to do as well, or maybe even better than using share memory. My guess is that you're going to spend a lot of time. But that is what I call computer science. =)

Nominal Animal · « **Reply #31 on:** August 20, 2021, 07:29:12 pm »

There are lots and lots of security details an implementation has to worry about if used as a privilege separation mechanism, like authorization tests having time-of-test-vs-time-of-use races, leaking message passing channels inadvertently to child processes, and so on.

One surprisingly hard problem is intentional or accidental denial-of-service attacks by spamming too many messages. Implementing heuristics and bandwidth shaping sounds easy, but making them unavoidable/mandatory isn't.

Unix Domain sockets are a pipe-like construct within a single machine, but allow passing ancillary data, including open file descriptors and sockets. One possible denial-of-service attack on some architectures is to open random file descriptors, and pass them via an Unix Domain socket to an unsuspecting privileged service, and immediately self close the local copies. Depending on the implementation, the unexpectedly received descriptors may take up space in the descriptor table in the recipient, which is of finite size, and can thus lead to the privileged service unexpectedly running out of descriptors it does not know it has open. Passing credentials (ancillary data verified by the kernel to match the sending process) is also vulnerable to time-of-test-vs-time-of-use races (by causing the client to execute a new binary right after the credentials check, while keeping the socket open, and the new binary providing nefarious data), which can be difficult to wrap ones mind around when implementing privileged delegates like say Apache SuEXEC.

Even very simple message channels, like the Linux kernel early logging ring buffer, can be overwhelmed by unexpectedly large number of messages. (This has already happened, when a certain service decided that when DEBUG was used in the kernel boot command line, it'd spew so much data the initial kernel log messages were lost. And this was just a temporary disagreement, amicably resolved with a bit of diplomacy.) A nefarious program can use something similar to hide any tracks of its wrongdoing, or even use message stuffing to widen a timing race window in a buggy privileged service, making the race window exploitable.

Quote from: SiliconWizard on August 20, 2021, 06:04:53 pm

A "fun" exercise is to implement an efficient "parallel" merge sort using MP only, able to do as well, or maybe even better than using share memory. My guess is that you're going to spend a lot of time. But that is what I call computer science. =)

Perhaps, perhaps not, considering that the tape sort variant of merge sort is old hat.

Most HPC clusters demand a minimum scalability factor, if you wish to use a specific simulator with multiple cores in parallel. This means that not only does the message passing need to be efficient, but the amount of work (or rather, the time taken to do the work) per core needs to be roughly equal, as the overall time taken is measured by the slowest core.

Now that adds some really interesting wrinkles to the problem.

(If you're like me, you also like to do the communications in parallel with the calculations. Ouchie. The most interesting variant to me is the one where the sorted dataset also needs to be distributed across the cores so that each core has a roughly equal continuous section of the sorted data. This significantly reduces the amount of data that needs to be transferred.)

Doing the same on shared memory might sound much easier, but there, cacheline ping-pong can really ruin the performance. Doing the sorting directly on the data in the shared memory is not going to be the most efficient method (except on architectures without any data caching); doing local sorts in per-thread/per-core memory will be faster (assuming sufficiently large dataset, of course, that using multiple cores/threads makes sense in the first place).

bd139 · « **Reply #32 on:** August 20, 2021, 09:23:07 pm »

Oh don’t get me started on logging DoS attacks. We did it to ourselves

tggzzz · « **Reply #33 on:** August 20, 2021, 10:19:09 pm »

Quote from: SiliconWizard on August 20, 2021, 06:04:53 pm

As I said, *efficient* implementation of parallel algorithms on large data sets using MP is usually a lot harder than with shared memory though. A "fun" exercise is to implement an efficient "parallel" merge sort using MP only, able to do as well, or maybe even better than using share memory. My guess is that you're going to spend a lot of time. But that is what I call computer science. =)

Without having considered that problem, my inclination would be to start by understanding the algorithms used for sorting large datasets stored on magnetic tape.

Edit: nominal animal beat me to that!

tggzzz · « **Reply #34 on:** August 20, 2021, 10:28:18 pm »

Quote from: Nominal Animal on August 20, 2021, 07:29:12 pm

Doing the same on shared memory might sound much easier, but there, cacheline ping-pong can really ruin the performance. Doing the sorting directly on the data in the shared memory is not going to be the most efficient method (except on architectures without any data caching); doing local sorts in per-thread/per-core memory will be faster (assuming sufficiently large dataset, of course, that using multiple cores/threads makes sense in the first place).

Caches are great on average. But apparently trivial changes anywhere in entire application's code or in the dataset can degrade performance by orders of magnitude.

SiliconWizard · « **Reply #35 on:** August 20, 2021, 10:59:32 pm »

Quote from: tggzzz on August 20, 2021, 10:19:09 pm

Quote from: SiliconWizard on August 20, 2021, 06:04:53 pm
As I said, *efficient* implementation of parallel algorithms on large data sets using MP is usually a lot harder than with shared memory though. A "fun" exercise is to implement an efficient "parallel" merge sort using MP only, able to do as well, or maybe even better than using share memory. My guess is that you're going to spend a lot of time. But that is what I call computer science. =)

Without having considered that problem, my inclination would be to start by understanding the algorithms used for sorting large datasets stored on magnetic tape.

Edit: nominal animal beat me to that!

Yep, also thought about that. I remember reading about it in a book about algorithms, but I do not remember the details. It wasn't completely trivial though AFAIR, and I'm not even sure they really could achieve the N*log(N) complexity of the merge sort. But I'll have to dig this up. And, whereas it was meant for being relatively efficient with data only available sequentially (from tape), it wasn't meant to be done in parallel, that I can remember of. So, whereas there are good ideas to take there, it probably requires some additional work.

I'd really be interested if anyone is up for the challenge. I will look it up as well, because chances are, it has already been done.
Something like this in Erlang should be fun to look at. But you could have a shot at it in Go. Or OpenMPI.

bd139 · « **Reply #36 on:** August 20, 2021, 11:03:56 pm »

Quote from: tggzzz on August 20, 2021, 10:28:18 pm

Quote from: Nominal Animal on August 20, 2021, 07:29:12 pm
Doing the same on shared memory might sound much easier, but there, cacheline ping-pong can really ruin the performance. Doing the sorting directly on the data in the shared memory is not going to be the most efficient method (except on architectures without any data caching); doing local sorts in per-thread/per-core memory will be faster (assuming sufficiently large dataset, of course, that using multiple cores/threads makes sense in the first place).

Caches are great on average. But apparently trivial changes anywhere in entire application's code or in the dataset can degrade performance by orders of magnitude.

That's broken unfortunately in some places. Caches are now horrible as we have to share them with other tenants on the same virtualized excrement

Oh it runs at 20% the speed in AWS but the CPU is newer generation?

Nominal Animal · « **Reply #37 on:** August 21, 2021, 12:33:53 am »

On the hardware level and kernel-userspace interface level, the parts needed to implement a good message passing interface are surprisingly useful in other tasks as well.

For example, from the Berkeley Packet Filter implementation, Linux grew seccomp (usually used via libseccomp, i.e. seccomp_init() et al.) This is essentially a very lightweight bytecode interpreter on the kernel side of each syscall, that a process can install. The filter can be made permanent, and is often used to implement sandboxes where only a specific set of syscalls are possible.

As an IPC mechanism, message passing works really well (in my experience) with event driven programming models. This approach tends to work well when designing an application (say, a simulator), or say a distributed implementation of some algorithm like Merge sort.

Quote from: SiliconWizard on August 20, 2021, 10:59:32 pm

I'd really be interested if anyone is up for the challenge.

A generic approach (something providing a simple API like distributed_msort()) is unlikely to perform that well, because the sources of efficiency boosts are in the details. For example:

Is the data to be sorted on a specific node, duplicated over all nodes, distributed evenly to the nodes, or distributed unevenly to the nodes?
Should the sorted results be on a specific node, duplicated over all nodes, or distributed (evenly?) to the nodes?
Can we collect simple statistics on the data to be sorted beforehand? Specifically, minimum and maximum?
Can we collect a simple histogram of the data, say 4×nodes number of bins, and combine and distribute the histogram to all nodes?

The last two are steps that can often be done in earlier passes, for example when the data is obtained in the first place.

(The practical example of this kind of stuff I like to tell, is the classic sort program. If you want to minimize the wall-clock time taken to sort the data, you do not read the data first, and then use your spiffiest sort function to sort it, because the I/O will almost always be the bottleneck. Instead, you use an on-line sort algorithm, say a binary heap, that may use more CPU time, but lets you sort each line as it arrives. Each sort/insert is likely to take less time than reading a new line on average, so by the time the last input line is read, the lines are already in sorted order, and can be output immediately. You want to avoid waiting for an event, and instead use (most) of those wait times to do useful work.)

In the case of a distributed sort, the local sort is "cheap", and passing the data is "expensive", in terms of 2021 computer hardware.
We can assume O(N log N) time complexity for the local sort (although if you have tens of megabytes to gigabytes of data per node, radix sort will do it in O(N) but have a horrible cache behaviour). Since the slowest node will dictate when the entire operation completes, you'll want each node to do roughly the same amount of work. (Or rather, need about the same wall clock time to do the processing, since that minimizes the overall time taken.)

Because the message passing always introduces a latency, you'll want to initiate the transfers as early as possible, and preferably do some of the local sort work during the transfers.

With 2021 hardware, transfer latencies will dominate your wall clock time, especially when using actual network hardware (MPI, even with InfiniBand). Note that this is not a question of bandwidth; it is about how much wall clock time is needed to transfer all the messages. (Even if the interface is implemented via shared memory or DMA, not only does copying the data take its time, but the processes need to be informed of the events also; that means context switches or inter-process interrupts or similar.)

In general, there are three different approaches to solve the merge/broadcast/distribute problem (following the local sort). This same problem occurs in distributed molecular dynamics simulators as well, when the simulated volume is spatially divided among the computational nodes.

You find out the pivot points in the sorted data (first and last entry in the sorted data per node), and have each node distribute their data directly to the target nodes
You merge the data as e.g a binary tree (in pairs), with the root of the tree broadcasting or distributing the completely sorted and merged data to all others
You have each node broadcast the data (as all-recipient messages) to each other node.

The first one has the least amount of I/O, but the problem of finding the pivot points in the distributed data is nontrivial. However, usually the pivot points do not need to be exactly correct, as the balance between nodes does not need to be exact, and a few percent difference in the dataset sizes is perfectly acceptable. Thus, various histogram methods with perhaps a post-distribution balancing pass between neighboring nodes (in the sort order) can be quite effective, but complex to implement.

The second one has (log_k nodes) levels, but on each level, the messages are passed pairwise (or within the merge group, to the shared parent node). This means that most messages are independent, and don't need to be routed through a common choke point, so better use of the theoretical bandwidth available is achieved. The downside is that only a few nodes will do the merge pass, and most nodes will only do the local sort, then send the sorted data to their parent, and then just sit and wait to receive the completed dataset from the root; and the passing of data from root to each node may be a severe performance bottleneck.
(A binary tree merge will have half the nodes idle in the initial level, so in real life, you might wish to reuse nodes and transfer extra data, just so that as many nodes are doing useful work at each level, and the level count as small as possible.)

The third one is the obvious one if the sorted data is to be duplicated on all nodes, assuming the message passing interface does support "broadcast", or all-recipient messages. For MPI implementations, the details vary even at runtime, but it is a rather common operation, so something to consider when implementing ones own message passing stuff. The simplest "broadcast" implementation is to use a ring, with each node receiving only from the previous node, and sending only to the next node, and each node sending their locally sorted data first, then forwarding the (node count - 1) sorted subsets they receive to the next one, and while doing the I/O, doing a merge pass on their current datasets, ending up with the complete dataset at end. Next simplest is a tree/graph approach, similar to the second step, where each node distributes their own data to two or more designated nodes; this has the same number of transfers, but fewer steps/levels. Best is kernel or hardware routing support, of course.

A real-world MPI example should really be run on real distributed hardware, since using e.g. OpenMPI on a single x86-64 or ARM computer implements the message passing using shared memory; I'm not aware of OpenMPI support for artificially delayed I/O for testing, although that'd be real nice. (That is, an adjustable millisecond/sub-millisecond delay, wall-clock time, before the recipient is notified the transfer is complete, to simulate network latencies.)

tggzzz · « **Reply #38 on:** August 21, 2021, 08:41:50 am »

Quote from: bd139 on August 20, 2021, 11:03:56 pm

Quote from: tggzzz on August 20, 2021, 10:28:18 pm
Quote from: Nominal Animal on August 20, 2021, 07:29:12 pm
Doing the same on shared memory might sound much easier, but there, cacheline ping-pong can really ruin the performance. Doing the sorting directly on the data in the shared memory is not going to be the most efficient method (except on architectures without any data caching); doing local sorts in per-thread/per-core memory will be faster (assuming sufficiently large dataset, of course, that using multiple cores/threads makes sense in the first place).

Caches are great on average. But apparently trivial changes anywhere in entire application's code or in the dataset can degrade performance by orders of magnitude.

That's broken unfortunately in some places. Caches are now horrible as we have to share them with other tenants on the same virtualized excrement

Oh it runs at 20% the speed in AWS but the CPU is newer generation?

That's a new level of pain that I'm never going to have to feel!

I have no idea how anybody can make any plausible performance prediction on a virtualised system.

bd139 · « **Reply #39 on:** August 21, 2021, 08:48:04 am »

You can’t. Even load testing is now non deterministic. Performance differs by up to 100% on the same node classes. The only exit is making sure your stuff is horizontally scalable far beyond your expected headroom and winging it. This inevitably just ends up making Bezos richer and the CxO class more purple and steam come out of their ears while there is a cost reduction scramble.

Can blame the big toilet paper producing groups like Gartner for this shit show.

tggzzz · « **Reply #40 on:** August 21, 2021, 08:57:23 am »

Snipping many points, I'll concentrate on...

Quote from: Nominal Animal on August 21, 2021, 12:33:53 am

On the hardware level and kernel-userspace interface level, the parts needed to implement a good message passing interface are surprisingly useful in other tasks as well.

Yes indeed, and one pleasing example of that is the way xC+xCORE unifies i/o and computation: everything is a "process" connected by "messages/events".

Quote

As an IPC mechanism, message passing works really well (in my experience) with event driven programming models. This approach tends to work well when designing an application (say, a simulator), or say a distributed implementation of some algorithm like Merge sort.

Very true. Interestingly many user-level GUI-based applications/frameworks are being designed using event-based concepts. I wonder if that will, over time, encourage more average developers to "think" in terms of events and then to architect their applications that way.

Personally I've always frameworks to be explicitly event-based rather than, say, callback-based. The latter feel too much like trying to preserve an illusion of single thread of execution semantics where it really doesn't exist. Yes, I know they are both equally expressive, but the thought processes and implementation pain points are different.

Quote

Because the message passing always introduces a latency, you'll want to initiate the transfers as early as possible, and preferably do some of the local sort work during the transfers.

Latency is a physics problem; bandwidth is an economics problem.

Nominal Animal · « **Reply #41 on:** August 21, 2021, 04:48:27 pm »

Quote from: tggzzz on August 21, 2021, 08:57:23 am

Very true. Interestingly many user-level GUI-based applications/frameworks are being designed using event-based concepts. I wonder if that will, over time, encourage more average developers to "think" in terms of events and then to architect their applications that way.

My own first introduction to an event-based domain-specific language was Lingo, the script used in Macromedia Director, in 1995 or so; did a paid CD-ROM project. (Similar to Flash authoring environment, preceded Flash; was used to make lots of CD-ROM based interactive stuff especially for kids. Worked exceptionally well on Macs.) It even used the "on event ... end [event]" idiom for the handlers.

However, even the venerable awk is really event-based. Each record (typically a line, but only because the default record separator is the newline) comes in sequentially, and is handled by one or more rules, or event handlers. (Each record is also split into fields, for easier manipulation, according to the field separator.) The rules are written using a simple imperative script language with good support for regular expressions, so it is easy to miss the event-based approach.

Even things like thread pools share many of the same ideas/approaches as event-driven programming. There, instead of an "event", you have a "work nugget", or similar. The same rules apply: you don't want to impose any specific order for the completion of (parallel) events, especially not rely on sequential event completion, and so on.

Xlib and XCB are almost completely event-based (aside from initialization), and mostly asynchronous (using request-response messages). Xlib is almost four decades old, and still heavily used; by no means perfect, but that's mostly because of historical baggage and not the event-based approach.

Very useful idioms overall. I do not personally feel constricted or limited, if message-passing is the only IPC method. In C, I do need threads within the same process, which implies some sharing of memory, but that's a low-level computational performance issue. In higher-level languages like Python, I can switch to process parallelism and messages, no problem.

Quote from: tggzzz on August 21, 2021, 08:57:23 am

Latency is a physics problem; bandwidth is an economics problem.

That is damned well put!

tggzzz · « **Reply #42 on:** August 21, 2021, 05:48:55 pm »

Quote from: Nominal Animal on August 21, 2021, 04:48:27 pm

Quote from: tggzzz on August 21, 2021, 08:57:23 am
Very true. Interestingly many user-level GUI-based applications/frameworks are being designed using event-based concepts. I wonder if that will, over time, encourage more average developers to "think" in terms of events and then to architect their applications that way.
My own first introduction to an event-based domain-specific language was Lingo, the script used in Macromedia Director, in 1995 or so; did a paid CD-ROM project. (Similar to Flash authoring environment, preceded Flash; was used to make lots of CD-ROM based interactive stuff especially for kids. Worked exceptionally well on Macs.) It even used the "on event ... end [event]" idiom for the handlers.

In general making things explicit aids comprehension and usability. Sometimes the how-to descriptions/"definitions" of how to use a framework actively avoid explaining the computational model. Yes, Apple, I'm looking at you

Quote

However, even the venerable awk is really event-based. Each record (typically a line, but only because the default record separator is the newline) comes in sequentially, and is handled by one or more rules, or event handlers. (Each record is also split into fields, for easier manipulation, according to the field separator.) The rules are written using a simple imperative script language with good support for regular expressions, so it is easy to miss the event-based approach.

Yes, but there's a big grey area between event and pattern based programming.

The XML DOM processing engines typically either create a large in-memory structure replicating the DOM, or have pattern-matched "events" minimising computational overhead until you encounter the DOM elements of interest. Both are valuable, both have strengths and weaknesses.

Quote

Even things like thread pools share many of the same ideas/approaches as event-driven programming. There, instead of an "event", you have a "work nugget", or similar. The same rules apply: you don't want to impose any specific order for the completion of (parallel) events, especially not rely on sequential event completion, and so on.

I'll argue thread pools and event based programming are orthogonal. Having said that, I usually use thread pools as an execution mechanism enabling parallel processing of independent events. They have the advantage that they can minimise thrashing if there is ~N threads in the pool for an N-core processor.

Quote

Very useful idioms overall. I do not personally feel constricted or limited, if message-passing is the only IPC method. In C, I do need threads within the same process, which implies some sharing of memory, but that's a low-level computational performance issue. In higher-level languages like Python, I can switch to process parallelism and messages, no problem.

I attempt to work out how to neatly describe the application-domain level computation, and then use whatever tools aid in the implementation. Having many tools in my toolbox enables me to select the appropriate tool for this job. I don't need to "mis-apply the only tool I know to the current job".

To me the interest is noting large-scale similarities between languages/frameworks, so that I can rapidly understand their strengths/limitations. Unfortunately it is in marketeers interests to obfuscate that, and young software engineers don't have the depth of experience.

Arguably my first event-driven application was a hard-realtime embedded program in C and executing in a home-grown cooperatively-scheduled RTOS that has many similarities to co-routine programming. Worked nicely, especially since I could debug the high level FSMs and processing at the event level in Xenix on a PDP-11.

As for FSMs, it appears I unwittingly and triumphantly reinvented the concept while at school, with my first assembler-level program which ran on a 39-bit, 8Kword 2kIPS machine

Quote

Quote from: tggzzz on August 21, 2021, 08:57:23 am
Latency is a physics problem; bandwidth is an economics problem.
That is damned well put!

Unfortunately I can't claim originality - but that merely illustrates that it is a memorable way of encapsulating the issues.

SiliconWizard · « **Reply #43 on:** August 21, 2021, 05:59:55 pm »

Quote from: Nominal Animal on August 21, 2021, 12:33:53 am

Because the message passing always introduces a latency, you'll want to initiate the transfers as early as possible, and preferably do some of the local sort work during the transfers.

I find this statement both right and odd at the same time.

Any operation introduces "latency". But as a means of communicating between tasks, whether message passing introduces more latency than memory sharing, for instance, is absolutely not a given. It entirely depends on the implementation. It's certainly not a hard fact.

Not sure exactly what you have in mind. From your posts, I'm assuming you may think of message passing through networks, for instance. But while MP lends itself well to network-distributed processors, this is just one particular case. And even so, if you're dealing with such a topology, you can't share memory anyway - at least not directly. So there's no definite difference, in any case you have to move data around, and latency will essentially be the same.

Nominal Animal · « **Reply #44 on:** August 21, 2021, 08:11:33 pm »

Quote from: tggzzz on August 21, 2021, 05:48:55 pm

Quote
However, even the venerable awk is really event-based. Each record (typically a line, but only because the default record separator is the newline) comes in sequentially, and is handled by one or more rules, or event handlers. (Each record is also split into fields, for easier manipulation, according to the field separator.) The rules are written using a simple imperative script language with good support for regular expressions, so it is easy to miss the event-based approach.
Yes, but there's a big grey area between event and pattern based programming.

The XML DOM processing engines typically either create a large in-memory structure replicating the DOM, or have pattern-matched "events" minimising computational overhead until you encounter the DOM elements of interest. Both are valuable, both have strengths and weaknesses.

Sure, agreed. I perhaps should have written that as "kinda event-based".

My point was that the ideas underlying the event based approach are ubiquitous, and nothing new.

Both awks I currently use (mawk and gawk; mawk is faster for many tasks, but gawk has more useful features and additional functions) do process records as they are read, and that is one of their core strengths. The event-based approach is useful as a mental model for new awk users; that way, they tend to "grok" how they work, and generate more efficient awk scripts.

If one wanted to be pedantic, then it would be fair to say awk only has two actual events: BEGIN and END, the two special rules ran before any input is processed and after all input has been processed. Me, I don't have the kind of control of English language that would permit me to be that pedantic without being utterly wrong almost all of the time. (As it is, now I'm only often wrong.)

Quote from: tggzzz on August 21, 2021, 05:48:55 pm

I'll argue thread pools and event based programming are orthogonal.

From the point of the worker function implementation, you can consider the work as an incoming event, and the result produced as an outgoing event, when the work is a computing task (as opposed to e.g. client service type task).

For example, in a local/non-distributed classical molecular dynamics simulation (not quantum mechanics, but using interaction potentials or force fields as the chemists like to call them), each time step the interaction between each pair (and depending on model, triplet, and even quad) of atoms, up to a maximum distance, needs to be evaluated. For many models, there are two functions to be evaluated; for a pairwise interaction, V_i,j(r_i,j) and its gradient -∇V_i,j(r_i,j). Some models, like embedded atom models, require a first pass over all atoms (in EAM, to calculate a scalar approximating the density of relevant electrons at each atom location), and a second pass to calculate the total potential energy at each atom V and its gradient (force vector) -∇V.

While the simulator overall is usually not touched by scientists, they often develop new potential models, and add quirks to them. So, giving them an interface that provides an atom (its type, say iron or copper), and the types and distances (both as a scalar and as a vector) to its neighbors, with the function (those scientists can create or modify) having to calculate the interactions as above, works better than letting them trash the entire codebase. I describe those functions as a kind of an "event handler"; this seems to provide a good mental model for implementing such functions. The simulator itself should manage the thread pool, the order in which the handlers are called, and so on, with those details outside the grubby hands of the dirty scientists who don't know good software engineering if it hit them on the head.

Even if you have asymmetric hardware that produces exact same results (say, a GPU supporting double-precision IEEE 754 arithmetic), this kind of approach works absolutely fine. The slower hardware will process fewer atoms, but that's perfectly okay.

I don't know what one should call this kind of approach, from the point of view of those writing/modifying only the functions calculating the interactions around a single atom. To me, "event-based" matches best, but I'm perfectly willing to be corrected, if you know a better, widely used term.

Quote from: tggzzz on August 21, 2021, 05:48:55 pm

Arguably my first event-driven application was a hard-realtime embedded program in C and executing in a home-grown cooperatively-scheduled RTOS that has many similarities to co-routine programming. Worked nicely, especially since I could debug the high level FSMs and processing at the event level in Xenix on a PDP-11.

If it matters, I like to use "foo-based" when using the approach/paradigm but not restricted to it only, and "foo-driven" when strictly using the foo model.
Very few things I write follow any specific pure model, though; I'm messy.

I fully agree wrt. having a varied toolkit one can use, and not trying to fit a favourite approach/paradigm to every problem one encounters, too.

Quote from: SiliconWizard on August 21, 2021, 05:59:55 pm

Quote from: Nominal Animal on August 21, 2021, 12:33:53 am
Because the message passing always introduces a latency, you'll want to initiate the transfers as early as possible, and preferably do some of the local sort work during the transfers.
I find this statement both right and odd at the same time.

I wrote it as part of the description of the distributed merge sort, not as a general fact.

I meant that the preferred pattern when implementing such algorithms using message passing is

Send messages providing information needed later
Work on local data as much as possible
Wait until all messages providing the necessary additional information to complete the work are received
Complete work on data

Most distributed simulators using MPI currently do

Work on local data
Send changes to affected nodes
Receive changes from other nodes that affect this node
Repeat

which means that instead of useful computation being done while the messages are in flight, computation and communication is done as separate steps. This is a horrible pattern, because the more communication is needed between the distributed processing nodes, the smaller fraction of time the processors are actually doing useful work.
It is also the real reason why HPC clusters "require" InfiniBand: it has smaller latencies than IP over Ethernet, and often papers nicely over such design flaws.

I want to burn this later pattern on a pyre and ban it.

Quote from: SiliconWizard on August 21, 2021, 05:59:55 pm

But as a means of communicating between tasks, whether message passing introduces more latency than memory sharing, for instance, is absolutely not a given. It entirely depends on the implementation. It's certainly not a hard fact.

No, definitely not. The point of the above is to consider even the message passing itself as an asynchronous operation, not a synchronous one, when designing an implementation.

That is, logically, you think of the pattern as "you send a message, and sometime later, the recipient will receive it", instead of "when you send a message, the recipient immediately receives it, and then the send-message function returns". See? Disagree?

Quote from: SiliconWizard on August 21, 2021, 05:59:55 pm

Not sure exactly what you have in mind. From your posts, I'm assuming you may think of message passing through networks, for instance.

They were just examples of use I'm very familiar with, to define the underlying reasons why I don't have any issues using message-passing only for concurrent programming (as long as I can use multiple threads or processes to do computation in parallel). I did point out stuff like X11, too.

Marco · « **Reply #45 on:** August 21, 2021, 08:20:27 pm »

Quote from: Nominal Animal on August 21, 2021, 08:11:33 pm

No, definitely not. The point of the above is to consider even the message passing itself as an asynchronous operation, not a synchronous one, when designing an implementation.

But the physical reality is that everything is just synchronous with FIFOs, some of those FIFOs might be close to the sender end to allow it to quickly go on it's merry way while the hardware and hidden software layers figure out how to empty the close FIFO into the remote FIFO ... but FIFOs can fill up. Asynchronous is an illusion, synchronous is reality. Highest performance doesn't generally come from entertaining illusions.

Nominal Animal · « **Reply #46 on:** August 21, 2021, 08:57:55 pm »

Quote from: Marco on August 21, 2021, 08:20:27 pm

But the physical reality is that everything is just synchronous

Actually, if we accept the relativistic worldview, physical reality itself is asynchronous: there is no global, synchronous clock at all, and the order of events can depend on the observer.

Consider this:

Sending a message is a synchronous operation. A function call to send a message returns successfully if the message was queued, and with failure if the queue was full or some other reason prevented the message from being sent.
Receiving a message is a synchronous operation. A function call to check for a/the next message will return with the message, or block until a message is available, or report no messages to receive, or report an error.
The two operations do not occur at the same time, nor are they so tightly coupled that an unrelated message could not be sent or received (by either of the two parties) in between.
The only physical limitation is that the message cannot be received before it is sent.
There is always an interval between the two operations, when the sender and/or the receiver can perform other actions, unless both operations are done sequentially by the same processor.

This means that considering the whole operation (and not just sending and receiving in isolation), passing a message is by default asynchronous.

You can implement a synchronous message passing interface by not letting the sender succeed until the message has been received, but for the reasons I explained above, it will perform worse (take longer per operation) than an asynchronous one. Such a synchronous interface may be important in some specific rare use cases, of course, but can always be built on top of an asynchronous (and even unreliable) one.

Asynchronous does not imply unreliability. Many message passing interfaces, including MPI, provide an unique handle for the sender when a message is sent. The underlying implementation can provide an acknowledgement when the message has been received (and/or processed), and the sender can check for this via the handle.

Synchrony implies a global clock and exact ordering of events. In reality, we have neither.

tggzzz · « **Reply #47 on:** August 21, 2021, 09:09:52 pm »

Quote from: Marco on August 21, 2021, 08:20:27 pm

Quote from: Nominal Animal on August 21, 2021, 08:11:33 pm
No, definitely not. The point of the above is to consider even the message passing itself as an asynchronous operation, not a synchronous one, when designing an implementation.

But the physical reality is that everything is just synchronous with FIFOs, some of those FIFOs might be close to the sender end to allow it to quickly go on it's merry way while the hardware and hidden software layers figure out how to empty the close FIFO into the remote FIFO ... but FIFOs can fill up. Asynchronous is an illusion, synchronous is reality. Highest performance doesn't generally come from entertaining illusions.

Where comms and physics are involved, asynchronous is the basic mode of operation and synchronous behaviour is something imposed on that because it makes it easier to understand and deploy.

Obvious examples:

look at how a flip-flop works, and understand metastability
on-chip logic operations take a small time compared with the time taken to propagate across the chip
ditto where computations take place on a different machine

and there are many others.

tggzzz · « **Reply #48 on:** August 21, 2021, 09:16:44 pm »

Quote from: Nominal Animal on August 21, 2021, 08:11:33 pm

Quote from: tggzzz on August 21, 2021, 05:48:55 pm
Quote
However, even the venerable awk is really event-based. Each record (typically a line, but only because the default record separator is the newline) comes in sequentially, and is handled by one or more rules, or event handlers. (Each record is also split into fields, for easier manipulation, according to the field separator.) The rules are written using a simple imperative script language with good support for regular expressions, so it is easy to miss the event-based approach.
Yes, but there's a big grey area between event and pattern based programming.

The XML DOM processing engines typically either create a large in-memory structure replicating the DOM, or have pattern-matched "events" minimising computational overhead until you encounter the DOM elements of interest. Both are valuable, both have strengths and weaknesses.
Sure, agreed. I perhaps should have written that as "kinda event-based".

My point was that the ideas underlying the event based approach are ubiquitous, and nothing new.

Accepted.

Regrettably computer programming courses neglect considering processing as a sequence of events, in favour of data structures and sequential algorithms. Too often I've heard "FSMs are those things in parsers, aren't they?"

I hope that changes; it needs to.

Quote

If it matters, I like to use "foo-based" when using the approach/paradigm but not restricted to it only, and "foo-driven" when strictly using the foo model.
Very few things I write follow any specific pure model, though; I'm messy.

I fully agree wrt. having a varied toolkit one can use, and not trying to fit a favourite approach/paradigm to every problem one encounters, too.

Yes indeed, and I want to see radically improved tools, not mere variations on a theme. Ignoring the me-too variations (e.g. Delphi and C#) has saved me an awful lot of time.

If CSP, for example, was the last word, I'd be worried!

DiTBho · « **Reply #49 on:** August 22, 2021, 09:05:20 am »

My problem is: how to ensure that the recipient gets the messages sent exactly in the order they were sent.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Pros and Cons of using message-passing only for concurrent programming (Read 10361 times)

Share me