Author Topic: Books on Basic and Advanced Algorithms to program in C/C ++ (Read 11670 times)

GeorgeOfTheJungle · « **Reply #50 on:** July 09, 2018, 11:04:51 pm »

Hey, concurrency and parallelism are not the same thing. While objects (OO) have already been shoehorned in one way or another into ~ every existing language, neither concurrency nor parallelism have got good enough (if any) support in mainstream languages yet, imo.

The web/JS is, out of necessity, trying and making a slow but steady progress in these matters of concurrency, asynchronous execution and parallelism. Good or bad IDK, time will tell, but I keep an eye on that, it's very interesting.

NorthGuy · « **Reply #51 on:** July 09, 2018, 11:46:27 pm »

Quote from: GeorgeOfTheJungle on July 09, 2018, 11:04:51 pm

neither concurrency nor parallelism have got good enough (if any) support in mainstream languages yet, imo.

Intel's C compiler tries to parallelize loops then uses SIMD instructions to perform parallel execution. You don't need to do anything special in the language - it's the same C, but it results in parallel execution.

tggzzz · « **Reply #52 on:** July 10, 2018, 05:55:45 am »

Quote from: bd139 on July 09, 2018, 10:11:03 pm

Usable compilers and schedulers are harder than all problems however so YMMV. Consider scheduling even a trivial concurrent system such as a map and reduce operation. How do you schedule that including IO bottlenecks, data paths and throughout optimisation? You now have n! problems to solve.

I’m very sceptical of such systems, even with CSP thrown in.

CSP doesn't have much to do with performance; it is about correctness. OTOH, if you are prepared to accept an incorrect result, I can significantly speed up any algorithm

Ahead-of-time schedulability - which I think is what you are after - requires that all components of a system are predictable. That's unobtainable in the large-scale cases you quote; but XMOS does achieve it in constrained embedded systems.

Note that CSP is closely related to message-passing, which is the best hope for scalability of large-scale coarse-granularity systems you mention. Hence some CSP concepts appearing in Go and Rust.

tggzzz · « **Reply #53 on:** July 10, 2018, 06:00:16 am »

Quote from: GeorgeOfTheJungle on July 09, 2018, 11:04:51 pm

Hey, concurrency and parallelism are not the same thing. While objects (OO) have already been shoehorned in one way or another into ~ every existing language, neither concurrency nor parallelism have got good enough (if any) support in mainstream languages yet, imo.

Some languages have useful concurrency concepts, e.g. Ada, Java, xC, Go, Rust. (And there are other less well known ones)

Other languages have a "head in the sand" "let's build castles on sand" mentality, notably C and C++. (It has taken them 40 years to admit that they need a memory model, quarter of a century too late)

westfw · « **Reply #54 on:** July 10, 2018, 06:57:49 am »

Aren't there some Fortran compilers that are very good at parallelizing the sort of problems that people still use Fortran to solve?

GeorgeOfTheJungle · « **Reply #55 on:** July 10, 2018, 07:10:22 am »

This is why I say that "it's a recent problem": https://en.wikipedia.org/wiki/Parallel_computing#Background

Quote

Frequency scaling was the dominant reason for improvements in computer performance from the mid-1980s until 2004. The runtime of a program is equal to the number of instructions multiplied by the average time per instruction. Maintaining everything else constant, increasing the clock frequency decreases the average time it takes to execute an instruction. An increase in frequency thus decreases runtime for all compute-bound programs.[10] However, power consumption P by a chip is given by the equation P = C × V 2 × F, where C is the capacitance being switched per clock cycle (proportional to the number of transistors whose inputs change), V is voltage, and F is the processor frequency (cycles per second).[11] Increases in frequency increase the amount of power used in a processor. Increasing processor power consumption led ultimately to Intel's May 8, 2004 cancellation of its Tejas and Jayhawk processors, which is generally cited as the end of frequency scaling as the dominant computer architecture paradigm.[12]

To deal with the problem of power consumption and overheating the major central processing unit (CPU or processor) manufacturers started to produce power efficient processors with multiple cores. The core is the computing unit of the processor and in multi-core processors each core is independent and can access the same memory concurrently. Multi-core processors have brought parallel computing to desktop computers. Thus parallelisation of serial programmes has become a mainstream programming task. In 2012 quad-core processors became standard for desktop computers, while servers have 10 and 12 core processors. From Moore's law it can be predicted that the number of cores per processor will double every 18–24 months. This could mean that after 2020 a typical processor will have dozens or hundreds of cores.[13]

An operating system can ensure that different tasks and user programmes are run in parallel on the available cores. However, for a serial software programme to take full advantage of the multi-core architecture the programmer needs to restructure and parallelise the code. A speed-up of application software runtime will no longer be achieved through frequency scaling, instead programmers will need to parallelise their software code to take advantage of the increasing computing power of multicore architectures.

CJay · « **Reply #56 on:** July 10, 2018, 09:17:03 am »

I Liked Programming Pearls by Joe Bentley and found it useful

https://www.amazon.co.uk/dp/B01EAW7XXU/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1

Nominal Animal · « **Reply #57 on:** July 10, 2018, 11:08:04 am »

Quote from: NorthGuy on July 09, 2018, 11:46:27 pm

Intel's C compiler tries to parallelize loops then uses SIMD instructions to perform parallel execution. You don't need to do anything special in the language - it's the same C, but it results in parallel execution.

Have you looked at the generated code? In all non-trivial cases, I can do a better job by using intrinsics (<immintrin.h>) and compiler-provided vector type extensions.

The sequence point model C standard uses does not lend itself well to vectorization. Fortran array notation is much easier to vectorize, because there is no implicit order in the assignments. (By vectorization, I mean using architecture-specific single-instruction, multiple-data extensions.)

NorthGuy · « **Reply #58 on:** July 10, 2018, 02:02:13 pm »

Quote from: Nominal Animal on July 10, 2018, 11:08:04 am

Quote from: NorthGuy on July 09, 2018, 11:46:27 pm
Intel's C compiler tries to parallelize loops then uses SIMD instructions to perform parallel execution. You don't need to do anything special in the language - it's the same C, but it results in parallel execution.
Have you looked at the generated code? In all non-trivial cases, I can do a better job by using intrinsics (<immintrin.h>) and compiler-provided vector type extensions.

Sure, you can do better than anything writing in assembler. However, most cases are simple, such as:

Code: [Select]

for (j = s = 0; j < n; j++) s += a[j];
Re-writing this with intrinsics for AVX will take a while - you need to take care of alignment, figure what to do with odd values of "n" etc. The compiler can do all the job for you.

Of course, you can gain more efficiency by aligning the array, perhaps adding some extra "zero" space at the end of the array making sure you do not need partial operations at the end of the loop. Then your assembler/intrinsics code will roll. However, if the array is long enough, the gain won't be that big.

As to the parallelization with the numbers of cores, I think we're already at the end of this road - Intel cores share the same memory, and memory access becomes a bottleneck with relatively small number of cores. 100 cores sharing the same memory won't do very well even with caches.

Nominal Animal · « **Reply #59 on:** July 11, 2018, 10:06:41 am »

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm

Sure, you can do better than anything writing in assembler.

No, that's not what I meant. What I meant was that using C constructs describing explicit vectorization, I can do a better job than a C compiler can.

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm

However, most cases are simple

I've dealt with a lot of C code, and that is just not true. It is only true if your C program is horribly inefficient or imprecise to start with, at the algorithmic level, and no compiler magick will help then.
For example, if you need to calculate a sum of more than a couple of dozen floating-point terms in practice, you'll find you'll either need to ensure they are in monotonic order, or do a Kahan sum, or you'll lose precision.

The most typical operations needed tend to be vector operations, or sums of products. (I'm not talking about computer graphics, either. Think of any physical models, operations on audio or video, animation, or any kind of packing problem, for example. A lot of geometric and physical problems need to be solved in everyday programs.)

A very typical operation is normalizing vectors to unit length. (Which, by the way, might not have anything to do with geometry, because a large number of problems can be simplified this way.) It turns out that if the components of a vector are consecutive in memory, it is especially hard for the compiler to vectorize the norm squared, and the compiler-generated code tends to do the square root operation separately. However, if you write the C code such that you explicitly load vectors so that each vector register has one component from consecutive vectors, you get near-optimal code, even with GCC (which is terrible at vectorization, compared to Intel C/C++). Because of how C is defined via an underlying virtual machine with sequence points, this is very, very hard for a C compiler to detect the pattern in code and optimize it. (I think ICC does detect the most typical pattern -- i.e., brute-force detection of each pattern is the only way those can be optimized in C --, but GCC does not.)

Similarly, if you want to do e.g. matrix-matrix multiplication between large matrices (a common problem in many optimization problems), the bottleneck is the cache pattern. The solution is to use temporary memory, and copying the contents of at least one of the matrices, so that they both will have very linear access patterns. Ulrich Drepper's 2007 article is a very detailed look into this, including the data when vectorized using SSE2 instructions. The C language itself is such that it does not allow a compiler to do this automatically (and it is not even clear if one would want a compiler to do this automatically).

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm

As to the parallelization with the numbers of cores, I think we're already at the end of this road

Again, I sort of disagree. I agree that with symmetric multiprocessing, increasing the number of cores will not help with a small number of concurrent tasks -- say, typical laptop, tablet, or desktop computer usage, where you have a browser open, maybe a word processor, and such.

Asymmetric multiprocessing, where you have different types of cores, some optimized for specific tasks (like calculating checksums of data, DMA'ing it around based on logical filter expressions), is growing like crazy. Phones tend to have a separate processor doing all the network/audio stuff, graphics are handled by dedicated processors (that are still highly programmable via CUDA or OpenCL), and so on. (In fact, I have an Odroid HC1 single-board Linux computer, which has a Samsung Exynos 5422 processor. It has four fast Cortex A15 cores, and four slower and simpler Cortex A7 cores.)

Coroutines, message passing (especially various "mailbox" approaches), atomic operations, and explicit concurrency (or, rather, explicit notation of memory ordering and visibility of side effects) are very important tools for writing asymmetric multiprocessing code, since typically, the intent is to keep data flowing at maximum bandwidth with minimum latencies. C does not provide those (even in C11, the memory model and atomics were just copied over from C++, with the assumption that although the two languages differ a lot in their approach, C++ atomics should work in C too).

A related problem (to coroutines and message passing) is that we kind of need a construct even lighter than threads (nanothreads or fibres or similar). CPU cores have so much state that context switches even within the same process tend to be expensive. Coroutines and closures (especially if combined with message passing) could really use some sort of nanothreads that have their own stack, but otherwise share the state with their parent thread. Hardware-wise there is no problem (and stack switching thus is actually quite common in POSIXy operating systems, for signal delivery), but such concepts have thus far only been incorporated in very high-level languages like Python; we do not really know how to frame/describe/define them for low level code that actually works and does not cause more harm than good. In particular, it is not clear whether coroutines should be implemented as functions (or with similar restrictions), or if they should be treated specially, similar to e.g. POSIX signal handler functions (where many common operations do lead to undefined behaviour, and are therefore not really allowed).

There is a lot of research done on all of these concepts as we speak, down to new types of hardware architectures. On the programming language development side, a major problem is that by the time you've learned enough to develop a language, all that knowledge will steer you towards a specific paradigm/approach to programming languages, so it is terribly hard to create something completely novel. (Plus, even if you do, the sheer inertia to overcome is more than one human can bear.) Higher level languages are easier, because there is less work to get a minimal working implementation for examples, further research, and obtaining results (for example, implementing typical programs, and comparing them to similar programs written in other languages). To implement a new low-level language, you'll also need to implement the basic runtime library from scratch, consider the Application Binary Interfaces, and whether you'll model your base library on top of C/POSIX (and thus system calls on most OSes), or try for something different. Even just mapping a minimal subset of POSIX (to get filesystem access and so on) to your new base library is a huge task; it is a complex specification with a very long history. (And without POSIX, you basically condemn yourself to not just creating a new programming language, but to create a whole new OS to go with it.)

It would take someone like Elon Musk donating a million or so for a number of oddball developers like myself, and see what they come up with. (Most of us would fail, but there is a small probability that someone could come up with a new low-level language, better suited for current and future hardware architectures than C is, but on the same order of simplicity of the core language itself.)

tggzzz · « **Reply #60 on:** July 11, 2018, 10:31:41 am »

Much of your very sensible and knowledgeable message deleted for brevity. It horrifies me that many people have the misunderstandings you describe.

I'll merely make comments on this bit...

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Coroutines, message passing (especially various "mailbox" approaches), atomic operations, and explicit concurrency (or, rather, explicit notation of memory ordering and visibility of side effects) are very important tools for writing asymmetric multiprocessing code, since typically, the intent is to keep data flowing at maximum bandwidth with minimum latencies.

All those are valuable for more than "just" asymmetric multiprocessing code.

Quote

C does not provide those (even in C11, the memory model and atomics were just copied over from C++, with the assumption that although the two languages differ a lot in their approach, C++ atomics should work in C too).

Yes, that attitude is negligent. C/C++ is only getting a memory model 20 years after Java.

I note that, even though they were starting with a clean sheet of paper, the Java designers had to revise their memory model. Given that, what's the probability that the C/C++ "designers" and compiler implementers will produce something that is usable by a typical software writer?

GeorgeOfTheJungle · « **Reply #61 on:** July 11, 2018, 10:36:52 am »

Closures are great, instead of spawning multiple processes each with its own parameters and state, with closures you can do the same in a single process because there can be multiple, simultaneous, different program states alive at the same time. Closures saves you too very often of the need to create and administrate objects just to save state.

coppice · « **Reply #62 on:** July 11, 2018, 11:03:21 am »

Quote from: tggzzz on July 11, 2018, 10:31:41 am

Quote
C does not provide those (even in C11, the memory model and atomics were just copied over from C++, with the assumption that although the two languages differ a lot in their approach, C++ atomics should work in C too).

Yes, that attitude is negligent. C/C++ is only getting a memory model 20 years after Java.

I note that, even though they were starting with a clean sheet of paper, the Java designers had to revise their memory model. Given that, what's the probability that the C/C++ "designers" and compiler implementers will produce something that is usable by a typical software writer?

To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices. I've heard a number of anecdotes of people asking, say, Intel or AMD what happens in some circumstance, and getting a blank stare in response because it had never been thought through in development. If you are going to specify a practical memory model within a language, it has to map sensibly to what the hardware can do and do efficiently, and if even the hardware developers haven't got that completely nailed.....

tggzzz · « **Reply #63 on:** July 11, 2018, 11:11:31 am »

Quote from: coppice on July 11, 2018, 11:03:21 am

Quote from: tggzzz on July 11, 2018, 10:31:41 am
Quote
C does not provide those (even in C11, the memory model and atomics were just copied over from C++, with the assumption that although the two languages differ a lot in their approach, C++ atomics should work in C too).

Yes, that attitude is negligent. C/C++ is only getting a memory model 20 years after Java.

I note that, even though they were starting with a clean sheet of paper, the Java designers had to revise their memory model. Given that, what's the probability that the C/C++ "designers" and compiler implementers will produce something that is usable by a typical software writer?
To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices. I've heard a number of anecdotes of people asking, say, Intel or AMD what happens in some circumstance, and getting a blank stare in response because it had never been thought through in development. If you are going to specify a practical memory model within a language, it has to map sensibly to what the hardware can do and do efficiently, and if even the hardware developers haven't got that completely nailed.....

I disagree, since your statement appears very x86 centric.

I disagree, since Java faces the same problems, and successfully addressed them >20 years ago.

C/C++ claims to be low-level, close to the hardware, efficient, available on (almost) all architectures. That's their downfall, since it is trying to do be all things to all people.

Until recently C explicitly stated that multiprocessing type things were part of the library and outside the language. That's a problem for library developers, since they are forced to rely on things the language doesn't guarantee.

coppice · « **Reply #64 on:** July 11, 2018, 11:33:15 am »

Quote from: tggzzz on July 11, 2018, 11:11:31 am

Quote from: coppice on July 11, 2018, 11:03:21 am
To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices. I've heard a number of anecdotes of people asking, say, Intel or AMD what happens in some circumstance, and getting a blank stare in response because it had never been thought through in development. If you are going to specify a practical memory model within a language, it has to map sensibly to what the hardware can do and do efficiently, and if even the hardware developers haven't got that completely nailed.....

I disagree, since your statement appears very x86 centric.

I disagree, since Java faces the same problems, and successfully addressed them >20 years ago.

C/C++ claims to be low-level, close to the hardware, efficient, available on (almost) all architectures. That's their downfall, since it is trying to do be all things to all people.

Until recently C explicitly stated that multiprocessing type things were part of the library and outside the language. That's a problem for library developers, since they are forced to rely on things the language doesn't guarantee.

What is x86 specific about what I said? People have struggled trying to make things like lock free algorithms rock solid on most machines, because of iffy things in the memory model.

I do agree that having no reliable memory model in C and C++ hasn't been very helpful. However, the inertia of those languages means its hard to compare them to the freely ripped up and replaced nature of early Java.

brucehoult · « **Reply #65 on:** July 11, 2018, 11:37:08 am »

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

In particular, it is not clear whether coroutines should be implemented as functions (or with similar restrictions), or if they should be treated specially

Hmm. I think this is a solved problem. It's called "continuation-passing style". Functions never return .. they just tail-call the thing that wanted their result.

It's a pain in the arse to write manually, but compilers can automatically transform normal functions into CPS. It's been a standard thing for compilers for languages such as Scheme or ML to do for decades, but you can do it for C too.

CPS makes switching micro-threads into simply calling the current continuation of the other thread instead of calling the current continuation of your thread.

In the simplest implementation, local variables that live past a call to another function have to be heap-allocated -- it's basically turning each function into a class, with each basic block in the function being a member function in the class. But it's possible for a smart compiler to do it more efficiently than this.

GeorgeOfTheJungle · « **Reply #66 on:** July 11, 2018, 12:36:44 pm »

Quote from: brucehoult on July 11, 2018, 11:37:08 am

CPS makes switching micro-threads into simply calling the current continuation of the other thread instead of calling the current continuation of your thread.

In the simplest implementation, local variables that live past a call to another function have to be heap-allocated -- it's basically turning each function into a class, with each basic block in the function being a member function in the class. But it's possible for a smart compiler to do it more efficiently than this.

And that's exactly why and where closures shine. Impure functional languages FTW. Closures are blocks in C https://en.wikipedia.org/wiki/Blocks_(C_language_extension) although ~ nobody uses them, I believe.

Nominal Animal · « **Reply #67 on:** July 11, 2018, 12:41:31 pm »

Quote from: tggzzz on July 11, 2018, 10:31:41 am

All those are valuable for more than "just" asymmetric multiprocessing code.

Oh, absolutely agreed. My favourite example is using coroutines (say, Python generator) to provide an input stream (file or socket) split into lines.

A lot of common tasks are easy to describe/implement using coroutines. You could even argue that most event models using in GUIs are better described using classes and coroutines, rather than as actions on objects. (The main difference being, exactly who "owns" or controls the "event loop", and the contexts in which each action occurs.)

Quote from: coppice on July 11, 2018, 11:03:21 am

To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices.

Spectre and meltdown bugs are good examples of how device developers don't really have a good picture of the repercussions the device behaviour has in the real world. Just look at the Linux Kernel Mailing List, to see how often the kernel developers need to take GCC devs to task (when the generated code is absolutely silly or useless, and the GCC devs' defence is that the standard allows it). Or at the number of questions forwarded to hardware developers when architecture-specific details like locking is implemented (or how locking and cache concurrency interact).

Quote from: brucehoult on July 11, 2018, 11:37:08 am

Hmm. I think this is a solved problem. It's called "continuation-passing style". Functions never return .. they just tail-call the thing that wanted their result.

Yes, that is one way to implement it. However, what I meant was a difference similar to Pascal or Fortran have with functions (that return a value) and procedures/subroutines (that do not return a value); even a low-level language could easily have functions and coroutines that follow a different set of language rules.

It is not clear to me that CPS is the best syntactic construct to describe coroutines. (That is, I find it human-error-prone in practice.) For example, if you compare Python generators with code using CPS, you'll find that Python generators with yield returning/generating one value a "caller" can consume, yields easier to maintain code. In a low-level language, such a construct can be implemented via CPS, or it can be implemented by creating a nanothread/fiber; a context that has its own stack and local variables, but otherwise shares the state with its parent thread (and in general, is non-reentrant). While the two are not that different in the underlying implementation using current hardware architectures, at the syntactic level (how they are expressed in human-readable code) they're very, very different. I've always thought of CPS as a large step in the direction of functional languages, myself, so perhaps that colors my view unduly, though.

I'm also not a language developer, myself; other than script-type domain-specific tool languages, I haven't developed my own, or even looked at the state of the art. So I could easily be wrong here.

Back on topic: Even if one is most interested in algorithms in C or C++, I've found it extremely rewarding to explore various algorithms in different programming languages. I believe (but have no proof!) that it widens ones mental focus, i.e. helps see "outside the box", when encountering new problems. The Weiss' book helped me look at algorithms separate from their implementations, at a somewhat abstract level; based on the descriptions for the intent behind each algorithm. Then again, I was very familiar with trees and lists when I first read the book.

Right now, I'd focus much more on memory access patterns and cache behaviour -- and getting data flowing continuously, as opposed to being treated as a single in-memory set that is operated on --, as that in my experience helps with the biggest issues with existing programs: latencies. I find it horrific that a single program can take seconds to start up on current desktop systems; the amount of work done on every single invocation seems not only a waste, but terribly aggravating to users, too. Similarly for a GUI stuttering or becoming unresponsive, because the program is too busy to respond to user action.

tggzzz · « **Reply #68 on:** July 11, 2018, 02:08:17 pm »

Quote from: coppice on July 11, 2018, 11:33:15 am

Quote from: tggzzz on July 11, 2018, 11:11:31 am
Quote from: coppice on July 11, 2018, 11:03:21 am
To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices. I've heard a number of anecdotes of people asking, say, Intel or AMD what happens in some circumstance, and getting a blank stare in response because it had never been thought through in development. If you are going to specify a practical memory model within a language, it has to map sensibly to what the hardware can do and do efficiently, and if even the hardware developers haven't got that completely nailed.....

I disagree, since your statement appears very x86 centric.

I disagree, since Java faces the same problems, and successfully addressed them >20 years ago.

C/C++ claims to be low-level, close to the hardware, efficient, available on (almost) all architectures. That's their downfall, since it is trying to do be all things to all people.

Until recently C explicitly stated that multiprocessing type things were part of the library and outside the language. That's a problem for library developers, since they are forced to rely on things the language doesn't guarantee.
What is x86 specific about what I said? People have struggled trying to make things like lock free algorithms rock solid on most machines, because of iffy things in the memory model.

You mentioned Intel and AMD, and didn't mention any other manufacturer. It is a long time since they produced anything other than x86 machines.

Quote

I do agree that having no reliable memory model in C and C++ hasn't been very helpful.

It is more than "hasn't been helpful"; it is a killer omission. And a blindingly obvious omission at that!

Quote

However, the inertia of those languages means its hard to compare them to the freely ripped up and replaced nature of early Java.

Shrug. I think you meant "inertia of those language designers".

Presumably having a C/C++ memory model is possible, and has been possible for >20 years. Why, then, did the language designers repeatedly avoid creating memory model?

None of the possible answers reflect well on the language designers.

tggzzz · « **Reply #69 on:** July 11, 2018, 02:28:38 pm »

Quote from: Nominal Animal on July 11, 2018, 12:41:31 pm

Quote from: tggzzz on July 11, 2018, 10:31:41 am
All those are valuable for more than "just" asymmetric multiprocessing code.
Oh, absolutely agreed. My favourite example is using coroutines (say, Python generator) to provide an input stream (file or socket) split into lines.

A lot of common tasks are easy to describe/implement using coroutines. You could even argue that most event models using in GUIs are better described using classes and coroutines, rather than as actions on objects. (The main difference being, exactly who "owns" or controls the "event loop", and the contexts in which each action occurs.)

Yes indeed. I first did that in C in 1982, for a hard realtime system. There was a little bit of processor-dependent assembler in the RTOS, but that's inevitable and of no significant consequence.

Quote

I'm also not a language developer, myself; other than script-type domain-specific tool languages, I haven't developed my own, or even looked at the state of the art. So I could easily be wrong here.

The troubles with domain specific languages is that they grow like cancer, there is no off-the-shelf tool support, and no off-the-shelf developers. OTOH, domain specific libraries are essential and don't have any of those drawbacks.

Quote

Back on topic: Even if one is most interested in algorithms in C or C++, I've found it extremely rewarding to explore various algorithms in different programming languages. I believe (but have no proof!) that it widens ones mental focus, i.e. helps see "outside the box", when encountering new problems.

IMNSHO there is no doubt about that.

When I first looked at Java in 1996, the white paper was impressive since it said we have incorporated T and U which have previously been shown to work in languages J and K respectively. T and U work well together to achieve... In contrast most C/C++ papers were incestuous and ignorant, referring only to other C/C++ papers.

Quote

Right now, I'd focus much more on memory access patterns and cache behaviour -- and getting data flowing continuously, as opposed to being treated as a single in-memory set that is operated on --, as that in my experience helps with the biggest issues with existing programs: latencies. I find it horrific that a single program can take seconds to start up on current desktop systems; the amount of work done on every single invocation seems not only a waste, but terribly aggravating to users, too. Similarly for a GUI stuttering or becoming unresponsive, because the program is too busy to respond to user action.

L1/2/3 cache is the new RAM, RAM is the new disk.

Have a look at xC/xCORE. Start with https://www.xmos.com/published/xmos-programming-guide which gives a rapid tutorial introduction to the principal features. The first part is about multicore programming (none of the "this is a loop..." detritus) and getting i/o to occur on known clock cycles, and the second part gives longer examples.

Even if you never use xC, you'll begin to understand the underpinnings of some language constructs in Rust and Go

NorthGuy · « **Reply #70 on:** July 11, 2018, 03:42:54 pm »

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm
Sure, you can do better than anything writing in assembler.
No, that's not what I meant. What I meant was that using C constructs describing explicit vectorization, I can do a better job than a C compiler can.

The intrinsics you use in C are just mutilated assembler instructions.

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm
However, most cases are simple
I've dealt with a lot of C code, and that is just not true. It is only true if your C program is horribly inefficient or imprecise to start with, at the algorithmic level, and no compiler magick will help then.

I don't quite understand your communications. Are you saying that the program which contains calculations of a sum of an integer array is horribly inefficient?

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

For example, if you need to calculate a sum of more than a couple of dozen floating-point terms in practice, you'll find you'll either need to ensure they are in monotonic order, or do a Kahan sum, or you'll lose precision.

You just use adequate precision (say double instead of float), or, if that is not enough, you go to longer integers. It's much easier and more efficient than sorting the arrays. Adequate precision means the minimum precision which fits your requirements.

Of course, C99 says the compiler cannot re-arrange floating point expressions. But there must be a compiler flag to circumvent this.

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

A very typical operation is normalizing vectors to unit length. (Which, by the way, might not have anything to do with geometry, because a large number of problems can be simplified this way.) It turns out that if the components of a vector are consecutive in memory, it is especially hard for the compiler to vectorize the norm squared, and the compiler-generated code tends to do the square root operation separately. However, if you write the C code such that you explicitly load vectors so that each vector register has one component from consecutive vectors, you get near-optimal code, even with GCC (which is terrible at vectorization, compared to Intel C/C++). Because of how C is defined via an underlying virtual machine with sequence points, this is very, very hard for a C compiler to detect the pattern in code and optimize it. (I think ICC does detect the most typical pattern -- i.e., brute-force detection of each pattern is the only way those can be optimized in C --, but GCC does not.)

Calculating a sum of squares is mot much more complex than a simple sum. If your SIMD register supports 4 numbers, you simply calculate 4 sums (one sum for elements 0,4,8 .., another of 1,5,9 etc.) then you sum them up at the end. Thus, the calculation for a single array can be vectorized. You do not need multiple arrays to get the benefit. The Intel compiler should be able to do it.

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Similarly, if you want to do e.g. matrix-matrix multiplication between large matrices (a common problem in many optimization problems), the bottleneck is the cache pattern. The solution is to use temporary memory, and copying the contents of at least one of the matrices, so that they both will have very linear access patterns. Ulrich Drepper's 2007 article is a very detailed look into this, including the data when vectorized using SSE2 instructions. The C language itself is such that it does not allow a compiler to do this automatically (and it is not even clear if one would want a compiler to do this automatically).

You're mixing two different operations - verctorization and cache optimization.

You can do vectorization without moving elements. Newer Intel CPU even have scatter/gather instructions for this (although I don't think these instructions add much speed).

or

You can re-arrange your matrices for cache access and then do vectorizotion.

In either case, the vectorozation case is simple, and the Intel compiler should be able to do vectorization for you. You don't need to write SSE code (and the few years later re-write for AVX and then for AVX-512, and then who knows what). This is the same logic as using C instead of assembler. You give up a little bit of efficiency, but you get convenience in return. Of course, if you cannot give up any efficiency, then assembler it is.

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm
As to the parallelization with the numbers of cores, I think we're already at the end of this road
Again, I sort of disagree <snip>

This has nothing to do with ever-growing number of Intel cores.

tggzzz · « **Reply #71 on:** July 11, 2018, 04:15:57 pm »

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am
Quote from: NorthGuy on July 10, 2018, 02:02:13 pm
Sure, you can do better than anything writing in assembler.
No, that's not what I meant. What I meant was that using C constructs describing explicit vectorization, I can do a better job than a C compiler can.

The intrinsics you use in C are just mutilated assembler instructions.

So you agree with him.

If you talk to the high performance computing (HPC) community (who have always been on the bleeding edge), you will find there are numerous cases where C compilers have to be pessimising. Typical examples involve the inability of the compiler to prove there is no aliasing.

And in that area, C had a long-running dilemma in the early 90s:

should it be impossible to "cast away constness", thereby allowing the compiler to make optimisations
must it be possible to "cast away constness", thereby allowing some forms of higher level programming and debugging

both of which are perfectly valid positions to take.

I remember watching the committee(s) debate that endlessly for well over a year - until I realised they were attempting to square a circle, decided C/C++ was becoming too complex for its own good, and realised that C/C++ was part of the problem rather than part of the solution. Fortunately Java came along for higher-level applications.

Bjarne Soustroup on C++ (but C has many of the same problems):
"It is worth remembering that many in the C++ community still have not completely caught up with C++98."
"Even I can’t answer every question about C++ without reference to supporting material (e.g. my own books, online documentation, or the standard). I’m sure that if I tried to keep all of that information in my head, I’d become a worse programmer."
from http://www.stroustrup.com/japanese2010.pdf

If even he had that problem back in 2010, what chance have mere mortals?

Nominal Animal · « **Reply #72 on:** July 11, 2018, 05:08:15 pm »

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

The intrinsics you use in C are just mutilated assembler instructions.

No, they are C functions that implement single-instruction, multiple-data operations that the underlying hardware also supports. The form and restrictions are often not the same, either. (In particular, with regards to the result register.)

They can be mapped to multiple architectures (see GCC, which can do multi-arch shuffles quite efficiently already), although aside from fused multiply-add, there hasn't been any push in that direction yet.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

Are you saying that the program which contains calculations of a sum of an integer array is horribly inefficient?

I am saying that having to calculate a sum of an integer array is rare. Having to calculate a straight sum of floating-point values is even more rare. I am also saying that whenever I've seen that done in a real life application (for an array with more than a couple of dozen elements), it has usually been a sign of an inefficient approach to the underlying problem. Usually, there has either been a way of avoiding said array and computation altogether, or a more efficient data structure. (Disjoint sets come up surprisingly often.)

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

You just use adequate precision (say double instead of float)

No! Precision is not the issue. The set may contain cancelling terms. As an extreme example, consider 1e30 + 1e-30 - 1e30 = 1e-30. Both sorting according to decreasing magnitude, and Kahan sum, yield the correct result for both IEEE-754 binary32 (float) and binary64 (double).

Many sets one might need to sum often contain such cancelling terms. (It is especially typical in geometric problems, where each term is a result of several products. Because the magnitude doubles for each product, having vectors both shorter and longer than 1, easily lead to data sets with such wildly varying magnitudes.)

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

Thus, the calculation for a single array can be vectorized.

You don't understand. Let me show a practical example.

Let v1, v2, and v3be vector registers, each containing a 3-component vector. (For illustration, this architecture has 3-component vectors.)
To calculate the three norms into v4 one needs to do

Code: [Select]

    v1 = v1 * v1
    v2 = v2 * v2
    v3 = v3 * v3
    v4.1 = v1.1 + v1.2 + v1.3    !
    v4.2 = v2.1 + v2.2 + v2.3    !
    v4.3 = v3.1 + v3.2 + v3.3    !
    v4 = sqrt(v4)

The lines marked with ! are problematic. First, they contain a horizontal sum, which is rarely supported (as it requires interaction between elements in the same vector register); when it is supported, it tends to be slow, too. Second, they contain an assignment to a specific element in a vector register, which is often slow or requires more than one operation. The exception that supports the above just fine are CUDA and OpenCL.

However, if you have the x coordinates in v1 for the three vectors, y coordinates in v2 , and z coordinates in v3, you can do

Code: [Select]

    v1 = v1 * v1
    v2 = v2 * v2
    v3 = v3 * v3
    v1 = v1 + v2 + v3
    v4 = sqrt(v4)

In fact, GCC supports the above in GNU C using architecture-independent vector types, except for the square root.

You might think that the two do not differ that much, but in reality, for SSE2/AVX, it does. So much so that it is better to write the code as

Code: [Select]

    v1 = (x0 y0 z0 x1)
    v2 = (y1 z1 x2 y2)
    v3 = (z2 x3 y3 z3)
    Register shuffling, using two temp registers, yielding
    v1 = (x0 x1 x2 x3)
    v2 = (y0 y1 y2 y3)
    v3 = (z0 z1 z2 z3)
    v1 = v1*v1
    v2 = v2*v2
    v3 = v3*v3
    v1 = v1+v2
    v3 = v3+v1
    v3 = sqrt(v3)
    Store v3

but for 6, 9, or 12 vectors at a time (staggered). The efficiency gains are significant.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

You're mixing two different operations - verctorization and cache optimization.

One might think so, but I am not. You see, the two are intricately intertwined, because the vector registers do not support a fast scatter/gather for similar reasons why horizontal summing is rare or slow even when supported.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

In either case, the vectorozation case is simple, and the Intel compiler should be able to do vectorization for you.

No! If it does, it breaks the C standard!

Don't you see, the problem is threefold. One is that data access patterns are hugely important, and C does not provide any construct to allow the compiler to choose the one that best matches the underlying hardware. The second is that there is no way to relax the sequence point rules in C (i.e., the order in which the side effects of each operation occur). The third is that you cannot specify a sequence of operations to be done in parallel to another, for example to relax the order in which a loop is iterated in (compare to forall loop construct in Fortran).

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

You give up a little bit of efficiency, but you get convenience in return. Of course, if you cannot give up any efficiency, then assembler it is.

I thought I was explaining the features lacking in C, and all other low level languages (Forth?), that would allow one to concisely write much more efficient programs.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

This has nothing to do with ever-growing number of Intel cores.

I was pointing out how other manufacturers use asymmetric cores to achieve much better performance per watt used. Intel is not exactly a forerunner here. Just because Intel does something stupid, does not mean the entire world is stupid.

Quote from: tggzzz on July 11, 2018, 04:15:57 pm

I remember watching the committee(s) debate that endlessly for well over a year - until I realised they were attempting to square a circle, decided C/C++ was becoming too complex for its own good, and realised that C/C++ was part of the problem rather than part of the solution.

I agree. I do believe it would be possible to construct a low-level programming language better suited for current architectures than C, and attempted to highlight the major differences to C in my posts. (Other details include non-binary integer types, floating-point types other than Binary16, Binary32, Binary64, and Binary128, fixed-point arithmetic, exposing arithmetic flags (overflow, carry, zero) as a result of integer operations, and so on.) I do not believe it is possible to design such a language by a committee, however.

tggzzz · « **Reply #73 on:** July 11, 2018, 06:07:02 pm »

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
Thus, the calculation for a single array can be vectorized.
You don't understand. Let me show a practical example.

I would make that statement in a wider context!

Quote

Quote from: tggzzz on July 11, 2018, 04:15:57 pm
I remember watching the committee(s) debate that endlessly for well over a year - until I realised they were attempting to square a circle, decided C/C++ was becoming too complex for its own good, and realised that C/C++ was part of the problem rather than part of the solution.
I agree. I do believe it would be possible to construct a low-level programming language better suited for current architectures than C, and attempted to highlight the major differences to C in my posts. (Other details include non-binary integer types, floating-point types other than Binary16, Binary32, Binary64, and Binary128, fixed-point arithmetic, exposing arithmetic flags (overflow, carry, zero) as a result of integer operations, and so on.) I do not believe it is possible to design such a language by a committee, however.

There was a period, in say the early 90s, when C/C++ could have chosen to be either a high level language for large scale applications or a low level systems language. Either choice would have been reasonable, and arguably either would have succeeded. But it tried to be both, and failed at both. That's a prime example of it being better to do one thing well than two things poorly.

Fortunately Java came along for the large scale applications, and it is finally being realised that C++ is the new COBOL. Whether new languages will gradually supplant C for systems programming is yet to be determined.

NorthGuy · « **Reply #74 on:** July 11, 2018, 06:43:02 pm »

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
The intrinsics you use in C are just mutilated assembler instructions.
No, they are C functions that implement single-instruction, multiple-data operations that the underlying hardware also supports. The form and restrictions are often not the same, either. (In particular, with regards to the result register.)

Looks exactly the same to me. You know as they say: " ... don't multiply entities beyond necessity".

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
Are you saying that the program which contains calculations of a sum of an integer array is horribly inefficient?
I am saying that having to calculate a sum of an integer array is rare. Having to calculate a straight sum of floating-point values is even more rare. I am also saying that whenever I've seen that done in a real life application (for an array with more than a couple of dozen elements), it has usually been a sign of an inefficient approach to the underlying problem. Usually, there has either been a way of avoiding said array and computation altogether, or a more efficient data structure. (Disjoint sets come up surprisingly often.)

Depends on what you're doing. May be rare for some people and frequent for others. It's silly to assign labels without reviewing the particular situation.

Besides, the two examples you had - calculating the vector length and multiplying matrices are not that far from the sum of the array in terms of complexity.

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
You just use adequate precision (say double instead of float)
No! Precision is not the issue. The set may contain cancelling terms. As an extreme example, consider 1e30 + 1e-30 - 1e30 = 1e-30. Both sorting according to decreasing magnitude, and Kahan sum, yield the correct result for both IEEE-754 binary32 (float) and binary64 (double).

Many sets one might need to sum often contain such cancelling terms. (It is especially typical in geometric problems, where each term is a result of several products. Because the magnitude doubles for each product, having vectors both shorter and longer than 1, easily lead to data sets with such wildly varying magnitudes.)

Definitely, about precision. Say you need to know distances with 1mm precision and your biggest distance is 100 km. You realize that you will lose some precision if you use 32-bit floats, but you'll be ok with 64-bit doubles. With 64-bit doubles you will never lose a single mm even if you have "cancelling" terms.

BTW: that's when you start to realize superiority of integers. Using 32-bit integers would be enough - two-fold reduction in memory and better calculation speed. Not to mention a possibility of using 64-bit integer accumulators for products.

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
Thus, the calculation for a single array can be vectorized.
You don't understand. Let me show a practical example.

Let v1, v2, and v3be vector registers, each containing a 3-component vector. (For illustration, this architecture has 3-component vectors.)

I had in mind the likes of 1000-dimentional vectors. With 3-dimentinal vectors, square-rooting dominates the performance, so your performance will be similar as soon as you vectorize square-rooting. The square-rooting takes real long. Everything else will be done in parallel with the square-rooting (unless, of course, if you insert some stupid dependency).

I would guess the compiler would figure 3-dimentional vectors out. Graphics manipulations is a very common task, I guess they would attack it first.

BTW: Intel already had horizontal addition/subtraction in SSE3.

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
In either case, the vectorozation case is simple, and the Intel compiler should be able to do vectorization for you.
No! If it does, it breaks the C standard!

Don't you see, the problem is threefold. One is that data access patterns are hugely important, and C does not provide any construct to allow the compiler to choose the one that best matches the underlying hardware. The second is that there is no way to relax the sequence point rules in C (i.e., the order in which the side effects of each operation occur). The third is that you cannot specify a sequence of operations to be done in parallel to another, for example to relax the order in which a loop is iterated in (compare to forall loop construct in Fortran).

C standard operates on an abstract machine. The goal of the compiler is to produce the same result as the abstract machine. As son as this holds, the compiler is free to do whatever it pleases.

One "bummer" here is floating point - because the things you describe - A + B - A may not yield B. This holds the compiler optimizations - it cannot replace (A+B-A) with B. There are two solutions - allow compiler to re-arrange floats, or use integers wherever possible.

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
This has nothing to do with ever-growing number of Intel cores.
I was pointing out how other manufacturers use asymmetric cores to achieve much better performance per watt used. Intel is not exactly a forerunner here. Just because Intel does something stupid, does not mean the entire world is stupid.

For some reason you quoted my post where I was talking about Intel (although you did cut it off right before the "Intel" word). I mistakenly assumed you were responding to my post. My apologies.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Books on Basic and Advanced Algorithms to program in C/C ++ (Read 11670 times)

Share me