Author Topic: Books on Basic and Advanced Algorithms to program in C/C ++ (Read 11726 times)

luiHS · « **on:** July 05, 2018, 10:36:12 am »

Hello.

I am looking for good books on Basic and Advanced Algorithms to program in C / C ++. Structures, Sorting, Search, Graphics, etc ...

Any suggestions on good books on these matters?. In my case it is for programming with microcontrollers, but I guess it does not matter, and can be applied to programming in C/C ++ on any hardware, such as PC or Raspberry as well.

At this moment I am interested in hashing/checksum algorithms to detect low resolution images, for example using CRC32, MD5, etc...

I've been looking for Amazon, and I've selected these books for the time being, but I do not know if they're good books, and some are expensive if they do not really offer what I'm looking for.

https://www.amazon.es/gp/product/8429126627/ref=ox_sc_sfl_title_14?ie=UTF8&psc=1&smid=A1AT7YVPFBWXBL
https://www.amazon.es/gp/product/032157351X/ref=ox_sc_sfl_title_15?ie=UTF8&psc=1&smid=A1AT7YVPFBWXBL
https://www.amazon.es/gp/product/0321751043/ref=ox_sc_sfl_title_16?ie=UTF8&psc=1&smid=A1AT7YVPFBWXBL
https://www.amazon.es/gp/product/B00M0OE4GA/ref=ox_sc_sfl_title_17?ie=UTF8&psc=1&smid=AD51AJ46QU3GJ

Best Regards

photovore · « **Reply #1 on:** July 05, 2018, 12:56:23 pm »

I enjoyed reading though "Matters Computational". Available as PDF from the authors website: https://www.jjj.de/fxt/fxtbook.pdf

coppice · « **Reply #2 on:** July 05, 2018, 01:17:59 pm »

The Knuth books are considered classics, but they are quite old. The algorithms considered of high importance in the 1970s aren't necessarily the ones of high importance today. In general beware of "algorithm" books. They can be rather unfocussed, giving you a sprinkling of algorithms from different disciplines, without enough coverage of any one discipline to be very useful. Most people need books that thoroughly cover a specific area thoroughly, like:

A book on very low level bit and arithmetic manipulations
A structures, sorting, and searching book
A graphics book
An image processing book (although there are books on graphics + image processing)
A book on cryptography and related security areas

Kjelt · « **Reply #3 on:** July 05, 2018, 01:31:20 pm »

Most programmers these days look on sites like stackoverflow, if you have some experience you can easily see which solutions are gold and which garbage.

coppice · « **Reply #4 on:** July 05, 2018, 01:44:30 pm »

Quote from: Kjelt on July 05, 2018, 01:31:20 pm

Most programmers these days look on sites like stackoverflow, if you have some experience you can easily see which solutions are gold and which garbage.

I thought stackoverflow was where information went to die.

How often have you actually solved a problem using information from stackoverflow?

Kjelt · « **Reply #5 on:** July 05, 2018, 01:58:08 pm »

Quote from: coppice on July 05, 2018, 01:44:30 pm

Quote from: Kjelt on July 05, 2018, 01:31:20 pm
Most programmers these days look on sites like stackoverflow, if you have some experience you can easily see which solutions are gold and which garbage.
I thought stackoverflow was where information went to die. How often have you actually solved a problem using information from stackoverflow?

It was an example, more sites out there.
How often? I think perhaps 8 times or so in the last 10 years.
Mostly I use the examples/insights to code something different more suitable to my problem but it does help from time to time to see some different angles/ideas for a problem.
Discussing with colleagues can also be a great help ofcourse.

tggzzz · « **Reply #6 on:** July 05, 2018, 02:00:25 pm »

Any decent C/C++ book should include consideration of the facilities available in the different standards - because they define what you can and can't (in theory) do.

You might like to include:
Frequently Questioned Answers (sic) http://yosefk.com/c++fqa/
C Is Not a Low-level Language https://queue.acm.org/detail.cfm?id=3212479
Into the Depths of C: Elaborating the De Facto Standards https://www.cl.cam.ac.uk/~km569/into_the_depths_of_C.pdf

GeorgeOfTheJungle · « **Reply #7 on:** July 05, 2018, 02:20:27 pm »

I like this one: http://publications.gbdirect.co.uk/c_book/ it's mostly about C only, the language, not the libraries. And "The Unix Haters Handbook" is a good reading too, e.g.:

int main (void) {
while (1) fork();
}

mathsquid · « **Reply #8 on:** July 05, 2018, 02:49:59 pm »

I'd avoid the Art of Computer Programming. It's a great book, but it is very theoretical, and the implementations of algorithms are given in an assembly language that Knuth describes in the book. I don't think it would be the best way to get practical advice for writing algorithms in C.

rstofer · « **Reply #9 on:** July 05, 2018, 03:07:23 pm »

You might consider looking for books at Alibris.com. I get many of my older books there but every once in awhile they will have an attractive price on a new book. OTOH, some of the sellers think their books are gold plated.

There's a huge difference between C++ on a PC where memory and speed abound and the code you might run on a small microcontroller. I don't think the Standard Template Library (STL) is the kind of thing I would use on a uC but I would certainly use it on a PC. All the work for structures and methods is already done. If I need a queue, all I need to do is say so. Very powerful stuff!

https://www.amazon.com/C-Standard-Template-Library/dp/0134376331

Graphics is an entirely different animal. At one level, it is all matrix algebra. Rotate, scale and translate of graphic elements is strictly a matrix problem. It gets more involved as you move up from the pixel level because images for video games are generally created from vertices (triangles) and you can spend a lifetime working on this idea.

https://www.google.com/search?q=game+graphics+vertices

There are a lot of books. I would tend to sort on publication date. Newer would be better.

I haven't done it but I understand it can be fun to use graphics cards as numeric processors.

https://www.google.com/search?q=cuda+programming

bsfeechannel · « **Reply #10 on:** July 05, 2018, 04:22:33 pm »

Has anyone mentioned "Mastering Algorithms With C" by Kyle Loudon?

ralphrmartin · « **Reply #11 on:** July 05, 2018, 05:01:33 pm »

Sedgewick's algorithm books are excellent. I thoroughly recommend them as top books on the topic.

(For what my opinions is worth, I was a professor of computer science until I retired).

bd139 · « **Reply #12 on:** July 05, 2018, 06:32:15 pm »

Quote from: bsfeechannel on July 05, 2018, 04:22:33 pm

Has anyone mentioned "Mastering Algorithms With C" by Kyle Loudon?

Excellent book but a little heavy in malloc for embedded systems.

rstofer · « **Reply #13 on:** July 05, 2018, 07:24:48 pm »

Quote from: ralphrmartin on July 05, 2018, 05:01:33 pm

Sedgewick's algorithm books are excellent. I thoroughly recommend them as top books on the topic.

(For what my opinions is worth, I was a professor of computer science until I retired).

Robert Sedgewick's books are all over Alibris.com and the prices are really low (mostly).
https://www.alibris.com/booksearch?keyword=Sedgewick

As an author, Sedgewick has been prolific!

Retirement is good! I've been kicking back for 14 years.

agehall · « **Reply #14 on:** July 05, 2018, 07:43:10 pm »

Quote from: mathsquid on July 05, 2018, 02:49:59 pm

I'd avoid the Art of Computer Programming. It's a great book, but it is very theoretical, and the implementations of algorithms are given in an assembly language that Knuth describes in the book. I don't think it would be the best way to get practical advice for writing algorithms in C.

If you want to do more than copy&paste code into your software, that isn't a problem. Knuth will give you a fundamental understanding of what you are actually doing and how the algorithms really work. Once you understand that, implementing them in C is trivial.

westfw · « **Reply #15 on:** July 06, 2018, 02:55:28 am »

Quote

Sedgewick's algorithm books are excellent. I thoroughly recommend them as top books on the topic.

I second the recommendation.Also, his (free!) online classes on Algorithms are very, very good - among the best online classes I've seen.https://www.coursera.org/learn/algorithms-part1The classes are currently using Java, but ... "basic and advanced algorithms" ought to be pretty language independent. In fact, the class FAQ says:

Quote

I have no familiarity with Java. Can I still take this course?
Our central thesis is that algorithms are best understood by implementing and testing them. Our use of Java is essentially expository, and we shy away from exotic language features, so we expect you would be able to adapt our code to your favorite language. However, we require that you submit the programming assignments in Java.

Quote

a little heavy in malloc for embedded systems.

I do not believe that you will be able to find a book or class in algorithms that does not make heavy use of dynamic allocation. And recursion, as well. After you understand the algorithms in their "pure" "computer science" form, you can figure out how to remove these evil dependencies using one of the "how to appease some organization who thinks the way to avoid the large number of bugs in today's software is to write code like it was written 40 years ago" books.

FlyingHacker · « **Reply #16 on:** July 06, 2018, 03:41:00 am »

Sedgewick is good, but I prefer Weiss’ teaching and writing style:

https://www.pearson.com/us/higher-education/program/Weiss-Weiss-Data-Struc-Algor-Analy-C-4-4th-Edition/PGM148299.html

BravoV · « **Reply #17 on:** July 06, 2018, 04:54:42 am »

Quote from: coppice on July 05, 2018, 01:17:59 pm

The Knuth books are considered classics, but they are quite old. The algorithms considered of high importance in the 1970s aren't necessarily the ones of high importance today.

Curious about this, please share just one example.

PartialDischarge · « **Reply #18 on:** July 06, 2018, 05:01:06 am »

Hackers Delight

https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0321842685

westfw · « **Reply #19 on:** July 06, 2018, 06:45:23 am »

Quote

Quote
The algorithms considered of high importance in the 1970s [in Knuth] aren't necessarily the ones of high importance today.
Curious about this, please share just one example.

These days, there are a lot more kinds of trees (and/or more tree algorithms) then there are in the original Knuth v2. (I can't comment on the revised edition, or v3, alas.)When I was looking at Sedgewick's class, I did a lot of "how come they never mentioned this back in MY data structures class? Oh. It wasn't invented yet." :-(
Also, algorithms tend to be described as performing relative to the number of things being operated on: t = k₁ * f(n) + k₂
To a large extent, f(n) describes how good your algorithm is, and is of much interest to Computer Scientists ("f(n) = k₃*log₂(n)" is SO much better than "f(n) = k3 * n³"). The k's come largely from how fast your computer is, and how well you code the algorithm (of much interest to Computer Engineers and coders.). As computers have gotten faster and n's have gotten larger, algorithms are considered, described, and implemented in ways that they wouldn't have been in the 70s (commonly using recursion and dynamic allocation being examples.)

bd139 · « **Reply #20 on:** July 06, 2018, 06:48:49 am »

Quote from: westfw on July 06, 2018, 02:55:28 am

Quote
a little heavy in malloc for embedded systems.
I do not believe that you will be able to find a book or class in algorithms that does not make heavy use of dynamic allocation. And recursion, as well. After you understand the algorithms in their "pure" "computer science" form, you can figure out how to remove these evil dependencies using one of the "how to appease some organization who thinks the way to avoid the large number of bugs in today's software is to write code like it was written 40 years ago" books.

Agree. The problem is that it’s virtually impossible finding anyone who actually understands that in 2018.

The right approach is to build what you need within the constraints you have and thanks to moore’s law this has given most software engineers way too much slack. We have the same problem when you have so much going on inside one physical compute node that it’s like having each thread running on a low end ARM core.

MattSR · « **Reply #21 on:** July 06, 2018, 06:53:17 am »

Great thread guys, I have gleaned plenty of info and suggestions!

Thanks!

brucehoult · « **Reply #22 on:** July 06, 2018, 07:35:21 am »

Quote from: coppice on July 05, 2018, 01:44:30 pm

Quote from: Kjelt on July 05, 2018, 01:31:20 pm
Most programmers these days look on sites like stackoverflow, if you have some experience you can easily see which solutions are gold and which garbage.
I thought stackoverflow was where information went to die. How often have you actually solved a problem using information from stackoverflow?

Probably 3 to 10 times a week, on average. I've been programming professionally for 33 years, but there's still plenty I don't know, especially as I continually explore new areas to learn.

Not usually for algorithms though -- I know the standard ones and am pretty good at inventing new ones. For obscure commands or options on commands, mostly.

coppice · « **Reply #23 on:** July 06, 2018, 08:45:19 am »

Quote from: bd139 on July 06, 2018, 06:48:49 am

Quote from: westfw on July 06, 2018, 02:55:28 am
Quote
a little heavy in malloc for embedded systems.
I do not believe that you will be able to find a book or class in algorithms that does not make heavy use of dynamic allocation. And recursion, as well. After you understand the algorithms in their "pure" "computer science" form, you can figure out how to remove these evil dependencies using one of the "how to appease some organization who thinks the way to avoid the large number of bugs in today's software is to write code like it was written 40 years ago" books.
Agree. The problem is that it’s virtually impossible finding anyone who actually understands that in 2018.

Not really. There are a huge number of embedded developers who understand it very well. The problem is people teaching and writing books rarely understand it, or treat it as a lightweight issue of little consequence.

bd139 · « **Reply #24 on:** July 06, 2018, 08:53:15 am »

Yeah that was the second part of my comment about Moore's Law. Prevailing attitude is "just blow things away in cycles and meh".

We have one other ex embedded guy (we are fintech sector) and everyone considers us nuts compared to the status quo

rstofer · « **Reply #25 on:** July 06, 2018, 09:07:13 pm »

Quote from: FlyingHacker on July 06, 2018, 03:41:00 am

Sedgewick is good, but I prefer Weiss’ teaching and writing style:

https://www.pearson.com/us/higher-education/program/Weiss-Weiss-Data-Struc-Algor-Analy-C-4-4th-Edition/PGM148299.html

Weiss is another prolific writer. There are versions of his books all over Alibris.
https://www.alibris.com/booksearch?keyword=data+structures%2C+weiss

rhb · « **Reply #26 on:** July 06, 2018, 09:49:32 pm »

Personal recommendations:

Algorithms in C
Robert Sedgewick

Introduction to Algorithms
Cormen, Leiserson & Rivest

The Art of Programming
Knuth

I have both editions of Knuth. I have used it very seldom over the years, but when you need it, there is no substitute. But I'd suggest you start with the current editions of the first two. Once you've read those, you can sit down with a copy of Knuth and decide very quickly if it's a reference you want to have.

Other recommendations:

Code Complete
McConnell

Rapid Development
McConnell

A copy of the C & C++ language standards. Read the relevant section any time you do anything unusual. Pay particular attention to where is says the behavior is "undefined" or "implementation dependent" and tread very carefully in that area.

I have 80 ft of computer books. There are a lot of books that I read and then referenced once every few years. But having them allowed me to get a much better rate as a contractor than most.

Kjelt · « **Reply #27 on:** July 07, 2018, 06:53:10 am »

The other recommendations are Pretty old books from the MS press end 90s when they were hot, still have them myself but i don't know if they are relevant today or would recommend them, so much has changed.

Not for algorythms but Uncle Bob with his clean coder and clean architecture series is worthwhile although his videos are really bugging me, he get the right point. You don't make friends with a lot of SW archs showing them his work

rhb · « **Reply #28 on:** July 07, 2018, 05:01:09 pm »

Quote from: Kjelt on July 07, 2018, 06:53:10 am

The other recommendations are Pretty old books from the MS press end 90s when they were hot, still have them myself but i don't know if they are relevant today or would recommend them, so much has changed.

Not for algorythms but Uncle Bob with his clean coder and clean architecture series is worthwhile although his videos are really bugging me, he get the right point. You don't make friends with a lot of SW archs showing them his work

I wasn't aware that the concept of writing correct code had become obsolete, though I'll certainly admit it doesn't seem to be as popular as using the latest fad language or development methodology. The software engineering section of my library is 20 ft or so of shelf space. The McConnell books were the standouts with Jon Bentley and a long list of shorter and more obscure books close behind.

After some almost 40 years of programming in multiple languages on multiple platforms, the thing that amazes me the most is the degree to which the same mistakes are repeated every 10 years or so. The typical programmer today has not the least clue as to what is actually going on. Sadly this was true of most over my working career. I was the person they came to to sort out why things would not compile or link, or if they did did not work.

I've lost count of the number of times I've said, "Section 8.3.5 of the FORTRAN 77 standard states that upon return from a function or subroutine, NAMED COMMON may be undefined. This was included in the standard to accommodate the Burroughs company which wanted to offer a standard compliant FORTRAN compiler for the B5000 successors. As it was a stack architecture machine, performance was very dependent upon allocating variables on the stack."

The early SPARC and MIPS compilers broke a lot of old FORTRAN, so I spent a lot of time explaining this to people I worked with.

I also have an annoying habit of pointing out that some construct in the C or C++ standards is either "undefined" or "implementation dependent". Of course, we become "friends" again when their code quits working.

Kjelt · « **Reply #29 on:** July 07, 2018, 10:07:41 pm »

Well the problem today is more the definition of done and making sure the sw teams on three continents working on the same stack have the same definition

Rapid development has become Safe agile and it takes more time for all safe consultants to learn the concept to the projectleaders and management that they should leave the sw teams plan for them selves and not to muddle with the planning.
A lot has changed however you're right in that sense that a lot of mistakes that were made 30 years ago are still being made today.

bd139 · « **Reply #30 on:** July 07, 2018, 10:17:51 pm »

It’s more complicated than that. We’re also stil in the steam age of software. Designing software is still a pretty new thing that we haven’t quite sussed yet as a species.

We can all make little things work but anything large is a disaster in the making.

Kjelt · « **Reply #31 on:** July 07, 2018, 10:24:13 pm »

I agree, what frustrates me the most is that new languages pop up like mushrooms but they do not solve anything meaningfull. Anyway end 90s a common thought was that by 2020 human programmers would be extinct, computers would program themselves

I sometimes think that programmers make new languages, new development and simulation tools just to keep on having work.

bd139 · « **Reply #32 on:** July 07, 2018, 10:37:41 pm »

That would have been good. I remember the dreams of sipping cocktails on a Caribbean island while the machines do the work for us. What actually happened was we built DSLs, tools and meta languages so the unskilled business analysts could describe the systems they wanted instead. Unfortunately it turns out they had no idea what they wanted so delegated it back to the programmers who didn’t have time for all that shit and didn’t want to do themselves out of a job. Ergo, wincing and slow head shaking at the analysts whenever they want something and say “that’ll take at least three months as we have to shovel it onto the pile of shit you changed your mind half way through last time”. Actually what happens is the engineers become a poorly educated crap filter neither understanding the problem domain adequately nor the tools they chose because someone had some loud marketing. So we’re all still chipping away at rocks slaving away with no real improvements.

And yes you’re right. I think new programming languages are invented so someone can showboat that they read Knuth and implement it poorly, again, touting advantage X of a language whilst hiding disadvantages Y,Z and A-W.

Problem is programmers are like flies around shit. A new pile of shit appears and they’re on it.

This post was sponsored by 20 years of working in the software sector and half a bottle of red

NorthGuy · « **Reply #33 on:** July 07, 2018, 11:29:32 pm »

Quote from: bd139 on July 07, 2018, 10:37:41 pm

This post was sponsored by 20 years of working in the software sector and half a bottle of red

I watch people play golf. Most of them have very inefficient golf swings, they cannot hit the ball straight or far. New clubs appear all the time, promising instant improvements if only you buy the club. People buy new clubs hoping that the club would help them hit the ball better. It doesn't. But somehow, by some magical way of thinking, they believe that they hit balls better with their new clubs. They truly believe that the new club has helped them a lot. So, when a new club comes up they buy it again.

Same think with programmers, and I suspect everywhere else.

About books - I found the "Numerical Recipes in C" to be useful. It is not exactly C related, but it covers wide range of algorithms.

I don't think it is a good idea to build your program from algorithms as from lego blocks. Every task is unique, thus requires an unique approach. Most of the time, you need to invent something by yourself.

coppice · « **Reply #34 on:** July 07, 2018, 11:48:04 pm »

Quote from: NorthGuy on July 07, 2018, 11:29:32 pm

I don't think it is a good idea to build your program from algorithms as from lego blocks. Every task is unique, thus requires an unique approach. Most of the time, you need to invent something by yourself.

Right on. Components are for wimps. Build every solution from raw transistors up, to be finely tuned for purpose.

GeorgeOfTheJungle · « **Reply #35 on:** July 08, 2018, 12:04:43 am »

DSLs will always exist because the linux kernel or a device driver is one thing (C, systems programming), a web app is another (javascript), a hardware description language another (vhdl), a page description is another (postscript), a database another, the presentation/display+human interface another... etc, etc.

What current (application) languages can't yet do very well is help exploit concurrency and parallelism with ease, in part because it's a relatively recent need (multiple cores), in part because for many years we've been being saved by ever faster clock speeds coming to the rescue, in part because due to inertia we as programmers still mostly "think linear and singlethreadedly", and also because simply nobody has yet been clever enough to devise "the right way to do it", which threads (alone) is clearly not.

coppice · « **Reply #36 on:** July 08, 2018, 02:12:22 am »

Quote from: GeorgeOfTheJungle on July 08, 2018, 12:04:43 am

What current (application) languages can't yet do very well is help exploit concurrency and parallelism with ease, in part because it's a relatively recent need (multiple cores)

This is not true at all. There has been extensive research since the early 1960s to find and exploit parallelism in algorithms, from the micro-instruction level up to the most macroscopic level. Machines with 2 to many thousands of cores are not at all new. They are only new to home users. The problem is a few things are trivial to parallelize, like rendering animation cells, while most things hit "B can't done before we know the result of A" issues so very often that most algorithms won't even allow a reasonable amount of ILP from today's very complex cores.

Nominal Animal · « **Reply #37 on:** July 08, 2018, 09:49:10 am »

First thing to work out is to find out what you want to achieve.
The second thing is to work out how to do it.
The third thing is to find out what to do when things go wrong.
The fourth thing is to work it all out in a way that you can fix, when you find out you were wrong.
If you do not document your intentions and ideas at each step, you screw someone, usually the later version of yourself.

I liked Weiss' old Data Structures and Algorithm Analysis in C, because not only does it show good examples of various data structures, you also shows you how to evaluate and compare them along different properties (time and space complexity, best/worst/average case).

It seems that a lot of developers do not realize that on many architectures (possibly all non-microcontroller ones), typical tasks are constrained by memory bandwidth, and not CPU speed. Even in things like Discrete Cosine Transform (used in most audio and video compression schemes), the heavy computation itself is easy to optimize; the hard part is to efficiently decode and extract the components out of a binary stream.

Current server and desktop architectures use multi-level caches to mitigate this, which means that to maximize memory efficiency, the memory access patterns must be such that the CPU can predict them to a high degree, transferring data between cache levels and memory efficiently, without unnecessary stalling.

A possibly surprising result is that if you implement some algorithm that in a microbenchmark behaves optimally, using that same algorithm in a real program on real data may be slower than an implementation that behaves worse in that same microbenchmark. This is because in microbenchmarks, the cache is only used by the implementation tested. In real life programs, the cache is shared by the entire task, and can even be evicted every now and then due to process switches on that core. So, if you design an algorithm for a CPU with specific sizes at specific cache levels, it is really unlikely to perform that way in a real programs. However, it does not seem to hinder people from developing just such algorithms and implementations...

Whenever discussing parallel (or distributed) programming, typically the most important part for actual software development, the level at which parallelism yields best results, is completely skipped.

You can develop a 3D dot product for three CPU cores, and optimize it to perform optimally even cache-wise, before you realize that saving and restoring the internal hardware state needed for each execution thread is several orders of magnitude higher than any speedup such approach might gain, for any number of vectors.

The nastiest part about that is that it is usually a human-level question, not something you can compute or automatically find out or even assume. Sometimes the user wants a task to be done in the background, using the fewest resources needed, since it can take a few minutes longer since the user is right now doing something more important. Other times, the CPU resources needed are not important, as long as the results are obtained with as little latency (in as little wall clock time) as possible.

Then, there is the problem of what "performance" means. When optimizing computation for speed, do you optimize wall clock time taken or CPU time taken? Those two are very, very different things. Or by performance, do you mean bandwidth? What about latencies? Or bounded execution times (any sort of guarantees on performance)?

Many HPC simulations involve the Monte Carlo method. One of my favourite gripes is when a large simulation is spatially divided to allow the same simulation to be calculated in parallel; i.e. distributed over multiple computational cores. If each random change affects a nonzero neighborhood, and they always do if the simulation is at all useful, changes near the CPU core boundaries need to be communicated between cores, or the core boundaries will show up in the final results. In most scientific papers on Molecular Dynamics where the base energy state of some chunk of matter is found using Monte Carlo simulations and atom swaps, you can clearly see the boundaries, because communication between cores simply slows down the computation way too much -- several orders of magnitude slower, in fact. As far as I know the solution, sliding windows, is still not proven to retain the detail balance (meaning, whether it yields the same results as a single simulation), but it definitely beats the results containing core boundaries. My own solution, using a space-filling curve, and transmitting information during computation, at least one widow step ahead of the computation, is apparently too complicated for typical scientists to grok.

During that research, about a decade ago, I came to the conclusion that most programmers (or developers) do not understand that to get high performance, data must flow constantly. You cannot do computation, then communication, in turns: you need to do them at the same time to get your money's worth out of the hardware you have.

Using larger buffers does not help, either. At each buffer, additional latency is induced, when the data flows to fill the buffer before it continues onwards. This is known as bufferbloat in packet-switched networks, but it affects all software, really. After all, when you read data from storage, do some kind of work on it, then write the data out, you do exactly the same thing a network switch does to packets.

Apologies for the ranty wall of text.

tggzzz · « **Reply #38 on:** July 08, 2018, 10:27:09 am »

Quote from: GeorgeOfTheJungle on July 08, 2018, 12:04:43 am

What current (application) languages can't yet do very well is help exploit concurrency and parallelism with ease, in part because it's a relatively recent need (multiple cores), in part because for many years we've been being saved by ever faster clock speeds coming to the rescue, in part because due to inertia we as programmers still mostly "think linear and singlethreadedly", and also because simply nobody has yet been clever enough to devise "the right way to do it", which threads (alone) is clearly not.

Not true.

One significant example is:
1970s: Tony Hoare (of Quicksort, monitors fame and null reference infamy) invents Communicating Sequential Processes (CSP) as a formalism for expressing and analysing parallel processing
1980s: CSP embodied in Occam running on Transputers
2000s: xC and XCORE are the spiritual descendents of Occam and the Transputer, for embedded systems.

Today you can buy a 4000MIPS chip with 32 processors and an IDE that guarantees min/max loop and i/o timings to a 4ns resolution. Cost: £22/23 one off, from Digikey https://www.digikey.co.uk/products/en/integrated-circuits-ics/embedded-microcontrollers/685?k=&pkeyword=&pv1989=0&FV=fffc0370%2Cffe002ad%2Cac4001b&quantity=0&ColumnSort=0&page=1&stock=1&pageSize=25

Do they suitable for all problems? Of course not.
Are they perfect? Of course not.
Are they useful? Demonstrably yes, given the commercial investment and products.

bd139 · « **Reply #39 on:** July 08, 2018, 10:54:09 am »

Most software projects depend on staff availability these days so that kind of shoots that down.

GeorgeOfTheJungle · « **Reply #40 on:** July 08, 2018, 10:56:45 am »

IMO the focus has been mostly on objects (OO), it's been the fad, objects everywhere, in many many cases unnecessarily, and not so much interest in concurrency. Also, a generation has lost their time with Java and irreparably damaged their brains forever with classical inheritance

bd139 · « **Reply #41 on:** July 08, 2018, 11:13:00 am »

OO is fine if you program to interfaces and favour composition over inheritance.

Again going back to an earlier comment, most of the theoretical stuff works fine in a very small problem domain but doesn’t scale well. OO sucks in very small problem domains but scales to large ones.

nctnico · « **Reply #42 on:** July 08, 2018, 11:51:57 am »

Quote from: luiHS on July 05, 2018, 10:36:12 am

Hello.

I am looking for good books on Basic and Advanced Algorithms to program in C / C ++. Structures, Sorting, Search, Graphics, etc ...

'C++ Data structures' by Dale Teague is a nice book to read about how to deal with data. It shows some inner workings of the STL data storage containers.

westfw · « **Reply #43 on:** July 08, 2018, 10:39:46 pm »

Quote

the concept of writing correct code had become obsolete ... "Section 8.3.5 of the FORTRAN 77 standard states
... some construct in the C or C++ standards is either "undefined" or "implementation dependent".

Yeah sure, but now we're well out of the context of "algorithms and data structures", at least as commonly taught. Good coding practices, standards adherence, requirements specification, effective debugging... Not subjects for an algorithms book. (and I'm sure we've all known a CS PhD with recognizable brilliance in some segment, who "can't code his way out of a paper bag", or who "accidentally" blew the memory or performance budget of some less-popular product he wasn't familiar with...) (I remember ... Smalltalk on Sun-1 systems, and how un-impressed I was at how much you had the deck out the system to get "moderately acceptable" performance. Well beyond standard memory configurations, for instance.)

Quote

the focus has been mostly on objects

Now THAT'S an interesting point. Has anyone seen a book that specifically covers "Objects" as a data type and/or target for standard algorithms?

SiliconWizard · « **Reply #44 on:** July 09, 2018, 07:38:31 pm »

Quote from: coppice on July 08, 2018, 02:12:22 am

Quote from: GeorgeOfTheJungle on July 08, 2018, 12:04:43 am
What current (application) languages can't yet do very well is help exploit concurrency and parallelism with ease, in part because it's a relatively recent need (multiple cores)
This is not true at all. There has been extensive research since the early 1960s to find and exploit parallelism in algorithms, from the micro-instruction level up to the most macroscopic level. Machines with 2 to many thousands of cores are not at all new.

Absolutely, and for good reasons.
At the time, processors were so limited that it soon became very logical to try and use many processors at once. Multi-processor designs and algorithms appeared way sooner than multi-core processors for technological reasons, but the underlying principles were basically the same.

And looking at it on a larger scale, parallelizing activities (and the underlying constraints and approaches) is a concept that is almost as old as life itself. And it's definitely not trivial except in very obvious cases where the tasks at hand are completely independent.

The fact that programming languages don't include more specific parallel processing stuff is not because the need is recent. It's fundamentally because it's hard to even devise what would be useful in the general case.

bd139 · « **Reply #45 on:** July 09, 2018, 07:43:30 pm »

This always looked interesting to me: http://www.greenarraychips.com/index.html

GeorgeOfTheJungle · « **Reply #46 on:** July 09, 2018, 08:32:46 pm »

Quote from: bd139 on July 09, 2018, 07:43:30 pm

This always looked interesting to me: http://www.greenarraychips.com/index.html

To me too, it's irresistible from a nerd's perspective.

tggzzz · « **Reply #47 on:** July 09, 2018, 09:06:31 pm »

Quote from: SiliconWizard on July 09, 2018, 07:38:31 pm

The fact that programming languages don't include more specific parallel processing stuff is not because the need is recent. It's fundamentally because it's hard to even devise what would be useful in the general case.

Sigh. Some commercially significant languages do just that - and have done since the 80s.

Hoare's CSP is from the 70s, embodied directly in Occam in the 80s, and in xC in the 00s to date. Plus key concepts appear in other fashionable languages such as Go and Rust.

tggzzz · « **Reply #48 on:** July 09, 2018, 09:09:20 pm »

Quote from: bd139 on July 09, 2018, 07:43:30 pm

This always looked interesting to me: http://www.greenarraychips.com/index.html

Yes - but only until you start to think how you might use it and program it. Hardware is easy; software and usable hardware is difficult.

And that's why xCORE plus xC is so significant: hardware and software working well together in real-life hard realtime commercial applications.

bd139 · « **Reply #49 on:** July 09, 2018, 10:11:03 pm »

Usable compilers and schedulers are harder than all problems however so YMMV. Consider scheduling even a trivial concurrent system such as a map and reduce operation. How do you schedule that including IO bottlenecks, data paths and throughout optimisation? You now have n! problems to solve.

I’m very sceptical of such systems, even with CSP thrown in.

GeorgeOfTheJungle · « **Reply #50 on:** July 09, 2018, 11:04:51 pm »

Hey, concurrency and parallelism are not the same thing. While objects (OO) have already been shoehorned in one way or another into ~ every existing language, neither concurrency nor parallelism have got good enough (if any) support in mainstream languages yet, imo.

The web/JS is, out of necessity, trying and making a slow but steady progress in these matters of concurrency, asynchronous execution and parallelism. Good or bad IDK, time will tell, but I keep an eye on that, it's very interesting.

NorthGuy · « **Reply #51 on:** July 09, 2018, 11:46:27 pm »

Quote from: GeorgeOfTheJungle on July 09, 2018, 11:04:51 pm

neither concurrency nor parallelism have got good enough (if any) support in mainstream languages yet, imo.

Intel's C compiler tries to parallelize loops then uses SIMD instructions to perform parallel execution. You don't need to do anything special in the language - it's the same C, but it results in parallel execution.

tggzzz · « **Reply #52 on:** July 10, 2018, 05:55:45 am »

Quote from: bd139 on July 09, 2018, 10:11:03 pm

Usable compilers and schedulers are harder than all problems however so YMMV. Consider scheduling even a trivial concurrent system such as a map and reduce operation. How do you schedule that including IO bottlenecks, data paths and throughout optimisation? You now have n! problems to solve.

I’m very sceptical of such systems, even with CSP thrown in.

CSP doesn't have much to do with performance; it is about correctness. OTOH, if you are prepared to accept an incorrect result, I can significantly speed up any algorithm

Ahead-of-time schedulability - which I think is what you are after - requires that all components of a system are predictable. That's unobtainable in the large-scale cases you quote; but XMOS does achieve it in constrained embedded systems.

Note that CSP is closely related to message-passing, which is the best hope for scalability of large-scale coarse-granularity systems you mention. Hence some CSP concepts appearing in Go and Rust.

tggzzz · « **Reply #53 on:** July 10, 2018, 06:00:16 am »

Quote from: GeorgeOfTheJungle on July 09, 2018, 11:04:51 pm

Hey, concurrency and parallelism are not the same thing. While objects (OO) have already been shoehorned in one way or another into ~ every existing language, neither concurrency nor parallelism have got good enough (if any) support in mainstream languages yet, imo.

Some languages have useful concurrency concepts, e.g. Ada, Java, xC, Go, Rust. (And there are other less well known ones)

Other languages have a "head in the sand" "let's build castles on sand" mentality, notably C and C++. (It has taken them 40 years to admit that they need a memory model, quarter of a century too late)

westfw · « **Reply #54 on:** July 10, 2018, 06:57:49 am »

Aren't there some Fortran compilers that are very good at parallelizing the sort of problems that people still use Fortran to solve?

GeorgeOfTheJungle · « **Reply #55 on:** July 10, 2018, 07:10:22 am »

This is why I say that "it's a recent problem": https://en.wikipedia.org/wiki/Parallel_computing#Background

Quote

Frequency scaling was the dominant reason for improvements in computer performance from the mid-1980s until 2004. The runtime of a program is equal to the number of instructions multiplied by the average time per instruction. Maintaining everything else constant, increasing the clock frequency decreases the average time it takes to execute an instruction. An increase in frequency thus decreases runtime for all compute-bound programs.[10] However, power consumption P by a chip is given by the equation P = C × V 2 × F, where C is the capacitance being switched per clock cycle (proportional to the number of transistors whose inputs change), V is voltage, and F is the processor frequency (cycles per second).[11] Increases in frequency increase the amount of power used in a processor. Increasing processor power consumption led ultimately to Intel's May 8, 2004 cancellation of its Tejas and Jayhawk processors, which is generally cited as the end of frequency scaling as the dominant computer architecture paradigm.[12]

To deal with the problem of power consumption and overheating the major central processing unit (CPU or processor) manufacturers started to produce power efficient processors with multiple cores. The core is the computing unit of the processor and in multi-core processors each core is independent and can access the same memory concurrently. Multi-core processors have brought parallel computing to desktop computers. Thus parallelisation of serial programmes has become a mainstream programming task. In 2012 quad-core processors became standard for desktop computers, while servers have 10 and 12 core processors. From Moore's law it can be predicted that the number of cores per processor will double every 18–24 months. This could mean that after 2020 a typical processor will have dozens or hundreds of cores.[13]

An operating system can ensure that different tasks and user programmes are run in parallel on the available cores. However, for a serial software programme to take full advantage of the multi-core architecture the programmer needs to restructure and parallelise the code. A speed-up of application software runtime will no longer be achieved through frequency scaling, instead programmers will need to parallelise their software code to take advantage of the increasing computing power of multicore architectures.

CJay · « **Reply #56 on:** July 10, 2018, 09:17:03 am »

I Liked Programming Pearls by Joe Bentley and found it useful

https://www.amazon.co.uk/dp/B01EAW7XXU/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1

Nominal Animal · « **Reply #57 on:** July 10, 2018, 11:08:04 am »

Quote from: NorthGuy on July 09, 2018, 11:46:27 pm

Intel's C compiler tries to parallelize loops then uses SIMD instructions to perform parallel execution. You don't need to do anything special in the language - it's the same C, but it results in parallel execution.

Have you looked at the generated code? In all non-trivial cases, I can do a better job by using intrinsics (<immintrin.h>) and compiler-provided vector type extensions.

The sequence point model C standard uses does not lend itself well to vectorization. Fortran array notation is much easier to vectorize, because there is no implicit order in the assignments. (By vectorization, I mean using architecture-specific single-instruction, multiple-data extensions.)

NorthGuy · « **Reply #58 on:** July 10, 2018, 02:02:13 pm »

Quote from: Nominal Animal on July 10, 2018, 11:08:04 am

Quote from: NorthGuy on July 09, 2018, 11:46:27 pm
Intel's C compiler tries to parallelize loops then uses SIMD instructions to perform parallel execution. You don't need to do anything special in the language - it's the same C, but it results in parallel execution.
Have you looked at the generated code? In all non-trivial cases, I can do a better job by using intrinsics (<immintrin.h>) and compiler-provided vector type extensions.

Sure, you can do better than anything writing in assembler. However, most cases are simple, such as:

Code: [Select]

for (j = s = 0; j < n; j++) s += a[j];
Re-writing this with intrinsics for AVX will take a while - you need to take care of alignment, figure what to do with odd values of "n" etc. The compiler can do all the job for you.

Of course, you can gain more efficiency by aligning the array, perhaps adding some extra "zero" space at the end of the array making sure you do not need partial operations at the end of the loop. Then your assembler/intrinsics code will roll. However, if the array is long enough, the gain won't be that big.

As to the parallelization with the numbers of cores, I think we're already at the end of this road - Intel cores share the same memory, and memory access becomes a bottleneck with relatively small number of cores. 100 cores sharing the same memory won't do very well even with caches.

Nominal Animal · « **Reply #59 on:** July 11, 2018, 10:06:41 am »

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm

Sure, you can do better than anything writing in assembler.

No, that's not what I meant. What I meant was that using C constructs describing explicit vectorization, I can do a better job than a C compiler can.

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm

However, most cases are simple

I've dealt with a lot of C code, and that is just not true. It is only true if your C program is horribly inefficient or imprecise to start with, at the algorithmic level, and no compiler magick will help then.
For example, if you need to calculate a sum of more than a couple of dozen floating-point terms in practice, you'll find you'll either need to ensure they are in monotonic order, or do a Kahan sum, or you'll lose precision.

The most typical operations needed tend to be vector operations, or sums of products. (I'm not talking about computer graphics, either. Think of any physical models, operations on audio or video, animation, or any kind of packing problem, for example. A lot of geometric and physical problems need to be solved in everyday programs.)

A very typical operation is normalizing vectors to unit length. (Which, by the way, might not have anything to do with geometry, because a large number of problems can be simplified this way.) It turns out that if the components of a vector are consecutive in memory, it is especially hard for the compiler to vectorize the norm squared, and the compiler-generated code tends to do the square root operation separately. However, if you write the C code such that you explicitly load vectors so that each vector register has one component from consecutive vectors, you get near-optimal code, even with GCC (which is terrible at vectorization, compared to Intel C/C++). Because of how C is defined via an underlying virtual machine with sequence points, this is very, very hard for a C compiler to detect the pattern in code and optimize it. (I think ICC does detect the most typical pattern -- i.e., brute-force detection of each pattern is the only way those can be optimized in C --, but GCC does not.)

Similarly, if you want to do e.g. matrix-matrix multiplication between large matrices (a common problem in many optimization problems), the bottleneck is the cache pattern. The solution is to use temporary memory, and copying the contents of at least one of the matrices, so that they both will have very linear access patterns. Ulrich Drepper's 2007 article is a very detailed look into this, including the data when vectorized using SSE2 instructions. The C language itself is such that it does not allow a compiler to do this automatically (and it is not even clear if one would want a compiler to do this automatically).

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm

As to the parallelization with the numbers of cores, I think we're already at the end of this road

Again, I sort of disagree. I agree that with symmetric multiprocessing, increasing the number of cores will not help with a small number of concurrent tasks -- say, typical laptop, tablet, or desktop computer usage, where you have a browser open, maybe a word processor, and such.

Asymmetric multiprocessing, where you have different types of cores, some optimized for specific tasks (like calculating checksums of data, DMA'ing it around based on logical filter expressions), is growing like crazy. Phones tend to have a separate processor doing all the network/audio stuff, graphics are handled by dedicated processors (that are still highly programmable via CUDA or OpenCL), and so on. (In fact, I have an Odroid HC1 single-board Linux computer, which has a Samsung Exynos 5422 processor. It has four fast Cortex A15 cores, and four slower and simpler Cortex A7 cores.)

Coroutines, message passing (especially various "mailbox" approaches), atomic operations, and explicit concurrency (or, rather, explicit notation of memory ordering and visibility of side effects) are very important tools for writing asymmetric multiprocessing code, since typically, the intent is to keep data flowing at maximum bandwidth with minimum latencies. C does not provide those (even in C11, the memory model and atomics were just copied over from C++, with the assumption that although the two languages differ a lot in their approach, C++ atomics should work in C too).

A related problem (to coroutines and message passing) is that we kind of need a construct even lighter than threads (nanothreads or fibres or similar). CPU cores have so much state that context switches even within the same process tend to be expensive. Coroutines and closures (especially if combined with message passing) could really use some sort of nanothreads that have their own stack, but otherwise share the state with their parent thread. Hardware-wise there is no problem (and stack switching thus is actually quite common in POSIXy operating systems, for signal delivery), but such concepts have thus far only been incorporated in very high-level languages like Python; we do not really know how to frame/describe/define them for low level code that actually works and does not cause more harm than good. In particular, it is not clear whether coroutines should be implemented as functions (or with similar restrictions), or if they should be treated specially, similar to e.g. POSIX signal handler functions (where many common operations do lead to undefined behaviour, and are therefore not really allowed).

There is a lot of research done on all of these concepts as we speak, down to new types of hardware architectures. On the programming language development side, a major problem is that by the time you've learned enough to develop a language, all that knowledge will steer you towards a specific paradigm/approach to programming languages, so it is terribly hard to create something completely novel. (Plus, even if you do, the sheer inertia to overcome is more than one human can bear.) Higher level languages are easier, because there is less work to get a minimal working implementation for examples, further research, and obtaining results (for example, implementing typical programs, and comparing them to similar programs written in other languages). To implement a new low-level language, you'll also need to implement the basic runtime library from scratch, consider the Application Binary Interfaces, and whether you'll model your base library on top of C/POSIX (and thus system calls on most OSes), or try for something different. Even just mapping a minimal subset of POSIX (to get filesystem access and so on) to your new base library is a huge task; it is a complex specification with a very long history. (And without POSIX, you basically condemn yourself to not just creating a new programming language, but to create a whole new OS to go with it.)

It would take someone like Elon Musk donating a million or so for a number of oddball developers like myself, and see what they come up with. (Most of us would fail, but there is a small probability that someone could come up with a new low-level language, better suited for current and future hardware architectures than C is, but on the same order of simplicity of the core language itself.)

tggzzz · « **Reply #60 on:** July 11, 2018, 10:31:41 am »

Much of your very sensible and knowledgeable message deleted for brevity. It horrifies me that many people have the misunderstandings you describe.

I'll merely make comments on this bit...

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Coroutines, message passing (especially various "mailbox" approaches), atomic operations, and explicit concurrency (or, rather, explicit notation of memory ordering and visibility of side effects) are very important tools for writing asymmetric multiprocessing code, since typically, the intent is to keep data flowing at maximum bandwidth with minimum latencies.

All those are valuable for more than "just" asymmetric multiprocessing code.

Quote

C does not provide those (even in C11, the memory model and atomics were just copied over from C++, with the assumption that although the two languages differ a lot in their approach, C++ atomics should work in C too).

Yes, that attitude is negligent. C/C++ is only getting a memory model 20 years after Java.

I note that, even though they were starting with a clean sheet of paper, the Java designers had to revise their memory model. Given that, what's the probability that the C/C++ "designers" and compiler implementers will produce something that is usable by a typical software writer?

GeorgeOfTheJungle · « **Reply #61 on:** July 11, 2018, 10:36:52 am »

Closures are great, instead of spawning multiple processes each with its own parameters and state, with closures you can do the same in a single process because there can be multiple, simultaneous, different program states alive at the same time. Closures saves you too very often of the need to create and administrate objects just to save state.

coppice · « **Reply #62 on:** July 11, 2018, 11:03:21 am »

Quote from: tggzzz on July 11, 2018, 10:31:41 am

Quote
C does not provide those (even in C11, the memory model and atomics were just copied over from C++, with the assumption that although the two languages differ a lot in their approach, C++ atomics should work in C too).

Yes, that attitude is negligent. C/C++ is only getting a memory model 20 years after Java.

I note that, even though they were starting with a clean sheet of paper, the Java designers had to revise their memory model. Given that, what's the probability that the C/C++ "designers" and compiler implementers will produce something that is usable by a typical software writer?

To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices. I've heard a number of anecdotes of people asking, say, Intel or AMD what happens in some circumstance, and getting a blank stare in response because it had never been thought through in development. If you are going to specify a practical memory model within a language, it has to map sensibly to what the hardware can do and do efficiently, and if even the hardware developers haven't got that completely nailed.....

tggzzz · « **Reply #63 on:** July 11, 2018, 11:11:31 am »

Quote from: coppice on July 11, 2018, 11:03:21 am

Quote from: tggzzz on July 11, 2018, 10:31:41 am
Quote
C does not provide those (even in C11, the memory model and atomics were just copied over from C++, with the assumption that although the two languages differ a lot in their approach, C++ atomics should work in C too).

Yes, that attitude is negligent. C/C++ is only getting a memory model 20 years after Java.

I note that, even though they were starting with a clean sheet of paper, the Java designers had to revise their memory model. Given that, what's the probability that the C/C++ "designers" and compiler implementers will produce something that is usable by a typical software writer?
To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices. I've heard a number of anecdotes of people asking, say, Intel or AMD what happens in some circumstance, and getting a blank stare in response because it had never been thought through in development. If you are going to specify a practical memory model within a language, it has to map sensibly to what the hardware can do and do efficiently, and if even the hardware developers haven't got that completely nailed.....

I disagree, since your statement appears very x86 centric.

I disagree, since Java faces the same problems, and successfully addressed them >20 years ago.

C/C++ claims to be low-level, close to the hardware, efficient, available on (almost) all architectures. That's their downfall, since it is trying to do be all things to all people.

Until recently C explicitly stated that multiprocessing type things were part of the library and outside the language. That's a problem for library developers, since they are forced to rely on things the language doesn't guarantee.

coppice · « **Reply #64 on:** July 11, 2018, 11:33:15 am »

Quote from: tggzzz on July 11, 2018, 11:11:31 am

Quote from: coppice on July 11, 2018, 11:03:21 am
To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices. I've heard a number of anecdotes of people asking, say, Intel or AMD what happens in some circumstance, and getting a blank stare in response because it had never been thought through in development. If you are going to specify a practical memory model within a language, it has to map sensibly to what the hardware can do and do efficiently, and if even the hardware developers haven't got that completely nailed.....

I disagree, since your statement appears very x86 centric.

I disagree, since Java faces the same problems, and successfully addressed them >20 years ago.

C/C++ claims to be low-level, close to the hardware, efficient, available on (almost) all architectures. That's their downfall, since it is trying to do be all things to all people.

Until recently C explicitly stated that multiprocessing type things were part of the library and outside the language. That's a problem for library developers, since they are forced to rely on things the language doesn't guarantee.

What is x86 specific about what I said? People have struggled trying to make things like lock free algorithms rock solid on most machines, because of iffy things in the memory model.

I do agree that having no reliable memory model in C and C++ hasn't been very helpful. However, the inertia of those languages means its hard to compare them to the freely ripped up and replaced nature of early Java.

brucehoult · « **Reply #65 on:** July 11, 2018, 11:37:08 am »

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

In particular, it is not clear whether coroutines should be implemented as functions (or with similar restrictions), or if they should be treated specially

Hmm. I think this is a solved problem. It's called "continuation-passing style". Functions never return .. they just tail-call the thing that wanted their result.

It's a pain in the arse to write manually, but compilers can automatically transform normal functions into CPS. It's been a standard thing for compilers for languages such as Scheme or ML to do for decades, but you can do it for C too.

CPS makes switching micro-threads into simply calling the current continuation of the other thread instead of calling the current continuation of your thread.

In the simplest implementation, local variables that live past a call to another function have to be heap-allocated -- it's basically turning each function into a class, with each basic block in the function being a member function in the class. But it's possible for a smart compiler to do it more efficiently than this.

GeorgeOfTheJungle · « **Reply #66 on:** July 11, 2018, 12:36:44 pm »

Quote from: brucehoult on July 11, 2018, 11:37:08 am

CPS makes switching micro-threads into simply calling the current continuation of the other thread instead of calling the current continuation of your thread.

In the simplest implementation, local variables that live past a call to another function have to be heap-allocated -- it's basically turning each function into a class, with each basic block in the function being a member function in the class. But it's possible for a smart compiler to do it more efficiently than this.

And that's exactly why and where closures shine. Impure functional languages FTW. Closures are blocks in C https://en.wikipedia.org/wiki/Blocks_(C_language_extension) although ~ nobody uses them, I believe.

Nominal Animal · « **Reply #67 on:** July 11, 2018, 12:41:31 pm »

Quote from: tggzzz on July 11, 2018, 10:31:41 am

All those are valuable for more than "just" asymmetric multiprocessing code.

Oh, absolutely agreed. My favourite example is using coroutines (say, Python generator) to provide an input stream (file or socket) split into lines.

A lot of common tasks are easy to describe/implement using coroutines. You could even argue that most event models using in GUIs are better described using classes and coroutines, rather than as actions on objects. (The main difference being, exactly who "owns" or controls the "event loop", and the contexts in which each action occurs.)

Quote from: coppice on July 11, 2018, 11:03:21 am

To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices.

Spectre and meltdown bugs are good examples of how device developers don't really have a good picture of the repercussions the device behaviour has in the real world. Just look at the Linux Kernel Mailing List, to see how often the kernel developers need to take GCC devs to task (when the generated code is absolutely silly or useless, and the GCC devs' defence is that the standard allows it). Or at the number of questions forwarded to hardware developers when architecture-specific details like locking is implemented (or how locking and cache concurrency interact).

Quote from: brucehoult on July 11, 2018, 11:37:08 am

Hmm. I think this is a solved problem. It's called "continuation-passing style". Functions never return .. they just tail-call the thing that wanted their result.

Yes, that is one way to implement it. However, what I meant was a difference similar to Pascal or Fortran have with functions (that return a value) and procedures/subroutines (that do not return a value); even a low-level language could easily have functions and coroutines that follow a different set of language rules.

It is not clear to me that CPS is the best syntactic construct to describe coroutines. (That is, I find it human-error-prone in practice.) For example, if you compare Python generators with code using CPS, you'll find that Python generators with yield returning/generating one value a "caller" can consume, yields easier to maintain code. In a low-level language, such a construct can be implemented via CPS, or it can be implemented by creating a nanothread/fiber; a context that has its own stack and local variables, but otherwise shares the state with its parent thread (and in general, is non-reentrant). While the two are not that different in the underlying implementation using current hardware architectures, at the syntactic level (how they are expressed in human-readable code) they're very, very different. I've always thought of CPS as a large step in the direction of functional languages, myself, so perhaps that colors my view unduly, though.

I'm also not a language developer, myself; other than script-type domain-specific tool languages, I haven't developed my own, or even looked at the state of the art. So I could easily be wrong here.

Back on topic: Even if one is most interested in algorithms in C or C++, I've found it extremely rewarding to explore various algorithms in different programming languages. I believe (but have no proof!) that it widens ones mental focus, i.e. helps see "outside the box", when encountering new problems. The Weiss' book helped me look at algorithms separate from their implementations, at a somewhat abstract level; based on the descriptions for the intent behind each algorithm. Then again, I was very familiar with trees and lists when I first read the book.

Right now, I'd focus much more on memory access patterns and cache behaviour -- and getting data flowing continuously, as opposed to being treated as a single in-memory set that is operated on --, as that in my experience helps with the biggest issues with existing programs: latencies. I find it horrific that a single program can take seconds to start up on current desktop systems; the amount of work done on every single invocation seems not only a waste, but terribly aggravating to users, too. Similarly for a GUI stuttering or becoming unresponsive, because the program is too busy to respond to user action.

tggzzz · « **Reply #68 on:** July 11, 2018, 02:08:17 pm »

Quote from: coppice on July 11, 2018, 11:33:15 am

Quote from: tggzzz on July 11, 2018, 11:11:31 am
Quote from: coppice on July 11, 2018, 11:03:21 am
To be fair to the developers of the C and C++ specs, they've struggled for years to get enough information out of device developers to figure out the complete memory behaviour of their devices. I've heard a number of anecdotes of people asking, say, Intel or AMD what happens in some circumstance, and getting a blank stare in response because it had never been thought through in development. If you are going to specify a practical memory model within a language, it has to map sensibly to what the hardware can do and do efficiently, and if even the hardware developers haven't got that completely nailed.....

I disagree, since your statement appears very x86 centric.

I disagree, since Java faces the same problems, and successfully addressed them >20 years ago.

C/C++ claims to be low-level, close to the hardware, efficient, available on (almost) all architectures. That's their downfall, since it is trying to do be all things to all people.

Until recently C explicitly stated that multiprocessing type things were part of the library and outside the language. That's a problem for library developers, since they are forced to rely on things the language doesn't guarantee.
What is x86 specific about what I said? People have struggled trying to make things like lock free algorithms rock solid on most machines, because of iffy things in the memory model.

You mentioned Intel and AMD, and didn't mention any other manufacturer. It is a long time since they produced anything other than x86 machines.

Quote

I do agree that having no reliable memory model in C and C++ hasn't been very helpful.

It is more than "hasn't been helpful"; it is a killer omission. And a blindingly obvious omission at that!

Quote

However, the inertia of those languages means its hard to compare them to the freely ripped up and replaced nature of early Java.

Shrug. I think you meant "inertia of those language designers".

Presumably having a C/C++ memory model is possible, and has been possible for >20 years. Why, then, did the language designers repeatedly avoid creating memory model?

None of the possible answers reflect well on the language designers.

tggzzz · « **Reply #69 on:** July 11, 2018, 02:28:38 pm »

Quote from: Nominal Animal on July 11, 2018, 12:41:31 pm

Quote from: tggzzz on July 11, 2018, 10:31:41 am
All those are valuable for more than "just" asymmetric multiprocessing code.
Oh, absolutely agreed. My favourite example is using coroutines (say, Python generator) to provide an input stream (file or socket) split into lines.

A lot of common tasks are easy to describe/implement using coroutines. You could even argue that most event models using in GUIs are better described using classes and coroutines, rather than as actions on objects. (The main difference being, exactly who "owns" or controls the "event loop", and the contexts in which each action occurs.)

Yes indeed. I first did that in C in 1982, for a hard realtime system. There was a little bit of processor-dependent assembler in the RTOS, but that's inevitable and of no significant consequence.

Quote

I'm also not a language developer, myself; other than script-type domain-specific tool languages, I haven't developed my own, or even looked at the state of the art. So I could easily be wrong here.

The troubles with domain specific languages is that they grow like cancer, there is no off-the-shelf tool support, and no off-the-shelf developers. OTOH, domain specific libraries are essential and don't have any of those drawbacks.

Quote

Back on topic: Even if one is most interested in algorithms in C or C++, I've found it extremely rewarding to explore various algorithms in different programming languages. I believe (but have no proof!) that it widens ones mental focus, i.e. helps see "outside the box", when encountering new problems.

IMNSHO there is no doubt about that.

When I first looked at Java in 1996, the white paper was impressive since it said we have incorporated T and U which have previously been shown to work in languages J and K respectively. T and U work well together to achieve... In contrast most C/C++ papers were incestuous and ignorant, referring only to other C/C++ papers.

Quote

Right now, I'd focus much more on memory access patterns and cache behaviour -- and getting data flowing continuously, as opposed to being treated as a single in-memory set that is operated on --, as that in my experience helps with the biggest issues with existing programs: latencies. I find it horrific that a single program can take seconds to start up on current desktop systems; the amount of work done on every single invocation seems not only a waste, but terribly aggravating to users, too. Similarly for a GUI stuttering or becoming unresponsive, because the program is too busy to respond to user action.

L1/2/3 cache is the new RAM, RAM is the new disk.

Have a look at xC/xCORE. Start with https://www.xmos.com/published/xmos-programming-guide which gives a rapid tutorial introduction to the principal features. The first part is about multicore programming (none of the "this is a loop..." detritus) and getting i/o to occur on known clock cycles, and the second part gives longer examples.

Even if you never use xC, you'll begin to understand the underpinnings of some language constructs in Rust and Go

NorthGuy · « **Reply #70 on:** July 11, 2018, 03:42:54 pm »

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm
Sure, you can do better than anything writing in assembler.
No, that's not what I meant. What I meant was that using C constructs describing explicit vectorization, I can do a better job than a C compiler can.

The intrinsics you use in C are just mutilated assembler instructions.

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm
However, most cases are simple
I've dealt with a lot of C code, and that is just not true. It is only true if your C program is horribly inefficient or imprecise to start with, at the algorithmic level, and no compiler magick will help then.

I don't quite understand your communications. Are you saying that the program which contains calculations of a sum of an integer array is horribly inefficient?

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

For example, if you need to calculate a sum of more than a couple of dozen floating-point terms in practice, you'll find you'll either need to ensure they are in monotonic order, or do a Kahan sum, or you'll lose precision.

You just use adequate precision (say double instead of float), or, if that is not enough, you go to longer integers. It's much easier and more efficient than sorting the arrays. Adequate precision means the minimum precision which fits your requirements.

Of course, C99 says the compiler cannot re-arrange floating point expressions. But there must be a compiler flag to circumvent this.

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

A very typical operation is normalizing vectors to unit length. (Which, by the way, might not have anything to do with geometry, because a large number of problems can be simplified this way.) It turns out that if the components of a vector are consecutive in memory, it is especially hard for the compiler to vectorize the norm squared, and the compiler-generated code tends to do the square root operation separately. However, if you write the C code such that you explicitly load vectors so that each vector register has one component from consecutive vectors, you get near-optimal code, even with GCC (which is terrible at vectorization, compared to Intel C/C++). Because of how C is defined via an underlying virtual machine with sequence points, this is very, very hard for a C compiler to detect the pattern in code and optimize it. (I think ICC does detect the most typical pattern -- i.e., brute-force detection of each pattern is the only way those can be optimized in C --, but GCC does not.)

Calculating a sum of squares is mot much more complex than a simple sum. If your SIMD register supports 4 numbers, you simply calculate 4 sums (one sum for elements 0,4,8 .., another of 1,5,9 etc.) then you sum them up at the end. Thus, the calculation for a single array can be vectorized. You do not need multiple arrays to get the benefit. The Intel compiler should be able to do it.

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Similarly, if you want to do e.g. matrix-matrix multiplication between large matrices (a common problem in many optimization problems), the bottleneck is the cache pattern. The solution is to use temporary memory, and copying the contents of at least one of the matrices, so that they both will have very linear access patterns. Ulrich Drepper's 2007 article is a very detailed look into this, including the data when vectorized using SSE2 instructions. The C language itself is such that it does not allow a compiler to do this automatically (and it is not even clear if one would want a compiler to do this automatically).

You're mixing two different operations - verctorization and cache optimization.

You can do vectorization without moving elements. Newer Intel CPU even have scatter/gather instructions for this (although I don't think these instructions add much speed).

or

You can re-arrange your matrices for cache access and then do vectorizotion.

In either case, the vectorozation case is simple, and the Intel compiler should be able to do vectorization for you. You don't need to write SSE code (and the few years later re-write for AVX and then for AVX-512, and then who knows what). This is the same logic as using C instead of assembler. You give up a little bit of efficiency, but you get convenience in return. Of course, if you cannot give up any efficiency, then assembler it is.

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am

Quote from: NorthGuy on July 10, 2018, 02:02:13 pm
As to the parallelization with the numbers of cores, I think we're already at the end of this road
Again, I sort of disagree <snip>

This has nothing to do with ever-growing number of Intel cores.

tggzzz · « **Reply #71 on:** July 11, 2018, 04:15:57 pm »

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

Quote from: Nominal Animal on July 11, 2018, 10:06:41 am
Quote from: NorthGuy on July 10, 2018, 02:02:13 pm
Sure, you can do better than anything writing in assembler.
No, that's not what I meant. What I meant was that using C constructs describing explicit vectorization, I can do a better job than a C compiler can.

The intrinsics you use in C are just mutilated assembler instructions.

So you agree with him.

If you talk to the high performance computing (HPC) community (who have always been on the bleeding edge), you will find there are numerous cases where C compilers have to be pessimising. Typical examples involve the inability of the compiler to prove there is no aliasing.

And in that area, C had a long-running dilemma in the early 90s:

should it be impossible to "cast away constness", thereby allowing the compiler to make optimisations
must it be possible to "cast away constness", thereby allowing some forms of higher level programming and debugging

both of which are perfectly valid positions to take.

I remember watching the committee(s) debate that endlessly for well over a year - until I realised they were attempting to square a circle, decided C/C++ was becoming too complex for its own good, and realised that C/C++ was part of the problem rather than part of the solution. Fortunately Java came along for higher-level applications.

Bjarne Soustroup on C++ (but C has many of the same problems):
"It is worth remembering that many in the C++ community still have not completely caught up with C++98."
"Even I can’t answer every question about C++ without reference to supporting material (e.g. my own books, online documentation, or the standard). I’m sure that if I tried to keep all of that information in my head, I’d become a worse programmer."
from http://www.stroustrup.com/japanese2010.pdf

If even he had that problem back in 2010, what chance have mere mortals?

Nominal Animal · « **Reply #72 on:** July 11, 2018, 05:08:15 pm »

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

The intrinsics you use in C are just mutilated assembler instructions.

No, they are C functions that implement single-instruction, multiple-data operations that the underlying hardware also supports. The form and restrictions are often not the same, either. (In particular, with regards to the result register.)

They can be mapped to multiple architectures (see GCC, which can do multi-arch shuffles quite efficiently already), although aside from fused multiply-add, there hasn't been any push in that direction yet.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

Are you saying that the program which contains calculations of a sum of an integer array is horribly inefficient?

I am saying that having to calculate a sum of an integer array is rare. Having to calculate a straight sum of floating-point values is even more rare. I am also saying that whenever I've seen that done in a real life application (for an array with more than a couple of dozen elements), it has usually been a sign of an inefficient approach to the underlying problem. Usually, there has either been a way of avoiding said array and computation altogether, or a more efficient data structure. (Disjoint sets come up surprisingly often.)

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

You just use adequate precision (say double instead of float)

No! Precision is not the issue. The set may contain cancelling terms. As an extreme example, consider 1e30 + 1e-30 - 1e30 = 1e-30. Both sorting according to decreasing magnitude, and Kahan sum, yield the correct result for both IEEE-754 binary32 (float) and binary64 (double).

Many sets one might need to sum often contain such cancelling terms. (It is especially typical in geometric problems, where each term is a result of several products. Because the magnitude doubles for each product, having vectors both shorter and longer than 1, easily lead to data sets with such wildly varying magnitudes.)

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

Thus, the calculation for a single array can be vectorized.

You don't understand. Let me show a practical example.

Let v1, v2, and v3be vector registers, each containing a 3-component vector. (For illustration, this architecture has 3-component vectors.)
To calculate the three norms into v4 one needs to do

Code: [Select]

    v1 = v1 * v1
    v2 = v2 * v2
    v3 = v3 * v3
    v4.1 = v1.1 + v1.2 + v1.3    !
    v4.2 = v2.1 + v2.2 + v2.3    !
    v4.3 = v3.1 + v3.2 + v3.3    !
    v4 = sqrt(v4)

The lines marked with ! are problematic. First, they contain a horizontal sum, which is rarely supported (as it requires interaction between elements in the same vector register); when it is supported, it tends to be slow, too. Second, they contain an assignment to a specific element in a vector register, which is often slow or requires more than one operation. The exception that supports the above just fine are CUDA and OpenCL.

However, if you have the x coordinates in v1 for the three vectors, y coordinates in v2 , and z coordinates in v3, you can do

Code: [Select]

    v1 = v1 * v1
    v2 = v2 * v2
    v3 = v3 * v3
    v1 = v1 + v2 + v3
    v4 = sqrt(v4)

In fact, GCC supports the above in GNU C using architecture-independent vector types, except for the square root.

You might think that the two do not differ that much, but in reality, for SSE2/AVX, it does. So much so that it is better to write the code as

Code: [Select]

    v1 = (x0 y0 z0 x1)
    v2 = (y1 z1 x2 y2)
    v3 = (z2 x3 y3 z3)
    Register shuffling, using two temp registers, yielding
    v1 = (x0 x1 x2 x3)
    v2 = (y0 y1 y2 y3)
    v3 = (z0 z1 z2 z3)
    v1 = v1*v1
    v2 = v2*v2
    v3 = v3*v3
    v1 = v1+v2
    v3 = v3+v1
    v3 = sqrt(v3)
    Store v3

but for 6, 9, or 12 vectors at a time (staggered). The efficiency gains are significant.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

You're mixing two different operations - verctorization and cache optimization.

One might think so, but I am not. You see, the two are intricately intertwined, because the vector registers do not support a fast scatter/gather for similar reasons why horizontal summing is rare or slow even when supported.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

In either case, the vectorozation case is simple, and the Intel compiler should be able to do vectorization for you.

No! If it does, it breaks the C standard!

Don't you see, the problem is threefold. One is that data access patterns are hugely important, and C does not provide any construct to allow the compiler to choose the one that best matches the underlying hardware. The second is that there is no way to relax the sequence point rules in C (i.e., the order in which the side effects of each operation occur). The third is that you cannot specify a sequence of operations to be done in parallel to another, for example to relax the order in which a loop is iterated in (compare to forall loop construct in Fortran).

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

You give up a little bit of efficiency, but you get convenience in return. Of course, if you cannot give up any efficiency, then assembler it is.

I thought I was explaining the features lacking in C, and all other low level languages (Forth?), that would allow one to concisely write much more efficient programs.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

This has nothing to do with ever-growing number of Intel cores.

I was pointing out how other manufacturers use asymmetric cores to achieve much better performance per watt used. Intel is not exactly a forerunner here. Just because Intel does something stupid, does not mean the entire world is stupid.

Quote from: tggzzz on July 11, 2018, 04:15:57 pm

I remember watching the committee(s) debate that endlessly for well over a year - until I realised they were attempting to square a circle, decided C/C++ was becoming too complex for its own good, and realised that C/C++ was part of the problem rather than part of the solution.

I agree. I do believe it would be possible to construct a low-level programming language better suited for current architectures than C, and attempted to highlight the major differences to C in my posts. (Other details include non-binary integer types, floating-point types other than Binary16, Binary32, Binary64, and Binary128, fixed-point arithmetic, exposing arithmetic flags (overflow, carry, zero) as a result of integer operations, and so on.) I do not believe it is possible to design such a language by a committee, however.

tggzzz · « **Reply #73 on:** July 11, 2018, 06:07:02 pm »

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
Thus, the calculation for a single array can be vectorized.
You don't understand. Let me show a practical example.

I would make that statement in a wider context!

Quote

Quote from: tggzzz on July 11, 2018, 04:15:57 pm
I remember watching the committee(s) debate that endlessly for well over a year - until I realised they were attempting to square a circle, decided C/C++ was becoming too complex for its own good, and realised that C/C++ was part of the problem rather than part of the solution.
I agree. I do believe it would be possible to construct a low-level programming language better suited for current architectures than C, and attempted to highlight the major differences to C in my posts. (Other details include non-binary integer types, floating-point types other than Binary16, Binary32, Binary64, and Binary128, fixed-point arithmetic, exposing arithmetic flags (overflow, carry, zero) as a result of integer operations, and so on.) I do not believe it is possible to design such a language by a committee, however.

There was a period, in say the early 90s, when C/C++ could have chosen to be either a high level language for large scale applications or a low level systems language. Either choice would have been reasonable, and arguably either would have succeeded. But it tried to be both, and failed at both. That's a prime example of it being better to do one thing well than two things poorly.

Fortunately Java came along for the large scale applications, and it is finally being realised that C++ is the new COBOL. Whether new languages will gradually supplant C for systems programming is yet to be determined.

NorthGuy · « **Reply #74 on:** July 11, 2018, 06:43:02 pm »

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
The intrinsics you use in C are just mutilated assembler instructions.
No, they are C functions that implement single-instruction, multiple-data operations that the underlying hardware also supports. The form and restrictions are often not the same, either. (In particular, with regards to the result register.)

Looks exactly the same to me. You know as they say: " ... don't multiply entities beyond necessity".

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
Are you saying that the program which contains calculations of a sum of an integer array is horribly inefficient?
I am saying that having to calculate a sum of an integer array is rare. Having to calculate a straight sum of floating-point values is even more rare. I am also saying that whenever I've seen that done in a real life application (for an array with more than a couple of dozen elements), it has usually been a sign of an inefficient approach to the underlying problem. Usually, there has either been a way of avoiding said array and computation altogether, or a more efficient data structure. (Disjoint sets come up surprisingly often.)

Depends on what you're doing. May be rare for some people and frequent for others. It's silly to assign labels without reviewing the particular situation.

Besides, the two examples you had - calculating the vector length and multiplying matrices are not that far from the sum of the array in terms of complexity.

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
You just use adequate precision (say double instead of float)
No! Precision is not the issue. The set may contain cancelling terms. As an extreme example, consider 1e30 + 1e-30 - 1e30 = 1e-30. Both sorting according to decreasing magnitude, and Kahan sum, yield the correct result for both IEEE-754 binary32 (float) and binary64 (double).

Many sets one might need to sum often contain such cancelling terms. (It is especially typical in geometric problems, where each term is a result of several products. Because the magnitude doubles for each product, having vectors both shorter and longer than 1, easily lead to data sets with such wildly varying magnitudes.)

Definitely, about precision. Say you need to know distances with 1mm precision and your biggest distance is 100 km. You realize that you will lose some precision if you use 32-bit floats, but you'll be ok with 64-bit doubles. With 64-bit doubles you will never lose a single mm even if you have "cancelling" terms.

BTW: that's when you start to realize superiority of integers. Using 32-bit integers would be enough - two-fold reduction in memory and better calculation speed. Not to mention a possibility of using 64-bit integer accumulators for products.

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
Thus, the calculation for a single array can be vectorized.
You don't understand. Let me show a practical example.

Let v1, v2, and v3be vector registers, each containing a 3-component vector. (For illustration, this architecture has 3-component vectors.)

I had in mind the likes of 1000-dimentional vectors. With 3-dimentinal vectors, square-rooting dominates the performance, so your performance will be similar as soon as you vectorize square-rooting. The square-rooting takes real long. Everything else will be done in parallel with the square-rooting (unless, of course, if you insert some stupid dependency).

I would guess the compiler would figure 3-dimentional vectors out. Graphics manipulations is a very common task, I guess they would attack it first.

BTW: Intel already had horizontal addition/subtraction in SSE3.

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
In either case, the vectorozation case is simple, and the Intel compiler should be able to do vectorization for you.
No! If it does, it breaks the C standard!

Don't you see, the problem is threefold. One is that data access patterns are hugely important, and C does not provide any construct to allow the compiler to choose the one that best matches the underlying hardware. The second is that there is no way to relax the sequence point rules in C (i.e., the order in which the side effects of each operation occur). The third is that you cannot specify a sequence of operations to be done in parallel to another, for example to relax the order in which a loop is iterated in (compare to forall loop construct in Fortran).

C standard operates on an abstract machine. The goal of the compiler is to produce the same result as the abstract machine. As son as this holds, the compiler is free to do whatever it pleases.

One "bummer" here is floating point - because the things you describe - A + B - A may not yield B. This holds the compiler optimizations - it cannot replace (A+B-A) with B. There are two solutions - allow compiler to re-arrange floats, or use integers wherever possible.

Quote from: Nominal Animal on July 11, 2018, 05:08:15 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
This has nothing to do with ever-growing number of Intel cores.
I was pointing out how other manufacturers use asymmetric cores to achieve much better performance per watt used. Intel is not exactly a forerunner here. Just because Intel does something stupid, does not mean the entire world is stupid.

For some reason you quoted my post where I was talking about Intel (although you did cut it off right before the "Intel" word). I mistakenly assumed you were responding to my post. My apologies.

Nominal Animal · « **Reply #75 on:** July 11, 2018, 08:38:31 pm »

Quote from: tggzzz on July 11, 2018, 06:07:02 pm

Whether new languages will gradually supplant C for systems programming is yet to be determined.

I'm mostly interested in systems programming (that is, kernels and low level libraries), and HPC (molecular dynamics simulations with classical potentials).

Since we are talking about algorithms, one of my core interests is related to Verlet lists (or neighbour lists); and in particular, enhancing them for both vectorization and cache locality. The key point about neighbour/Verlet lists is that rather than calculate all particle pairs, you maintain a list of neighbours for each particle. The classical potentials have a cut-off distance, above which a pair does not interact. If you (re)construct the neighbour/Verlet list to include all neigbours within the cut-off distance plus twice some delta, you can keep using the same list until at least one particle has moved by delta.

The simplest enhancement I've developed is to allow reordering the neighbours in the list, while keeping a parallel array of distances (and (unit) direction vectors to the neighbors within interaction distance). The list is not kept sorted by distance; instead, the maximum interaction distance acts like the pivot in Quicksort. Interactions are then evaluated for the initial section only, those particles that are within interaction distance. This allows increasing the delta (at cost of increasing the neighbor list size), but also only calculate the actual interactions for truly interacting particle pairs, rather than all those in the neighbour list. (So, you calculate distance, or rather distance squared, for all neighbours, but the actual interaction for only the neighbours the particle really interacts with.)

In theory, that's all there is to that algorithm. It's not "new" or "useful" enough to get published. Implementing it in Fortran is relatively straightforward, although the extra arrays needed (for keeping the distances and/or delta vectors) are a bit annoying. Implementing it in C opens up a whole lot of new possibilities, like using an ASIC/FPGA to calculate the distances at double precision, another to look up the potential and force using piecewise cubic polynomials with very large (but constant) coefficient tables.. perfect for asymmetric multiprocessing.

Quote from: NorthGuy on July 11, 2018, 06:43:02 pm

Looks exactly the same to me. You know as they say: " ... don't multiply entities beyond necessity".

They are normal C constructs. Using intrinsics in C code is very different to say inlining the assembly code.

Or, put another way, you can teach a new C programmer to use the intrinsics and vectorize computation, without teaching them assembly.

Quote from: NorthGuy on July 11, 2018, 06:43:02 pm

Depends on what you're doing. May be rare for some people and frequent for others. It's silly to assign labels without reviewing the particular situation.

It was an opinion/estimate based on the C sources for dozens, if not hundreds of open-source projects. I do believe I've written over ~~a million lines~~ half a million lines of C code myself, too.

Quote from: NorthGuy on July 11, 2018, 06:43:02 pm

Besides, the two examples you had - calculating the vector length and multiplying matrices are not that far from the sum of the array in terms of complexity.

And yet, no C compiler can properly vectorize such code, which was my point. Writing the vectorization explicitly, via functions that operate on specific types that contain multiple data values, yields a significant speedup. In my opinion, that is because the model underlying C does not allow, and cannot allow, the transformations necessary. Even when C compilers can do the transformation, it typically requires a very specific C pattern to be used; essentially, the C compiler detects the pattern, and implements it using a pre-prepared code template. That is like translating between languages with a frasebook: you are limited to the phrases in the book.

Compiler options that tell the compiler to break C rules just emphasize that point, in my opinion.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

Say you need to know distances with 1mm precision and your biggest distance is 100 km

Then I'd use 32-bit integers, not floating-point types. Just like you do not use floating-point types for financial calculations either.

A better example is when you are told to e.g. calculate the area of a non-self-intersecting 2D polygon to within some number of significant figures. For a polygon with n vertices, this is a sum of 2n products. If you use floating-point numbers, and the polygon is located far from the origin, you'll lose most of the precision in domain cancellation errors; that is, each pair of products contain values very similar in magnitude, but of different signs.

There are several ways of fixing that. One is to find the ~~centroid~~ center of the polygon, and subtract it from all vertices before calculating the products. Another is to use Kahan sum to calculate the sum of the products. Yet another is to save the products in an array and sort them, so you can calculate the sum in descending order of magnitude of the summands. Of these, the Kahan sum is the cheapest to implement (essentially free on current architectures), and very likely to provide sufficient accuracy.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

I had in mind the likes of 1000-dimentional vectors.

Those tend to be rare in my experience, unless the vectors are sparse (as in sets of (dimension, value) pairs. Operations on them tend to be memory bandwidth limited.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

The square-rooting takes real long.

On AMD Ryzen, gathering load to a vector register takes longer than a vectorized square root; square root taking roughly the same as two vectorized divisions.

Obviously, it varies a lot between implementations, but in general terms, square root is just a couple of times to a few times as expensive as a division, with addition and multiplication being only limited by memory bandwidth.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

I would guess the compiler would figure 3-dimentional vectors out.

Why would you guess that? They haven't yet, even though we've had SSE and support for 4-component floats in a single vector register for almost two decades now.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

BTW: Intel already had horizontal addition/subtraction in SSE3.

HADD[PS][SD] is strictly speaking half or quarter of a horizontal addition; similarly for [V]HSUB[PS][SD]. On most implementations, it is as slow as vectorized division, much slower than vectorized addition or multiplication. This is not just a quirk of the implementations; it is very hard to make it fast.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

C standard operates on an abstract machine. The goal of the compiler is to produce the same result as the abstract machine. As son as this holds, the compiler is free to do whatever it pleases.

Exactly. One of the results required by the abstract machine are sequence points: points in time at which side effects of each operation must be visible in memory. This is a severe limitation, because there is no way of specifying that two or more operations (or sequences of operations) can be done in any order.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

There are two solutions - allow compiler to re-arrange floats, or use integers wherever possible.

Much better solution than either of those, is to use a construct similar to C99 cast, and specify that without such casts, the compiler is free to simplify the expressions using arithmetic rules. In C99 and later, (type)(expression) restricts the value of expression to the range and precision of type type.

This means that in such a programming language, (B - A) + A is equivalent to B, but (double)(B - A) + A requires the compiler to ensure that the difference, before the addition, is limited to double range and precision. The notation could be better, however: the cast expressions seem to be pretty hard for humans to parse correctly.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

For some reason you quoted my post where I was talking about Intel (although you did cut it off right before the "Intel" word).

I apologize if I caused confusion: my only intent is to help here. (Please, do not read anything in my "tone" either. I strive for clarity, precision, and usefulness; that is all. Consider me autistic in my communications that way.)

I probably should explain that I view the SSE2/3/4/AVX/AVX512 "extensions" to the x86-64 hardware architecture as a separate arithmetic-logic unit, one which is pretty accurately modeled as a helper coprocessor, rather than an integral part of the CPU.

Especially if you look at kernel-level C code, it becomes an useful way of considering things. The SSE/AVX register state alone is huge, and storing and loading it back takes a considerable number of cycles. One reason the Linux kernel in particular avoids it, is because it makes context switches (between userspace and kernel space) faster: if the kernel does not touch the SSE/AVX state, the state only needs to be stored/restored when switching from one userspace thread/process to another.

NorthGuy · « **Reply #76 on:** July 11, 2018, 11:50:35 pm »

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

A better example is when you are told to e.g. calculate the area of a non-self-intersecting 2D polygon to within some number of significant figures. For a polygon with n vertices, this is a sum of 2n products. If you use floating-point numbers, and the polygon is located far from the origin, you'll lose most of the precision in domain cancellation errors; that is, each pair of products contain values very similar in magnitude, but of different signs.

There are several ways of fixing that. One is to find the centroid of the polygon, and subtract it from all vertices before calculating the products. Another is to use Kahan sum to calculate the sum of the products. Yet another is to save the products in an array and sort them, so you can calculate the sum in descending order of magnitude of the summands. Of these, the Kahan sum is the cheapest to implement (essentially free on current architectures), and very likely to provide sufficient accuracy.

First, if you have a polygon with the area of few mm located 100 km from the origin and the coordinates of the vertices are stored in 32-bit floats, the information is already lost, and there's absolutely no way to recover precision for calculating the area of the polygon.

Then, you do not have to calculate the position of the centroid. You can just calculate everything relative to the position of any (arbitrary) vertex. Or, if you do want something "central", then instead of the centroid, you can simply calculate the average coordinate of the vertex - sum all the X-coordinates all up and divide by the number of points, then do the same with Y-coordinates (but be careful - this approach is rare and will make your program horribly inefficient

).

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
I had in mind the likes of 1000-dimentional vectors.
Those tend to be rare in my experience, unless the vectors are sparse (as in sets of (dimension, value) pairs. Operations on them tend to be memory bandwidth limited.

It depends on your application. 30 years ago I used to work with really big multi-dimensional arrays and matrices. Would be so easy now

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
I would guess the compiler would figure 3-dimentional vectors out.
Why would you guess that?

I don't know. I always thought that one of the SIMD targets is graphics. I may be wrong, of course.

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
C standard operates on an abstract machine. The goal of the compiler is to produce the same result as the abstract machine. As son as this holds, the compiler is free to do whatever it pleases.
Exactly. One of the results required by the abstract machine are sequence points: points in time at which side effects of each operation must be visible in memory. This is a severe limitation, because there is no way of specifying that two or more operations (or sequences of operations) can be done in any order.

Memory is not a concern - until you access a variable, it doesn't matter what it contains or even if it exists at all. That's why they have "volatile" - to tell the compiler that the variable may be accessed by external processes which the compiler don't know anything about.

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

(double)(B - A) + A

if A and B are "float" then this cast doesn't prevent big A from cancelling small B.

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

I probably should explain that I view the SSE2/3/4/AVX/AVX512 "extensions" to the x86-64 hardware architecture as a separate arithmetic-logic unit, one which is pretty accurately modeled as a helper coprocessor, rather than an integral part of the CPU.

This how ii was historically - 8087 was physically a separate unit. Then they took it in, 8087 donated its registers to MMX, then they replicated MMX functionality with SSE and so on. But these things are still optional. I have a small fanless mini-server running Linux on AMD processor. It is relatively new, but it doesn't have SSE.

Nominal Animal · « **Reply #77 on:** July 12, 2018, 07:16:22 am »

Quote from: NorthGuy on July 11, 2018, 11:50:35 pm

First, if you have a polygon with the area of few mm located 100 km from the origin

You don't need to. Consider a triangle with vertices at (1000, 1000), (1002, 1000), and (1000,1002). The area is computed as

Code: [Select]

A = 0.5*( 1000*1000 - 1000*1002 + 1002*1002 - 1000*1000 + 1000*1000 - 1002*1000 )
  = 0.5*( 1000000 - 1002000 + 1004004 - 1000000 + 1000000 - 1002000 )
  = 2

If your coordinate type has four significant digits, you need eight significant digits to describe the product. Domain cancellation occurs, when you sum such products where some terms cancel each other. It is a very well known numerical problem, and as I already explained, easily avoided by removing the common bias before calculating the product.

Also, I corrected a mistake I made in my previous post: I did mean center when I referred to the centroid; i.e. the mean of vertex coordinates. (It was late, and I was careless. Sorry.)

With regards to floating-point numbers, precision refers to significant figures; i.e. relative precision. Absolute precision is usually referred to as accuracy, or since ISO 5725, "trueness".

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

It depends on your application. 30 years ago I used to work with really big multi-dimensional arrays and matrices. Would be so easy now

It is cheaper to throw more hardware to solve a computational problem, yes. But to truly take advantage of the hardware, you cannot rely on the C compiler. Yes yes, I know the argument: the time it takes me to write and run my computation is worth more than the resource usage wasted by that inefficient code. (In most cases, yes. That is why it is perfectly okay to use e.g. Matlab for numerical computations on your own machine.)

That is not at all the point here. The point is that programming languages, even low-level ones, have not kept up with the possibilities provided by the hardware. We could do better.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

I always thought that one of the SIMD targets is graphics.

I'm not sure what they are targeted at, really. I do know that graphics cards' approach is very, very different. Both CUDA and OpenCL (C-like languages allowing one to do general purpose numerical computing with graphics cards) are optimized for four-component vectors, but usually limited to single precision.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

if A and B are "float" then this cast doesn't prevent big A from cancelling small B.

Right; and that was my point. If we had a language where cancellation is allowed (because the computer can optimize the entire mathematical expression in a way that drops the reference to A altogether), we would only need a way to specify that some sub-expression must be evaluated at some specific precision and range. That is exactly how e.g. Kahan summation works.

It would mean that if you wrote A + B - A, some compilers would evaluate it as B, and others might evaluate it as if it was written (double)(A + B) - A or as A + (double)(B - A). The point is, the compiler would be allowed to choose, unless you used a specific notation (the cast here is just an example) to restrict the rules. Lazily written formulae would not reproduce exact same results on all compilers, but there would always be a way to specify the formula exactly (and you only need the precision-and-range restriction operator to do that with floating-point types).

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

This how ii was historically - 8087 was physically a separate unit.

Yes. Point was, asymmetric multiprocessing (using different types of "cores" simultaneously) is nothing new, but is cropping up everywhere nowadays, much more than symmetric multiprocessing.

(I would love to have a AMD Ryzen Threadripper, though, just to do scientific computation at home. But HPC is kind of a not-so-common target; most devices nowadays are phones, tablets, and laptops, or desktop machines of comparable performance.)

(SSE2 is included in the x86-64 instruction set. If you run on an 64-bit Intel/AMD x86 architecture processor, you always have SSE2.)

Getting back to the topic at hand, none of the current programming languages have a natural way of implementing asymmetric multiprocessing. Only some high-level languages support coroutines with human-friendly expressions, although they are quite easy to implement at the hardware level. Our low-level languages (compiled languages without runtime binaries, supporting freestanding environments like kernels) do not have easy, natural ways to express these constructs at all. Heck, even atomic access and memory model is a recent (less than a decade old) addition to C, and arguably looks like a tacked-on bulge, rather than a natural part of the language.

When studying algorithms and their implementation, it is (in my opinion) important to also realize how the language used affects the implementation, behaviour, and efficiency of the algorithm. To do that, you do need to understand the benefits and limitations of the language used. (You could say you need to perceive the box before you can think outside the box, but I'm not sure if that is an accurate way of putting it.)

For example, the filesystem access in C is very rudimentary. You can open files, creating them if they do not exist; read and write those files; rename files; and delete files. There is no way to e.g. list the contents of a directory in standard C. You need to either use operating system specific extensions, or rely on another standard, usually POSIX.1. However, because POSIX.1 is not taught as a complete spec on its own, developers are often taught to use opendir()/readdir() etc. to scan directory contents, which is problematic on current systems where the contents of directories may change mid-scan. POSIX.1 provides nftw(), scandir(), and glob(), which are perfect for scanning directory contents (as in they should behave in a predictable manner when the directory contents change during scanning), and aside from the DOS-era Borland C library, are available whenever opendir()/readdir()/etc. are.

Even the standard C I/O is quite deficient; especially how the scanf() family of functions behave on incorrect numerical input. (I still don't get how Microsoft got their "safe" functions versions into an annex to C11, with no mention of getline() anywhere.)

tggzzz · « **Reply #78 on:** July 12, 2018, 07:53:33 am »

Quote from: Nominal Animal on July 12, 2018, 07:16:22 am

When studying algorithms and their implementation, it is (in my opinion) important to also realize how the language used affects the implementation, behaviour, and efficiency of the algorithm. To do that, you do need to understand the benefits and limitations of the language used. (You could say you need to perceive the box before you can think outside the box, but I'm not sure if that is an accurate way of putting it.)

Even if not strictly accurate, it is pithy, thought provoking - and no worse than the wretched "think outside the box" mantra!

More importantly, the first part of that paragraph is accurate.

coppice · « **Reply #79 on:** July 12, 2018, 10:17:05 am »

Quote from: NorthGuy on July 11, 2018, 11:50:35 pm

I don't know. I always thought that one of the SIMD targets is graphics. I may be wrong, of course.

SIMD in things like GPUs is obviously targeted at graphics, but I don't think the SIMD in most CPUs ever was. In the PPC world the additions seemed to be targeted at general high throughput computing. In the x86 world I have no idea what the original integer SIMD additions were trying to achieve. They were called MMX (multi-media extension), but only seemed to support some limited image processing (something rather different from graphics) reasonably well. They did a really poor job with anything else considered multi-media at the time, as Intel's own example code showed clearly. The early SSE stuff seemed so focussed on building upon the original integer MMX, that the only really effective thing it brought was an efficient replacement for the x87 for SISD computing. They took years to respond to the needs of most people doing a lot of computation, and got the alignment issues under control. I don't remember seeing any code for early versions of SSE where I felt like it was the algorithm that the hardware designers were trying to do really well. I don't think these people were ever in touch with their inner product.

Nominal Animal · « **Reply #80 on:** July 12, 2018, 10:40:24 am »

If I recall correctly, one of the use cases for x86 SIMD extensions was speeding up discrete cosine transforms and inverse discrete cosine transforms; both 1D (used in audio compression) and 2D (used in video compression and JPEG images).

NorthGuy · « **Reply #81 on:** July 12, 2018, 01:31:54 pm »

Quote from: coppice on July 12, 2018, 10:17:05 am

SIMD in things like GPUs is obviously targeted at graphics, but I don't think the SIMD in most CPUs ever was. In the PPC world the additions seemed to be targeted at general high throughput computing. In the x86 world I have no idea what the original integer SIMD additions were trying to achieve. They were called MMX (multi-media extension) ...

In AMD, the first SIMD extension was called "3DNow".

NorthGuy · « **Reply #82 on:** July 12, 2018, 05:15:39 pm »

Quote from: Nominal Animal on July 12, 2018, 07:16:22 am

Also, I corrected a mistake I made in my previous post: I did mean center when I referred to the centroid; i.e. the mean of vertex coordinates. (It was late, and I was careless. Sorry.)

Center is the one requiring the sum, which you frown upon

Quote from: Nominal Animal on July 12, 2018, 07:16:22 am

Getting back to the topic at hand, none of the current programming languages have a natural way of implementing asymmetric multiprocessing. Only some high-level languages support coroutines with human-friendly expressions, although they are quite easy to implement at the hardware level. Our low-level languages (compiled languages without runtime binaries, supporting freestanding environments like kernels) do not have easy, natural ways to express these constructs at all. Heck, even atomic access and memory model is a recent (less than a decade old) addition to C, and arguably looks like a tacked-on bulge, rather than a natural part of the language.

Long time ago people wrote in assembler. It is quite tedious. So people who did that a lot started using macros. If you looked at their assembler programs, you would see more macros than commands. But macros were not that flexible in some areas. For example it was very still very tedious to calculate expressions. So, someone came with an idea to write a compiler which would allow more complex macros. The compiler would read your macro file and literally write the assembler for you. Of course, you could insert raw assembler where needed. This is how C came to life. C might be considered HDL, but it still has ebb and flow of assembler, unlike other languages such as algol, FORTRAN, PL/1, Ada, which started from abstract models. Because of such flow, C gives you possibility to write practically anything. This explains C popularity.

All the high level concepts may be traced back to simple things. For example, C++ objects are structures containing a pointer to the arrays of virtual functions. But there may be other presentations - such as Objective-C where each object provides a set of functions to access and manage its member functions. Even though both C++ and Objective-C are both object oriented the underlying structure is very different. If you use C++, you must subscribe to the C++ object model. C gives you more flexibility because you are not tied to any particular model and can implement your objects as you please. For example, look at the Linux drivers code. Is it really wise to force a certain model on the programmer through language extensions, such as C++?

Co-routines may be represented as structures containing execution pointer and a set of local variables. But there may be other presentations, special needs etc. Is it wise to force a particular model upon the programmer by making co-routines a part of the language?

coppice · « **Reply #83 on:** July 12, 2018, 06:02:24 pm »

Quote from: NorthGuy on July 12, 2018, 01:31:54 pm

Quote from: coppice on July 12, 2018, 10:17:05 am
SIMD in things like GPUs is obviously targeted at graphics, but I don't think the SIMD in most CPUs ever was. In the PPC world the additions seemed to be targeted at general high throughput computing. In the x86 world I have no idea what the original integer SIMD additions were trying to achieve. They were called MMX (multi-media extension) ...

In AMD, the first SIMD extension was called "3DNow".

True, but was that anything but a good name for publicity? They put two 32 bit reals in a 64 bit word, and made a fairly general instruction set for computing with them. I used 3DNow for a little while, and I don't recall anything that looked application specific. I do recall that it was a heck of a lot easier to get some performance from 3DNow than it was later on to get some performance out of the four reals per word SSE scheme.

tggzzz · « **Reply #84 on:** July 12, 2018, 07:56:20 pm »

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Long time ago people wrote in assembler. It is quite tedious. So people who did that a lot started using macros. If you looked at their assembler programs, you would see more macros than commands. But macros were not that flexible in some areas. For example it was very still very tedious to calculate expressions. So, someone came with an idea to write a compiler which would allow more complex macros. The compiler would read your macro file and literally write the assembler for you. Of course, you could insert raw assembler where needed. This is how C came to life. C might be considered HDL, but it still has ebb and flow of assembler, unlike other languages such as algol, FORTRAN, PL/1, Ada, which started from abstract models. Because of such flow, C gives you possibility to write practically anything. This explains C popularity.

There are many mistakes in that, but I'll merely note

C did not come from assembler, it came from BCPL
while C was once almost a low level language, that stopped being true a generation ago. FFI start with https://queue.acm.org/detail.cfm?id=3212479
C is popular because it was given away for free with UNIX, at a time when other language implementations cost real money. That ensured it was taught to students, and the rest is history

Quote

Co-routines may be represented as structures containing execution pointer and a set of local variables. But there may be other presentations, special needs etc. Is it wise to force a particular model upon the programmer by making co-routines a part of the language?

In modern multicore systems, it is wise to allow the programmer to use coroutines if they simplify the application. Coroutines are impossible in pure C, unless there are significant limitations e.g. as in FreeRTOS.

Nominal Animal · « **Reply #85 on:** July 13, 2018, 08:29:59 am »

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Center is the one requiring the sum, which you frown upon

Yes, and it is a pretty good example of why. Calculating the center usually requires a loop over all vertices. However, you can achieve the same result by calculating the area using Kahan summation, without an extra pass over the vertices; and on current machines (especially x86 and x86-64 architectures) the extra work done in the Kahan sum is basically free anyway, because the area calculation loop is limited by memory bandwidth.

Sure, there are cases where you want a straight sum over some data. My point is that if you do that, you'll want to take a step backwards, and look at the task at hand as a whole, and see if there are algorithmic improvements one can do, or combine the sum with some other work. It is not evil or bad in itself, it's just one of those things I've learned to see as an indicator. Like a headache in summer heat is often a sign of dehydration.

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Long time ago people wrote in assembler.

My very first paid programming job, in very early nineties, was to incorporate a serial number (for tracking if copying became an issue) to a DOS binary I was not given source codes for. I used Turbo Pascal to write a simple TUI that incorporated the serial number into the code, and combined it with my own relocator code (that extracted a copy of the same serial number from relocation data). Sneaky stuff.

I wrote the EXE relocator in DEBUG.EXE, the debugger/assembler that came with MS-DOS. It had neither macro nor label support.. Good times.

If you look at freestanding C, it is a surprisingly simple language: very small number of keywords, operators, and core data types. (It is a common exercise to create a parser to parse C source code using e.g. flex and bison.) It is also very procedural. It could be that C simply hit the sweet spot between human readability, simplicity, and the massive complexity it allows if the programmer can handle it.

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

For example, look at the Linux drivers code.

Linux kernel is written in GNU C, which is basically C99 with GNU extensions. The freestanding environment (without the standard C library) is very different to the hosted environment (where the standard C library is available). A similar difference exists when programming microcontrollers in e.g. Arduino environment; the compiler supports C++ syntax, but only a subset of C++ features are available (and almost none of the standard C++ library).

To me, this is a good indicator that we are missing a low level systems programming language, that exposes the capabilities of the hardware, but has a minimal footprint and no runtime (say, is capable of running in a freestanding environment).

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Co-routines may be represented as structures containing execution pointer and a set of local variables.

A better low-level implementation, in my opinion, is to have coroutines run on their own stacks. The tricky bit is how to express (in human-readable language) the two scopes used simultaneously; the coroutine-local and other-local. At the hardware level, the switch to/from a coroutine is then as cheap as a normal function call, but as the local variable references are to stack, there is no indirect access overhead (which can be significant otherwise).

The yield idiom (returning a value, but not exiting from the coroutine; i.e. next "receive"/"get" continues execution in the coroutine at that position) is then simple to implement, too, as the coroutine state is then completely described using the pair (top of coroutine stack, instruction pointer in coroutine). During coroutine execution, only the corresponding pair (top of caller stack, instruction pointer in caller) need to be maintained (and can be simply kept in the coroutine stack, if the coroutine stack is per-thread).

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Is it wise to force a particular model upon the programmer by making co-routines a part of the language?

Good question. I claim that every language "forces" a particular model upon the programmer anyway.

I definitely want a low-level or systems programming language, that does provide coroutines (or generators) as a part of the language. I know of quite a few practical examples how they would be extremely useful, in both low-level libraries (almost all I/O, really), as well as in microcontroller firmwares (especially I/O there too).

coppice · « **Reply #86 on:** July 13, 2018, 09:59:40 am »

Quote from: tggzzz on July 12, 2018, 07:56:20 pm

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm
Long time ago people wrote in assembler. It is quite tedious. So people who did that a lot started using macros. If you looked at their assembler programs, you would see more macros than commands. But macros were not that flexible in some areas. For example it was very still very tedious to calculate expressions. So, someone came with an idea to write a compiler which would allow more complex macros. The compiler would read your macro file and literally write the assembler for you. Of course, you could insert raw assembler where needed. This is how C came to life. C might be considered HDL, but it still has ebb and flow of assembler, unlike other languages such as algol, FORTRAN, PL/1, Ada, which started from abstract models. Because of such flow, C gives you possibility to write practically anything. This explains C popularity.
There are many mistakes in that, but I'll merely note
C did not come from assembler, it came from BCPL

C was explicitly described as a derivative of BCPL, as were with a number of other languages targetted at systems programming. I used Coral extensively (a mostly UK language) before I used C. Coral came from the same school as BCPL, and the transition to C was almost seamless. Coral did support fixed point numbers, though. C never quite got them.

Quote from: tggzzz on July 12, 2018, 07:56:20 pm

while C was once almost a low level language, that stopped being true a generation ago. FFI start with https://queue.acm.org/detail.cfm?id=3212479

"Almost" is right. As soon as you have basic things like a function call model, you've started to move away from the low level.

Quote from: tggzzz on July 12, 2018, 07:56:20 pm

C is popular because it was given away for free with UNIX, at a time when other language implementations cost real money. That ensured it was taught to students, and the rest is history

I don't agree with that. Pl/1 derivatives were pushed by several of the early MPU makers (Motorola, Intel and others), as it was seen by some as a great basis for systems programming on bare metal. Most users didn't entirely agree. Some Pascal compilers appeared with extensions to make them more flexible (notably TurboPascal) and became big hits, because for a while they were the best of a poor bunch of options. Then C compilers appeared for MPUs, and seemed to be a great fit for what people were actually doing with MPUs. C became a huge hit at that point, quite separate from Unix. In the 80s and 90s the vast majority of people programming in C had never touched a Unix machine.

Quote from: tggzzz on July 12, 2018, 07:56:20 pm

[/list]

GeorgeOfTheJungle · « **Reply #87 on:** July 13, 2018, 10:48:29 am »

C is low level imo, as low level as any good macro assembler. Perhaps the abundance of C libraries is what makes people think otherwise, but C libraries are not the C language.

tggzzz · « **Reply #88 on:** July 13, 2018, 11:11:10 am »

Quote from: Nominal Animal on July 13, 2018, 08:29:59 am

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm
Co-routines may be represented as structures containing execution pointer and a set of local variables.
A better low-level implementation, in my opinion, is to have coroutines run on their own stacks. The tricky bit is how to express (in human-readable language) the two scopes used simultaneously; the coroutine-local and other-local. At the hardware level, the switch to/from a coroutine is then as cheap as a normal function call, but as the local variable references are to stack, there is no indirect access overhead (which can be significant otherwise).

The yield idiom (returning a value, but not exiting from the coroutine; i.e. next "receive"/"get" continues execution in the coroutine at that position) is then simple to implement, too, as the coroutine state is then completely described using the pair (top of coroutine stack, instruction pointer in coroutine). During coroutine execution, only the corresponding pair (top of caller stack, instruction pointer in caller) need to be maintained (and can be simply kept in the coroutine stack, if the coroutine stack is per-thread).

While not coroutines per se, xC manages all that by:

at compile time preventing multiple access by different cores
extra keywords for parallel execution
a select keyword, like a case statement except the core suspends until the condition (timeout, input, output, message) occurs
extra keywords for input/output of messages to other cores or i/o ports, including timestamps

The resulting code is simple to read and write, is explicit, and can be analysed to determine the precise execution time.

Very impressive.

coppice · « **Reply #89 on:** July 13, 2018, 03:48:31 pm »

Quote from: tggzzz on July 13, 2018, 11:11:10 am

While not coroutines per se, xC manages all that by:
at compile time preventing multiple access by different cores
extra keywords for parallel execution
a select keyword, like a case statement except the core suspends until the condition (timeout, input, output, message) occurs
extra keywords for input/output of messages to other cores or i/o ports, including timestamps

The resulting code is simple to read and write, is explicit, and can be analysed to determine the precise execution time.

Very impressive.

Its sounds like you have really been working with these xCore processors. I find them interesting, just as I found the preceding lineage like the Transputer interesting. The snag always seems to be finding a good fit with an application where xCore make sense. It feels like their approach is always going to remain a small niche, although then the match is good they should really shine. What has your experience been?

tggzzz · « **Reply #90 on:** July 13, 2018, 05:38:20 pm »

Quote from: coppice on July 13, 2018, 03:48:31 pm

Quote from: tggzzz on July 13, 2018, 11:11:10 am
While not coroutines per se, xC manages all that by:
at compile time preventing multiple access by different cores
extra keywords for parallel execution
a select keyword, like a case statement except the core suspends until the condition (timeout, input, output, message) occurs
extra keywords for input/output of messages to other cores or i/o ports, including timestamps

The resulting code is simple to read and write, is explicit, and can be analysed to determine the precise execution time.

Very impressive.
Its sounds like you have really been working with these xCore processors. I find them interesting, just as I found the preceding lineage like the Transputer interesting. The snag always seems to be finding a good fit with an application where xCore make sense. It feels like their approach is always going to remain a small niche, although then the match is good they should really shine. What has your experience been?

My experience is limited, but far more positive than I was expecting. They are really easy to use for hard realtime embedded systems. I regard that as a significant niche, and one that other processors can't touch. Effectively they are eating into territory that normally requires FPGAs.

I was extremely impressed with the language and tools, traditionally the weak point of previous multicore systems. To that they had added excellent simple i/o support that was neatly tied into the language.

Overall it enables high-level and low-level concepts to be expressed simply and in a unified fashion. And you don't need an RTOS: those functions are all in hardware.

If it reminds you of Transputers, then it might not come as a surprise that Prof David May is a prime instigator in both Transputers and xCORE. He's learned from experience and improved the system - whereas most other systems are a simple boring derivation of stuff that I was doing in the early 80s. (I'm horrified at how little has changed since then!)

The PDF at https://www.xmos.com/published/xmos-programming-guide is easy to read and understand. I have the same good feelings I had as when I read the Java whitepaper in 1995.

bd139 · « **Reply #91 on:** July 13, 2018, 06:46:51 pm »

Only problem vs FPGAs is latency. State machine can have somewhat tighter timings.

Hence why TCP offload exists.

Edit: I had good feelings when I saw Java the first time. Then I got my hands dirty on Java EE

tggzzz · « **Reply #92 on:** July 13, 2018, 08:22:54 pm »

Quote from: bd139 on July 13, 2018, 06:46:51 pm

Only problem vs FPGAs is latency. State machine can have somewhat tighter timings.

Hence why TCP offload exists.

Edit: I had good feelings when I saw Java the first time. Then I got my hands dirty on Java EE

Sometimes latency can be tolerated, provided it is consistent. Otherwise you are correct.

TCP usually isn't the bottleneck; Van Jacobson demonstrated that in '88/'89.

TCP offload has a very chequered history, especially with asymmetric "big-little" multiprocessor systems. Offload often doesn't remove a bottleneck. I first saw that in the late 80s where an Appaling Domain DN10K processor had a 68020 as an network processor. The comms between the DN10K and the 68020 were of equivalent complexity/timing to networking protocols; the real win is arranging it so that there are zero memory copies as the data goes up the stack to the application.

Java EE: I managed to avoid that, might have used Rod Johnson's Spring, but ended up using the telecom equivalent of EE: JAIN SLEE.

I demonstrated a prototype of a telecom system using plain Java plus Tangosol's Coherence as a distributed replicated in-memory database. I really liked Cameron Purdy's mentality, but Tangosol was later borged by Oracle.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee