Whether new languages will gradually supplant C for systems programming is yet to be determined.

I'm mostly interested in systems programming (that is, kernels and low level libraries), and HPC (molecular dynamics simulations with classical potentials).

Since we are talking about algorithms, one of my core interests is related to

Verlet lists (or neighbour lists); and in particular, enhancing them for both vectorization and cache locality. The key point about neighbour/Verlet lists is that rather than calculate all particle pairs, you maintain a list of neighbours for each particle. The classical potentials have a cut-off distance, above which a pair does not interact. If you (re)construct the neighbour/Verlet list to include all neigbours within the cut-off distance plus twice some delta, you can keep using the same list until at least one particle has moved by delta.

The simplest enhancement I've developed is to allow reordering the neighbours in the list, while keeping a parallel array of distances (and (unit) direction vectors to the neighbors within interaction distance). The list is not kept sorted by distance; instead, the maximum interaction distance acts like the pivot in Quicksort. Interactions are then evaluated for the initial section only, those particles that are within interaction distance. This allows increasing the delta (at cost of increasing the neighbor list size), but also only calculate the actual interactions for truly interacting particle pairs, rather than all those in the neighbour list. (So, you calculate distance, or rather distance squared, for all neighbours, but the actual interaction for only the neighbours the particle really interacts with.)

In theory, that's all there is to that algorithm. It's not "new" or "useful" enough to get published. Implementing it in Fortran is relatively straightforward, although the extra arrays needed (for keeping the distances and/or delta vectors) are a bit annoying. Implementing it in C opens up a whole lot of new possibilities, like using an ASIC/FPGA to calculate the distances at double precision, another to look up the potential and force using piecewise cubic polynomials with very large (but constant) coefficient tables.. perfect for asymmetric multiprocessing.

Looks exactly the same to me. You know as they say: " ... don't multiply entities beyond necessity".

They are normal C constructs. Using intrinsics in C code is very different to say inlining the assembly code.

Or, put another way, you can teach a new C programmer to use the intrinsics and vectorize computation, without teaching them assembly.

Depends on what you're doing. May be rare for some people and frequent for others. It's silly to assign labels without reviewing the particular situation.

It was an opinion/estimate based on the C sources for dozens, if not hundreds of open-source projects. I do believe I've written over

~~a million lines~~ half a million lines of C code myself, too.

Besides, the two examples you had - calculating the vector length and multiplying matrices are not that far from the sum of the array in terms of complexity.

And yet, no C compiler can properly vectorize such code, which was my point. Writing the vectorization explicitly, via functions that operate on specific types that contain multiple data values, yields a significant speedup. In my opinion, that is because the model underlying C does not allow, and cannot allow, the transformations necessary. Even when C compilers can do the transformation, it typically requires a very specific C pattern to be used; essentially, the C compiler detects the pattern, and implements it using a pre-prepared code template. That is like translating between languages with a frasebook: you are limited to the phrases in the book.

Compiler options that tell the compiler to break C rules just emphasize that point, in my opinion.

Say you need to know distances with 1mm precision and your biggest distance is 100 km

Then I'd use 32-bit integers, not floating-point types. Just like you do not use floating-point types for financial calculations either.

A better example is when you are told to e.g. calculate

the area of a non-self-intersecting 2D polygon to within some number of significant figures. For a polygon with n vertices, this is a sum of 2n products. If you use floating-point numbers, and the polygon is located far from the origin, you'll lose most of the precision in domain cancellation errors; that is, each pair of products contain values very similar in magnitude, but of different signs.

There are several ways of fixing that. One is to find the

~~centroid~~ center of the polygon, and subtract it from all vertices before calculating the products. Another is to use Kahan sum to calculate the sum of the products. Yet another is to save the products in an array and sort them, so you can calculate the sum in descending order of magnitude of the summands. Of these, the Kahan sum is the cheapest to implement (essentially free on current architectures), and very likely to provide sufficient accuracy.

I had in mind the likes of 1000-dimentional vectors.

Those tend to be rare in my experience, unless the vectors are sparse (as in sets of (

`dimension`,

` value`) pairs. Operations on them tend to be memory bandwidth limited.

The square-rooting takes real long.

On AMD Ryzen, gathering load to a vector register takes longer than a vectorized square root; square root taking roughly the same as two vectorized divisions.

Obviously, it varies a lot between implementations, but in general terms, square root is just a couple of times to a few times as expensive as a division, with addition and multiplication being only limited by memory bandwidth.

I would guess the compiler would figure 3-dimentional vectors out.

Why would you guess that? They haven't yet, even though we've had SSE and support for 4-component floats in a single vector register for almost two decades now.

BTW: Intel already had horizontal addition/subtraction in SSE3.

HADD[PS][SD] is strictly speaking

*half* or

*quarter* of a horizontal addition; similarly for [V]HSUB[PS][SD]. On most implementations, it is as slow as vectorized division, much slower than vectorized addition or multiplication. This is not just a quirk of the implementations; it is very hard to make it fast.

C standard operates on an abstract machine. The goal of the compiler is to produce the same result as the abstract machine. As son as this holds, the compiler is free to do whatever it pleases.

Exactly. One of the results required by the abstract machine are sequence points: points in time at which side effects of each operation must be visible in memory. This is a severe limitation, because there is no way of specifying that two or more operations (or sequences of operations) can be done in any order.

There are two solutions - allow compiler to re-arrange floats, or use integers wherever possible.

Much better solution than either of those, is to use a construct similar to C99 cast, and specify that without such casts, the compiler is free to simplify the expressions using arithmetic rules. In C99 and later,

` (`*type*)(*expression*) restricts the value of

` `*expression* to the range and precision of type

` `*type*.

This means that in such a programming language,

` (B - A) + A ` is equivalent to

` B`, but

` (double)(B - A) + A `requires the compiler to ensure that the difference, before the addition, is limited to

` double `range and precision. The notation could be better, however: the cast expressions seem to be pretty hard for humans to parse correctly.

For some reason you quoted my post where I was talking about Intel (although you did cut it off right before the "Intel" word).

I apologize if I caused confusion: my only intent is to help here. (Please, do not read anything in my "tone" either. I strive for clarity, precision, and usefulness; that is all. Consider me autistic in my communications that way.)

I probably should explain that I view the SSE2/3/4/AVX/AVX512 "extensions" to the x86-64 hardware architecture as a separate arithmetic-logic unit, one which is pretty accurately modeled as a helper coprocessor, rather than an integral part of the CPU.

Especially if you look at kernel-level C code, it becomes an useful way of considering things. The SSE/AVX register state alone is huge, and storing and loading it back takes a considerable number of cycles. One reason the Linux kernel in particular avoids it, is because it makes context switches (between userspace and kernel space) faster: if the kernel does not touch the SSE/AVX state, the state only needs to be stored/restored when switching from one userspace thread/process to another.