Sure, you can do better than anything writing in assembler.
No, that's not what I meant. What I meant was that using C constructs describing explicit vectorization, I can do a better job than a C compiler can.
However, most cases are simple
I've dealt with a lot of C code, and that is just not true. It is only true if your C program is horribly inefficient or imprecise to start with, at the algorithmic level, and no compiler magick will help then.
For example, if you need to calculate a sum of more than a couple of dozen floating-point terms in practice, you'll find you'll either need to ensure they are in monotonic order, or do a Kahan sum, or you'll lose precision.
The most typical operations needed tend to be vector operations, or sums of products. (I'm not talking about computer graphics, either. Think of any physical models, operations on audio or video, animation, or any kind of packing problem, for example. A lot of geometric and physical problems need to be solved in everyday programs.)
A very typical operation is normalizing vectors to unit length. (Which, by the way, might not have anything to do with geometry, because a large number of problems can be simplified this way.) It turns out that if the components of a vector are consecutive in memory, it is especially hard for the compiler to vectorize the norm squared, and the compiler-generated code tends to do the square root operation separately. However, if you write the C code such that you explicitly load vectors so that each vector register has one component from consecutive vectors, you get near-optimal code, even with GCC (which is terrible at vectorization, compared to Intel C/C++). Because of how C is defined via an underlying virtual machine with sequence points, this is very, very hard for a C compiler to detect the pattern in code and optimize it. (I think ICC does detect the most typical pattern -- i.e., brute-force detection of each pattern is the only way those can be optimized in C --, but GCC does not.)
Similarly, if you want to do e.g. matrix-matrix multiplication between large matrices (a common problem in many optimization problems), the bottleneck is the cache pattern. The solution is to use temporary memory, and copying the contents of at least one of the matrices, so that they both will have very linear access patterns.
Ulrich Drepper's 2007 article is a very detailed look into this, including the data when vectorized using SSE2 instructions. The C language itself is such that it does not allow a compiler to do this automatically (and it is not even clear if one would want a compiler to do this automatically).
As to the parallelization with the numbers of cores, I think we're already at the end of this road
Again, I sort of disagree. I agree that with
symmetric multiprocessing, increasing the number of cores will not help with a small number of concurrent tasks -- say, typical laptop, tablet, or desktop computer usage, where you have a browser open, maybe a word processor, and such.
Asymmetric multiprocessing, where you have different types of cores, some optimized for specific tasks (like calculating checksums of data, DMA'ing it around based on logical filter expressions), is growing like crazy. Phones tend to have a separate processor doing all the network/audio stuff, graphics are handled by dedicated processors (that are still highly programmable via CUDA or OpenCL), and so on. (In fact, I have an Odroid HC1 single-board Linux computer, which has a Samsung Exynos 5422 processor. It has four fast Cortex A15 cores, and four slower and simpler Cortex A7 cores.)
Coroutines, message passing (especially various "mailbox" approaches), atomic operations, and explicit concurrency (or, rather, explicit notation of memory ordering and visibility of side effects) are very important tools for writing asymmetric multiprocessing code, since typically, the intent is to keep data flowing at maximum bandwidth with minimum latencies. C does not provide those (even in C11, the memory model and atomics were just copied over from C++, with the
assumption that although the two languages differ a lot in their approach, C++ atomics
should work in C too).
A related problem (to coroutines and message passing) is that we kind of need a construct even lighter than threads (nanothreads or fibres or similar). CPU cores have so much state that context switches even within the same process tend to be expensive. Coroutines and closures (especially if combined with message passing) could really use some sort of nanothreads that have their own stack, but otherwise share the state with their parent thread. Hardware-wise there is no problem (and stack switching thus is actually quite common in POSIXy operating systems, for signal delivery), but such concepts have thus far only been incorporated in very high-level languages like Python; we do not really know how to frame/describe/define them for low level code that actually works and does not cause more harm than good. In particular, it is not clear whether coroutines should be implemented as functions (or with similar restrictions), or if they should be treated specially, similar to e.g. POSIX signal handler functions (where many common operations do lead to undefined behaviour, and are therefore not really allowed).
There is a lot of research done on all of these concepts as we speak, down to new types of hardware architectures. On the programming language development side, a major problem is that by the time you've learned enough to develop a language, all that knowledge will steer you towards a specific paradigm/approach to programming languages, so it is terribly hard to create something completely novel. (Plus, even if you do, the sheer inertia to overcome is more than one human can bear.) Higher level languages are easier, because there is less work to get a minimal working implementation for examples, further research, and obtaining results (for example, implementing typical programs, and comparing them to similar programs written in other languages). To implement a new low-level language, you'll also need to implement the basic runtime library from scratch, consider the Application Binary Interfaces, and whether you'll model your base library on top of C/POSIX (and thus system calls on most OSes), or try for something different. Even just mapping a minimal subset of POSIX (to get filesystem access and so on) to your new base library is a huge task; it is a complex specification with a very long history. (And without POSIX, you basically condemn yourself to not just creating a new programming language, but to create a whole new OS to go with it.)
It would take someone like Elon Musk donating a million or so for a number of oddball developers like myself, and see what they come up with. (Most of us would fail, but there is a small probability that someone could come up with a new low-level language, better suited for current and future hardware architectures than C is, but on the same order of simplicity of the core language itself.)