We have translated more than 300 LLVM optimizations into Alive and, in the process, found that eight of them were wrong.
%1 = xor i32 %x, -1
%2 = add i32 %1, 3333
%2 = sub i32 3332, %x
Generate good code for the frequent case; don't optimize the exceptional case
Generate good code from the beginning; don't generate bad code and optimize later
Generate predictable code
For example, these are some of the headings:QuoteGenerate good code from the beginning; don't generate bad code and optimize later
Brandis and others [3] report that non-optimizing Oberon compilers for the Motorola 68020, for the Sun SPARC and for the DEC MIPS architectures generated code that was as
efficient as the code generated by the standard C compiler with the optimization option switched on (cc -O). Only for the IBM RS/6000 the optimizing C compiler was able to generate more efficient code than the Oberon compiler.
Yeah - this probably falls into the category of "stating the bleeding obvious". I suspect it is based from experience of working with substandard code generators.For example, these are some of the headings:QuoteGenerate good code from the beginning; don't generate bad code and optimize later
I'm not really buying the argument of this one, others make sense though.
How is that any different than just enabling optimization by default?Not all un-optimised code is equal. There can be many different ways of compiling code "as written". Some can be better than others due to the author's better knowledge of the instruction set; more careful design and implementation.
Isn't the value of having a flag that you can compile un-optimized, to attempt to compile the code "as written", which we know might not be optimal.
Some words of wisdom regarding compiler code optimisation can be found in Hanspeter Mössenböck's article Compiler Construction -
The Art of Niklaus Wirth.
For example, these are some of the headings:QuoteGenerate good code for the frequent case; don't optimize the exceptional case
QuoteGenerate predictable code
Good luck with that; the interactions between data access patterns and caches is significant and unknowable at compile time.
Good luck with that; the interactions between data access patterns and caches is significant and unknowable at compile time.
if you want to laugh, in my-c i have to force a cache flush on every read/write attempt to the shared ram, because some hw-designer idiot put the shared ram in the same area addressed by the cache controller (which can't be disabled, because ... nobody knows), so when cpu0.board0 needs to talk to cpu1.board1 (on the other mezzanine) ... well, there is cache in the middle, without a bloody cache coherency protocol, so not only timings are not deterministic, but you risk a cache miss machine exception (which would be problematic if it happens under interrupt -> ISR exception inside a kernel exception) and - worse still - without cache miss exceptions - the shared cell content is not guaranteed to be correct, and you are not sure it's has been updated, so you could read old information.
What's that madness crazy hardware?!? It's pure garbage, but it happenes even with big companies ... an examples is Indigo2/Impact by SGI ... its MIPS4 R10000 has similar cache problems.
All of that is BAD hardware design, and you cannot fix, the best you can do is ... trying to find a workaround.
So, for my second MIPS5++ prototype board, I created a special memory model in my-c to handle that madness stuff. Simple trick: the machine layer always put a "cache flush" before every IO on the shared ram.
Job done!, just it makes talking about "optimization" really funny since performance literally sucks, but at least it works correctly.
(I wish I could unsolder the cache, anyway)
Great paper. Sure some points are a bit outdated, when related to limitations of computers from the 80's and 90's.
The discussion about definition modules is interesting.
As to the "quality" of the IR impacting the output of optimizers, this can be experimented relatively easily. These days with LLVM, you can generate LLVM IR in various ways and check the impact with various optimizations, all with pretty state-of-the-art optimization techniques.
For someone implementing a compiler, I would recommend defining its own IR, fully adapted to the language itself, and then implementing various generators from this IR - can be C code, LLVM IR, or assembly if you feel inclined to that.
While LLVM IR looks decently general, I would recommend not to base your IR directly on it, as it may unnecessarily constrain and direct your language design, even if in subtle ways.
Nah, it is good hardware design because it pushes the hard rare issues to be solved by software.
I can understand those guys, because my-c needs to force extra "cache invalidation" op after each DMA. It makes code-optimization ... all a joke.On i.MX RT1062, forgetting the cache invalidation after DMA isn't catastrophic, it can just give "wrong" data. (If the area is already cached, then reading will yield stale data; if the area is already cached, a write will overwrite the DMA'd data. No interrupt or such is generated. The cache is also small enough – 32k for code, 32k for data – so that just doing an unconditional data cache eviction on DMA write completion interrupts and before DMA reads, has acceptable overhead.)
On i.MX RT1062, forgetting the cache invalidation after DMA isn't catastrophic, it can just give "wrong" data.
p.s.
also the ***32Mbyte*** KSEG0 as "cached coherent" hardware trick is great and unique!!! Other systems only give you 4-8Mbyte. With 32Mbyte I can have the Linux kernel(~6Mbyte) + mini-ram-rootfs(~16Mbyte) all running without any issue!
Great paper. Sure some points are a bit outdated, when related to limitations of computers from the 80's and 90's.
The discussion about definition modules is interesting.
As to the "quality" of the IR impacting the output of optimizers, this can be experimented relatively easily. These days with LLVM, you can generate LLVM IR in various ways and check the impact with various optimizations, all with pretty state-of-the-art optimization techniques.
For someone implementing a compiler, I would recommend defining its own IR, fully adapted to the language itself, and then implementing various generators from this IR - can be C code, LLVM IR, or assembly if you feel inclined to that.
While LLVM IR looks decently general, I would recommend not to base your IR directly on it, as it may unnecessarily constrain and direct your language design, even if in subtle ways.
I know very little currently about LLVM that's something I've started to explore (and there's a managed class library (https://github.com/dotnet/LLVMSharp) that can be used from C# too which is important).
Caches are a hack, which work and fail randomly.Well, enjoy the performance of a cacheless PC :popcorn:
Caches are a hack, which work and fail randomly.Well, enjoy the performance of a cacheless PC :popcorn:
Caches are a hack, which work and fail randomly.Well, enjoy the performance of a cacheless PC :popcorn:
Caches are a pretty useful hack. ;D
Caches are a hack, which work and fail randomly.Well, enjoy the performance of a cacheless PC :popcorn:
Caches are a hack, which work and fail randomly.Well, enjoy the performance of a cacheless PC :popcorn:
Caches are a pretty useful hack. ;D
Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed :'(
/*
* branch if the first register operand is less than the second
*/
slt %t0, %t1, %t2 // %t0 = 1 if %t1 < %t2; otherwise %t0 = 0
bne %t0, %0, L1 // branch to L1 if %t0 nop // this is always exect
nop // on pipelined implementation, this is always executed
...
L1:
Having said that, my general preference is for predictably slow in preference to maybe fast maybe slow. And that applies to more than software, e.g. to trains running on different routes to the same destination, or to waiting while making a phone call to a call centre, or to knowing when to go and pick someone up in the car. Predictability gives me the opportunity to prioritise what matters more to me.The other end of the spectrum is high-performance computing, where performance is money.
Having said that, my general preference is for predictably slow in preference to maybe fast maybe slow. And that applies to more than software, e.g. to trains running on different routes to the same destination, or to waiting while making a phone call to a call centre, or to knowing when to go and pick someone up in the car. Predictability gives me the opportunity to prioritise what matters more to me.The other end of the spectrum is high-performance computing, where performance is money.
If we limit to embedded, then many appliances – routers, file servers and NAS boxes – benefit more from performance than predictability of every single operation; we do want some limits, but the average/typical/expected performance is what the humans care about.
This, too, is a good example of how you want the decision made by a human designer, instead of a compiler: the exact same thing, say an IP stack, may require predictable (adequate) performance in one use case, and variable but as high performance as possible in another. It is not something you can leave for the tools to decide.
While the tools cannot and should not make the performance tradeoffs, the tools must provide the mechanisms for the designer to make the tradeoffs.Exactly. For compiler optimizations, that means the ability to enable/disable each as desired.
While the tools cannot and should not make the performance tradeoffs, the tools must provide the mechanisms for the designer to make the tradeoffs.Exactly. For compiler optimizations, that means the ability to enable/disable each as desired.
The traditional hierarchical control works pretty well: you have "optimization levels" or "types", each including a set of optimizations.
A particular example of this is how clang on ARMv7e-m is a bit too aggressive at unrolling loops at -O2, so one might wish to use -Os (optimize for size) instead, or -O2 -fno-unroll-loops.
The most important thing to users is that these are well documented (for clang, see Clang command line argument reference (https://clang.llvm.org/docs/ClangCommandLineReference.html)). While numerically most use cases just use -O2, -Og, or -Os, there are those important cases where one wants a finer control. The way clang allows one to enable/disable optimization for a particular function via #pragma in C and C++, is also quite useful and important, even though rarely needed.
I personally also heavily use Compiler Explorer (https://godbolt.org/), which can compile a number of languages from Ada to Zip to many target architectures, showing the generated machine code. It is how I made sure that this C function for Cortex-M7 (https://www.eevblog.com/forum/microcontrollers/this-code-is-so-broken-it-some-how-fixes-itself/msg4675780/#msg4675780) computes
$$S = \sum_{i=0}^{N-1} x_i C_i, \quad N \text{ even}$$
for 16-bit signed integer or fixed-point \$x_i\$ and \$C_i\$ at about four cycles per iteration, at full precision and without overflow as the accumulator \$S\$ is a 64-bit integer (using the SMLALD SIMD instruction available on Cortex-M7), when compiled with either GCC or Clang (and suitable optimization). It is obviously useful for FIR and IIR filters, as well as FFT/DFT, when the data is 16-bit signed integer or fixed-point format.
In addition tools must have mechanisms to get information from this process/thread/core to that process/thread/core reliably. I'm agnostic as to how that is achieved; it doesn't have to be solely at the compiler level.True, and as I mentioned earlier, as an initial approximation for understanding the interactions and issues, subsystems and peripherals like DMA can be modeled as separate threads/cores running in parallel.
The Itanic experience made me wary of twiddling things to suit individual processors. I remember a talk (in 1995?) by someone that fettled with the Itanic's assembler code for benchmarks. Every time there was a trivial change to the processor implementation, he started again from scratch. Now that was supposed to be taken care of by the oracular compiler, but that never arrived.I do agree that twiddling with the source code to obtain the desired machine code for a specific processor and compiler combination is not something one does for general code; it's just that I do a lot of low-level computationally intensive stuff, where the performance is important. The other option, for me, would be to write the actual machine code in extended assembly, letting the compiler choose the registers to be used.
Caches are a hack, which work and fail randomly.Well, enjoy the performance of a cacheless PC :popcorn:
Caches are a pretty useful hack. ;D
Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed :'(
In addition tools must have mechanisms to get information from this process/thread/core to that process/thread/core reliably. I'm agnostic as to how that is achieved; it doesn't have to be solely at the compiler level.True, and as I mentioned earlier, as an initial approximation for understanding the interactions and issues, subsystems and peripherals like DMA can be modeled as separate threads/cores running in parallel.
The Itanic experience made me wary of twiddling things to suit individual processors. I remember a talk (in 1995?) by someone that fettled with the Itanic's assembler code for benchmarks. Every time there was a trivial change to the processor implementation, he started again from scratch. Now that was supposed to be taken care of by the oracular compiler, but that never arrived.I do agree that twiddling with the source code to obtain the desired machine code for a specific processor and compiler combination is not something one does for general code; it's just that I do a lot of low-level computationally intensive stuff, where the performance is important. The other option, for me, would be to write the actual machine code in extended assembly, letting the compiler choose the registers to be used.
The properties of C (and to a lesser extent, C++) as languages are such that it is very difficult, sometimes impossible, for the compiler to SIMD-vectorize the code, and to more generally convert data-sequential loops to data-parallel ones (for the backend optimizer to optimize for the target processor). So, to make sure your most computationally intensive innermost loops are as efficient as possible, this Compiler Explorer investigation is unfortunately necessary.
Regarding my example in my previous message, it actually did not target any individual processor, but any Cortex-M4 (with FPv5) or -M7 microcontroller. Consider it equally low-level function as say memcpy() is: intended to be used as the core operation when implementing higher-level functions like FIR and IIR filters. (For DFT/FFT on 16-bit real data on power-of-two number of samples, one probably wants to use a slightly different core operation.)
In that particular case, the odd loop iteration form was necessary, because on the ARMv7e-m target architectures (Thumb/Thumb2 instruction set), neither GCC nor Clang could combine the loop iterator and data pointers when the loop is written using a separate iterator variable. This is actually a rather common issue with C compilers – and a good example of how both GCC and Clang suffer from the same, even though having completely different backends and intermediate representation for the code! –, and I've used the same pattern even on x86-64 with SSE/AVX SIMD vectorization.
Which, funnily enough, leads back to the topic at hand: optimization. When one uses a tool like Compiler Explorer to compare such tight loops, including which languages and which compilers can SIMD-vectorize such loops effectively to the same architecture (say, x86-64), it well illustrates how the language model itself limits what kind of optimizations can be done. The data-sequential to data-parallel loop conversion is the main one I'm aware (simply because it affects HPC so often), but I'm sure there are others.
(When we include distributed computing (non-uniform memory architectures on larger machines, and distributed computing on multiple machines), we get to one of my own pet peeves: the difficulties related to ensuring communication and computation can be done at the same time. Technically, the solution is asynchronous message passing, and is well known, but for whatever reason, it seems to be quite difficult for some people to truly grok; it is mostly a human problem. Similar, but lesser issues affect parallel shared memory accesses, mostly related to atomicity and caching and interactions between multiple separate locking primitives. We seem to generally just not have the necessary mental machinery to manage the complexity and rules, I guess.)
Caches are a hack, which work and fail randomly.Well, enjoy the performance of a cacheless PC :popcorn:
Caches are a pretty useful hack. ;D
Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed :'(
Caches are useful everytime you have storage devices with different access latencies and/or speeds in a single system. This goes beyond just RAM.
No doubt some people are dreaming of a unified space with everything accessed at top speeds and in a single address space, but this is not realistic and would come with other issues as well.
Until then, caches and translation units serve us well.
One useful approach is to be able to "lock" some fast-memory area (could be cache or something else) for blocks of code or data that you don't want to ever be swapped out of cache / go back and forth.
Of course, while this works well in tightly constrained systems, it's hardly applicable for a general-purpose system.
Proper security schemes? See CHERI and The Mill.Most recently, in the highest level of userspace – browsers and such –, we (part of software development world that actually cares about this stuff) are right now discovering that what we actually need, is a hierarchy of access controls under each human user, as well as the hierarchy of access controls at the human user level. Things like "human (H) needs to access processes (A) and (B), but we do not want (A) and (B) to know anything about each other, and definitely not exfiltrate data".
Do modern processors allow cache invalidation of relatively small memory sections?
Dedicate some memory for DMA-based IO, and don't bother the other things in cache?
I've found the documentation of microcontroller caches to be relatively opaque :-(
Proper security schemes? See CHERI and The Mill.Most recently, in the highest level of userspace – browsers and such –, we (part of software development world that actually cares about this stuff) are right now discovering that what we actually need, is a hierarchy of access controls under each human user, as well as the hierarchy of access controls at the human user level. Things like "human (H) needs to access processes (A) and (B), but we do not want (A) and (B) to know anything about each other, and definitely not exfiltrate data".
That sort of stuff is not something you bolt on top of something; it has to be supported at the core kernel level, and to do that, it needs hardware support (proper MMU in particular).
Over a decade ago, when I was doing research and HPC admin stuff, I constructed a similar hierarchical model for maintaining projects' web pages on the same server, using dedicated local group assignments for access controls, because there were a lot of "student admins", and maintainers changed quite often.
The same scheme could be used to secure web sites like forums and WordPress-type stuff, by ensuring that code running on the server could neither create new executables or interpretable binaries, nor rewrite themselves, and that pages only have the minimum set of privileges they need to do their thing.
Why aren't those used, and why doesn't WP et al. support such access partitioning schemes at all? Because management software like Plesk, cPanel, Webmin do not support that. They only support one user account per site. Besides, WP and SMF and other such software are designed to be able to upgrade themselves, which requires them to be able to rewrite their own code. The entire ecosystem is designed to be insecure!
There is a reason BSD created Jails so early, and why Linux has seccomp and cgroups.
I personally would prefer a sub-hierarchy under user accounts (a third ID type, in addition to UID and GID, that defaults to 0, and has a similar sub-hierarchy as UIDs have) instead of cgroups, but hey, we work with what we have.
Circling back to code optimization, we only need to consider the relatively recent discovery of the security issues in speculative execution, and other side channel attacks (like exfiltrating information on based how much time a known computational operation on it takes), that depending on the developer intent, the same source code should compile to either fixed-resource machine code (to stop the timing attacks), or to maximum performance code (because all processes having access to the timing information would already have access to the plaintext data).
So, when we talk about "optimization", we also need to specify the purpose.
I've found the documentation of microcontroller caches to be relatively opaque :-(
Do modern processors allow cache invalidation of relatively small memory sections?For ARM Cortex-M7, it is part of the ARM core, so it's documented in the Armv7-M Architecture Reference Manual (https://developer.arm.com/documentation/ddi0403/ee/), separate from e.g ST and NXP manuals for their Cortex-M7 MCUs, even though ARM explicitly documents some of these as implementation defined! Relatively opaque? Byzantine, I'd say. Anyway, section B2.2.7, and especially Table B2-1, describes the possible operations.
Dedicate some memory for DMA-based IO, and don't bother the other things in cache?
I've found the documentation of microcontroller caches to be relatively opaque :-(
Setting [proper security scheme] up correctly is an expert task, and possibly not something a simple tool could achieve.Definitely; and it is not something one should let random developers at, as there is lots of experience in security schemes already available, and most developers just aren't diligent and careful enough to learn from history, ending up repeating the same mistakes again and again.
Setting [proper security scheme] up correctly is an expert task, and possibly not something a simple tool could achieve.Definitely; and it is not something one should let random developers at, as there is lots of experience in security schemes already available, and most developers just aren't diligent and careful enough to learn from history, ending up repeating the same mistakes again and again.
Just like for code optimization opportunities the intermediate representation can subtly limit the options and approaches possible, our web services are not insecure because the operating systems do not support better, more secure approaches; they are insecure because of arbitrary (easy but bad) choices made in the design and implementation of the services themselves, and in the services managing these. The inertia of "this is how everybody else does it" is enormous, and very difficult to change, even if it constantly leads to user information being leaked and exploited illegally.