A secondary use case is to interleave unrelated operations because their order and side effects is irrelevant
I think one way to do that is with , instead of ;
No, comma is a sequence point in C: the side effects of the left side must be complete before the side effects of the right side start.
However, there is one workaround with GCC: splitting individual operations into (often static inline) functions marked
pure or const, and using local scope
const variables to hold the temporary results. Then, GCC knows there are
no side effects, and is free to reorder and interleave the machine code implementation. For example:
static inline double op1(const double arg1, const double arg2) __attribute__((const));
static inline double op1(const double arg1, const double arg2)
{
/* Implementation depending only arg1 and arg2, does not change any variables */
}
static inline double op2(double arg1, double arg2, double arg3) __attribute__((pure));
static inline double op2(double arg1, double arg2, double arg3)
{
/* Implementation depending on arg1, arg2, arg3, and optionally global variables,
but may only modify arg1, arg2, arg3. */
}
then a sequence similar to say
/* double arg1, arg2, arg3; */
double result;
{
const double temp1 = op1(arg1, arg2);
const double temp2 = op1(arg1, arg3);
const double temp3 = op2(arg1, arg2, arg3);
result = op2(temp1, temp2, temp3);
}
has maximal opportunities for optimizing the code used to calculate
result, since GCC knows the only side effect is the value of
result itself.
(It might be better to declare
result in the same scope with current versions of GCC; the above code is for illustration only, and not from a real world project.)
However, as you can see, splitting a complicated formula, say something like used in the
embedded atom model (EAM, often used in molecular dynamic simulations involving metals), the code does become completely undebuggable and write-only. Especially so when you vectorize the calculation (for two or four pairs of atoms) using SSE or AVX.
After all, we
could write the code in assembly right now. It just isn't worth the time and especially maintenance effort. I
believe, but do not know for sure, that a new low-level programming language not completely different to C, could solve this and other system-level programming issues.
Another problem occurs when the calculation
does have side effects (for example, you keep min/max tally, reorder the neighbour list to keep pairs within the interaction distance at the beginning of the slice for a particular atom, and so on), but the programmer
knows their order does not matter. I can cheat by implementing the function calls in a separate compilation unit, but declare them as
const/
pure in the compilation unit they're used in, but then the compiler cannot inline the machine code.
I'd rather not write an assembly generator and test harness to implement optimized versions of these hot code paths, as they're not worth it -- maintainability is much more important than the few percentage point speed increase. But, if there was a low level language where such concepts could be expressed in, with the compiler itself understanding the SIMD vector concept and atomic ops (including support for both CAS and LL/SC, and preprocessing/choosing an implementation at compile time depending on which one is used), this kind of high-performance code might become somewhat more readable.
This is not, however, just for HPC, but also for low-level libraries, and kernel or freestanding code. Stuff like privilege boundaries (like between the kernel and userspace, and between userspace processes) needs atomics and exact control of when side effects are to be visible. Only the compiler knows exactly which hardware architecture and processor type the code is compiled for (and ELF binaries at least have a native way for functions to provide the linkage for other functions at run time), so all this stuff really needs to be part of the language provided by the compiler, and not a "library" in the C sense.
I do not know the "best" feature set myself yet, but looking at other languages (and especially their syntax: how they express intent, side effects, and so on), gives new ideas to me at least.
A similar-ish issue exists in userspace programming, especially GUI applications. Desktop environments and GUIs seem to be best implemented using an event-based approach, and threads and queues can be used to make responsive, efficient applications. GTK+ shows that you do not need to have an OOP language to implement one, even C suffices, but Python shows that you can do it in simple, concise code. (Python's downside is that it has slowish I/O, and is an interpreted language where the standard interpreter can only run bytecode in one thread at a time; I'd much rather see a compiled language with similar features, and a modular, lightweight runtime.)
In this case 'best' is defined as using the minimum number of cycles in the operating thread, including statistical latency estimates, possibly interleaving the operations.
Perhaps I'm not quite understanding your requirement here or where/how you see this being done without some form of execution analysis.
No, I thought you were referring to standard runtime analysis tools, which instead of analysing the machine code, simply run it, measuring the time taken, sampling the instruction pointer.
On superscalar processors, C's sequence points (in the presence of side effects) mean that the pipelines are not fully utilized, especially for simple code.
I would like to mark code so the compiler gets to ignore the order of side effects, and thus sequence points altogether, and try its hardest (even use extraordinary amounts of time) to optimize these rare code sequences.
Also, parallelism would look at functional blocks and it's not clear if that's what you are referring to by operation.
I am referring to parallelism; specifically, instruction-level parallelism on superscalar architectures, not thread- or process-based parallelism, for chunks of code whose order of execution or order of side effects are irrelevant for the correctness of the program; that being up to the programmer, and not for the compiler to detect.
(I do not know how much you know about the subject, and anyway always try to write in a way that keeps anyone interested in the subject along with the ride. So, please do not read my tone as condescending or anything. I have no subtext, me no English wrote that good.)
A secondary use case is to interleave unrelated operations because their order and side effects is irrelevant, typical in initialization. There, the intent is for the compiler to choose how to implement the machine code;
I'm a little lost here. Instruction sequencing (which includes scheduling, dependencies, instruction groups, available execution units, store queues etc) are considered during optimization.
Yes, but the first problem occurs at the higher stage. For example, how do you tell the compiler that you have two or more program sequences that may be executed in any order, even simultaneously (instruction-level parallelized on superscalar architectures, not using different threads of execution)?
The secondary issue is that when you have a very hot (takes much of the total CPU time) set of e.g. mathematical operations, there are many different ways to turn that sequence of operations into machine code. Brute-force, i.e. trial and error, method applied to the entire program would make the compiler insufferably slow, but it might be worth it to spend such efforts for specifically marked blocks of code. I'm not even sure how one would implement such an optimizer, but I'm pretty sure it'd have to include generating variants of the relevant abstract syntax tree, and so on.
The mechanism used for this could be used to e.g. tell the compiler that code that does graphics using single precision floating point numbers does not need to be fully IEEE-754 compliant (only finite normal values, and larger than 0.5ULP error allowed), while some other code doing e.g. statistics in the same compilation unit does need to. (I know: why you'd have them both in the same compilation unit in the first place? but this is just the example that popped in my mind.)
Neither of these involve branches (conditionals sometimes yes, but only very rarely branches), so runtime profiling is useless here.
Perhaps, but as I said my use/interest of FPDR was primarily to look at branching. Run-time analysis can provide a lot more insight than that use case even tho it is not feed back to the optimizer.
Sure, I didn't mean to imply it is not useful; again, just that run-time analysis in this particular case shows that the block(s) of code at hand are indeed very "hot", with well over 50% of the CPU time spent there.
Optimization is kind of a funny problem: trying to find an efficient expression, without using excessive resources (especially time) while doing so. There are some very rare, often complex but branchless, chunks of code I'd like the compiler to go beast-mode on when generating the code, because I already know that the CPU will spend most of its time in those particular chunks of code, no matter what the dataset.
I wonder how many optimization methods have been deemed un-applicable, because they were deemed too slow in practice? Or how many have not been explored, because they are known to not scale well enough to be practical?