EEVblog Electronics Community Forum

Products => Computers => Programming => Topic started by: Sherlock Holmes on January 31, 2023, 03:44:50 pm

Title: Code Optimization
Post by: Sherlock Holmes on January 31, 2023, 03:44:50 pm
This is an interesting paper, given that many of the compilers we might use today leverage LLVM as their back end, this is useful to get some insight into as it affects much of what we do, so ask is my code correct...

Provably Correct Peephole Optimizations with Alive (https://web.ist.utl.pt/nuno.lopes/pubs/alive-pldi15.pdf)

Quote
Quote
We have translated more than 300 LLVM optimizations into Alive and, in the process, found that eight of them were wrong.

Interesting to read that this sequence of LLVM IR:

Code: [Select]
%1 = xor i32 %x, -1
%2 = add i32 %1, 3333

when optimized to:

Code: [Select]
%2 = sub i32 3332, %x

Itself relies on some 160 bytes of C++ analysis and optimization code!

Title: Re: Code Optimization
Post by: cfbsoftware on January 31, 2023, 10:38:43 pm
Some words of wisdom regarding compiler code optimisation can be found in Hanspeter Mössenböck's article Compiler Construction -
The Art of Niklaus Wirth
.

For example, these are some of the headings:

Quote
Generate good code for the frequent case; don't optimize the exceptional case

Quote
Generate good code from the beginning; don't generate bad code and optimize later

Quote
Generate predictable code

The complete article can be download from:

http://pascal.hansotten.com/uploads/books/art3.pdf (http://pascal.hansotten.com/uploads/books/art3.pdf)

Hanspeter Mössenböck designed the Object Oriented extensions to the language Oberon, to become Oberon-2, and also created Coco/R - A Generator for Fast Compiler Front-Ends.

C#, Java, C++, F#, VB.Net, Delphi, Swift, TypeScript and Oberon versions of the Coco/R  tool are available:

https://ssw.jku.at/Research/Projects/Coco/ (https://ssw.jku.at/Research/Projects/Coco/)

Title: Re: Code Optimization
Post by: eutectique on January 31, 2023, 11:56:59 pm
Have you heard of superoptimization? You can find An Interesting Example in this work: https://web.stanford.edu/class/cs343/resources/superoptimizer.pdf
Title: Re: Code Optimization
Post by: thm_w on February 01, 2023, 12:34:30 am
For example, these are some of the headings:
Quote
Generate good code from the beginning; don't generate bad code and optimize later

I'm not really buying the argument of this one, others make sense though.

Quote
Brandis and others [3] report that non-optimizing Oberon compilers for the Motorola 68020, for the Sun SPARC and for the DEC MIPS architectures generated code that was as
efficient as the code generated by the standard C compiler with the optimization option switched on (cc -O). Only for the IBM RS/6000 the optimizing C compiler was able to generate more efficient code than the Oberon compiler.


How is that any different than just enabling optimization by default?
Isn't the value of having a flag that you can compile un-optimized, to attempt to compile the code "as written", which we know might not be optimal.
Title: Re: Code Optimization
Post by: cfbsoftware on February 01, 2023, 04:07:46 am
For example, these are some of the headings:
Quote
Generate good code from the beginning; don't generate bad code and optimize later

I'm not really buying the argument of this one, others make sense though.
Yeah - this probably falls into the category of "stating the bleeding obvious".  I suspect it is based from experience of working with substandard code generators.
Quote
How is that any different than just enabling optimization by default?
Isn't the value of having a flag that you can compile un-optimized, to attempt to compile the code "as written", which we know might not be optimal.
Not all un-optimised code is equal. There can be many different ways of compiling code "as written". Some can be better than others due to the author's better knowledge of the instruction set; more careful design and implementation.
Title: Re: Code Optimization
Post by: SiliconWizard on February 01, 2023, 05:46:53 am
Great paper. Sure some points are a bit outdated, when related to limitations of computers from the 80's and 90's.

The discussion about definition modules is interesting.

As to the "quality" of the IR impacting the output of optimizers, this can be experimented relatively easily. These days with LLVM, you can generate LLVM IR in various ways and check the impact with various optimizations, all with pretty state-of-the-art optimization techniques.

For someone implementing a compiler, I would recommend defining its own IR, fully adapted to the language itself, and then implementing various generators from this IR - can be C code, LLVM IR, or assembly if you feel inclined to that.

While LLVM IR looks decently general, I would recommend not to base your IR directly on it, as it may unnecessarily constrain and direct your language design, even if in subtle ways.
Title: Re: Code Optimization
Post by: tggzzz on February 01, 2023, 09:38:53 am
Some words of wisdom regarding compiler code optimisation can be found in Hanspeter Mössenböck's article Compiler Construction -
The Art of Niklaus Wirth
.

For example, these are some of the headings:

Quote
Generate good code for the frequent case; don't optimize the exceptional case

Determining the "frequent case" is, um, challenging in modern programs.

One of the more interesting methods of optimisation is given in a series of papers where a research compiler accidentally gave production quality results.
https://www.hpl.hp.com/techreports/1999/HPL-1999-78.html (https://www.hpl.hp.com/techreports/1999/HPL-1999-78.html)

Quote
Quote
Generate predictable code

Good luck with that; the interactions between data access patterns and caches is significant and unknowable at compile time.
Title: Re: Code Optimization
Post by: DiTBho on February 01, 2023, 10:36:51 am
Good luck with that; the interactions between data access patterns and caches is significant and unknowable at compile time.

if you want to laugh, in my-c i have to force a cache flush on every read/write attempt to the shared ram, because some hw-designer idiot put the shared ram in the same area addressed by the cache controller (which can't be disabled, because ... nobody knows), so when cpu0.board0 needs to talk to cpu1.board1 (on the other mezzanine) ... well, there is cache in the middle, without a bloody cache coherency protocol, so not only timings are not deterministic, but you risk a cache miss machine exception (which would be problematic if it happens under interrupt -> ISR exception inside a kernel exception) and - worse still - without cache miss exceptions - the shared cell content is not guaranteed to be correct, and you are not sure it's has been updated, so you could read old information.

What's that madness crazy hardware?!? It's pure garbage, but it happenes even with big companies ... an examples is Indigo2/Impact by SGI ... its MIPS4 R10000 has similar cache problems.

All of that is BAD hardware design, and you cannot fix, the best you can do is ... trying to find a workaround.

So, for my second MIPS5++ prototype board, I created a special memory model in my-c to handle that madness stuff. Simple trick: the machine layer always puts a "cache flush" before every IO on the shared ram.

Job done!, just it makes talking about "optimization" really funny since performance literally sucks, but at least it works correctly.

(I wish I could unsolder the cache, anyway)
Title: Re: Code Optimization
Post by: tggzzz on February 01, 2023, 11:27:54 am
Good luck with that; the interactions between data access patterns and caches is significant and unknowable at compile time.

if you want to laugh, in my-c i have to force a cache flush on every read/write attempt to the shared ram, because some hw-designer idiot put the shared ram in the same area addressed by the cache controller (which can't be disabled, because ... nobody knows), so when cpu0.board0 needs to talk to cpu1.board1 (on the other mezzanine) ... well, there is cache in the middle, without a bloody cache coherency protocol, so not only timings are not deterministic, but you risk a cache miss machine exception (which would be problematic if it happens under interrupt -> ISR exception inside a kernel exception) and - worse still - without cache miss exceptions - the shared cell content is not guaranteed to be correct, and you are not sure it's has been updated, so you could read old information.

I wonder if the OP will realise that's why a memory model is necessary. He had difficulty with that in his other thread on A New Hardware Oriented Programming Language "The Imperium programming language - IPL" https://www.eevblog.com/forum/programming/a-new-hardware-oriented-programming-language/msg4661023/#msg4661023 (https://www.eevblog.com/forum/programming/a-new-hardware-oriented-programming-language/msg4661023/#msg4661023)

I don't know why that thread was locked, but I imagine the OP requested it.

Quote
What's that madness crazy hardware?!? It's pure garbage, but it happenes even with big companies ... an examples is Indigo2/Impact by SGI ... its MIPS4 R10000 has similar cache problems.

All of that is BAD hardware design, and you cannot fix, the best you can do is ... trying to find a workaround.

So, for my second MIPS5++ prototype board, I created a special memory model in my-c to handle that madness stuff. Simple trick: the machine layer always put a "cache flush" before every IO on the shared ram.

Nah, it is good hardware design because it pushes the hard rare issues to be solved by software.

HP/Intel had the same approach with Itanium Itanic.

Oh.

Quote
Job done!, just it makes talking about "optimization" really funny since performance literally sucks, but at least it works correctly.

(I wish I could unsolder the cache, anyway)

If you are allowed to get incorrect answers, you can employ much more aggressive optimisation.  ;)
Title: Re: Code Optimization
Post by: Nominal Animal on February 01, 2023, 12:15:10 pm
Even on NXP i.MX RT1062 (as used on Teensy 4.x) one has to do a manual cache flush between DMA and non-DMA accesses to the same memory.  It's not too expensive, but it is basically impossible to optimize at the machine code generation level. (Teensyduino Core for these does such flushes around DMA operations, which isn't optimal, but not too bad either.)

In other words, some features cannot be automated, as they depends on the programmer intent, and is not really decidable by the compiler.  (A single-threaded interpreter could track it and only do a cache flush when needed, but it turns out that on these, the cache flush is much cheaper than tracking all accesses.)
 _ _ _

There is one very useful intermediate representation detail that nicely illustrates the subtle constraints (as mentioned by SiliconWizard) the IR may put on language implementations, somewhat related to the above: data-parallel loop optimizations.

Backends that support Fortran typically support at least two different types of loops with completely different semantics: FOR loops, with similar semantics and requirements as C/C++/etc. loops normally have, essentially sequential data loops; and FORALL for parallel data loops with associated limitations and semantics.  Essentially, FORALL loops have no predetermined order of iterations, but occur in parallel, with an arbitrary or undefined order of side effects (like memory access pattern).  In particular, FORALL loop bodies cannot refer to results calculated by other iterations in the same loop.

If your backend IR supports both data-sequential "FOR" loops and data-parallel "FORALL" loops, even if your language does not have a similar concept, it is often possible to convert data-sequential loops into data-parallel loops as an algorithmic optimization, when the language constructs used within the loop body permit such conversion.  In practice, the handling of the iterator variable and the iteration direction for "FORALL" loops varies depending on the target architecture, and is implemented using a machine code pattern best suited for that type of instruction set architecture.

If we take a wider view, we can see that this difference between FOR and FORALL loops is essentially algorithmic –– data-sequential versus data-parallel loops ––, and if the IR does not support data-parallel loops, to achieve similar benefits, one would need to convert the loop representation to one that best suits each target architecture, requiring quite a lot of effort.

To circle back to the things like cache flush patterns that compilers may not in all cases be able to deduce for themselves, the FOR/FORALL difference is opposite: it is not too difficult for the compiler to determine if the data references in the loop body require sequential data access, or whether the data accesses would still be valid if they were done all in parallel, and thus determine when FOR loops can be turned into FORALL loops.

Thus, it is not necessary for the programming language itself to differentiate between FOR and FORALL loops –– even I can argue both for and against! ––, but it is necessary for the compiler to understand the difference between data-sequential and data-parallel loops, and the IR/backend to support both, to truly leverage this.
 _ _ _

In high-performace computing Fortran is still used, because it is easier to write efficient computation in Fortran than in C or C++.  Comparing the microbenchmarks exhibiting the differences, and the machine code generated, the main reasons are surprisingly few, and all involve arrays: efficient array slicing, data-parallel loops, and filling an array with a constant through duplication of the initial member without interpreting the value of the initial member.

On x86-64, data-parallel loops involving floating-point data are particularly easy to vectorize, leading to more efficient machine code.  To achieve the same in C or C++, one needs to write the loop in a very specific form (specific to the compiler version), or use vector intrinsics provided by the compiler (<immintrin.h> for x86 and x86-64).

It is exactly for this kind of thing that I insist one must look at the machine code generated, and at wildly different programming languages, when designing a new language.  True efficiency requires understanding and the application of many such concepts and details, and I personally have not yet seen any books or articles that describes many of them – in particular, because single-instruction multiple-data vectorization (SIMD) is quite "new" in computer science terms, and a lot of the findings are still in peer-reviewed articles and not yet collected and their meaning distilled into books that describe the best current understanding and approaches.

Concepts like the difference between FOR and FORALL loops (or rather, data-sequential and data-parallel loops) definitely look insignificant on the surface, but when you look into how they actually affect the code generated for different hardware architectures, and how much it helps with SIMD vectorization (so much so that I personally don't think SIMD vectorization outside FORALL/data-parallel loops makes much sense!), you realize they can make a very significant difference on certain architectures –– we're talking 2× to 8× computational performance in some cases.

And while FOR/FORALL is something the compiler can handle by itself without the language requiring explicit constructs, some other constructs like cache flushes, even customized memory prefetch patterns, may have to be exposed to the human developer.  This means that a "single universal language fit for all purposes" is not only unrealistic, it would have very un-optimal performance.  There are not that many situations where one is willing to give up say nine tenths of the performance just to achieve exactly predictable behaviour; personally, I care about the limits of the behaviour, and need those to be well defined, but do not care at all about the distribution of the behaviour as long as it is within those known limits.
Title: Re: Code Optimization
Post by: Sherlock Holmes on February 01, 2023, 02:59:35 pm
Great paper. Sure some points are a bit outdated, when related to limitations of computers from the 80's and 90's.

The discussion about definition modules is interesting.

As to the "quality" of the IR impacting the output of optimizers, this can be experimented relatively easily. These days with LLVM, you can generate LLVM IR in various ways and check the impact with various optimizations, all with pretty state-of-the-art optimization techniques.

For someone implementing a compiler, I would recommend defining its own IR, fully adapted to the language itself, and then implementing various generators from this IR - can be C code, LLVM IR, or assembly if you feel inclined to that.

While LLVM IR looks decently general, I would recommend not to base your IR directly on it, as it may unnecessarily constrain and direct your language design, even if in subtle ways.

I know very little currently about LLVM that's something I've started to explore (and there's a managed class library (https://github.com/dotnet/LLVMSharp) that can be used from C# too which is important).

As for language specific IR that's noted, I've just started to look at creating an AST from the Antlr CST, that's going quite well, once I get the basic pattern I'm going to use, established, this will become a very meaty (yet extremely interesting) phase of the work. It's already reporting some initial diagnostics too but the overall structure of the tree is still very fluid.

You're right too, it might be prudent to consider an IR decoupled from LLVM, never did that in my earlier work but it could pay dividends, the ANSI spec for PL/I Subset G is a fascinating document actually possibly the first language to have a formally defined concrete syntax and abstract syntax, the abstract syntax is in fact described in terms of an imaginary "PL/I machine" so I might dig into that for ideas, sadly the ANSI standard is not public, a shame because its a first class example of a formal language definition (I have a printed copy I purchased decades ago).

I also joined a compiler design online discussion group which has been most helpful.

You sound like someone who knows a thing or two about all this, have you coded with LLVM much at all?


Title: Re: Code Optimization
Post by: DiTBho on February 01, 2023, 04:00:37 pm
Nah, it is good hardware design because it pushes the hard rare issues to be solved by software.

hehe  ;D

right, I think the hardware guys must have thought something like, "oh, well, it'll be up to the software guys to fix it, so let's ignore it because we are terribly late with the project deadline"  :o :o :o

And so they did, and commercial guys just said "approved", and then the result was so bad that the next product has a different approach.

Yup, there are commercial MIPS4 systems where { R10k, R12K, R14K, R16k } also expose several several problems load/store instructions after a mispredicted brunch.

_____w_____h_____a_____t_____?_____

_____"cache-coherent system"_____ vs _____"non-cache-coherent system"_____


That madness was commercialized, and, worse still, Gcc never had a solution, so you cannot run Linux on I2/R10K. There are patches (cache-flush after each I/O), sure, but ... they don't really work and they have never been accepted main-stream.

I can understand those guys, because my-c needs to force extra "cache invalidation" op after each DMA. It makes code-optimization ... all a joke.

Anyway, in my Atlas board, there is an hw-feature that causes any stores to all but first 32M of KSEG0 addresses to fail, and that allows you to configure KSEG0 as "cached coherent", you only need to put the kernel code into first 32M of RAM, and use HMEM for the rest. Any memory that is being target of DMA transfer will be removed from TLB, and thus inaccessible while DMA is in progress.

Just, you have to make sure it isn't brought back into TLB by an accidental reference.

Oh, but! BUT!!! on the second MIPS5++ cpu-board prototype I can completely disable the cache and remap it as kseg0 ram, an incredible pleasure!!!! What can you do with 2Mbyte of synchronous ram chipping at 400Mhz with ZERO wait cycles!!!  :o :o :o
Title: Re: Code Optimization
Post by: Nominal Animal on February 01, 2023, 06:10:18 pm
I can understand those guys, because my-c needs to force extra "cache invalidation" op after each DMA. It makes code-optimization ... all a joke.
On i.MX RT1062, forgetting the cache invalidation after DMA isn't catastrophic, it can just give "wrong" data.  (If the area is already cached, then reading will yield stale data; if the area is already cached, a write will overwrite the DMA'd data.  No interrupt or such is generated.  The cache is also small enough – 32k for code, 32k for data – so that just doing an unconditional data cache eviction on DMA write completion interrupts and before DMA reads, has acceptable overhead.)

There is a lot of talk about cache coherency on the linux-kernel mailing list amongst key kernel developers.  When you have more than one processor, it gets quite complicated.  In particular, you must resolve cache conflicts at the point when one processor loads the cache line (and we can model DMA on these architectures by treating it as a separate processor, just without its own cache per se); it is basically impossible to resolve cache conflicts when more than one processor has the data cached.  (You can imagine why: both processors think they have valid data, and have already done work on said data.  If one wins, then the work done by the other would have to be undone somehow.  Any other resolution leads to work having been done with the wrong data.)

This is related to code optimization and to how IR can subtly limit a compiler, because the Linux kernel itself is based on axiomatic assumptions (especially wrt. cache coherency and locking primitives), and porting the kernel to a system where those assumptions do not hold is extremely painful.
Similarly, trying to implement language features not supported by the intermediate representation, is often quite painful; like stuffing a rectangular peg through a smaller round hole.
Title: Re: Code Optimization
Post by: DiTBho on February 01, 2023, 07:43:34 pm
On i.MX RT1062, forgetting the cache invalidation after DMA isn't catastrophic, it can just give "wrong" data.

Linux/MIPS on Systems like O2/{R10K, R12K} and Indigo2/{R8K, R10K} crashes in less than 5 seconds from the bootstrap.

That's why I say "catastrophic"  :D

Usually the severe failure happens after DMA transfer on the SCSI sub-unit, but also with *UMA unit, so even with the ram-rootfs (kernel + ram-rootfs, to exclude the SCSI issue) when some data is just wrong.

Sure, there are file systems for non-cache-coherent multi-cores (e.g. Hare-fs), but Linux "as-is" has no support and crashes.

So, running Linux on SGI O2/MIPS4, and/or on Indigo2 means "kernel crash *ALWAYS* guaranteed within 5 second from the bootstrap"(1), unless you "patch gcc" to force cache-invalidation/flush.

I gave up years ago  :-//


(1) hey? can we say this is a feature?
Guaranteed to crash within 5 sec  ;D ;D ;D
Title: Re: Code Optimization
Post by: DiTBho on February 01, 2023, 07:50:02 pm
p.s.
also the ***32Mbyte*** KSEG0 as "cached coherent" hardware trick is great and unique!!! Other systems only give you 4-8Mbyte. With 32Mbyte I can have the Linux kernel(~6Mbyte) + mini-ram-rootfs(~16Mbyte) all running without any issue!
Title: Re: Code Optimization
Post by: tggzzz on February 01, 2023, 07:55:53 pm
p.s.
also the ***32Mbyte*** KSEG0 as "cached coherent" hardware trick is great and unique!!! Other systems only give you 4-8Mbyte. With 32Mbyte I can have the Linux kernel(~6Mbyte) + mini-ram-rootfs(~16Mbyte) all running without any issue!

Caches are a hack, which work and fail randomly.

Basic principle: if you know how to optimise, do it - but when you don't, randomise it. Hence caches, spread spectrum, ...
Title: Re: Code Optimization
Post by: SiliconWizard on February 01, 2023, 08:05:20 pm
Great paper. Sure some points are a bit outdated, when related to limitations of computers from the 80's and 90's.

The discussion about definition modules is interesting.

As to the "quality" of the IR impacting the output of optimizers, this can be experimented relatively easily. These days with LLVM, you can generate LLVM IR in various ways and check the impact with various optimizations, all with pretty state-of-the-art optimization techniques.

For someone implementing a compiler, I would recommend defining its own IR, fully adapted to the language itself, and then implementing various generators from this IR - can be C code, LLVM IR, or assembly if you feel inclined to that.

While LLVM IR looks decently general, I would recommend not to base your IR directly on it, as it may unnecessarily constrain and direct your language design, even if in subtle ways.

I know very little currently about LLVM that's something I've started to explore (and there's a managed class library (https://github.com/dotnet/LLVMSharp) that can be used from C# too which is important).

LLVM can take its native IR code in pure text form, and I'd recommend using that for starters. No need to fiddle with the LLVM libraries directly until you want to generate a fully "standalone" compiler. One thing at a time. You can absolutely start with just the LLVM binaries.
Title: Re: Code Optimization
Post by: gf on February 01, 2023, 08:57:39 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC :popcorn:
Title: Re: Code Optimization
Post by: SiliconWizard on February 01, 2023, 09:10:31 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC :popcorn:

Caches are a pretty useful hack. ;D
Title: Re: Code Optimization
Post by: gf on February 01, 2023, 09:22:23 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC :popcorn:

Caches are a pretty useful hack. ;D

Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed :'(
Title: Re: Code Optimization
Post by: tggzzz on February 01, 2023, 10:04:26 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC :popcorn:

You are referring to the "mean" or "expected" performance of a PC, which can drop by orders of magnitude when least convenient. Hells teeth, even on a 486 with its miniscule caches, it was possible to see mean:worst interrupt times of 8:1. It ain't got better :)

For a mere hint at worst case performance, look at any benchmark which shows performance as a function of array size. The benchmark articles spend lots of words discussing where the performance cliffs are there, and why.

Then add in the effects that other programs have on yours.

And don't forget to add in the effects of having to refill TLBs and other caches from disk :0
Title: Re: Code Optimization
Post by: tggzzz on February 01, 2023, 10:05:55 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC :popcorn:

Caches are a pretty useful hack. ;D

Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed :'(

You mean you can't use the TI9900's trick of having the registers located in RAM? Shame; that made context switching trivial :)
Title: Re: Code Optimization
Post by: DiTBho on February 01, 2023, 11:20:38 pm
@tgzzz
there is even a further hardware option  :o

so, tell me, if you were me, would you run the MIPS5++ Atlas board ***always*** with cache disabled and resource reassigned as static synchronous ram?  ;D

The CPU clock is 400Mhz
The cache clock is 400Mhz
The cache re-assigned as simple-linear-ram works at 400Mhz, with ZERO wait state
(there is a small fpga to handle chips as "cache"(small L1 + L2, or large L1) or as "linear ram")


You can choose one of these two operating modes!

For me, 32Mbyte of fast-ram is huge space!!! Especially for small kernels like XINU (1Mbyte) or ucOS/2 (256Kbyte)!!!

Physically these 32Mbyte are on two large boards made with 512 Kbyte each synchronous ram chips. Sure, lot of chips (on each ram-board, 16 chips on SideA and 16 chips on sideB, 66 chips in total, including the fpga-logic to address them), but it doesn't look too bad with high-density smd.

And! you can always ask the CPU to disable the delay branch slot (only 1 bit to be set), so after a branch-instruction the CPU will automatically insert a "bubble" (works like NOP)! WOW, that's awesome for a Pipelined MIPS!!! Because it means you no more have to care about "what" the compiler puts after a branch.

Code: [Select]
       /*
        * branch if the first register operand is less than the second
        */
       slt %t0, %t1, %t2 // %t0 = 1 if %t1 < %t2; otherwise %t0 = 0
       bne %t0, %0, L1   // branch to L1 if %t0      nop // this is always exect
       nop               // on pipelined implementation, this is always executed

       ...

L1:

The m88100 prototype has a similar option. I like the Motorola 88k, and I love that idea with MIPS5++!!! But the Atlas, as platform, with that "disable cache and re-use the resource as ram" really looks close to what an ideal computer should look like!

Also, it makes writing an HL compiler easier  ;D
Title: Re: Code Optimization
Post by: tggzzz on February 01, 2023, 11:37:23 pm
Cache or no cache? Hmmm. I think it would depend on the objectives and constraints.

If mean performance is more important, then caches might help.

If predictable or repeatable performance is more important, then caches might hinder and fast RAM might be useful.

If the code and data cannot fit into fast RAM and cannot be partitioned between fast and slow RAM, then caches might help.

No surprises there, I think.

Having said that, my general preference is for predictably slow in preference to maybe fast maybe slow. And that applies to more than software, e.g. to trains running on different routes to the same destination, or to waiting while making a phone call to a call centre, or to knowing when to go and pick someone up in the car. Predictability gives me the opportunity to prioritise what matters more to me.
Title: Re: Code Optimization
Post by: Nominal Animal on February 02, 2023, 12:06:53 am
Having said that, my general preference is for predictably slow in preference to maybe fast maybe slow. And that applies to more than software, e.g. to trains running on different routes to the same destination, or to waiting while making a phone call to a call centre, or to knowing when to go and pick someone up in the car. Predictability gives me the opportunity to prioritise what matters more to me.
The other end of the spectrum is high-performance computing, where performance is money.

If we limit to embedded, then many appliances – routers, file servers and NAS boxes – benefit more from performance than predictability of every single operation; we do want some limits, but the average/typical/expected performance is what the humans care about.

This, too, is a good example of how you want the decision made by a human designer, instead of a compiler: the exact same thing, say an IP stack, may require predictable (adequate) performance in one use case, and variable but as high performance as possible in another.  It is not something you can leave for the tools to decide.
Title: Re: Code Optimization
Post by: tggzzz on February 02, 2023, 08:16:23 am
Having said that, my general preference is for predictably slow in preference to maybe fast maybe slow. And that applies to more than software, e.g. to trains running on different routes to the same destination, or to waiting while making a phone call to a call centre, or to knowing when to go and pick someone up in the car. Predictability gives me the opportunity to prioritise what matters more to me.
The other end of the spectrum is high-performance computing, where performance is money.

If we limit to embedded, then many appliances – routers, file servers and NAS boxes – benefit more from performance than predictability of every single operation; we do want some limits, but the average/typical/expected performance is what the humans care about.

This, too, is a good example of how you want the decision made by a human designer, instead of a compiler: the exact same thing, say an IP stack, may require predictable (adequate) performance in one use case, and variable but as high performance as possible in another.  It is not something you can leave for the tools to decide.

Agreed.

HPC is an interesting case where they have always pushed the boundaries of the possible (e.g. was MIPS, now MIPS/watt) and stressed processors, memory systems, and compilers to breaking points. It isn't uncommon that people on the sharp end know exactly where the skeletons are buried, even though other people deny there are even skeletons.

The HPC mob also knows what they don't need. One famous example from the Cray era is exact correct floating point arithmetic. They sensibly take the attitude that FP numbers can only be approximate, and that your algorithm plus input data define the output precision.

While the tools cannot and should not make the performance tradeoffs, the tools must provide the mechanisms for the designer to make the tradeoffs.
Title: Re: Code Optimization
Post by: Nominal Animal on February 02, 2023, 10:44:33 am
While the tools cannot and should not make the performance tradeoffs, the tools must provide the mechanisms for the designer to make the tradeoffs.
Exactly.  For compiler optimizations, that means the ability to enable/disable each as desired.

The traditional hierarchical control works pretty well: you have "optimization levels" or "types", each including a set of optimizations.

A particular example of this is how clang on ARMv7e-m is a bit too aggressive at unrolling loops at -O2, so one might wish to use -Os (optimize for size) instead, or -O2 -fno-unroll-loops.

The most important thing to users is that these are well documented (for clang, see Clang command line argument reference (https://clang.llvm.org/docs/ClangCommandLineReference.html)).  While numerically most use cases just use -O2, -Og, or -Os, there are those important cases where one wants a finer control.  The way clang allows one to enable/disable optimization for a particular function via #pragma in C and C++, is also quite useful and important, even though rarely needed.

I personally also heavily use Compiler Explorer (https://godbolt.org/), which can compile a number of languages from Ada to Zip to many target architectures, showing the generated machine code.  It is how I made sure that this C function for Cortex-M7 (https://www.eevblog.com/forum/microcontrollers/this-code-is-so-broken-it-some-how-fixes-itself/msg4675780/#msg4675780) computes
$$S = \sum_{i=0}^{N-1} x_i C_i, \quad N \text{ even}$$
for 16-bit signed integer or fixed-point \$x_i\$ and \$C_i\$ at about four cycles per iteration, at full precision and without overflow as the accumulator \$S\$ is a 64-bit integer (using the SMLALD SIMD instruction available on Cortex-M7), when compiled with either GCC or Clang (and suitable optimization).  It is obviously useful for FIR and IIR filters, as well as FFT/DFT, when the data is 16-bit signed integer or fixed-point format.
Title: Re: Code Optimization
Post by: tggzzz on February 02, 2023, 02:31:21 pm
While the tools cannot and should not make the performance tradeoffs, the tools must provide the mechanisms for the designer to make the tradeoffs.
Exactly.  For compiler optimizations, that means the ability to enable/disable each as desired.

The traditional hierarchical control works pretty well: you have "optimization levels" or "types", each including a set of optimizations.

A particular example of this is how clang on ARMv7e-m is a bit too aggressive at unrolling loops at -O2, so one might wish to use -Os (optimize for size) instead, or -O2 -fno-unroll-loops.

The most important thing to users is that these are well documented (for clang, see Clang command line argument reference (https://clang.llvm.org/docs/ClangCommandLineReference.html)).  While numerically most use cases just use -O2, -Og, or -Os, there are those important cases where one wants a finer control.  The way clang allows one to enable/disable optimization for a particular function via #pragma in C and C++, is also quite useful and important, even though rarely needed.

I personally also heavily use Compiler Explorer (https://godbolt.org/), which can compile a number of languages from Ada to Zip to many target architectures, showing the generated machine code.  It is how I made sure that this C function for Cortex-M7 (https://www.eevblog.com/forum/microcontrollers/this-code-is-so-broken-it-some-how-fixes-itself/msg4675780/#msg4675780) computes
$$S = \sum_{i=0}^{N-1} x_i C_i, \quad N \text{ even}$$
for 16-bit signed integer or fixed-point \$x_i\$ and \$C_i\$ at about four cycles per iteration, at full precision and without overflow as the accumulator \$S\$ is a 64-bit integer (using the SMLALD SIMD instruction available on Cortex-M7), when compiled with either GCC or Clang (and suitable optimization).  It is obviously useful for FIR and IIR filters, as well as FFT/DFT, when the data is 16-bit signed integer or fixed-point format.


In addition tools must have mechanisms to get information from this process/thread/core to that process/thread/core reliably. I'm agnostic as to how that is achieved; it doesn't have to be solely at the compiler level.

The Itanic experience made me wary of twiddling things to suit individual processors. I remember a talk (in 1995?) by someone that fettled with the Itanic's assembler code for benchmarks. Every time there was a trivial change to the processor implementation, he started again from scratch. Now that was supposed to be taken care of by the oracular compiler, but that never arrived.
Title: Re: Code Optimization
Post by: Nominal Animal on February 02, 2023, 03:55:05 pm
In addition tools must have mechanisms to get information from this process/thread/core to that process/thread/core reliably. I'm agnostic as to how that is achieved; it doesn't have to be solely at the compiler level.
True, and as I mentioned earlier, as an initial approximation for understanding the interactions and issues, subsystems and peripherals like DMA can be modeled as separate threads/cores running in parallel.

The Itanic experience made me wary of twiddling things to suit individual processors. I remember a talk (in 1995?) by someone that fettled with the Itanic's assembler code for benchmarks. Every time there was a trivial change to the processor implementation, he started again from scratch. Now that was supposed to be taken care of by the oracular compiler, but that never arrived.
I do agree that twiddling with the source code to obtain the desired machine code for a specific processor and compiler combination is not something one does for general code; it's just that I do a lot of low-level computationally intensive stuff, where the performance is important.  The other option, for me, would be to write the actual machine code in extended assembly, letting the compiler choose the registers to be used.

The properties of C (and to a lesser extent, C++) as languages are such that it is very difficult, sometimes impossible, for the compiler to SIMD-vectorize the code, and to more generally convert data-sequential loops to data-parallel ones (for the backend optimizer to optimize for the target processor).  So, to make sure your most computationally intensive innermost loops are as efficient as possible, this Compiler Explorer investigation is unfortunately necessary.

Regarding my example in my previous message, it actually did not target any individual processor, but any Cortex-M4 (with FPv5) or -M7 microcontroller.  Consider it equally low-level function as say memcpy() is: intended to be used as the core operation when implementing higher-level functions like FIR and IIR filters.  (For DFT/FFT on 16-bit real data on power-of-two number of samples, one probably wants to use a slightly different core operation.)

In that particular case, the odd loop iteration form was necessary, because on the ARMv7e-m target architectures (Thumb/Thumb2 instruction set), neither GCC nor Clang could combine the loop iterator and data pointers when the loop is written using a separate iterator variable.  This is actually a rather common issue with C compilers – and a good example of how both GCC and Clang suffer from the same, even though having completely different backends and intermediate representation for the code! –, and I've used the same pattern even on x86-64 with SSE/AVX SIMD vectorization.

Which, funnily enough, leads back to the topic at hand: optimization.  When one uses a tool like Compiler Explorer to compare such tight loops, including which languages and which compilers can SIMD-vectorize such loops effectively to the same architecture (say, x86-64), it well illustrates how the language model itself limits what kind of optimizations can be done.  The data-sequential to data-parallel loop conversion is the main one I'm aware (simply because it affects HPC so often), but I'm sure there are others.

(When we include distributed computing (non-uniform memory architectures on larger machines, and distributed computing on multiple machines), we get to one of my own pet peeves: the difficulties related to ensuring communication and computation can be done at the same time.  Technically, the solution is asynchronous message passing, and is well known, but for whatever reason, it seems to be quite difficult for some people to truly grok; it is mostly a human problem.  Similar, but lesser issues affect parallel shared memory accesses, mostly related to atomicity and caching and interactions between multiple separate locking primitives.  We seem to generally just not have the necessary mental machinery to manage the complexity and rules, I guess.)
Title: Re: Code Optimization
Post by: SiliconWizard on February 03, 2023, 02:59:41 am
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC :popcorn:

Caches are a pretty useful hack. ;D

Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed :'(

Caches are useful everytime you have storage devices with different access latencies and/or speeds in a single system. This goes beyond just RAM.
No doubt some people are dreaming of a unified space with everything accessed at top speeds and in a single address space, but this is not realistic and would come with other issues as well.
Until then, caches and translation units serve us well.

One useful approach is to be able to "lock" some fast-memory area (could be cache or something else) for blocks of code or data that you don't want to ever be swapped out of cache / go back and forth.
Of course, while this works well in tightly constrained systems, it's hardly applicable for a general-purpose system.
Title: Re: Code Optimization
Post by: tggzzz on February 03, 2023, 10:54:38 am
In addition tools must have mechanisms to get information from this process/thread/core to that process/thread/core reliably. I'm agnostic as to how that is achieved; it doesn't have to be solely at the compiler level.
True, and as I mentioned earlier, as an initial approximation for understanding the interactions and issues, subsystems and peripherals like DMA can be modeled as separate threads/cores running in parallel.

Agreed.

I'll add unifying hardware interrupts and software interrupts such as exception traps and operating system context switches.

Quote
The Itanic experience made me wary of twiddling things to suit individual processors. I remember a talk (in 1995?) by someone that fettled with the Itanic's assembler code for benchmarks. Every time there was a trivial change to the processor implementation, he started again from scratch. Now that was supposed to be taken care of by the oracular compiler, but that never arrived.
I do agree that twiddling with the source code to obtain the desired machine code for a specific processor and compiler combination is not something one does for general code; it's just that I do a lot of low-level computationally intensive stuff, where the performance is important.  The other option, for me, would be to write the actual machine code in extended assembly, letting the compiler choose the registers to be used.

I frequently look at the generated code as a sanity check. It is sad that many people don't grok how HLL code maps to machine code. Too many can't even outline how a function call with arguments maps to pushing arguments on the stack (with registers as an optimisation).

I still remember the ligh bulb illuminating when, in 1972, I realised what Tony Hoare's famous Algol-60 compiler was doing :)

Quote
The properties of C (and to a lesser extent, C++) as languages are such that it is very difficult, sometimes impossible, for the compiler to SIMD-vectorize the code, and to more generally convert data-sequential loops to data-parallel ones (for the backend optimizer to optimize for the target processor).  So, to make sure your most computationally intensive innermost loops are as efficient as possible, this Compiler Explorer investigation is unfortunately necessary.

C the language forces pessimism, e.g. w.r.t. optimisation in the face of pointer aliasing. To get around that C tools have added workarounds along the lines of "I promise you can do this, and if I'm lying then you can legitimately empty my bank account" options. Too many people incorrectly think they understand all the pitfalls. I know I don't.

Quote
Regarding my example in my previous message, it actually did not target any individual processor, but any Cortex-M4 (with FPv5) or -M7 microcontroller.  Consider it equally low-level function as say memcpy() is: intended to be used as the core operation when implementing higher-level functions like FIR and IIR filters.  (For DFT/FFT on 16-bit real data on power-of-two number of samples, one probably wants to use a slightly different core operation.)

In that particular case, the odd loop iteration form was necessary, because on the ARMv7e-m target architectures (Thumb/Thumb2 instruction set), neither GCC nor Clang could combine the loop iterator and data pointers when the loop is written using a separate iterator variable.  This is actually a rather common issue with C compilers – and a good example of how both GCC and Clang suffer from the same, even though having completely different backends and intermediate representation for the code! –, and I've used the same pattern even on x86-64 with SSE/AVX SIMD vectorization.

Which, funnily enough, leads back to the topic at hand: optimization.  When one uses a tool like Compiler Explorer to compare such tight loops, including which languages and which compilers can SIMD-vectorize such loops effectively to the same architecture (say, x86-64), it well illustrates how the language model itself limits what kind of optimizations can be done.  The data-sequential to data-parallel loop conversion is the main one I'm aware (simply because it affects HPC so often), but I'm sure there are others.

I would want to check that such "microbenchmark optimisations" would survive in a larger program. C the language ain't good at that.

Quote
(When we include distributed computing (non-uniform memory architectures on larger machines, and distributed computing on multiple machines), we get to one of my own pet peeves: the difficulties related to ensuring communication and computation can be done at the same time.  Technically, the solution is asynchronous message passing, and is well known, but for whatever reason, it seems to be quite difficult for some people to truly grok; it is mostly a human problem.  Similar, but lesser issues affect parallel shared memory accesses, mostly related to atomicity and caching and interactions between multiple separate locking primitives.  We seem to generally just not have the necessary mental machinery to manage the complexity and rules, I guess.)

Yes indeed.

My hypothesis is that it is more natural for hardware engineers. Evidence: traditional softies can't get their head around the semantics of assignment in Verilog/VHDL.

The standard problem with message passing is choosing the granularity. Too fine and comms latency dominates computation latency. Too coarse and the size of the context in the messages becomes a problem. It doesn't take much understanding to realise that the choice depends critically on the underlying memory and comms channels.
Title: Re: Code Optimization
Post by: tggzzz on February 03, 2023, 11:02:04 am
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC :popcorn:

Caches are a pretty useful hack. ;D

Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed :'(

Caches are useful everytime you have storage devices with different access latencies and/or speeds in a single system. This goes beyond just RAM.

DRAM is the new disc, and disc is the new mag tape. RPC is the new station wagon full of tapes :)

Quote
No doubt some people are dreaming of a unified space with everything accessed at top speeds and in a single address space, but this is not realistic and would come with other issues as well.

That is cretinous for several reasons.

At a small scale, NUMA highlights the DRAM/disc analogy, and screws up when memories have to be kept coherent.

At a large scale hiding parallel computation will fail when one of the nodes or comms channels fails. Detecting and recovering cannot be general purpose; it has to be part of the application. If it is hidden then the application cannot deal with it appropriately.

In general, partial failure is something that is usually ignored or swept under the carpet.

Quote
Until then, caches and translation units serve us well.

One useful approach is to be able to "lock" some fast-memory area (could be cache or something else) for blocks of code or data that you don't want to ever be swapped out of cache / go back and forth.
Of course, while this works well in tightly constrained systems, it's hardly applicable for a general-purpose system.

Just so.

IIRC the Intel 960 was the first MCU that could lock things in fast memory. I believe there have been others since, but I haven't used them.
Title: Re: Code Optimization
Post by: Nominal Animal on February 03, 2023, 12:00:02 pm
Daydreaming:

Just think how nice it would be, if you had a lots-of-cores processors with just local "cache" RAM, and a central memory arbitrator and scheduler, so that you could request sets of cachelines or pages to be mapped to the local "cache" at a desired location, work on it, and then release or return any changes, with an interrupt generated by the memory arbitrator if there is contention on that same cacheline/page.  The basic interconnect would include the cache bursts, as well as cacheline-sized or larger messages.

Cacheline ping-pong would have annoying latencies, sure, because there would be first the interrupt and its response, then the decision from the arbitrator, and only then the memory transferred.  It would be much more distributed processing than symmetric parallel processing.

Basically, your central memory wouldn't even need virtual memory mapping, but could be managed by the arbitrator as an allocator, and referred to using handles that do not need to map actual memory addresses at all.  On the writeback, it could relocate the cacheline(s)/page(s) to minimise fragmentation, too.

For an architecture where all the cores are the same kind, that doesn't sound too interesting, but consider the case where you have some DSP cores that are optimized to do just very fast arithmetic on the data, some cores that are better suited for general-purpose programming, heck, even cores that run bytecode.
Make the interconnects simple but fast enough (PCIe lanes?), optimised for messages (full cachelines or larger units, burst RAM transfers), and you'd have a truly modular architecture.  Need more computational power?  Add new cores.

Tile and xCore do already exist, but they tend to have identical cores in regular arrays, and I do not believe their memory access scheme allows full software control as described above.

Then again, I'm no hardware designer.  I'd just love to program on such an architecture, and would looove to design proper security schemes from the hardware up.  The ones we have are so ... flimsy and ad-hoc and bolted-on, they give me the willies.
Title: Re: Code Optimization
Post by: tggzzz on February 03, 2023, 12:18:04 pm
Interesting dreams :)

xCORE has the advantage that it doesn't need such "fancy" cache schemes. "Not needing" is always better than complex satisfaction of needs :)

I, and computer science understand and deal with only three numbers: zero, one, and many. Anything else is a hack which is difficult to get right and is therefore fragile. I apply that to the concept of different types of cores.

Proper security schemes? See CHERI and The Mill.

The former is "minor" alterations to existing stuff.

The latter is a complete rethink of existing stuff based on what has worked well in the past. Uniquely it is with the necessary constraint that all existing code must run better than now.
Title: Re: Code Optimization
Post by: Nominal Animal on February 03, 2023, 12:43:55 pm
Proper security schemes? See CHERI and The Mill.
Most recently, in the highest level of userspace – browsers and such –, we (part of software development world that actually cares about this stuff) are right now discovering that what we actually need, is a hierarchy of access controls under each human user, as well as the hierarchy of access controls at the human user level.  Things like "human (H) needs to access processes (A) and (B), but we do not want (A) and (B) to know anything about each other, and definitely not exfiltrate data".

That sort of stuff is not something you bolt on top of something; it has to be supported at the core kernel level, and to do that, it needs hardware support (proper MMU in particular).

Over a decade ago, when I was doing research and HPC admin stuff, I constructed a similar hierarchical model for maintaining projects' web pages on the same server, using dedicated local group assignments for access controls, because there were a lot of "student admins", and maintainers changed quite often.
The same scheme could be used to secure web sites like forums and WordPress-type stuff, by ensuring that code running on the server could neither create new executables or interpretable binaries, nor rewrite themselves, and that pages only have the minimum set of privileges they need to do their thing.
Why aren't those used, and why doesn't WP et al. support such access partitioning schemes at all?  Because management software like Plesk, cPanel, Webmin do not support that.  They only support one user account per site.  Besides, WP and SMF and other such software are designed to be able to upgrade themselves, which requires them to be able to rewrite their own code.  The entire ecosystem is designed to be insecure!

There is a reason BSD created Jails so early, and why Linux has seccomp and cgroups.
I personally would prefer a sub-hierarchy under user accounts (a third ID type, in addition to UID and GID, that defaults to 0, and has a similar sub-hierarchy as UIDs have) instead of cgroups, but hey, we work with what we have.
_ _ _

Circling back to code optimization, we only need to consider the relatively recent discovery of the security issues in speculative execution, and other side channel attacks (like exfiltrating information on based how much time a known computational operation on it takes), that depending on the developer intent, the same source code should compile to either fixed-resource machine code (to stop the timing attacks), or to maximum performance code (because all processes having access to the timing information would already have access to the plaintext data).

So, when we talk about "optimization", we also need to specify the purpose.
Title: Re: Code Optimization
Post by: westfw on February 04, 2023, 03:28:53 am
Do modern processors allow cache invalidation of relatively small memory sections?
Dedicate some memory for DMA-based IO, and don't bother the other things in cache?
I've found the documentation of microcontroller caches to be relatively opaque :-(
Title: Re: Code Optimization
Post by: DiTBho on February 04, 2023, 11:24:35 am
Do modern processors allow cache invalidation of relatively small memory sections?
Dedicate some memory for DMA-based IO, and don't bother the other things in cache?
I've found the documentation of microcontroller caches to be relatively opaque :-(

About MIPS it "depends".
It's rather chip-specific.

edit:
they same applies to PowerPC-embedded.
Title: Re: Code Optimization
Post by: tggzzz on February 04, 2023, 11:32:01 am
Proper security schemes? See CHERI and The Mill.
Most recently, in the highest level of userspace – browsers and such –, we (part of software development world that actually cares about this stuff) are right now discovering that what we actually need, is a hierarchy of access controls under each human user, as well as the hierarchy of access controls at the human user level.  Things like "human (H) needs to access processes (A) and (B), but we do not want (A) and (B) to know anything about each other, and definitely not exfiltrate data".

As you note below, side channels will remain a pig w.r.t. exfiltrating data. Very simple and general purpose mechanisms exist.

I remember a major company demonstrating their secure unix system, claiming (in effect) that each terminal/process was completely isolated from other. Overnight a co-worker covered himself in congratulations by writing a short program that copied anything in typed in one terminal to another. The trick: the source terminal created/deleted a large file, and the receiving terminal repeatedly checked the remaining disk space; the characters were (slowly!) exfiltrated one bit at a time in ASCII code.

Quote
That sort of stuff is not something you bolt on top of something; it has to be supported at the core kernel level, and to do that, it needs hardware support (proper MMU in particular).

Unix and Windows both have ACLs, and that should be 90% sufficient I believe.

Setting them up correctly is an expert task, and possibly not something a simple tool could achieve.

Quote
Over a decade ago, when I was doing research and HPC admin stuff, I constructed a similar hierarchical model for maintaining projects' web pages on the same server, using dedicated local group assignments for access controls, because there were a lot of "student admins", and maintainers changed quite often.
The same scheme could be used to secure web sites like forums and WordPress-type stuff, by ensuring that code running on the server could neither create new executables or interpretable binaries, nor rewrite themselves, and that pages only have the minimum set of privileges they need to do their thing.
Why aren't those used, and why doesn't WP et al. support such access partitioning schemes at all?  Because management software like Plesk, cPanel, Webmin do not support that.  They only support one user account per site.  Besides, WP and SMF and other such software are designed to be able to upgrade themselves, which requires them to be able to rewrite their own code.  The entire ecosystem is designed to be insecure!

There is a reason BSD created Jails so early, and why Linux has seccomp and cgroups.
I personally would prefer a sub-hierarchy under user accounts (a third ID type, in addition to UID and GID, that defaults to 0, and has a similar sub-hierarchy as UIDs have) instead of cgroups, but hey, we work with what we have.

Are you aware of Joanna Rutkowska's "Qubes OS A reasonably secure operating system" https://www.qubes-os.org (https://www.qubes-os.org) ?

I know someone that uses it successfully. I'd try it, but it won't (for good reasons) support the drivers for my Nvidia graphics card.

Quote
Circling back to code optimization, we only need to consider the relatively recent discovery of the security issues in speculative execution, and other side channel attacks (like exfiltrating information on based how much time a known computational operation on it takes), that depending on the developer intent, the same source code should compile to either fixed-resource machine code (to stop the timing attacks), or to maximum performance code (because all processes having access to the timing information would already have access to the plaintext data).

So, when we talk about "optimization", we also need to specify the purpose.

Yup :)
Title: Re: Code Optimization
Post by: tggzzz on February 04, 2023, 11:35:28 am
I've found the documentation of microcontroller caches to be relatively opaque :-(

Yup, and very processor specific. Plus in the case of Intel and AMD x86 processors, changeable after the processor has been installed on a customer's site.

(And still people claim there are simple tests to define whether something is hardware or software :) )
Title: Re: Code Optimization
Post by: Nominal Animal on February 04, 2023, 11:53:31 am
Do modern processors allow cache invalidation of relatively small memory sections?
Dedicate some memory for DMA-based IO, and don't bother the other things in cache?
I've found the documentation of microcontroller caches to be relatively opaque :-(
For ARM Cortex-M7, it is part of the ARM core, so it's documented in the Armv7-M Architecture Reference Manual (https://developer.arm.com/documentation/ddi0403/ee/), separate from e.g ST and NXP manuals for their Cortex-M7 MCUs, even though ARM explicitly documents some of these as implementation defined!  Relatively opaque? Byzantine, I'd say.  Anyway, section B2.2.7, and especially Table B2-1, describes the possible operations.

Essentially, on ARM Cortex-M7, there are 32-bit-write-only memory mapped ARM core registers 0xE000EF50..0xE000EF78.  Data cache invalidate (DCIMVAC) is 0xE000EF5C, and you write any address within the target cacheline to invalidate that cacheline.  To invalidate a region, you do that in a loop, incrementing the address by CTR.DMINLINE each iteration.  The cache line size, CTR.DMINLINE, is four times two to the power of bits 16..19 of the Cache Type Register, CTR (0xE000ED7C), i.e.
    DMINLINE = 4 << (((*(volatile uint32_t *)0xE000ED7C) >> 16) & 15);
I believe; although I think @ataradov knows more about these details than I do.  (The Set/Way approach (specifying the cache level, cache line set, the way number in the set) is downright arcane, and requires quite detailed knowledge of the processor cache architecture that I haven't seen documented anywhere; I suppose it would allow complete software control of the data cache.)

NXP and ST Cortex-M7 also have an MPU, that supports a number (typically 8) of memory regions.  Each memory region is a power of two size in bytes (32 bytes minimum), and has its own attributes including access protection and cache policy.  You can even disable caching completely.
Title: Re: Code Optimization
Post by: Nominal Animal on February 04, 2023, 12:08:49 pm
Setting [proper security scheme] up correctly is an expert task, and possibly not something a simple tool could achieve.
Definitely; and it is not something one should let random developers at, as there is lots of experience in security schemes already available, and most developers just aren't diligent and careful enough to learn from history, ending up repeating the same mistakes again and again.

Just like for code optimization opportunities the intermediate representation can subtly limit the options and approaches possible, our web services are not insecure because the operating systems do not support better, more secure approaches; they are insecure because of arbitrary (easy but bad) choices made in the design and implementation of the services themselves, and in the services managing these.  The inertia of "this is how everybody else does it" is enormous, and very difficult to change, even if it constantly leads to user information being leaked and exploited illegally.

Title: Re: Code Optimization
Post by: tggzzz on February 04, 2023, 12:20:33 pm
Setting [proper security scheme] up correctly is an expert task, and possibly not something a simple tool could achieve.
Definitely; and it is not something one should let random developers at, as there is lots of experience in security schemes already available, and most developers just aren't diligent and careful enough to learn from history, ending up repeating the same mistakes again and again.

My version of that would be....
Definitely; and it is not something one should let random developers people at, as there is lots of experience in security schemes already available, and most developers people just aren't diligent and careful enough to learn from history, ending up repeating the same mistakes again and again.

Not helped by job agencies and HR writing PSBD on people's CV. An excellent techie that we employed had previously seen that on his CV. After asking several times what it meant, they revealed "past sell by date".

Quote
Just like for code optimization opportunities the intermediate representation can subtly limit the options and approaches possible, our web services are not insecure because the operating systems do not support better, more secure approaches; they are insecure because of arbitrary (easy but bad) choices made in the design and implementation of the services themselves, and in the services managing these.  The inertia of "this is how everybody else does it" is enormous, and very difficult to change, even if it constantly leads to user information being leaked and exploited illegally.

It is much cheaper and faster and equally effective simply to say "the security and privacy of our customers information is our (highest) priority".