Author Topic: Code Optimization (Read 4106 times)

Sherlock Holmes · « **on:** January 31, 2023, 03:44:50 pm »

This is an interesting paper, given that many of the compilers we might use today leverage LLVM as their back end, this is useful to get some insight into as it affects much of what we do, so ask is my code correct...

Provably Correct Peephole Optimizations with Alive

Quote

Quote

We have translated more than 300 LLVM optimizations into Alive and, in the process, found that eight of them were wrong.

Interesting to read that this sequence of LLVM IR:

Code: [Select]

%1 = xor i32 %x, -1
%2 = add i32 %1, 3333

when optimized to:

Code: [Select]

%2 = sub i32 3332, %x

Itself relies on some 160 bytes of C++ analysis and optimization code!

cfbsoftware · « **Reply #1 on:** January 31, 2023, 10:38:43 pm »

Some words of wisdom regarding compiler code optimisation can be found in Hanspeter Mössenböck's article Compiler Construction -
The Art of Niklaus Wirth.

For example, these are some of the headings:

Quote

Generate good code for the frequent case; don't optimize the exceptional case

Quote

Generate good code from the beginning; don't generate bad code and optimize later

Quote

Generate predictable code

The complete article can be download from:

http://pascal.hansotten.com/uploads/books/art3.pdf

Hanspeter Mössenböck designed the Object Oriented extensions to the language Oberon, to become Oberon-2, and also created Coco/R - A Generator for Fast Compiler Front-Ends.

C#, Java, C++, F#, VB.Net, Delphi, Swift, TypeScript and Oberon versions of the Coco/R tool are available:

https://ssw.jku.at/Research/Projects/Coco/

eutectique · « **Reply #2 on:** January 31, 2023, 11:56:59 pm »

Have you heard of superoptimization? You can find An Interesting Example in this work: https://web.stanford.edu/class/cs343/resources/superoptimizer.pdf

thm_w · « **Reply #3 on:** February 01, 2023, 12:34:30 am »

Quote from: cfbsoftware on January 31, 2023, 10:38:43 pm

For example, these are some of the headings:
Quote
Generate good code from the beginning; don't generate bad code and optimize later

I'm not really buying the argument of this one, others make sense though.

Quote

Brandis and others [3] report that non-optimizing Oberon compilers for the Motorola 68020, for the Sun SPARC and for the DEC MIPS architectures generated code that was as
efficient as the code generated by the standard C compiler with the optimization option switched on (cc -O). Only for the IBM RS/6000 the optimizing C compiler was able to generate more efficient code than the Oberon compiler.

How is that any different than just enabling optimization by default?
Isn't the value of having a flag that you can compile un-optimized, to attempt to compile the code "as written", which we know might not be optimal.

cfbsoftware · « **Reply #4 on:** February 01, 2023, 04:07:46 am »

Quote from: thm_w on February 01, 2023, 12:34:30 am

Quote from: cfbsoftware on January 31, 2023, 10:38:43 pm
For example, these are some of the headings:
Quote
Generate good code from the beginning; don't generate bad code and optimize later

I'm not really buying the argument of this one, others make sense though.

Yeah - this probably falls into the category of "stating the bleeding obvious". I suspect it is based from experience of working with substandard code generators.

Quote

How is that any different than just enabling optimization by default?
Isn't the value of having a flag that you can compile un-optimized, to attempt to compile the code "as written", which we know might not be optimal.

Not all un-optimised code is equal. There can be many different ways of compiling code "as written". Some can be better than others due to the author's better knowledge of the instruction set; more careful design and implementation.

SiliconWizard · « **Reply #5 on:** February 01, 2023, 05:46:53 am »

Great paper. Sure some points are a bit outdated, when related to limitations of computers from the 80's and 90's.

The discussion about definition modules is interesting.

As to the "quality" of the IR impacting the output of optimizers, this can be experimented relatively easily. These days with LLVM, you can generate LLVM IR in various ways and check the impact with various optimizations, all with pretty state-of-the-art optimization techniques.

For someone implementing a compiler, I would recommend defining its own IR, fully adapted to the language itself, and then implementing various generators from this IR - can be C code, LLVM IR, or assembly if you feel inclined to that.

While LLVM IR looks decently general, I would recommend not to base your IR directly on it, as it may unnecessarily constrain and direct your language design, even if in subtle ways.

tggzzz · « **Reply #6 on:** February 01, 2023, 09:38:53 am »

Quote from: cfbsoftware on January 31, 2023, 10:38:43 pm

Some words of wisdom regarding compiler code optimisation can be found in Hanspeter Mössenböck's article Compiler Construction -
The Art of Niklaus Wirth.

For example, these are some of the headings:

Quote
Generate good code for the frequent case; don't optimize the exceptional case

Determining the "frequent case" is, um, challenging in modern programs.

One of the more interesting methods of optimisation is given in a series of papers where a research compiler accidentally gave production quality results.
https://www.hpl.hp.com/techreports/1999/HPL-1999-78.html

Quote

Quote
Generate predictable code

Good luck with that; the interactions between data access patterns and caches is significant and unknowable at compile time.

DiTBho · « **Reply #7 on:** February 01, 2023, 10:36:51 am »

Quote from: tggzzz on February 01, 2023, 09:38:53 am

Good luck with that; the interactions between data access patterns and caches is significant and unknowable at compile time.

if you want to laugh, in my-c i have to force a cache flush on every read/write attempt to the shared ram, because some hw-designer idiot put the shared ram in the same area addressed by the cache controller (which can't be disabled, because ... nobody knows), so when cpu0.board0 needs to talk to cpu1.board1 (on the other mezzanine) ... well, there is cache in the middle, without a bloody cache coherency protocol, so not only timings are not deterministic, but you risk a cache miss machine exception (which would be problematic if it happens under interrupt -> ISR exception inside a kernel exception) and - worse still - without cache miss exceptions - the shared cell content is not guaranteed to be correct, and you are not sure it's has been updated, so you could read old information.

What's that madness crazy hardware?!? It's pure garbage, but it happenes even with big companies ... an examples is Indigo2/Impact by SGI ... its MIPS4 R10000 has similar cache problems.

All of that is BAD hardware design, and you cannot fix, the best you can do is ... trying to find a workaround.

So, for my second MIPS5++ prototype board, I created a special memory model in my-c to handle that madness stuff. Simple trick: the machine layer always puts a "cache flush" before every IO on the shared ram.

Job done!, just it makes talking about "optimization" really funny since performance literally sucks, but at least it works correctly.

(I wish I could unsolder the cache, anyway)

tggzzz · « **Reply #8 on:** February 01, 2023, 11:27:54 am »

Quote from: DiTBho on February 01, 2023, 10:36:51 am

Quote from: tggzzz on February 01, 2023, 09:38:53 am
Good luck with that; the interactions between data access patterns and caches is significant and unknowable at compile time.

if you want to laugh, in my-c i have to force a cache flush on every read/write attempt to the shared ram, because some hw-designer idiot put the shared ram in the same area addressed by the cache controller (which can't be disabled, because ... nobody knows), so when cpu0.board0 needs to talk to cpu1.board1 (on the other mezzanine) ... well, there is cache in the middle, without a bloody cache coherency protocol, so not only timings are not deterministic, but you risk a cache miss machine exception (which would be problematic if it happens under interrupt -> ISR exception inside a kernel exception) and - worse still - without cache miss exceptions - the shared cell content is not guaranteed to be correct, and you are not sure it's has been updated, so you could read old information.

I wonder if the OP will realise that's why a memory model is necessary. He had difficulty with that in his other thread on ~~A New Hardware Oriented Programming Language~~ "The Imperium programming language - IPL" https://www.eevblog.com/forum/programming/a-new-hardware-oriented-programming-language/msg4661023/#msg4661023

I don't know why that thread was locked, but I imagine the OP requested it.

Quote

What's that madness crazy hardware?!? It's pure garbage, but it happenes even with big companies ... an examples is Indigo2/Impact by SGI ... its MIPS4 R10000 has similar cache problems.

All of that is BAD hardware design, and you cannot fix, the best you can do is ... trying to find a workaround.

So, for my second MIPS5++ prototype board, I created a special memory model in my-c to handle that madness stuff. Simple trick: the machine layer always put a "cache flush" before every IO on the shared ram.

Nah, it is good hardware design because it pushes the hard rare issues to be solved by software.

HP/Intel had the same approach with ~~Itanium~~ Itanic.

Oh.

Quote

Job done!, just it makes talking about "optimization" really funny since performance literally sucks, but at least it works correctly.

(I wish I could unsolder the cache, anyway)

If you are allowed to get incorrect answers, you can employ much more aggressive optimisation.

Nominal Animal · « **Reply #9 on:** February 01, 2023, 12:15:10 pm »

Even on NXP i.MX RT1062 (as used on Teensy 4.x) one has to do a manual cache flush between DMA and non-DMA accesses to the same memory. It's not too expensive, but it is basically impossible to optimize at the machine code generation level. (Teensyduino Core for these does such flushes around DMA operations, which isn't optimal, but not too bad either.)

In other words, some features cannot be automated, as they depends on the programmer intent, and is not really decidable by the compiler. (A single-threaded interpreter could track it and only do a cache flush when needed, but it turns out that on these, the cache flush is much cheaper than tracking all accesses.)
_ _ _

There is one very useful intermediate representation detail that nicely illustrates the subtle constraints (as mentioned by SiliconWizard) the IR may put on language implementations, somewhat related to the above: data-parallel loop optimizations.

Backends that support Fortran typically support at least two different types of loops with completely different semantics: FOR loops, with similar semantics and requirements as C/C++/etc. loops normally have, essentially sequential data loops; and FORALL for parallel data loops with associated limitations and semantics. Essentially, FORALL loops have no predetermined order of iterations, but occur in parallel, with an arbitrary or undefined order of side effects (like memory access pattern). In particular, FORALL loop bodies cannot refer to results calculated by other iterations in the same loop.

If your backend IR supports both data-sequential "FOR" loops and data-parallel "FORALL" loops, even if your language does not have a similar concept, it is often possible to convert data-sequential loops into data-parallel loops as an algorithmic optimization, when the language constructs used within the loop body permit such conversion. In practice, the handling of the iterator variable and the iteration direction for "FORALL" loops varies depending on the target architecture, and is implemented using a machine code pattern best suited for that type of instruction set architecture.

If we take a wider view, we can see that this difference between FOR and FORALL loops is essentially algorithmic –– data-sequential versus data-parallel loops ––, and if the IR does not support data-parallel loops, to achieve similar benefits, one would need to convert the loop representation to one that best suits each target architecture, requiring quite a lot of effort.

To circle back to the things like cache flush patterns that compilers may not in all cases be able to deduce for themselves, the FOR/FORALL difference is opposite: it is not too difficult for the compiler to determine if the data references in the loop body require sequential data access, or whether the data accesses would still be valid if they were done all in parallel, and thus determine when FOR loops can be turned into FORALL loops.

Thus, it is not necessary for the programming language itself to differentiate between FOR and FORALL loops –– even I can argue both for and against! ––, but it is necessary for the compiler to understand the difference between data-sequential and data-parallel loops, and the IR/backend to support both, to truly leverage this.
_ _ _

In high-performace computing Fortran is still used, because it is easier to write efficient computation in Fortran than in C or C++. Comparing the microbenchmarks exhibiting the differences, and the machine code generated, the main reasons are surprisingly few, and all involve arrays: efficient array slicing, data-parallel loops, and filling an array with a constant through duplication of the initial member without interpreting the value of the initial member.

On x86-64, data-parallel loops involving floating-point data are particularly easy to vectorize, leading to more efficient machine code. To achieve the same in C or C++, one needs to write the loop in a very specific form (specific to the compiler version), or use vector intrinsics provided by the compiler (<immintrin.h> for x86 and x86-64).

It is exactly for this kind of thing that I insist one must look at the machine code generated, and at wildly different programming languages, when designing a new language. True efficiency requires understanding and the application of many such concepts and details, and I personally have not yet seen any books or articles that describes many of them – in particular, because single-instruction multiple-data vectorization (SIMD) is quite "new" in computer science terms, and a lot of the findings are still in peer-reviewed articles and not yet collected and their meaning distilled into books that describe the best current understanding and approaches.

Concepts like the difference between FOR and FORALL loops (or rather, data-sequential and data-parallel loops) definitely look insignificant on the surface, but when you look into how they actually affect the code generated for different hardware architectures, and how much it helps with SIMD vectorization (so much so that I personally don't think SIMD vectorization outside FORALL/data-parallel loops makes much sense!), you realize they can make a very significant difference on certain architectures –– we're talking 2× to 8× computational performance in some cases.

And while FOR/FORALL is something the compiler can handle by itself without the language requiring explicit constructs, some other constructs like cache flushes, even customized memory prefetch patterns, may have to be exposed to the human developer. This means that a "single universal language fit for all purposes" is not only unrealistic, it would have very un-optimal performance. There are not that many situations where one is willing to give up say nine tenths of the performance just to achieve exactly predictable behaviour; personally, I care about the limits of the behaviour, and need those to be well defined, but do not care at all about the distribution of the behaviour as long as it is within those known limits.

Sherlock Holmes · « **Reply #10 on:** February 01, 2023, 02:59:35 pm »

Quote from: SiliconWizard on February 01, 2023, 05:46:53 am

Great paper. Sure some points are a bit outdated, when related to limitations of computers from the 80's and 90's.

The discussion about definition modules is interesting.

As to the "quality" of the IR impacting the output of optimizers, this can be experimented relatively easily. These days with LLVM, you can generate LLVM IR in various ways and check the impact with various optimizations, all with pretty state-of-the-art optimization techniques.

For someone implementing a compiler, I would recommend defining its own IR, fully adapted to the language itself, and then implementing various generators from this IR - can be C code, LLVM IR, or assembly if you feel inclined to that.

While LLVM IR looks decently general, I would recommend not to base your IR directly on it, as it may unnecessarily constrain and direct your language design, even if in subtle ways.

I know very little currently about LLVM that's something I've started to explore (and there's a managed class library that can be used from C# too which is important).

As for language specific IR that's noted, I've just started to look at creating an AST from the Antlr CST, that's going quite well, once I get the basic pattern I'm going to use, established, this will become a very meaty (yet extremely interesting) phase of the work. It's already reporting some initial diagnostics too but the overall structure of the tree is still very fluid.

You're right too, it might be prudent to consider an IR decoupled from LLVM, never did that in my earlier work but it could pay dividends, the ANSI spec for PL/I Subset G is a fascinating document actually possibly the first language to have a formally defined concrete syntax and abstract syntax, the abstract syntax is in fact described in terms of an imaginary "PL/I machine" so I might dig into that for ideas, sadly the ANSI standard is not public, a shame because its a first class example of a formal language definition (I have a printed copy I purchased decades ago).

I also joined a compiler design online discussion group which has been most helpful.

You sound like someone who knows a thing or two about all this, have you coded with LLVM much at all?

DiTBho · « **Reply #11 on:** February 01, 2023, 04:00:37 pm »

Quote from: tggzzz on February 01, 2023, 11:27:54 am

Nah, it is good hardware design because it pushes the hard rare issues to be solved by software.

hehe

right, I think the hardware guys must have thought something like, "oh, well, it'll be up to the software guys to fix it, so let's ignore it because we are terribly late with the project deadline"

And so they did, and commercial guys just said "approved", and then the result was so bad that the next product has a different approach.

Yup, there are commercial MIPS4 systems where { R10k, R12K, R14K, R16k } also expose several several problems load/store instructions after a mispredicted brunch.

_____w_____h_____a_____t_____?_____

_____"cache-coherent system"_____ vs _____"non-cache-coherent system"_____

In "cache-coherent system", load/store would just cancel this instruction later on, and no one would notice anything
In "non cache-coherent systems" (like that crap I mentioned), it is impossible to cancel loads or stores, and that may result in cache line being marked as dirty, which is catastrophic and causes kernel panic on DMA going on to the same location during that time, since data will be lost after write-back.

That madness was commercialized, and, worse still, Gcc never had a solution, so you cannot run Linux on I2/R10K. There are patches (cache-flush after each I/O), sure, but ... they don't really work and they have never been accepted main-stream.

I can understand those guys, because my-c needs to force extra "cache invalidation" op after each DMA. It makes code-optimization ... all a joke.

Anyway, in my Atlas board, there is an hw-feature that causes any stores to all but first 32M of KSEG0 addresses to fail, and that allows you to configure KSEG0 as "cached coherent", you only need to put the kernel code into first 32M of RAM, and use HMEM for the rest. Any memory that is being target of DMA transfer will be removed from TLB, and thus inaccessible while DMA is in progress.

Just, you have to make sure it isn't brought back into TLB by an accidental reference.

Oh, but! BUT!!! on the second MIPS5++ cpu-board prototype I can completely disable the cache and remap it as kseg0 ram, an incredible pleasure!!!! What can you do with 2Mbyte of synchronous ram chipping at 400Mhz with ZERO wait cycles!!!

Nominal Animal · « **Reply #12 on:** February 01, 2023, 06:10:18 pm »

Quote from: DiTBho on February 01, 2023, 04:00:37 pm

I can understand those guys, because my-c needs to force extra "cache invalidation" op after each DMA. It makes code-optimization ... all a joke.

On i.MX RT1062, forgetting the cache invalidation after DMA isn't catastrophic, it can just give "wrong" data. (If the area is already cached, then reading will yield stale data; if the area is already cached, a write will overwrite the DMA'd data. No interrupt or such is generated. The cache is also small enough – 32k for code, 32k for data – so that just doing an unconditional data cache eviction on DMA write completion interrupts and before DMA reads, has acceptable overhead.)

There is a lot of talk about cache coherency on the linux-kernel mailing list amongst key kernel developers. When you have more than one processor, it gets quite complicated. In particular, you must resolve cache conflicts at the point when one processor loads the cache line (and we can model DMA on these architectures by treating it as a separate processor, just without its own cache per se); it is basically impossible to resolve cache conflicts when more than one processor has the data cached. (You can imagine why: both processors think they have valid data, and have already done work on said data. If one wins, then the work done by the other would have to be undone somehow. Any other resolution leads to work having been done with the wrong data.)

This is related to code optimization and to how IR can subtly limit a compiler, because the Linux kernel itself is based on axiomatic assumptions (especially wrt. cache coherency and locking primitives), and porting the kernel to a system where those assumptions do not hold is extremely painful.
Similarly, trying to implement language features not supported by the intermediate representation, is often quite painful; like stuffing a rectangular peg through a smaller round hole.

DiTBho · « **Reply #13 on:** February 01, 2023, 07:43:34 pm »

Quote from: Nominal Animal on February 01, 2023, 06:10:18 pm

On i.MX RT1062, forgetting the cache invalidation after DMA isn't catastrophic, it can just give "wrong" data.

Linux/MIPS on Systems like O2/{R10K, R12K} and Indigo2/{R8K, R10K} crashes in less than 5 seconds from the bootstrap.

That's why I say "catastrophic"

Usually the severe failure happens after DMA transfer on the SCSI sub-unit, but also with *UMA unit, so even with the ram-rootfs (kernel + ram-rootfs, to exclude the SCSI issue) when some data is just wrong.

Sure, there are file systems for non-cache-coherent multi-cores (e.g. Hare-fs), but Linux "as-is" has no support and crashes.

So, running Linux on SGI O2/MIPS4, and/or on Indigo2 means "kernel crash *ALWAYS* guaranteed within 5 second from the bootstrap"(1), unless you "patch gcc" to force cache-invalidation/flush.

I gave up years ago

(1) hey? can we say this is a feature?
Guaranteed to crash within 5 sec

DiTBho · « **Reply #14 on:** February 01, 2023, 07:50:02 pm »

p.s.
also the ***32Mbyte*** KSEG0 as "cached coherent" hardware trick is great and unique!!! Other systems only give you 4-8Mbyte. With 32Mbyte I can have the Linux kernel(~6Mbyte) + mini-ram-rootfs(~16Mbyte) all running without any issue!

tggzzz · « **Reply #15 on:** February 01, 2023, 07:55:53 pm »

Quote from: DiTBho on February 01, 2023, 07:50:02 pm

p.s.
also the ***32Mbyte*** KSEG0 as "cached coherent" hardware trick is great and unique!!! Other systems only give you 4-8Mbyte. With 32Mbyte I can have the Linux kernel(~6Mbyte) + mini-ram-rootfs(~16Mbyte) all running without any issue!

Caches are a hack, which work and fail randomly.

Basic principle: if you know how to optimise, do it - but when you don't, randomise it. Hence caches, spread spectrum, ...

SiliconWizard · « **Reply #16 on:** February 01, 2023, 08:05:20 pm »

Quote from: Sherlock Holmes on February 01, 2023, 02:59:35 pm

Quote from: SiliconWizard on February 01, 2023, 05:46:53 am
Great paper. Sure some points are a bit outdated, when related to limitations of computers from the 80's and 90's.

The discussion about definition modules is interesting.

As to the "quality" of the IR impacting the output of optimizers, this can be experimented relatively easily. These days with LLVM, you can generate LLVM IR in various ways and check the impact with various optimizations, all with pretty state-of-the-art optimization techniques.

For someone implementing a compiler, I would recommend defining its own IR, fully adapted to the language itself, and then implementing various generators from this IR - can be C code, LLVM IR, or assembly if you feel inclined to that.

While LLVM IR looks decently general, I would recommend not to base your IR directly on it, as it may unnecessarily constrain and direct your language design, even if in subtle ways.

I know very little currently about LLVM that's something I've started to explore (and there's a managed class library that can be used from C# too which is important).

LLVM can take its native IR code in pure text form, and I'd recommend using that for starters. No need to fiddle with the LLVM libraries directly until you want to generate a fully "standalone" compiler. One thing at a time. You can absolutely start with just the LLVM binaries.

gf · « **Reply #17 on:** February 01, 2023, 08:57:39 pm »

Quote from: tggzzz on February 01, 2023, 07:55:53 pm

Caches are a hack, which work and fail randomly.

Well, enjoy the performance of a cacheless PC

SiliconWizard · « **Reply #18 on:** February 01, 2023, 09:10:31 pm »

Quote from: gf on February 01, 2023, 08:57:39 pm

Quote from: tggzzz on February 01, 2023, 07:55:53 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC

Caches are a pretty useful hack.

gf · « **Reply #19 on:** February 01, 2023, 09:22:23 pm »

Quote from: SiliconWizard on February 01, 2023, 09:10:31 pm

Quote from: gf on February 01, 2023, 08:57:39 pm
Quote from: tggzzz on February 01, 2023, 07:55:53 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC

Caches are a pretty useful hack.

Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed

tggzzz · « **Reply #20 on:** February 01, 2023, 10:04:26 pm »

Quote from: gf on February 01, 2023, 08:57:39 pm

Quote from: tggzzz on February 01, 2023, 07:55:53 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC

You are referring to the "mean" or "expected" performance of a PC, which can drop by orders of magnitude when least convenient. Hells teeth, even on a 486 with its miniscule caches, it was possible to see mean:worst interrupt times of 8:1. It ain't got better

For a mere hint at worst case performance, look at any benchmark which shows performance as a function of array size. The benchmark articles spend lots of words discussing where the performance cliffs are there, and why.

Then add in the effects that other programs have on yours.

And don't forget to add in the effects of having to refill TLBs and other caches from disk :0

tggzzz · « **Reply #21 on:** February 01, 2023, 10:05:55 pm »

Quote from: gf on February 01, 2023, 09:22:23 pm

Quote from: SiliconWizard on February 01, 2023, 09:10:31 pm
Quote from: gf on February 01, 2023, 08:57:39 pm
Quote from: tggzzz on February 01, 2023, 07:55:53 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC

Caches are a pretty useful hack.

Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed

You mean you can't use the TI9900's trick of having the registers located in RAM? Shame; that made context switching trivial

DiTBho · « **Reply #22 on:** February 01, 2023, 11:20:38 pm »

@tgzzz
there is even a further hardware option

so, tell me, if you were me, would you run the MIPS5++ Atlas board ***always*** with cache disabled and resource reassigned as static synchronous ram?

The CPU clock is 400Mhz
The cache clock is 400Mhz
The cache re-assigned as simple-linear-ram works at 400Mhz, with ZERO wait state
(there is a small fpga to handle chips as "cache"(small L1 + L2, or large L1) or as "linear ram")

32Mbyte cache, mappable as KSEG0 "cached coherent" so that *every* write in that area is guaranteed to fail about its dirty state (this is exactly the effect you have with cache flush) to ensure coherency
cache controller disabled, 32Mbyte reusable as linear-fast-ram

You can choose one of these two operating modes!

For me, 32Mbyte of fast-ram is huge space!!! Especially for small kernels like XINU (1Mbyte) or ucOS/2 (256Kbyte)!!!

Physically these 32Mbyte are on two large boards made with 512 Kbyte each synchronous ram chips. Sure, lot of chips (on each ram-board, 16 chips on SideA and 16 chips on sideB, 66 chips in total, including the fpga-logic to address them), but it doesn't look too bad with high-density smd.

And! you can always ask the CPU to disable the delay branch slot (only 1 bit to be set), so after a branch-instruction the CPU will automatically insert a "bubble" (works like NOP)! WOW, that's awesome for a Pipelined MIPS!!! Because it means you no more have to care about "what" the compiler puts after a branch.

Code: [Select]

       /*
        * branch if the first register operand is less than the second
        */
       slt %t0, %t1, %t2 // %t0 = 1 if %t1 < %t2; otherwise %t0 = 0
       bne %t0, %0, L1   // branch to L1 if %t0      nop // this is always exect
       nop               // on pipelined implementation, this is always executed

       ...

L1:

The m88100 prototype has a similar option. I like the Motorola 88k, and I love that idea with MIPS5++!!! But the Atlas, as platform, with that "disable cache and re-use the resource as ram" really looks close to what an ideal computer should look like!

Also, it makes writing an HL compiler easier

tggzzz · « **Reply #23 on:** February 01, 2023, 11:37:23 pm »

Cache or no cache? Hmmm. I think it would depend on the objectives and constraints.

If mean performance is more important, then caches might help.

If predictable or repeatable performance is more important, then caches might hinder and fast RAM might be useful.

If the code and data cannot fit into fast RAM and cannot be partitioned between fast and slow RAM, then caches might help.

No surprises there, I think.

Having said that, my general preference is for predictably slow in preference to maybe fast maybe slow. And that applies to more than software, e.g. to trains running on different routes to the same destination, or to waiting while making a phone call to a call centre, or to knowing when to go and pick someone up in the car. Predictability gives me the opportunity to prioritise what matters more to me.

Nominal Animal · « **Reply #24 on:** February 02, 2023, 12:06:53 am »

Quote from: tggzzz on February 01, 2023, 11:37:23 pm

Having said that, my general preference is for predictably slow in preference to maybe fast maybe slow. And that applies to more than software, e.g. to trains running on different routes to the same destination, or to waiting while making a phone call to a call centre, or to knowing when to go and pick someone up in the car. Predictability gives me the opportunity to prioritise what matters more to me.

The other end of the spectrum is high-performance computing, where performance is money.

If we limit to embedded, then many appliances – routers, file servers and NAS boxes – benefit more from performance than predictability of every single operation; we do want some limits, but the average/typical/expected performance is what the humans care about.

This, too, is a good example of how you want the decision made by a human designer, instead of a compiler: the exact same thing, say an IP stack, may require predictable (adequate) performance in one use case, and variable but as high performance as possible in another. It is not something you can leave for the tools to decide.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Code Optimization (Read 4106 times)

Share me