Author Topic: Code Optimization (Read 4131 times)

tggzzz · « **Reply #25 on:** February 02, 2023, 08:16:23 am »

Quote from: Nominal Animal on February 02, 2023, 12:06:53 am

Quote from: tggzzz on February 01, 2023, 11:37:23 pm
Having said that, my general preference is for predictably slow in preference to maybe fast maybe slow. And that applies to more than software, e.g. to trains running on different routes to the same destination, or to waiting while making a phone call to a call centre, or to knowing when to go and pick someone up in the car. Predictability gives me the opportunity to prioritise what matters more to me.
The other end of the spectrum is high-performance computing, where performance is money.

If we limit to embedded, then many appliances – routers, file servers and NAS boxes – benefit more from performance than predictability of every single operation; we do want some limits, but the average/typical/expected performance is what the humans care about.

This, too, is a good example of how you want the decision made by a human designer, instead of a compiler: the exact same thing, say an IP stack, may require predictable (adequate) performance in one use case, and variable but as high performance as possible in another. It is not something you can leave for the tools to decide.

Agreed.

HPC is an interesting case where they have always pushed the boundaries of the possible (e.g. was MIPS, now MIPS/watt) and stressed processors, memory systems, and compilers to breaking points. It isn't uncommon that people on the sharp end know exactly where the skeletons are buried, even though other people deny there are even skeletons.

The HPC mob also knows what they don't need. One famous example from the Cray era is exact correct floating point arithmetic. They sensibly take the attitude that FP numbers can only be approximate, and that your algorithm plus input data define the output precision.

While the tools cannot and should not make the performance tradeoffs, the tools must provide the mechanisms for the designer to make the tradeoffs.

Nominal Animal · « **Reply #26 on:** February 02, 2023, 10:44:33 am »

Quote from: tggzzz on February 02, 2023, 08:16:23 am

While the tools cannot and should not make the performance tradeoffs, the tools must provide the mechanisms for the designer to make the tradeoffs.

Exactly. For compiler optimizations, that means the ability to enable/disable each as desired.

The traditional hierarchical control works pretty well: you have "optimization levels" or "types", each including a set of optimizations.

A particular example of this is how clang on ARMv7e-m is a bit too aggressive at unrolling loops at -O2, so one might wish to use -Os (optimize for size) instead, or -O2 -fno-unroll-loops.

The most important thing to users is that these are well documented (for clang, see Clang command line argument reference). While numerically most use cases just use -O2, -Og, or -Os, there are those important cases where one wants a finer control. The way clang allows one to enable/disable optimization for a particular function via #pragma in C and C++, is also quite useful and important, even though rarely needed.

I personally also heavily use Compiler Explorer, which can compile a number of languages from Ada to Zip to many target architectures, showing the generated machine code. It is how I made sure that this C function for Cortex-M7 computes
$$S = \sum_{i=0}^{N-1} x_i C_i, \quad N \text{ even}$$
for 16-bit signed integer or fixed-point \$x_i\$ and \$C_i\$ at about four cycles per iteration, at full precision and without overflow as the accumulator \$S\$ is a 64-bit integer (using the SMLALD SIMD instruction available on Cortex-M7), when compiled with either GCC or Clang (and suitable optimization). It is obviously useful for FIR and IIR filters, as well as FFT/DFT, when the data is 16-bit signed integer or fixed-point format.

tggzzz · « **Reply #27 on:** February 02, 2023, 02:31:21 pm »

Quote from: Nominal Animal on February 02, 2023, 10:44:33 am

Quote from: tggzzz on February 02, 2023, 08:16:23 am
While the tools cannot and should not make the performance tradeoffs, the tools must provide the mechanisms for the designer to make the tradeoffs.
Exactly. For compiler optimizations, that means the ability to enable/disable each as desired.

The traditional hierarchical control works pretty well: you have "optimization levels" or "types", each including a set of optimizations.

A particular example of this is how clang on ARMv7e-m is a bit too aggressive at unrolling loops at -O2, so one might wish to use -Os (optimize for size) instead, or -O2 -fno-unroll-loops.

The most important thing to users is that these are well documented (for clang, see Clang command line argument reference). While numerically most use cases just use -O2, -Og, or -Os, there are those important cases where one wants a finer control. The way clang allows one to enable/disable optimization for a particular function via #pragma in C and C++, is also quite useful and important, even though rarely needed.

I personally also heavily use Compiler Explorer, which can compile a number of languages from Ada to Zip to many target architectures, showing the generated machine code. It is how I made sure that this C function for Cortex-M7 computes
$$S = \sum_{i=0}^{N-1} x_i C_i, \quad N \text{ even}$$
for 16-bit signed integer or fixed-point \$x_i\$ and \$C_i\$ at about four cycles per iteration, at full precision and without overflow as the accumulator \$S\$ is a 64-bit integer (using the SMLALD SIMD instruction available on Cortex-M7), when compiled with either GCC or Clang (and suitable optimization). It is obviously useful for FIR and IIR filters, as well as FFT/DFT, when the data is 16-bit signed integer or fixed-point format.

In addition tools must have mechanisms to get information from this process/thread/core to that process/thread/core reliably. I'm agnostic as to how that is achieved; it doesn't have to be solely at the compiler level.

The Itanic experience made me wary of twiddling things to suit individual processors. I remember a talk (in 1995?) by someone that fettled with the Itanic's assembler code for benchmarks. Every time there was a trivial change to the processor implementation, he started again from scratch. Now that was supposed to be taken care of by the oracular compiler, but that never arrived.

Nominal Animal · « **Reply #28 on:** February 02, 2023, 03:55:05 pm »

Quote from: tggzzz on February 02, 2023, 02:31:21 pm

In addition tools must have mechanisms to get information from this process/thread/core to that process/thread/core reliably. I'm agnostic as to how that is achieved; it doesn't have to be solely at the compiler level.

True, and as I mentioned earlier, as an initial approximation for understanding the interactions and issues, subsystems and peripherals like DMA can be modeled as separate threads/cores running in parallel.

Quote from: tggzzz on February 02, 2023, 02:31:21 pm

The Itanic experience made me wary of twiddling things to suit individual processors. I remember a talk (in 1995?) by someone that fettled with the Itanic's assembler code for benchmarks. Every time there was a trivial change to the processor implementation, he started again from scratch. Now that was supposed to be taken care of by the oracular compiler, but that never arrived.

I do agree that twiddling with the source code to obtain the desired machine code for a specific processor and compiler combination is not something one does for general code; it's just that I do a lot of low-level computationally intensive stuff, where the performance is important. The other option, for me, would be to write the actual machine code in extended assembly, letting the compiler choose the registers to be used.

The properties of C (and to a lesser extent, C++) as languages are such that it is very difficult, sometimes impossible, for the compiler to SIMD-vectorize the code, and to more generally convert data-sequential loops to data-parallel ones (for the backend optimizer to optimize for the target processor). So, to make sure your most computationally intensive innermost loops are as efficient as possible, this Compiler Explorer investigation is unfortunately necessary.

Regarding my example in my previous message, it actually did not target any individual processor, but any Cortex-M4 (with FPv5) or -M7 microcontroller. Consider it equally low-level function as say memcpy() is: intended to be used as the core operation when implementing higher-level functions like FIR and IIR filters. (For DFT/FFT on 16-bit real data on power-of-two number of samples, one probably wants to use a slightly different core operation.)

In that particular case, the odd loop iteration form was necessary, because on the ARMv7e-m target architectures (Thumb/Thumb2 instruction set), neither GCC nor Clang could combine the loop iterator and data pointers when the loop is written using a separate iterator variable. This is actually a rather common issue with C compilers – and a good example of how both GCC and Clang suffer from the same, even though having completely different backends and intermediate representation for the code! –, and I've used the same pattern even on x86-64 with SSE/AVX SIMD vectorization.

Which, funnily enough, leads back to the topic at hand: optimization. When one uses a tool like Compiler Explorer to compare such tight loops, including which languages and which compilers can SIMD-vectorize such loops effectively to the same architecture (say, x86-64), it well illustrates how the language model itself limits what kind of optimizations can be done. The data-sequential to data-parallel loop conversion is the main one I'm aware (simply because it affects HPC so often), but I'm sure there are others.

(When we include distributed computing (non-uniform memory architectures on larger machines, and distributed computing on multiple machines), we get to one of my own pet peeves: the difficulties related to ensuring communication and computation can be done at the same time. Technically, the solution is asynchronous message passing, and is well known, but for whatever reason, it seems to be quite difficult for some people to truly grok; it is mostly a human problem. Similar, but lesser issues affect parallel shared memory accesses, mostly related to atomicity and caching and interactions between multiple separate locking primitives. We seem to generally just not have the necessary mental machinery to manage the complexity and rules, I guess.)

SiliconWizard · « **Reply #29 on:** February 03, 2023, 02:59:41 am »

Quote from: gf on February 01, 2023, 09:22:23 pm

Quote from: SiliconWizard on February 01, 2023, 09:10:31 pm
Quote from: gf on February 01, 2023, 08:57:39 pm
Quote from: tggzzz on February 01, 2023, 07:55:53 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC

Caches are a pretty useful hack.

Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed :'(

Caches are useful everytime you have storage devices with different access latencies and/or speeds in a single system. This goes beyond just RAM.
No doubt some people are dreaming of a unified space with everything accessed at top speeds and in a single address space, but this is not realistic and would come with other issues as well.
Until then, caches and translation units serve us well.

One useful approach is to be able to "lock" some fast-memory area (could be cache or something else) for blocks of code or data that you don't want to ever be swapped out of cache / go back and forth.
Of course, while this works well in tightly constrained systems, it's hardly applicable for a general-purpose system.

tggzzz · « **Reply #30 on:** February 03, 2023, 10:54:38 am »

Quote from: Nominal Animal on February 02, 2023, 03:55:05 pm

Quote from: tggzzz on February 02, 2023, 02:31:21 pm
In addition tools must have mechanisms to get information from this process/thread/core to that process/thread/core reliably. I'm agnostic as to how that is achieved; it doesn't have to be solely at the compiler level.
True, and as I mentioned earlier, as an initial approximation for understanding the interactions and issues, subsystems and peripherals like DMA can be modeled as separate threads/cores running in parallel.

Agreed.

I'll add unifying hardware interrupts and software interrupts such as exception traps and operating system context switches.

Quote

Quote from: tggzzz on February 02, 2023, 02:31:21 pm
The Itanic experience made me wary of twiddling things to suit individual processors. I remember a talk (in 1995?) by someone that fettled with the Itanic's assembler code for benchmarks. Every time there was a trivial change to the processor implementation, he started again from scratch. Now that was supposed to be taken care of by the oracular compiler, but that never arrived.
I do agree that twiddling with the source code to obtain the desired machine code for a specific processor and compiler combination is not something one does for general code; it's just that I do a lot of low-level computationally intensive stuff, where the performance is important. The other option, for me, would be to write the actual machine code in extended assembly, letting the compiler choose the registers to be used.

I frequently look at the generated code as a sanity check. It is sad that many people don't grok how HLL code maps to machine code. Too many can't even outline how a function call with arguments maps to pushing arguments on the stack (with registers as an optimisation).

I still remember the ligh bulb illuminating when, in 1972, I realised what Tony Hoare's famous Algol-60 compiler was doing

Quote

The properties of C (and to a lesser extent, C++) as languages are such that it is very difficult, sometimes impossible, for the compiler to SIMD-vectorize the code, and to more generally convert data-sequential loops to data-parallel ones (for the backend optimizer to optimize for the target processor). So, to make sure your most computationally intensive innermost loops are as efficient as possible, this Compiler Explorer investigation is unfortunately necessary.

C the language forces pessimism, e.g. w.r.t. optimisation in the face of pointer aliasing. To get around that C tools have added workarounds along the lines of "I promise you can do this, and if I'm lying then you can legitimately empty my bank account" options. Too many people incorrectly think they understand all the pitfalls. I know I don't.

Quote

Regarding my example in my previous message, it actually did not target any individual processor, but any Cortex-M4 (with FPv5) or -M7 microcontroller. Consider it equally low-level function as say memcpy() is: intended to be used as the core operation when implementing higher-level functions like FIR and IIR filters. (For DFT/FFT on 16-bit real data on power-of-two number of samples, one probably wants to use a slightly different core operation.)

In that particular case, the odd loop iteration form was necessary, because on the ARMv7e-m target architectures (Thumb/Thumb2 instruction set), neither GCC nor Clang could combine the loop iterator and data pointers when the loop is written using a separate iterator variable. This is actually a rather common issue with C compilers – and a good example of how both GCC and Clang suffer from the same, even though having completely different backends and intermediate representation for the code! –, and I've used the same pattern even on x86-64 with SSE/AVX SIMD vectorization.

Which, funnily enough, leads back to the topic at hand: optimization. When one uses a tool like Compiler Explorer to compare such tight loops, including which languages and which compilers can SIMD-vectorize such loops effectively to the same architecture (say, x86-64), it well illustrates how the language model itself limits what kind of optimizations can be done. The data-sequential to data-parallel loop conversion is the main one I'm aware (simply because it affects HPC so often), but I'm sure there are others.

I would want to check that such "microbenchmark optimisations" would survive in a larger program. C the language ain't good at that.

Quote

(When we include distributed computing (non-uniform memory architectures on larger machines, and distributed computing on multiple machines), we get to one of my own pet peeves: the difficulties related to ensuring communication and computation can be done at the same time. Technically, the solution is asynchronous message passing, and is well known, but for whatever reason, it seems to be quite difficult for some people to truly grok; it is mostly a human problem. Similar, but lesser issues affect parallel shared memory accesses, mostly related to atomicity and caching and interactions between multiple separate locking primitives. We seem to generally just not have the necessary mental machinery to manage the complexity and rules, I guess.)

Yes indeed.

My hypothesis is that it is more natural for hardware engineers. Evidence: traditional softies can't get their head around the semantics of assignment in Verilog/VHDL.

The standard problem with message passing is choosing the granularity. Too fine and comms latency dominates computation latency. Too coarse and the size of the context in the messages becomes a problem. It doesn't take much understanding to realise that the choice depends critically on the underlying memory and comms channels.

tggzzz · « **Reply #31 on:** February 03, 2023, 11:02:04 am »

Quote from: SiliconWizard on February 03, 2023, 02:59:41 am

Quote from: gf on February 01, 2023, 09:22:23 pm
Quote from: SiliconWizard on February 01, 2023, 09:10:31 pm
Quote from: gf on February 01, 2023, 08:57:39 pm
Quote from: tggzzz on February 01, 2023, 07:55:53 pm
Caches are a hack, which work and fail randomly.
Well, enjoy the performance of a cacheless PC

Caches are a pretty useful hack.

Unfortunately, Moore's Law didn't work as well for DRAM latency as it did for CPUs clock speed :'(

Caches are useful everytime you have storage devices with different access latencies and/or speeds in a single system. This goes beyond just RAM.

DRAM is the new disc, and disc is the new mag tape. RPC is the new station wagon full of tapes

Quote

No doubt some people are dreaming of a unified space with everything accessed at top speeds and in a single address space, but this is not realistic and would come with other issues as well.

That is cretinous for several reasons.

At a small scale, NUMA highlights the DRAM/disc analogy, and screws up when memories have to be kept coherent.

At a large scale hiding parallel computation will fail when one of the nodes or comms channels fails. Detecting and recovering cannot be general purpose; it has to be part of the application. If it is hidden then the application cannot deal with it appropriately.

In general, partial failure is something that is usually ignored or swept under the carpet.

Quote

Until then, caches and translation units serve us well.

One useful approach is to be able to "lock" some fast-memory area (could be cache or something else) for blocks of code or data that you don't want to ever be swapped out of cache / go back and forth.
Of course, while this works well in tightly constrained systems, it's hardly applicable for a general-purpose system.

Just so.

IIRC the Intel 960 was the first MCU that could lock things in fast memory. I believe there have been others since, but I haven't used them.

Nominal Animal · « **Reply #32 on:** February 03, 2023, 12:00:02 pm »

Daydreaming:

Just think how nice it would be, if you had a lots-of-cores processors with just local "cache" RAM, and a central memory arbitrator and scheduler, so that you could request sets of cachelines or pages to be mapped to the local "cache" at a desired location, work on it, and then release or return any changes, with an interrupt generated by the memory arbitrator if there is contention on that same cacheline/page. The basic interconnect would include the cache bursts, as well as cacheline-sized or larger messages.

Cacheline ping-pong would have annoying latencies, sure, because there would be first the interrupt and its response, then the decision from the arbitrator, and only then the memory transferred. It would be much more distributed processing than symmetric parallel processing.

Basically, your central memory wouldn't even need virtual memory mapping, but could be managed by the arbitrator as an allocator, and referred to using handles that do not need to map actual memory addresses at all. On the writeback, it could relocate the cacheline(s)/page(s) to minimise fragmentation, too.

For an architecture where all the cores are the same kind, that doesn't sound too interesting, but consider the case where you have some DSP cores that are optimized to do just very fast arithmetic on the data, some cores that are better suited for general-purpose programming, heck, even cores that run bytecode.
Make the interconnects simple but fast enough (PCIe lanes?), optimised for messages (full cachelines or larger units, burst RAM transfers), and you'd have a truly modular architecture. Need more computational power? Add new cores.

Tile and xCore do already exist, but they tend to have identical cores in regular arrays, and I do not believe their memory access scheme allows full software control as described above.

Then again, I'm no hardware designer. I'd just love to program on such an architecture, and would looove to design proper security schemes from the hardware up. The ones we have are so ... flimsy and ad-hoc and bolted-on, they give me the willies.

tggzzz · « **Reply #33 on:** February 03, 2023, 12:18:04 pm »

Interesting dreams

xCORE has the advantage that it doesn't need such "fancy" cache schemes. "Not needing" is always better than complex satisfaction of needs

I, and computer science understand and deal with only three numbers: zero, one, and many. Anything else is a hack which is difficult to get right and is therefore fragile. I apply that to the concept of different types of cores.

Proper security schemes? See CHERI and The Mill.

The former is "minor" alterations to existing stuff.

The latter is a complete rethink of existing stuff based on what has worked well in the past. Uniquely it is with the necessary constraint that all existing code must run better than now.

Nominal Animal · « **Reply #34 on:** February 03, 2023, 12:43:55 pm »

Quote from: tggzzz on February 03, 2023, 12:18:04 pm

Proper security schemes? See CHERI and The Mill.

Most recently, in the highest level of userspace – browsers and such –, we (part of software development world that actually cares about this stuff) are right now discovering that what we actually need, is a hierarchy of access controls under each human user, as well as the hierarchy of access controls at the human user level. Things like "human (H) needs to access processes (A) and (B), but we do not want (A) and (B) to know anything about each other, and definitely not exfiltrate data".

That sort of stuff is not something you bolt on top of something; it has to be supported at the core kernel level, and to do that, it needs hardware support (proper MMU in particular).

Over a decade ago, when I was doing research and HPC admin stuff, I constructed a similar hierarchical model for maintaining projects' web pages on the same server, using dedicated local group assignments for access controls, because there were a lot of "student admins", and maintainers changed quite often.
The same scheme could be used to secure web sites like forums and WordPress-type stuff, by ensuring that code running on the server could neither create new executables or interpretable binaries, nor rewrite themselves, and that pages only have the minimum set of privileges they need to do their thing.
Why aren't those used, and why doesn't WP et al. support such access partitioning schemes at all? Because management software like Plesk, cPanel, Webmin do not support that. They only support one user account per site. Besides, WP and SMF and other such software are designed to be able to upgrade themselves, which requires them to be able to rewrite their own code. The entire ecosystem is designed to be insecure!

There is a reason BSD created Jails so early, and why Linux has seccomp and cgroups.
I personally would prefer a sub-hierarchy under user accounts (a third ID type, in addition to UID and GID, that defaults to 0, and has a similar sub-hierarchy as UIDs have) instead of cgroups, but hey, we work with what we have.
_ _ _

Circling back to code optimization, we only need to consider the relatively recent discovery of the security issues in speculative execution, and other side channel attacks (like exfiltrating information on based how much time a known computational operation on it takes), that depending on the developer intent, the same source code should compile to either fixed-resource machine code (to stop the timing attacks), or to maximum performance code (because all processes having access to the timing information would already have access to the plaintext data).

So, when we talk about "optimization", we also need to specify the purpose.

westfw · « **Reply #35 on:** February 04, 2023, 03:28:53 am »

Do modern processors allow cache invalidation of relatively small memory sections?
Dedicate some memory for DMA-based IO, and don't bother the other things in cache?
I've found the documentation of microcontroller caches to be relatively opaque :-(

DiTBho · « **Reply #36 on:** February 04, 2023, 11:24:35 am »

Quote from: westfw on February 04, 2023, 03:28:53 am

Do modern processors allow cache invalidation of relatively small memory sections?
Dedicate some memory for DMA-based IO, and don't bother the other things in cache?
I've found the documentation of microcontroller caches to be relatively opaque :-(

About MIPS it "depends".
It's rather chip-specific.

edit:
they same applies to PowerPC-embedded.

tggzzz · « **Reply #37 on:** February 04, 2023, 11:32:01 am »

Quote from: Nominal Animal on February 03, 2023, 12:43:55 pm

Quote from: tggzzz on February 03, 2023, 12:18:04 pm
Proper security schemes? See CHERI and The Mill.
Most recently, in the highest level of userspace – browsers and such –, we (part of software development world that actually cares about this stuff) are right now discovering that what we actually need, is a hierarchy of access controls under each human user, as well as the hierarchy of access controls at the human user level. Things like "human (H) needs to access processes (A) and (B), but we do not want (A) and (B) to know anything about each other, and definitely not exfiltrate data".

As you note below, side channels will remain a pig w.r.t. exfiltrating data. Very simple and general purpose mechanisms exist.

I remember a major company demonstrating their secure unix system, claiming (in effect) that each terminal/process was completely isolated from other. Overnight a co-worker covered himself in congratulations by writing a short program that copied anything in typed in one terminal to another. The trick: the source terminal created/deleted a large file, and the receiving terminal repeatedly checked the remaining disk space; the characters were (slowly!) exfiltrated one bit at a time in ASCII code.

Quote

That sort of stuff is not something you bolt on top of something; it has to be supported at the core kernel level, and to do that, it needs hardware support (proper MMU in particular).

Unix and Windows both have ACLs, and that should be 90% sufficient I believe.

Setting them up correctly is an expert task, and possibly not something a simple tool could achieve.

Quote

Over a decade ago, when I was doing research and HPC admin stuff, I constructed a similar hierarchical model for maintaining projects' web pages on the same server, using dedicated local group assignments for access controls, because there were a lot of "student admins", and maintainers changed quite often.
The same scheme could be used to secure web sites like forums and WordPress-type stuff, by ensuring that code running on the server could neither create new executables or interpretable binaries, nor rewrite themselves, and that pages only have the minimum set of privileges they need to do their thing.
Why aren't those used, and why doesn't WP et al. support such access partitioning schemes at all? Because management software like Plesk, cPanel, Webmin do not support that. They only support one user account per site. Besides, WP and SMF and other such software are designed to be able to upgrade themselves, which requires them to be able to rewrite their own code. The entire ecosystem is designed to be insecure!

There is a reason BSD created Jails so early, and why Linux has seccomp and cgroups.
I personally would prefer a sub-hierarchy under user accounts (a third ID type, in addition to UID and GID, that defaults to 0, and has a similar sub-hierarchy as UIDs have) instead of cgroups, but hey, we work with what we have.

Are you aware of Joanna Rutkowska's "Qubes OS A reasonably secure operating system" https://www.qubes-os.org ?

I know someone that uses it successfully. I'd try it, but it won't (for good reasons) support the drivers for my Nvidia graphics card.

Quote

Circling back to code optimization, we only need to consider the relatively recent discovery of the security issues in speculative execution, and other side channel attacks (like exfiltrating information on based how much time a known computational operation on it takes), that depending on the developer intent, the same source code should compile to either fixed-resource machine code (to stop the timing attacks), or to maximum performance code (because all processes having access to the timing information would already have access to the plaintext data).

So, when we talk about "optimization", we also need to specify the purpose.

Yup

tggzzz · « **Reply #38 on:** February 04, 2023, 11:35:28 am »

Quote from: westfw on February 04, 2023, 03:28:53 am

I've found the documentation of microcontroller caches to be relatively opaque :-(

Yup, and very processor specific. Plus in the case of Intel and AMD x86 processors, changeable after the processor has been installed on a customer's site.

(And still people claim there are simple tests to define whether something is hardware or software

)

Nominal Animal · « **Reply #39 on:** February 04, 2023, 11:53:31 am »

Quote from: westfw on February 04, 2023, 03:28:53 am

Do modern processors allow cache invalidation of relatively small memory sections?
Dedicate some memory for DMA-based IO, and don't bother the other things in cache?
I've found the documentation of microcontroller caches to be relatively opaque :-(

For ARM Cortex-M7, it is part of the ARM core, so it's documented in the Armv7-M Architecture Reference Manual, separate from e.g ST and NXP manuals for their Cortex-M7 MCUs, even though ARM explicitly documents some of these as implementation defined! Relatively opaque? Byzantine, I'd say. Anyway, section B2.2.7, and especially Table B2-1, describes the possible operations.

Essentially, on ARM Cortex-M7, there are 32-bit-write-only memory mapped ARM core registers 0xE000EF50..0xE000EF78. Data cache invalidate (DCIMVAC) is 0xE000EF5C, and you write any address within the target cacheline to invalidate that cacheline. To invalidate a region, you do that in a loop, incrementing the address by CTR.DMINLINE each iteration. The cache line size, CTR.DMINLINE, is four times two to the power of bits 16..19 of the Cache Type Register, CTR (0xE000ED7C), i.e.
DMINLINE = 4 << (((*(volatile uint32_t *)0xE000ED7C) >> 16) & 15);
I believe; although I think @ataradov knows more about these details than I do. (The Set/Way approach (specifying the cache level, cache line set, the way number in the set) is downright arcane, and requires quite detailed knowledge of the processor cache architecture that I haven't seen documented anywhere; I suppose it would allow complete software control of the data cache.)

NXP and ST Cortex-M7 also have an MPU, that supports a number (typically 8) of memory regions. Each memory region is a power of two size in bytes (32 bytes minimum), and has its own attributes including access protection and cache policy. You can even disable caching completely.

Nominal Animal · « **Reply #40 on:** February 04, 2023, 12:08:49 pm »

Quote from: tggzzz on February 04, 2023, 11:32:01 am

Setting [proper security scheme] up correctly is an expert task, and possibly not something a simple tool could achieve.

Definitely; and it is not something one should let random developers at, as there is lots of experience in security schemes already available, and most developers just aren't diligent and careful enough to learn from history, ending up repeating the same mistakes again and again.

Just like for code optimization opportunities the intermediate representation can subtly limit the options and approaches possible, our web services are not insecure because the operating systems do not support better, more secure approaches; they are insecure because of arbitrary (easy but bad) choices made in the design and implementation of the services themselves, and in the services managing these. The inertia of "this is how everybody else does it" is enormous, and very difficult to change, even if it constantly leads to user information being leaked and exploited illegally.

tggzzz · « **Reply #41 on:** February 04, 2023, 12:20:33 pm »

Quote from: Nominal Animal on February 04, 2023, 12:08:49 pm

Quote from: tggzzz on February 04, 2023, 11:32:01 am
Setting [proper security scheme] up correctly is an expert task, and possibly not something a simple tool could achieve.
Definitely; and it is not something one should let random developers at, as there is lots of experience in security schemes already available, and most developers just aren't diligent and careful enough to learn from history, ending up repeating the same mistakes again and again.

My version of that would be....
Definitely; and it is not something one should let random ~~developers~~ people at, as there is lots of experience ~~in security scheme~~s already available, and most ~~developers~~ people just aren't diligent and careful enough to learn from history, ending up repeating the same mistakes again and again.

Not helped by job agencies and HR writing PSBD on people's CV. An excellent techie that we employed had previously seen that on his CV. After asking several times what it meant, they revealed "past sell by date".

Quote

Just like for code optimization opportunities the intermediate representation can subtly limit the options and approaches possible, our web services are not insecure because the operating systems do not support better, more secure approaches; they are insecure because of arbitrary (easy but bad) choices made in the design and implementation of the services themselves, and in the services managing these. The inertia of "this is how everybody else does it" is enormous, and very difficult to change, even if it constantly leads to user information being leaked and exploited illegally.

It is much cheaper and faster and equally effective simply to say "the security and privacy of our customers information is our (highest) priority".


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Code Optimization (Read 4131 times)

tggzzz

Re: Code Optimization

Nominal Animal

Re: Code Optimization

tggzzz

Re: Code Optimization

Nominal Animal

Re: Code Optimization

SiliconWizard

Re: Code Optimization

tggzzz

Re: Code Optimization

tggzzz

Re: Code Optimization

Nominal Animal

Re: Code Optimization

tggzzz

Re: Code Optimization

Nominal Animal

Re: Code Optimization

westfw

Re: Code Optimization

DiTBho

Re: Code Optimization

tggzzz

Re: Code Optimization

tggzzz

Re: Code Optimization

Nominal Animal

Re: Code Optimization

Nominal Animal

Re: Code Optimization

tggzzz

Re: Code Optimization

Share me