Author Topic: Neat algorithms (Read 28565 times)

BravoV · « **Reply #125 on:** August 25, 2019, 07:04:21 am »

Quote from: mfro on August 25, 2019, 06:57:39 am

If you have a nice recursive algorithm but fear of overrunning stack, you can easily keep track of your recursion depth with an additional parameter you decrement each call and compare to a (hopefully reasonably choosen) maximum recursion depth (exiting with an error when the threshold has been reached).

This obviously uses extra stack space (and a little performance), so its up to you to to decide if you want to keep it in production code or remove it and call your testing (and additional safety marging due to its removal) good enough to be safe.

Not an expert in programming, nor experienced.

I've been wondering about the "control" on stack usage in recursion programming, seeing people raised lots of concerns.

My simple question is, if running a recursive program, without any control or aware of the stack limitation, isn't that like using a cpu by designed, that can't handle say like div by zero (eg. NMI or etc) from the 1st place ?

Edit : Simpler analogy, driving a tank with eyes closed, that you can't really be sure that you're crushing your enemy (expected & wanted) or your friends.

Cmiiw.

RoGeorge · « **Reply #126 on:** August 25, 2019, 07:13:12 am »

There is another topic about following or not the programming guidelines.

Quoting from the attached PDF: https://www.eevblog.com/forum/chat/dogmatic-coding-standards/msg2636658/#msg2636658

Quote

Avoiding recursion results in having an acyclic function call graph, which code analyzers can exploit to prove limits on stack use and boundedness of exe-cutions.

brucehoult · « **Reply #127 on:** August 25, 2019, 07:24:17 am »

Quote from: RoGeorge on August 25, 2019, 07:13:12 am

There is another topic about following or not the programming guidelines.

Quoting from the attached PDF: https://www.eevblog.com/forum/chat/dogmatic-coding-standards/msg2636658/#msg2636658
Quote
Avoiding recursion results in having an acyclic function call graph, which code analyzers can exploit to prove limits on stack use and boundedness of exe-cutions.

All that does is moves the problem from proving that you won't recurse too deep to proving that you won't run out of heap space or that the array you statically allocated for your hand-managed stack is large enough (and you didn't write buggy code doing it).

mfro · « **Reply #128 on:** August 25, 2019, 07:32:26 am »

Quote from: BravoV on August 25, 2019, 07:04:21 am

... I've been wondering about the "control" on stack usage in recursion programming, seeing people raised lots of concerns.

My simple question is, if running a recursive program, without any control or aware of the stack limitation, isn't that like using a cpu by designed, that can't handle say like div by zero (eg. NMI or etc) from the 1st place ?

I think that is more a question of how to avoid/detect stack overflow in any program, not just those that use recursive algorithms.

Any program where the call stack depth is dependend on run time values is potentially prone to stack overflow, i.e. refraining from recursion is no guarantee to avoid it (it's just that recursion will make it more likely as it increases stack load).

There are methods to detect a stack overflow; ranging from the poor man's stack guard where you vaccinate bottom of stack with a 'magic value' you regularily check for overwrites to unmapped pages below the stack on architectures that have a MMU.

Small µCs usually do not have the luxury of hardware supported faults on stack overflow, leading to the usual 'common sense' approach to ban everything that might cause it from µC code (MISRA-C, for example, explicitly prohibits recursion).

RoGeorge · « **Reply #129 on:** August 25, 2019, 07:56:07 am »

My intention was to move the talk about guidelines to its own topic, but I failed, sorry.

T3sl4co1l · « **Reply #130 on:** August 25, 2019, 09:46:39 am »

Quote from: mfro on August 25, 2019, 06:57:39 am

If you have a nice recursive algorithm but fear of overrunning stack, you can easily keep track of your recursion depth with an additional parameter you decrement each call and compare to a (hopefully reasonably choosen) maximum recursion depth (exiting with an error when the threshold has been reached).

This obviously uses extra stack space (and a little performance), so its up to you to to decide if you want to keep it in production code or remove it and call your testing (and additional safety marging due to its removal) good enough to be safe.

Or use a static variable and increment it on every call and decrement it on every return.

(Disclaimer: may break tail recursion optimization?)

Tim

coppice · « **Reply #131 on:** August 25, 2019, 11:48:35 am »

Quote from: BravoV on August 25, 2019, 07:04:21 am

Quote from: mfro on August 25, 2019, 06:57:39 am
If you have a nice recursive algorithm but fear of overrunning stack, you can easily keep track of your recursion depth with an additional parameter you decrement each call and compare to a (hopefully reasonably choosen) maximum recursion depth (exiting with an error when the threshold has been reached).

This obviously uses extra stack space (and a little performance), so its up to you to to decide if you want to keep it in production code or remove it and call your testing (and additional safety marging due to its removal) good enough to be safe.

Not an expert in programming, nor experienced.

I've been wondering about the "control" on stack usage in recursion programming, seeing people raised lots of concerns.

My simple question is, if running a recursive program, without any control or aware of the stack limitation, isn't that like using a cpu by designed, that can't handle say like div by zero (eg. NMI or etc) from the 1st place ?

Edit : Simpler analogy, driving a tank with eyes closed, that you can't really be sure that you're crushing your enemy (expected & wanted) or your friends.

Cmiiw.

The whole issue of blowing the stack is a big deal for any program. Recursive routines are just a notoriously troublesome example.

In most cases it is very difficult to work out the worst case biggest possible stack requirement, and make sure you have allowed enough stack space for that. On bigger machines most people do little more than allow lots of stack space and hope for the best. On small embedded machines you have to get your stack allocation right, as there is little RAM to waste. You might get some help from static analysis tools, but those tools are not generally foolproof. For example, they don't usually have the ability to work out the worst case program call depth + the worst case mix of nested interrupts.

NorthGuy · « **Reply #132 on:** August 25, 2019, 02:50:00 pm »

Quote from: coppice on August 25, 2019, 11:48:35 am

In most cases it is very difficult to work out the worst case biggest possible stack requirement, and make sure you have allowed enough stack space for that.

I don't think so. At any rate, it's much harder to predict heap behaviour, which, unlike stack, can be fragmented, possibly after days or months of running.

SiliconWizard · « **Reply #133 on:** August 25, 2019, 02:57:45 pm »

Quote from: NorthGuy on August 25, 2019, 02:50:00 pm

Quote from: coppice on August 25, 2019, 11:48:35 am
In most cases it is very difficult to work out the worst case biggest possible stack requirement, and make sure you have allowed enough stack space for that.

I don't think so. At any rate, it's much harder to predict heap behaviour, which, unlike stack, can be fragmented, possibly after days or months of running.

Agree with that.

coppice · « **Reply #134 on:** August 25, 2019, 04:08:10 pm »

Quote from: NorthGuy on August 25, 2019, 02:50:00 pm

Quote from: coppice on August 25, 2019, 11:48:35 am
In most cases it is very difficult to work out the worst case biggest possible stack requirement, and make sure you have allowed enough stack space for that.

I don't think so. At any rate, it's much harder to predict heap behaviour, which, unlike stack, can be fragmented, possibly after days or months of running.

Whilst its true that heap analysis is harder, most small embedded designs completely avoid using a heap. They can't help using the stack, and they can't afford to waste more than a few bytes. When your worst case stack use is an usually deep point in some calls, and all the interrupts having nested, hitting worst case can be a very rare event. It has to be analysed, but most development tool chains doesn't really take that need seriously, and try to help.

RoGeorge · « **Reply #135 on:** August 26, 2019, 09:57:13 am »

Feynman's algorithm for logarithms

Quote from: doi:10.1063/1.881196

Feynman worked out in some detail the program for
computing Hopfield's network on the Connection Ma-
chine. The part that he was proudest of was the
subroutine for computing a logarithm. I mention it here
not only because it is a clever algorithm, but also
because it is a specific contribution Richard made to the
mainstream of computer science. He had invented it at
Los Alamos.

Consider the problem of finding the logarithm of a
fractional number between 1 and 2. (The algorithm can be
generalized without too much difficulty.) Feynman ob-
served that any such number can be uniquely represented
as a product of numbers of the form 1 + 2 ^-k, where k is an
integer. Testing for the presence of each of these factors in
a binary representation is simply a matter of a shift and a
subtraction. Once the factors are determined, the loga-
rithm can be computed by adding together the precomput-
ed logarithms of the factors. The algorithm fit the
Connection Machine especially well because the small
table of the logarithms of 1 + 2 ^-k could be shared by all
the processors. The entire computation took less time
than doing a division.

Nominal Animal · « **Reply #136 on:** August 26, 2019, 12:08:01 pm »

Quarter square multiplication, one of the shift-and-add algorithms: Let's say you know S_n = floor(n²/4), i.e. S_n is a quarter of n squared, rounded towards zero. If x ≥ y ≥ 0, then x·y = S_x+y - S_x-y.

I wonder if this could be used to implement a fast multiplication on an FPGA? I dunno.

SiliconWizard · « **Reply #137 on:** August 26, 2019, 01:54:03 pm »

Quote from: Nominal Animal on August 26, 2019, 12:08:01 pm

Quarter square multiplication, one of the shift-and-add algorithms: Let's say you know S_n = floor(n²/4), i.e. S_n is a quarter of n squared, rounded towards zero. If x ≥ y ≥ 0, then x·y = S_x+y - S_x-y.

I wonder if this could be used to implement a fast multiplication on an FPGA? I dunno.

For small integers, probably. Now for larger ones, I think the size of the required look-up table would quickly get impractical. Ideas?

Nominal Animal · « **Reply #138 on:** August 26, 2019, 04:09:21 pm »

Quote from: SiliconWizard on August 26, 2019, 01:54:03 pm

for larger ones, I think the size of the required look-up table would quickly get impractical. Ideas?

I experimented with some approaches, but for larger ones, a binary shift-and-add ("barrel multiplier"?) wins. (It is difficult to keep N×N-bit product with 2N-bit result under 2N ops.) So, this only makes sense if you have S_i precalculated -- actually, S_i = floor(abs(i)²/4) for i = 1-2^N .. 2^N+1-1, so that for 0≤a,b≤2^N-1, a×b becomes S_a+b - S_a-b. That makes the lookup table size just under 6 N 2^N bits, turning the multiplication into two lookups, two subtractions, and one addition. I do not know how expensive such ROM tables are in general.

For example, if one has only 8x8 to 16-bit multiplication, the naïve/direct method is to compute 32×32 to 64-bit product using (a₀ + 2⁸a₁ + 2¹⁶a₂ + 2²⁴a₃)(b₀ + 2⁸b₁ + 2¹⁶b₂ + 2²⁴b₃) = 2⁰ a₀ b₀ + 2⁸ a₀ b₁ + 2¹⁶ a₀ b₂ + 2²⁴ a₀ b₃ + 2⁸ a₁ b₀ + 2¹⁶ a₁ b₁ + 2²⁴ a₁ b₂ + 2³² a₁ b₃ + 2¹⁶ a₂ b₀ + 2²⁴ a₂ b₁ + 2³² a₂ b₂ + 2⁴⁰ a₂ b₃ + 2²⁴ a₃ b₀ + 2³² a₃ b₁ + 2⁴⁰ a₃ b₂ + 2⁴⁸ a₃ b₃.
That is, with K words to construct each N-bit unsigned integer value, you have K² terms. However, a barrel multiplier can do it in O(KN), using just one-bit shifts (through carry), and additions; in hardware, several of them can be done in parallel. (For K=4, a 32×32=64-bit multiplication can be done in seven 64-bit additions, where each summand is constructed of at most three multiplications affecting non-overlapping bits.)

Anyway, knowing such mathematical tricks exist, may be useful for some odd purposes. Neat, in my opinion.

coppice · « **Reply #139 on:** August 26, 2019, 07:24:51 pm »

Quote from: Nominal Animal on August 26, 2019, 12:08:01 pm

Quarter square multiplication, one of the shift-and-add algorithms: Let's say you know S_n = floor(n²/4), i.e. S_n is a quarter of n squared, rounded towards zero. If x ≥ y ≥ 0, then x·y = S_x+y - S_x-y.

I wonder if this could be used to implement a fast multiplication on an FPGA? I dunno.

The critical importance of fast, silicon efficient, multiplication for DSP applications means a huge number of avenues to faster multiplication have been tried. You need to try something a bit more novel than a well established mathematical identity to have a good hope of breaking new ground. There are quite a few tricks for finding approximate answers, that people generally only know about if they work in an area where the approximation is effective. There aren't many obscure tricks for generating exact results, because everyone wants those.

legacy · « **Reply #140 on:** August 26, 2019, 09:40:34 pm »

Quote from: westfw on August 25, 2019, 06:13:55 am

how much code complexity does the recursion save you?

it saves a lot of debugging and team-working time, and it saves a lot of complexity in the test-report activity.

legacy · « **Reply #141 on:** August 26, 2019, 09:51:57 pm »

Quote from: coppice on August 25, 2019, 11:48:35 am

In most cases it is very difficult to work out the worst case biggest possible stack requirement, and make sure you have allowed enough stack space for that.

That's what I showed in my first post talking about *how* we do usually measure how much stack the algorithm is going to consume in both normal and abnormal operation. But it's just to make an idea, useful when the code is draft, and it cannot be used for any ufficial test-report.

Quote from: coppice on August 25, 2019, 11:48:35 am

On bigger machines most people do little more than allow lots of stack space and hope for the best. On small embedded machines you have to get your stack allocation right, as there is little RAM to waste. You might get some help from static analysis tools, but those tools are not generally foolproof. For example, they don't usually have the ability to work out the worst case program call depth + the worst case mix of nested interrupts.

We are on big embedded machines! PPC405, PPC440, PPC460, e500, etc, running RTOSes, but what not-avionic people do find hard seem the understanding of *how* do you test things.

Worst cases happen in abnormal conditions, where the "funniest" surprises greet you with their hands

legacy · « **Reply #142 on:** August 26, 2019, 10:18:13 pm »

Quote from: brucehoult on August 25, 2019, 05:53:05 am

We're talking about B-trees here. With disk-based blocks. What's your block size?

I am on B+tree, the block size is 4Kbyte, and block-pointers (iblock) are only stored on leaves.

I have rewritten the 90% of the library recursion-free, and I can easily show hot-points to guys in the testing-squad, e.g. btree_key_inserts and node_insert_after_splitting are two examples of "write-one-read-many" critical functions (they can be used with SMP) that insert a new { key, iblock } into a leaf so as to exceed the tree's order, causing the leaf to be split in half.

This code is critical, but I have a way to express the exact use of memory that it consumes

Code: [Select]

    btree_key_t      key[DEFAULT_ORDER];
    btree_iblock_t   iblock[DEFAULT_ORDER];

Those two arrays are a private "temporary" sets of key and block-pointer to hold everything in order, including the new key and iblock. This stuff would be on the stack with the recursion approach, and writing a test case would be more complex by several orders of magnitude, including the fact that you cannot test the quality on any running hardware

Howardlong · « **Reply #143 on:** August 26, 2019, 10:35:11 pm »

Regarding recursion, my anecdote is about a cheque printing algorithm (to print the words for the numbers) that one of my ex-colleagues presented as his own work. He was always quite full of himself, and to be honest we didn’t always see eye to eye on things: he was the young a blue sky thinker, whereas I’m a cynical old fart pragmatist with decades of battle wounds. His code was a heady mix of quite complex recursive self-referring SQL and XML. I even congratulated him on his ingenuity.

So part of me was really quite impressed with what he’d produced, but part of me was dreading supporting it as it was very write only code. But it did seem to work. One day we had a much larger than normal cheque batch, but not huge, about 30,000 cheques or so, which took down the database server, and with it all the call centre users.

The way the implementation worked under the hood is that the recursion objects were repeatedly allocated in RAM, so when the batch size hit a certain size, the machine was starved of memory, and ISTR it had 768GB RAM.

While I was figuring out how to resolve it, I took his code apart, and I suddenly realised while Googling part of it that my ex-colleague had pinched the whole code verbatim off the internet.

There are two morals. The first is that one should be very careful when re-using someone else’s code for your use case. The second is that some people can come over as smart, but aren’t.

westfw · « **Reply #144 on:** August 27, 2019, 04:58:23 am »

Quote

In most cases it is very difficult to work out the worst case biggest possible stack requirement

Yeah, that's the claim. But I don't think it's true.In many cases, you're dealing with standard algorithms whose stack depth requirements have been analyzed to death, providing many published papers and textbooks and theses. That leaves only the stack requirements for each level as "unknown", and that's easy to measure for a particular compiler/cpu.Sure, "not all recursive algorithms" are that well analyzed. But you're throwing away the ones that are for no good reason.

Nominal Animal · « **Reply #145 on:** August 27, 2019, 05:38:51 am »

Quote from: coppice on August 26, 2019, 07:24:51 pm

Quote from: Nominal Animal on August 26, 2019, 12:08:01 pm
Quarter square multiplication, one of the shift-and-add algorithms: Let's say you know S_n = floor(n²/4), i.e. S_n is a quarter of n squared, rounded towards zero. If x ≥ y ≥ 0, then x·y = S_x+y - S_x-y.

I wonder if this could be used to implement a fast multiplication on an FPGA? I dunno.
The critical importance of fast, silicon efficient, multiplication for DSP applications means a huge number of avenues to faster multiplication have been tried.

Sure, but the solution space is even more huge. This is just one of those utterly simple mathematical identities that people overlook, exactly because it is too simple.

I wondered whether it was a viable approach, and whether people were aware of it, considering that 1) FPGAs have LUTs (look-up tables) anyway, 2) here the look-up table contains at most 3×2^N entries of 2N bits for N×N=2N-bit multiplication, unlike N² for direct multiplication look-up, and 3) I haven't seen this mentioned in places that talk about multiplications on FPGAs (like here).

Again, I have zero experience with FPGAs, so I do not know. I do know that if one assumes that everything interesting has already been found and examined, one will be sorely disappointed in the reality of things.

mfro · « **Reply #146 on:** August 27, 2019, 07:18:20 am »

Quote from: Nominal Animal on August 27, 2019, 05:38:51 am

Again, I have zero experience with FPGAs, so I do not know.

FPGAs usually have dedicated DSP units on silicon and vendors usually do not tell you about implementation details.

Anyway, I would not expect anybody to be able to compete with 'direct silicon' no matter how smart and simple the implementation might be.

RoGeorge · « **Reply #147 on:** August 27, 2019, 08:34:11 am »

Any dedicated hardware is in fact the "direct silicon" implementation of an algorithm, so a hardware MAC could still be improved by a more performant algorithm. (MAC == multiplier followed by an adder/accumulator, the basic block for hardware DSP in FPGAs)

An example from the former ALTERA FPGAs (now Intel):

Implementing Multipliers in FPGA Devices
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/an/an306.pdf

A Survey and Comparative Analysis of Multiply-Accumulate (MAC) Block for Digital Signal Processing Application on ASIC and FPGA
https://scialert.net/fulltext/?doi=jas.2015.934.946

coppice · « **Reply #148 on:** August 27, 2019, 12:58:20 pm »

Quote from: mfro on August 27, 2019, 07:18:20 am

Quote from: Nominal Animal on August 27, 2019, 05:38:51 am
Again, I have zero experience with FPGAs, so I do not know.
FPGAs usually have dedicated DSP units on silicon and vendors usually do not tell you about implementation details.

The big FPGAs are mostly designed as DSP engines, so they have lots of dedicated hardware multipliers. In smaller FPGAs its still common to use macros to create multipliers out of logic blocks.

There is little point in the FPGA makers telling you the implementation details. You can easily tell from the specs. Its going to be based around a Booth or a Wallace tree, and if it has a latency greater than 1 clock cycle there will be some pipelining to help with the propagation delays in the tree.

legacy · « **Reply #149 on:** August 27, 2019, 08:39:07 pm »

Quote from: Nominal Animal on August 27, 2019, 05:38:51 am

about multiplications

My Arise-v2 core does implement MAC by a booth module. I am not considering any specific hardware, but rather a generic piece of VHDL code to be tested and simulated before it gets synthesized, so in this working conditions, "booth" is the best solution for me, and I would recommend it to everyone who wants to study how a fast MUL works.

Anyway, the division module is still problematic for me because it still takes 32 clocks for computing 32 bit. The reson is that I haven't yet found an algorithm with an acceptable complexity.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Neat algorithms (Read 28565 times)

Share me