Author Topic: Neat algorithms (Read 28557 times)

Nominal Animal · « **Reply #150 on:** August 28, 2019, 04:52:50 am »

Quote from: legacy on August 27, 2019, 08:39:07 pm

Anyway, the division module is still problematic for me because it still takes 32 clocks for computing 32 bit. The reson is that I haven't yet found an algorithm with an acceptable complexity.

What are you using now? Binary long division (i.e., shift and subtract)?

Berni · « **Reply #151 on:** August 28, 2019, 05:24:48 am »

Division is going to be slow no matter how you do it. It's normal for even modern CPUs to take many times more cycles do to a divide compared to a multiply. So whenever possible you usually masquerade the divide as a multiply operation with bit shift. Tho its not always possible to do, but is in most signal processing use cases.

Yeah these "DSP" blocks in FPGAs are mostly just glorified multiply accumulate units. The only reason they are called DSP is because multiply accumulate performance is the primary performance metric for DSPs as that's what they are good at (And most DSP algorithms need it). So the point is that these DSP blocks in a FPGA are typically significantly faster than a multiplier implemented in the logic fabric and it uses less silicon area so more of them can be packed in. Having more is very useful because you typically in FPGAs dedicate multiply units to one task in a large algorithm so you usually need many of them to implement it, as well as the typical running of many copies of the algorithms in parallel to take advantage of a FPGA as otherwise you might as well just be using a CPU or normal DSP in a lot of cases.

legacy · « **Reply #152 on:** August 28, 2019, 01:00:44 pm »

Quote from: Nominal Animal on August 28, 2019, 04:52:50 am

What are you using now? Binary long division (i.e., shift and subtract)?

Old-school is slow, but it's also simple

I have also tried to implement a newton-raphson module, which is fast but damn too complex

So, this is what I have currently implemented (in the simulation, the division is 8bit)

SiliconWizard · « **Reply #153 on:** August 28, 2019, 03:40:15 pm »

But do you have a working ADA compiler for your Arise-v2 core yet?

legacy · « **Reply #154 on:** August 28, 2019, 05:04:29 pm »

Quote from: SiliconWizard on August 28, 2019, 03:40:15 pm

But do you have a working ADA compiler for your Arise-v2 core yet?

LOL. Unfortunately, only an asm-compiler

coppice · « **Reply #155 on:** August 28, 2019, 08:10:03 pm »

Quote from: Berni on August 28, 2019, 05:24:48 am

Division is going to be slow no matter how you do it.

Yeah. Come up with a way to divide that's as fast as multiplying and generations of algorithms working around the slowness of divide go out the window.

jpanhalt · « **Reply #156 on:** August 28, 2019, 10:49:20 pm »

Quote from: coppice on August 28, 2019, 08:10:03 pm

Quote from: Berni on August 28, 2019, 05:24:48 am
Division is going to be slow no matter how you do it.
Yeah. Come up with a way to divide that's as fast as multiplying and generations of algorithms working around the slowness of divide go out the window.

That's like saying you need a shorter piece of string.

What size data are you dealing with? 128/64 bit? 8/8 bit? Or something else? How many Tcy does your current algorithm take? In other words, there are some fairly fast algorithms for division. The so-called Kenyan (aka Russian and a lot of other names) and PicList has some others.

coppice · « **Reply #157 on:** August 29, 2019, 12:48:38 am »

Quote from: jpanhalt on August 28, 2019, 10:49:20 pm

Quote from: coppice on August 28, 2019, 08:10:03 pm
Quote from: Berni on August 28, 2019, 05:24:48 am
Division is going to be slow no matter how you do it.
Yeah. Come up with a way to divide that's as fast as multiplying and generations of algorithms working around the slowness of divide go out the window.

That's like saying you need a shorter piece of string.

What size data are you dealing with? 128/64 bit? 8/8 bit? Or something else? How many Tcy does your current algorithm take? In other words, there are some fairly fast algorithms for division. The so-called Kenyan (aka Russian and a lot of other names) and PicList has some others.

Obviously for really small numbers of bits you can often use a lookup approach. As the bits expand you are quickly into successive approximation approaches, typically with a skip ahead for small values, to avoid the logic exploding to unrealistic numbers of gates. The Russian division technique is basically successive approximation, although its often expressed in a way that doesn't immediately look like it is. After so many years of people failing to find the division equivalent of Wallace or Booth, that can be implemented in a realistic number of gates, with all the carry look ahead needed to avoid severe ripple delays, it seem unlikely we are going to do better than we do today.

SiliconWizard · « **Reply #158 on:** August 29, 2019, 12:52:05 am »

The fastest way of dividing, of course, being not to divide at all.

coppice · « **Reply #159 on:** August 29, 2019, 12:54:48 am »

Quote from: SiliconWizard on August 29, 2019, 12:52:05 am

The fastest way of dividing, of course, being not to divide at all.

The fastest way to divide is typically to reformulate the problems until it turns into a multiply.

legacy · « **Reply #160 on:** August 29, 2019, 05:50:16 am »

Quote from: jpanhalt on August 28, 2019, 10:49:20 pm

What size data are you dealing with?

The current VHDL code is parametric, and the data_size for all the ALU's modules, and for all the Cop1's module can be { 8, 16, 32, 64 } bit. This is not a problem for the simulator, neither for the synthesizers (Xilinx ISE v10.1 for Spartan2, Xilinx ISE v14 for Spartan3-6), but concerning the ISA ... well, the Arise-v2-r2 core is ... 32bit; there is also a 64bit branch, but people inside the group (four friends including me) have 4*3*2 (n!) different opinions regarding details (d'oh

)

So, let's assume: 32bit

Quote from: jpanhalt on August 28, 2019, 10:49:20 pm

How many Tcy does your current algorithm take?

Currently, it evolves 1 clock to compute 1 bit, so 32 clock edges to compute quotient and reminder, both of 32bit of size, plus 2 clocks for handling the sign and the fsm.

Quote from: jpanhalt on August 28, 2019, 10:49:20 pm

Kenyan (aka Russian and a lot of other names) and PicList has some others.

can you tell me more?

jpanhalt · « **Reply #161 on:** August 29, 2019, 10:53:02 am »

@legacy

Re: "Kenyan" Here you go: http://www.piclist.com/techref/method/math/muldiv.htm

jpanhalt · « **Reply #162 on:** August 29, 2019, 12:35:21 pm »

Quote from: legacy

So, let's assume: 32bit

Quote from: jpanhalt on August 28, 2019, 10:49:20 pm
How many Tcy does your current algorithm take?

Currently, it evolves 1 clock to compute 1 bit, so 32 clock edges to compute quotient and reminder, both of 32bit of size, plus 2 clocks for handling the sign and the fsm.

I was thinking embedded microcontrollers for which (assuming 8-bit MCU's) the fastest routine is considerably slower than the product of the two arguments, e.g., 24-bit divided by 16-bit (384). For example:

http://www.piclist.com/techref/microchip/math/div/24by16.htm

Quote

;Inputs:
; Dividend - AARGB0:AARGB1:AARGB2 (0 - most significant!)
; Divisor - BARGB0:BARGB1
;Temporary:
; Counter - LOOPCOUNT
; Remainder- REMB0:REMB1
;Output:
; Quotient - AARGB0:AARGB1:AARGB2
;
; Size: 28
; Max timing: 4+24*(6+6+4+3+6)-1+3+2=608 cycles (with return)
; Min timing: 4+24*(6+6+5+6)-1+3+2=560 cycles (with return)

NorthGuy · « **Reply #163 on:** August 29, 2019, 01:05:51 pm »

Quote from: legacy on August 29, 2019, 05:50:16 am

Currently, it evolves 1 clock to compute 1 bit, so 32 clock edges to compute quotient and reminder, both of 32bit of size, plus 2 clocks for handling the sign and the fsm.

16-bit PICs do 16-bit division in 18 cycles (16 + 2). What a coincidence

mrflibble · « **Reply #164 on:** August 29, 2019, 02:13:48 pm »

On the subject of Kenyan/Karatsuba/etc multiplication & division, this book might be of interest:
* https://cosec.bit.uni-bonn.de/science/mca/
* https://cosec.bit.uni-bonn.de/fileadmin/user_upload/science/mca/mcaintro.pdf
* https://assets.cambridge.org/97811070/39032/excerpt/9781107039032_excerpt.pdf
* https://assets.cambridge.org/97811070/39032/index/9781107039032_index.pdf

Specifically chapters 5,6,8,9,14,15. Or probably for the topics at hand in this order: chapters 8, 9, 5, 6, 14, 15.

mrflibble · « **Reply #165 on:** August 29, 2019, 05:20:30 pm »

Quote from: RoGeorge on August 25, 2019, 07:56:07 am

My intention was to move the talk about guidelines to its own topic, but I failed, sorry.

Quote from: RoGeorge on August 26, 2019, 09:57:13 am

Feynman's algorithm for logarithms

Quote from: doi:10.1063/1.881196
Feynman worked out in some detail the program for
computing Hopfield's network on the Connection Ma-
chine.
...

Well, you at least tried to keep it somewhat on topic, which counts for something. Oh, and bonus points for providing the doi:10.1063/1.881196.

I wish people would use that more often on here. And especially certain people who bang on about providing references ... and then proceed to not provide references themselves. Or they do provide references, but forget to actually check them. Thus ending up with a reference that resolves to, yes the correct author & subject, but alas the wrong publication. Which of course happens to be a condensed version of the original paper, and equally of course, does not contain the particular snippet pertaining to the subject in the forum post. Takes serious practice, that!

Also, thanks for the link you posted in some other thread: https://www.av8n.com/. There's lots of interesting tidbits there. Especially the bits about ill posed problems and metacognition.

And before I forget, another (IMO) awesome resource is the Euler Archive. In particular the How Euler Did It section. From the main page:

Quote from: http://eulerarchive.maa.org/hedi/

How Euler Did It is an online MAA column, written by Ed Sandifer of Western Connecticut State University from 2003 to 2010. Each article examines a specific work or concept developed by Leonhard Euler, with the topics ranging from number theory to geography to fluid mechanics.

The Euler Archive, in collaboration with the MAA, hosts the article collection for the How Euler Did It series. The series was published in two volumes, both volumes are available from the MAA: How Euler Did It (2007), How Euler Did Even More (2014).

And while we're at it Project Euler provides a nice collection of Euler related problems. There are plenty to choose from to provide a bit of a challenge.

Quote from: https://projecteuler.net/

What is Project Euler?

Project Euler is a series of challenging mathematical/computer programming problems that will require more than just mathematical insights to solve. Although mathematics will help you arrive at elegant and efficient methods, the use of a computer and programming skills will be required to solve most problems.

The motivation for starting Project Euler, and its continuation, is to provide a platform for the inquiring mind to delve into unfamiliar areas and learn new concepts in a fun and recreational context.

jpanhalt · « **Reply #166 on:** August 29, 2019, 05:44:29 pm »

Quote from: NorthGuy on August 29, 2019, 01:05:51 pm

Quote from: legacy on August 29, 2019, 05:50:16 am
Currently, it evolves 1 clock to compute 1 bit, so 32 clock edges to compute quotient and reminder, both of 32bit of size, plus 2 clocks for handling the sign and the fsm.

16-bit PICs do 16-bit division in 18 cycles (16 + 2). What a coincidence

That's a good point, and every time I struggle with multiply or divide with my my little PIC16F chips , I wonder about making the jump to a PIC24F. However, I would not consider such slick hardware solutions, "neat algorithms."

coppice · « **Reply #167 on:** August 29, 2019, 05:47:11 pm »

Quote from: jpanhalt on August 29, 2019, 05:44:29 pm

Quote from: NorthGuy on August 29, 2019, 01:05:51 pm
Quote from: legacy on August 29, 2019, 05:50:16 am
Currently, it evolves 1 clock to compute 1 bit, so 32 clock edges to compute quotient and reminder, both of 32bit of size, plus 2 clocks for handling the sign and the fsm.

16-bit PICs do 16-bit division in 18 cycles (16 + 2). What a coincidence

That's a good point, and every time I struggle with multiply or divide with my my little PIC16F chips , I wonder about making the jump to a PIC24F. However, I would not consider such slick hardware solutions, "neat algorithms."

You would if you were a hardware engineer, looking for the fastest or smallest way to implement an instruction set.

mrflibble · « **Reply #168 on:** September 02, 2019, 06:16:10 pm »

Quote from: legacy on August 19, 2019, 12:29:28 pm

Tango trees? Is there any good C implementation? Where is it used?

"Is there any good C implementation?" ... Only several ways to find out!

"Where is it used?" ... Anywhere you'd use a balanced binary search tree, but need better performance than said balanced binary search tree. For a quick overview, see for example this here link. For more in depth info, just google "Erik Demaine tango trees". That'll give you plenty of info of various levels of detail to choose from.

westfw · « **Reply #169 on:** September 05, 2019, 07:20:40 am »

Quote

Tango trees?

"Preferred paths ... most recently accessed..."
Sort-of sounds like a formalized and well-analyzed caching scheme, which is sort-of neat...

legacy · « **Reply #170 on:** September 06, 2019, 11:56:50 am »

Quote from: mrflibble on September 02, 2019, 06:16:10 pm

Anywhere you'd use a balanced binary search tree

yeah, it's a log(log(n)). But it looks adding more complexity, therefore it seems that it makes sense only for large data-set, when "n" is a very big number, and so big that log(n) >> log(log(n)).

brucehoult · « **Reply #171 on:** September 06, 2019, 10:31:46 pm »

Quote from: legacy on September 06, 2019, 11:56:50 am

Quote from: mrflibble on September 02, 2019, 06:16:10 pm
Anywhere you'd use a balanced binary search tree

yeah, it's a log(log(n)). But it looks adding more complexity, therefore it seems that it makes sense only for large data-set, when "n" is a very big number, and so big that log(n) >> log(log(n)).

It looks horribly complex. The code will not be small. For useful data sizes (between 64k nodes and 4G nodes) you need 5 flag bits in every node. That's going to turn into 8 bits if not 16 or 32 in reality.

If you've got that much data then you're probably dealing with storage with either cache lines or TLB or actual mass storage "disk blocks". Something with the property that you have a relatively long latency but then get a large chunk of data very quickly. When you have that kind of characteristic a binary tree is a very poor fit. Once you get that data block locally you should make as much use of it as possible. That means you should be using a B{*,+}-tree not a binary tree for vastly reduced tree depth and hence number of access time delays.

For things small enough to be in local memory with truly random access in constant time I use Scapegoat trees. That's a balanced binary tree with *zero* extra flag data in each node and extremely small and simple code that I can write in about ten minutes from memory in an exam/contest setting. They outperform AVR or red-black trees on an amortised (i.e. throughput) basis. However they are not suitable for real-time uses as on the few occasions when a rebalance is needed it is O(N) on some subtree. Splay trees also have similar performance characteristics, but much more complex code.

brucehoult · « **Reply #172 on:** September 06, 2019, 11:21:04 pm »

Here's a complete program that uses a scapegoat tree to sort lines of text. The whole program is 85 lines. The basic scapegoat tree code including find(), insert() and rebalancing is 43 lines of code, 31 of which are not blank or just "}".

Fifty entries in the global minNodesForDepth table is good for over half a billion nodes in the tree. You might want to replace the init_table() function with a static array initializer.

Code: [Select]

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

struct node {char *key; node *left; node *right;};
int node_cnt;

#define BAL_OK 0
#define ALPHA 0.666667
#define TABSZ 50
int minNodesForDepth[TABSZ];

void init_table(){
  double A = 1.0/ALPHA;
  for (int n=0; n<TABSZ; ++n, A*=1.0/ALPHA){
    minNodesForDepth[n] = A;
  }
}

void printtree(node *root, int depth=0){
  if (!root) return;
  printtree(root->left, depth+1);
  for (int i=0; i<depth; ++i) printf("  ");
  printf("%s", root->key);
  printtree(root->right, depth+1);
}

node *find(node *root, char *key){
  int cmp = strcmp(key, root->key);
  if (cmp == 0) return root;
  return find(cmp<0 ? root->left : root->right, key);
}

node *makelinear(node *root, node *lst){
  if (!root) return lst;
  root->right = makelinear(root->right, lst);
  node *left = root->left;
  root->left = 0;
  return makelinear(left, root);
}

node *maketree(node *&lst, int n){
  if (n == 0) return 0;
  node *left = maketree(lst, (n-1)/2);
  node *root = lst; lst = lst->right;
  root->left = left;
  root->right = maketree(lst, n/2);
  return root;
}
  
int cntSz(node *root){
  return !root ? 0 : 1 + cntSz(root->left) + cntSz(root->right);
}

int insert(node *&root, node *n, int depth=0){
  if (root == 0){
    root = n;
    return ++node_cnt < minNodesForDepth[depth] ? 1 : BAL_OK;
  }

  int cmp = strcmp(n->key, root->key);
  int sz = insert(cmp<0 ? root->left : root->right, n, depth+1);
  if (sz == BAL_OK) return sz;

  int tot = sz + 1 + cntSz(cmp<0 ? root->right : root->left);
  if (sz <= tot*ALPHA) return tot;
  node *lst = makelinear(root, 0);
  root = maketree(lst, tot);
  return BAL_OK;
}

int main(){
  char buf[1000];
  init_table();
  node *tree = 0;
  while (fgets(buf, 1000, stdin)){
    node *n = (node*)malloc(sizeof(node));
    n->key = strdup(buf);
    n->left = n->right = 0;
    insert(tree, n);
  }

  printtree(tree);
  return 0;
}

legacy · « **Reply #173 on:** September 07, 2019, 09:08:09 am »

Quote from: brucehoult on September 06, 2019, 10:31:46 pm

Once you get that data block locally you should make as much use of it as possible.

Yup, it's what layer0 and layer1 do.

Layer0: DMA and low-level primitives to read/write a disk-block
Layer1: caching and LRU

Layer1 uses ram as disk-cache, while Layer0 tries to reduce the number of disk IOp by aggregating them into a large chain. This also impacts on the "i-block" allocator (find_empty_block), which tries to aggregate i-blocks as close as possible (not always possible) to facilitate the Layer1's job.

Quote from: brucehoult on September 06, 2019, 10:31:46 pm

That means you should be using a B{*,+}-tree not a binary tree for vastly reduced tree depth and hence number of access time delays.

Yup, in Layer2 I use "b+tree" because in "binary-trees" a node stores only a key and two pointers to the child nodes, whereas I need that each leaf stores as most i-blocks-entries (entries, glue between folders and files) or metadata (folders) or chunk (files' content) as possible, and seeks are always too expensive!

Each trunk and leaf is of 4Kbyte of size in my current implementation, and the layer0 consider it as the atomic size of each disk I/O operation (while the PHY layer might only consider 512 or 1024 bytes per command).

SiliconWizard · « **Reply #174 on:** September 07, 2019, 03:45:21 pm »

Note that many software implementations of trees in general, when manipulated in RAM, are a lot less efficient than expected from the algorithms due to accesses all over the place that lead to many cache misses. So in that respect, the choice of algorithm and data structures are key. It's not just a problem with RAM either, any medium that can be accessed more efficiently as consecutive blocks (hard drives being also that way for instance) are concerned.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Neat algorithms (Read 28557 times)

Share me