Benchmark: C/C++ Efficient Searching

Products > Programming

(1/3) > >>

udok:
To bring some tangible programming info:

In the attached pictures you can find benchmarks for different
common C/C++ search methods.

In principle nothing new, but this is actual measured data,
which you can use as a basis for your programming decisions.

OS is Win10, compiler is VS2019, system is a i7-8850H laptop
(L1 cache 32kByte, L2=256k, L3=1.5MByte).

Searched elements are 32 byte structures in a field.
The search index is an integer, so that the compare function has as little influence as possible.

The field with the N structures is always sorted in ascending order.
The search values come from a second field with integer values.
This second field is either initialized with random numbers,
or sorted linearly ascending from 1 - N.

The first picture shows the results, if the element to be searched are random.
The second picture shows the result for a linear search.

Search methods:

Direct-Access:
Here we access table[map[index]] directly.
In some cases we are talking about 250 picoseconds :-)

Robin-Hood Hash Map:
A self-made hash map, which uses the Robin-Hood method.
CRC32 is used as hash function. There are some
problems (at about 1000 Elements) with collisions,
which i have to look into.
A good reference is https://codecapsule.com/2013/11/11/robin-hood-hashing/

AVL Tree:
Adelson Velskii Landis (AVL) Balanced Binary Tree.
Standard method since 1962 from Russia.
Amazingly many and complicated special cases.
I don't want to program this a seoncd time.
Start with https://en.wikipedia.org/wiki/AVL_tree

std::map:
Supposedly uses a Red-Black Balanced Binary Tree.
If someone has more info, please tell me.

std::unordered_map:
C++ hash table.
Unfortunately I don't know anything more about it.

Linear-Search:
Is a simple for() loop.
According to
https://dirtyhandscoding.github.io/posts/performance-comparison-linear-search-vs-binary-search.html
linear search should be faster with less than 256 entries.
I have reproduced this with the code from that page.
But not with this more complex data, but i have to do more tests.

If you have any questions, feel free to discuss and expand on this!

Best regards,
Udo

Nominal Animal:
I find this stuff interesting.
Note: The following should not be read as criticism or anything like that; just observations and opinions based on past experience when processing a lot of data.
The intended "tone" is shop talk among colleagues.

--- Quote from: udok on November 25, 2021, 08:59:45 am ---Searched elements are 32 byte structures in a field.
The search index is an integer, so that the compare function has as little influence as possible.

--- End quote ---
If I understood correctly, the entries you compare are small – integers – but not consecutive in memory. If so, your cache utilization may be poor, and some of the approaches you tried will be bounded by memory bandwidth.

In my experience, a hash table will provide the best lookup performance. Personally, I use an array of pointers as the hash table itself, with a singly-linked list of colliding entries hanging off it. A few (3-8) rounds of Xorshift32 or Xorshift64 to better distribute integer keys works well; for string keys, I prefer DJB2-xor variant computed when the key is acquired. My entries always contain the original hash of the key very near the beginning of the entry, i.e. in C,
struct entry {
struct entry *next;
size_t hash;
/* Key, and either data or a reference to the data */
};
with preferably the data included in the entry itself. I do often use C99 flexible array member at end to hold the string key.

(The cache behaviour is simple: the hash table access hits one cacheline, followed by another cacheline per entry. If the linked list pointer and hash value are within that cacheline, then the cost of each collision is just one cacheline.)

For arbitrary range lookups – when you need to find multiple entries, say entries where the key is within some min and max –, I use a separate pair of arrays as a search index. One array contains only the keys in consecutive order, and the other contains only a pointer to the data entry. The reason they are separate and not interleaved is to keep the keys as compact as possible. I use a binary search to find the initial limit, and either a linear pass (if copying the keys) or a binary search to find the final limit (if not copying the keys). It is probably not optimal, but performs very well. Furthermore, it does not preclude also using a hash table for exact key searches.

It can seem odd, but for over a couple of decades now, cache behaviour on desktop and server class machines has been dominant in data performance in HPC.
The algorithms are very hard to compare, because microbenchmarking can yield misleading data: a cache-intensive algorithm can slow up other code because the data they need is evicted from the cache during the search. Testing a whole simulation or computation using real data, refactoring the search part for different cases, is the way to go; but one does need to remember that the results can still differ for even just slightly different data processing.

As to results, I prefer to report median wall clock access times myself, because it gives the duration in which at least half of searches complete. If the operation is nontrivial, minimum may only be achieved at very low loads (so that the task is not rescheduled) and basically no cache pressure (from other code); and maximum often includes reschedulings and other noise. Mean (average) is similarly "polluted" by reschedulings and other noise. Sometimes, depending on the exact distribution of the access times, I provide a non-median confidence interval instead: the wall clock access time within which at least N% of searches complete, with N typically about 90% to 95%, thus only excluding the "rescheduling" and "cold cache" noise.

SiliconWizard:
Agree with above. Cache is most often the determinant on modern systems.
Micro-benchmarking, as you said, is often pointless. It just gives us a micro-reality. :)

IMO, what should matter is the usual: algorithm complexity for large N, and ease of implementation/cache considerations/particular tradeoffs for a given application, for smaller N.

Just for the record - but this is for the record and by no means a general approach - I almost always use binary search by default, unless there's an excellent reason for not doing so, which I've personally found rare. I've more than once considered using hash tables for a particular application, just to end up with binary search, because hash tables are not the perfect tool either, the probability of collisions increases as N gets large and the overhead can sometimes be hard to justify. Hash tables are also not always very good for cache use. So, it depends.

Note that using the 'bsearch()' standard function can be slightly misleading. First there is of course the overhead of it calling a callback 'order' function for each comparison, and also it may actually be implemented in various ways depending on N. I know that qsort() is not necessarily a quicksort for small N, for instance, on some implementations. Could be the same for bsearch().

TheCalligrapher:
Cache behavior and branch prediction are two factors that might easily affect the results of such comparisons unless these comparisons are deliberately set up to suppress the effects of such factors. Of course, in real life you are supposed to take advantage of these effects, not suppress them. However, the magnitude of their influence is heavily platform-dependent.

brucehoult:
In addition to what others have said...

Direct access puts heavy limitations of the range of the keys.

Hash tables put heavy limitations on how you can use the data -- you can't find keys in a range, or visit data items in order.

Trees add flexibility that often justifies their extra costs.

AVL and Red-Black trees are old tech. There are better options now. In particular, both of those require storing some kind of balance indicator in each node along with the key and pointers. This is only a bit or two, but will often bloat the node structure by 32 or 64 bits. The best current option is, I believe, the scapegoat tree. It has similar average and worst case performance, but doesn't bloat each node, and also is simpler to program.

However! All of those are binary trees. Binary trees are not good (at N large enough that the whole tree is a lot bigger than cache) in the case of modern processors where a cache line is much bigger than a tree node. On a 64 bit CPU your binary tree node is likely to be 24 bytes (two pointers plus a 4 or 8 byte key). You can fit almost three tree nodes into a 64 byte cache line (same size on Intel *Lake and AMD Zen). This wastes resources and time bringing 64 bytes of data into cache and then only using 24 of them.

You'd be better to use tree nodes with, for example, structure {p0, key0, p1, key1, p2, key2, p3} which with 8 byte pointers and keys lets you use 56 out of the 64 bytes in a cache line. Do a linear search or an unrolled binary search of the keys within the node -- it doesn't really matter.

If your keys are only 32 bits then you can fit 4 keys and 5 pointers, also using 56 of the 64 bytes. In 32 bit code you can fit 7 keys and 8 pointers, using 60 of the 64 bytes.

It's not easy to extend AVL or Red/Black trees to this idea, but the Scapegoat tree algorithm can be easily extended.

For really large data you have to consider TLB misses (including page faults) as a major cost. The answer to this is to made each tree node the same as your VM page size (typically 4 KB), and put a lot of keys and pointers into each node.

This brings us to classical disk-based database data structures ... the sundry variations on B-trees. (B+, B*). With the characteristic of modern computers they are relevant for very large data structures in RAM not only on disk.

Navigation

[0] Message Index

[#] Next page

There was an error while thanking

Thanking...

Go to full version