Author Topic: Benchmark: C/C++ Efficient Searching (Read 3744 times)

udok · « **on:** November 25, 2021, 08:59:45 am »

To bring some tangible programming info:

In the attached pictures you can find benchmarks for different
common C/C++ search methods.

In principle nothing new, but this is actual measured data,
which you can use as a basis for your programming decisions.

OS is Win10, compiler is VS2019, system is a i7-8850H laptop
(L1 cache 32kByte, L2=256k, L3=1.5MByte).

Searched elements are 32 byte structures in a field.
The search index is an integer, so that the compare function has as little influence as possible.

The field with the N structures is always sorted in ascending order.
The search values come from a second field with integer values.
This second field is either initialized with random numbers,
or sorted linearly ascending from 1 - N.

The first picture shows the results, if the element to be searched are random.
The second picture shows the result for a linear search.

Search methods:

Direct-Access:
Here we access table[map[index]] directly.
In some cases we are talking about 250 picoseconds :-)

Robin-Hood Hash Map:
A self-made hash map, which uses the Robin-Hood method.
CRC32 is used as hash function. There are some
problems (at about 1000 Elements) with collisions,
which i have to look into.
A good reference is https://codecapsule.com/2013/11/11/robin-hood-hashing/

AVL Tree:
Adelson Velskii Landis (AVL) Balanced Binary Tree.
Standard method since 1962 from Russia.
Amazingly many and complicated special cases.
I don't want to program this a seoncd time.
Start with https://en.wikipedia.org/wiki/AVL_tree

std::map:
Supposedly uses a Red-Black Balanced Binary Tree.
If someone has more info, please tell me.

std::unordered_map:
C++ hash table.
Unfortunately I don't know anything more about it.

Linear-Search:
Is a simple for() loop.
According to
https://dirtyhandscoding.github.io/posts/performance-comparison-linear-search-vs-binary-search.html
linear search should be faster with less than 256 entries.
I have reproduced this with the code from that page.
But not with this more complex data, but i have to do more tests.

If you have any questions, feel free to discuss and expand on this!

Best regards,
Udo

Nominal Animal · « **Reply #1 on:** November 25, 2021, 04:42:17 pm »

I find this stuff interesting.
Note: The following should not be read as criticism or anything like that; just observations and opinions based on past experience when processing a lot of data.
The intended "tone" is shop talk among colleagues.

Quote from: udok on November 25, 2021, 08:59:45 am

Searched elements are 32 byte structures in a field.
The search index is an integer, so that the compare function has as little influence as possible.

If I understood correctly, the entries you compare are small – integers – but not consecutive in memory. If so, your cache utilization may be poor, and some of the approaches you tried will be bounded by memory bandwidth.

In my experience, a hash table will provide the best lookup performance. Personally, I use an array of pointers as the hash table itself, with a singly-linked list of colliding entries hanging off it. A few (3-8) rounds of Xorshift32 or Xorshift64 to better distribute integer keys works well; for string keys, I prefer DJB2-xor variant computed when the key is acquired. My entries always contain the original hash of the key very near the beginning of the entry, i.e. in C,
struct entry {
struct entry *next;
size_t hash;
/* Key, and either data or a reference to the data */
};
with preferably the data included in the entry itself. I do often use C99 flexible array member at end to hold the string key.

(The cache behaviour is simple: the hash table access hits one cacheline, followed by another cacheline per entry. If the linked list pointer and hash value are within that cacheline, then the cost of each collision is just one cacheline.)

For arbitrary range lookups – when you need to find multiple entries, say entries where the key is within some min and max –, I use a separate pair of arrays as a search index. One array contains only the keys in consecutive order, and the other contains only a pointer to the data entry. The reason they are separate and not interleaved is to keep the keys as compact as possible. I use a binary search to find the initial limit, and either a linear pass (if copying the keys) or a binary search to find the final limit (if not copying the keys). It is probably not optimal, but performs very well. Furthermore, it does not preclude also using a hash table for exact key searches.

It can seem odd, but for over a couple of decades now, cache behaviour on desktop and server class machines has been dominant in data performance in HPC.
The algorithms are very hard to compare, because microbenchmarking can yield misleading data: a cache-intensive algorithm can slow up other code because the data they need is evicted from the cache during the search. Testing a whole simulation or computation using real data, refactoring the search part for different cases, is the way to go; but one does need to remember that the results can still differ for even just slightly different data processing.

As to results, I prefer to report median wall clock access times myself, because it gives the duration in which at least half of searches complete. If the operation is nontrivial, minimum may only be achieved at very low loads (so that the task is not rescheduled) and basically no cache pressure (from other code); and maximum often includes reschedulings and other noise. Mean (average) is similarly "polluted" by reschedulings and other noise. Sometimes, depending on the exact distribution of the access times, I provide a non-median confidence interval instead: the wall clock access time within which at least N% of searches complete, with N typically about 90% to 95%, thus only excluding the "rescheduling" and "cold cache" noise.

SiliconWizard · « **Reply #2 on:** November 25, 2021, 07:16:20 pm »

Agree with above. Cache is most often the determinant on modern systems.
Micro-benchmarking, as you said, is often pointless. It just gives us a micro-reality.

IMO, what should matter is the usual: algorithm complexity for large N, and ease of implementation/cache considerations/particular tradeoffs for a given application, for smaller N.

Just for the record - but this is for the record and by no means a general approach - I almost always use binary search by default, unless there's an excellent reason for not doing so, which I've personally found rare. I've more than once considered using hash tables for a particular application, just to end up with binary search, because hash tables are not the perfect tool either, the probability of collisions increases as N gets large and the overhead can sometimes be hard to justify. Hash tables are also not always very good for cache use. So, it depends.

Note that using the 'bsearch()' standard function can be slightly misleading. First there is of course the overhead of it calling a callback 'order' function for each comparison, and also it may actually be implemented in various ways depending on N. I know that qsort() is not necessarily a quicksort for small N, for instance, on some implementations. Could be the same for bsearch().

TheCalligrapher · « **Reply #3 on:** November 25, 2021, 08:44:58 pm »

Cache behavior and branch prediction are two factors that might easily affect the results of such comparisons unless these comparisons are deliberately set up to suppress the effects of such factors. Of course, in real life you are supposed to take advantage of these effects, not suppress them. However, the magnitude of their influence is heavily platform-dependent.

brucehoult · « **Reply #4 on:** November 26, 2021, 01:03:43 am »

In addition to what others have said...

Direct access puts heavy limitations of the range of the keys.

Hash tables put heavy limitations on how you can use the data -- you can't find keys in a range, or visit data items in order.

Trees add flexibility that often justifies their extra costs.

AVL and Red-Black trees are old tech. There are better options now. In particular, both of those require storing some kind of balance indicator in each node along with the key and pointers. This is only a bit or two, but will often bloat the node structure by 32 or 64 bits. The best current option is, I believe, the scapegoat tree. It has similar average and worst case performance, but doesn't bloat each node, and also is simpler to program.

However! All of those are binary trees. Binary trees are not good (at N large enough that the whole tree is a lot bigger than cache) in the case of modern processors where a cache line is much bigger than a tree node. On a 64 bit CPU your binary tree node is likely to be 24 bytes (two pointers plus a 4 or 8 byte key). You can fit almost three tree nodes into a 64 byte cache line (same size on Intel *Lake and AMD Zen). This wastes resources and time bringing 64 bytes of data into cache and then only using 24 of them.

You'd be better to use tree nodes with, for example, structure {p0, key0, p1, key1, p2, key2, p3} which with 8 byte pointers and keys lets you use 56 out of the 64 bytes in a cache line. Do a linear search or an unrolled binary search of the keys within the node -- it doesn't really matter.

If your keys are only 32 bits then you can fit 4 keys and 5 pointers, also using 56 of the 64 bytes. In 32 bit code you can fit 7 keys and 8 pointers, using 60 of the 64 bytes.

It's not easy to extend AVL or Red/Black trees to this idea, but the Scapegoat tree algorithm can be easily extended.

For really large data you have to consider TLB misses (including page faults) as a major cost. The answer to this is to made each tree node the same as your VM page size (typically 4 KB), and put a lot of keys and pointers into each node.

This brings us to classical disk-based database data structures ... the sundry variations on B-trees. (B+, B*). With the characteristic of modern computers they are relevant for very large data structures in RAM not only on disk.

golden_labels · « **Reply #5 on:** November 26, 2021, 07:26:11 am »

One more thing: comparing apples to oranges.

std::find has guarantees that none other tested method offers. That directly affects its performance, because the default implementation must respect some requirements even in the most general case. The guarantees are: works for any InputIterator (only 4 unique operations, only one way of changing position), works for any type that has the equality relation defined and works with immutable collections. Comparing it to solutions requiring objects to be hashable, comparable types or using pre-filled or mutable trees is drawing a wrong picture. Of course it will lose, but mind that competitors’ performance is not counted in cases they can’t even handle.

Hash-based structures are tricky, as their performance for larger number of items is heavily dependent on the quality of the hashing algorithm used. And choosing a good one is not a trivial task. Many think they can do that easily, in particular in an ad-hoc manner; reality shows how wrong they are, in particular with variable-length inputs. Fail to use a right hash and your performance would plunge. Use a non-randomized one for data under adversary’s control and your software is susceptible to denial of service.

Trees are reasonably fast. But, if you keep adding and removing many elements over a long period of time in a complex program, even for balanced trees memory usage may grow way beyond what you expected. Things get worse if the tree is not balanced. The situation is caused by trees — depending on the implementation style — requiring O(lg(n)²) memory or causing address space fragmentation. The latter can be remedied a bit, but at the cost of complicating the implementation. Referential locality will also suffer.

Neither of the above paragraphs indicate that performance comparisons are wrong. But they may be meaningless if they deal with generic data. Each of those algorithms may be good or bad, depending on the situation at hand and benchmarks will be meaningful if performed for those actual cases.

brucehoult · « **Reply #6 on:** November 26, 2021, 09:23:09 am »

Quote from: golden_labels on November 26, 2021, 07:26:11 am

Trees are reasonably fast. But, if you keep adding and removing many elements over a long period of time in a complex program, even for balanced trees memory usage may grow way beyond what you expected. Things get worse if the tree is not balanced. The situation is caused by trees — depending on the implementation style — requiring O(lg(n)²) memory or causing address space fragmentation. The latter can be remedied a bit, but at the cost of complicating the implementation. Referential locality will also suffer.

A tree should be strictly O(N) in size and this does not depend on whether it is balanced. Balance only affects lookup (/insert/delete) speed.

Trees don't cause fragmentation. Bad memory allocators cause fragmentation. You should use a malloc() that doesn't grow the heap indefinitely under repeated allocation and deallocation. Pools or free lists of objects of a given size are two ways to achieve this. Splitting up or merging freed blocks is done at your peril! (It *can* be made to work safely, but it's not always).

Yes, locality of reference eventually suffers. At some point you need to expect every access from one level of the tree to the next to cause any or all of a cache miss, a TLB miss, and a page fault. That's why hash tables win if your data and access patterns are compatible with them -- they normally only need one "random" memory access. The next best thing is, as in my previous post, wide and shallow trees such as the B-tree family used in disk-based databases for 50 years (or IBM's even earlier ISAM)

udok · « **Reply #7 on:** November 26, 2021, 03:10:13 pm »

I try to answer your questions:

Quote from: Nominal Animal on November 25, 2021, 04:42:17 pm

If I understood correctly, the entries you compare are small – integers – but not consecutive in memory. If so, your cache utilization may be poor, and some of the approaches you tried will be bounded by memory bandwidth.

The entries are 32 Bytes long and are in an array, (named "tbl" with "tbl_size" entries).
They represent the unknown user data.

Each entry has a unique number as search key (the "id"), and the entries are sorted in ascending order.
Otherwise binary searching would not work.

The test generates an array of integers, the "map". The map contains integers from 1 to N
in random order (result are in first picture) or in ascending order (results are in second picture).

The test code iterates through the map, reads the integer from the map,
and then searches for an entry who's id is the value from the map.

The reported time is the run time for the whole test divided by (map_size * runs).
Note that the pictures have a logarithmic scale.

Code: [Select]

        // start time
        t0 = PerfCountStamp();

        for (k = 0; k < runs; k++)
        {
            for (n = 0; n < map_size; n++)
            {
                index = map[n];
                ASSERT(index < tbl_size);
                // Search function is inlined, "data" is the search data structure (std::map or hash table ...)
                // Return value is a pointer with the user data entry.
                lp = search_fp(tbl, tbl_size, index, data);  
                ASSERT(lp);
                id = lp->id;
                ASSERT(id == index); 
                // prevents the compiler from optimizing the loop away
                sum += id;
            }
        }
        
        // end time
        t1 = PerfCountStamp();
        td = t1 - t0;
        sum += td;

udok · « **Reply #8 on:** November 26, 2021, 03:34:40 pm »

Quote from: SiliconWizard on November 25, 2021, 07:16:20 pm

Agree with above. Cache is most often the determinant on modern systems.
Micro-benchmarking, as you said, is often pointless. It just gives us a micro-reality.

With some guessing you can even read the cache size from the pictures with the random keys.
The whole point is to show the performance of a real machine with different search algorithms.
Clearly the cache is often more important than the CPU speed.

Micro-benchmarking is not pointless as it is the base for selecting a search structure.
It is the performance which is guaranteed that you will not reach in practice

.

But if your application has no need for fast searches, then this special benchmark is not
relevant for your application :-)

Quote

IMO, what should matter is the usual: algorithm complexity for large N, and ease of implementation/cache considerations/particular tradeoffs for a given application, for smaller N.

What matters is the resulting speed. Normally this algorithms are implemented in a standard library.

Quote

Just for the record - but this is for the record and by no means a general approach - I almost always use binary search by default, unless there's an excellent reason for not doing so, which I've personally found rare. I've more than once considered using hash tables for a particular application, just to end up with binary search, because hash tables are not the perfect tool either, the probability of collisions increases as N gets large and the overhead can sometimes be hard to justify. Hash tables are also not always very good for cache use. So, it depends.

Note that using the 'bsearch()' standard function can be slightly misleading. First there is of course the overhead of it calling a callback 'order' function for each comparison, and also it may actually be implemented in various ways depending on N. I know that qsort() is not necessarily a quicksort for small N, for instance, on some implementations. Could be the same for bsearch().

Binary search is OK for most applications where searches are not the limiting factor.
But a hash table is almost always at least 10 times faster.

Note that the "bumps" in the hash table results are due to hash collisions.
A better hash function would give constant time access of less than 2 nS when
the entries fit in L1 cache and less than 10 nS for L2 cache.

Note for example, that the AVL Tree search is extremly fast if the values fit in L1 or L2 cache in
comparison to bsearch.
If the values do not fit into L2 cache, bsearch wins.
This is due to "branchless" implementation of the AVL Tree search.

udok · « **Reply #9 on:** November 26, 2021, 03:41:53 pm »

Quote from: TheCalligrapher on November 25, 2021, 08:44:58 pm

Cache behavior and branch prediction are two factors that might easily affect the results of such comparisons unless these comparisons are deliberately set up to suppress the effects of such factors. Of course, in real life you are supposed to take advantage of these effects, not suppress them. However, the magnitude of their influence is heavily platform-dependent.

This tests mainly show cache effects, branch predictions, and implementation details.
For example in theory the hash search should be a straigth line because it is O(1).
It praxis it is not.

udok · « **Reply #10 on:** December 04, 2021, 10:42:28 pm »

I have made an update to the benchmarks.
Pictures are attached.

The first picture shows the search results for random integer keys.

Now I tested some "branchless" search methods.

With the branchless versions, an "if-else" is replaced by a
a "conditional-move".

"AVL-tree" and "bsearch-inline" are partially branchless.
"bsearch-branchless" is completely branchless,
which means that the search is always terminated after log2(N) comparisons.

Binary search methods jump through the array quite random,
and the CPU branchpredictor does not work well with random jumps.
The conditional move instructions does not have this problem, because
the same code path is always executed, and thus the CPU pipeline stays full.

This brings a considerable speed advantage, provided that
the data fits into the L2 cache.

The second picture shows the search results for random zero terminated strings.

Here you can see, that in practice the differences between the
search methods are much smaller, because the relatively slow strcmp()
function dominates the time.

Only the hash search methods stand out positively, because the strcmp()
function is executed only once.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Benchmark: C/C++ Efficient Searching (Read 3744 times)

udok

Benchmark: C/C++ Efficient Searching

Nominal Animal

Re: Benchmark: C/C++ Efficient Searching

SiliconWizard

Re: Benchmark: C/C++ Efficient Searching

TheCalligrapher

Re: Benchmark: C/C++ Efficient Searching

brucehoult

Re: Benchmark: C/C++ Efficient Searching

golden_labels

Re: Benchmark: C/C++ Efficient Searching

brucehoult

Re: Benchmark: C/C++ Efficient Searching

udok

Re: Benchmark: C/C++ Efficient Searching

udok

Re: Benchmark: C/C++ Efficient Searching

udok

Re: Benchmark: C/C++ Efficient Searching

udok

Re: Benchmark: C/C++ Efficient Searching

Share me