I find this stuff interesting.
Note: The following should not be read as criticism or anything like that; just observations and opinions based on past experience when processing a lot of data.
The intended "tone" is shop talk among colleagues.
Searched elements are 32 byte structures in a field.
The search index is an integer, so that the compare function has as little influence as possible.
If I understood correctly, the entries you compare are small – integers – but not consecutive in memory. If so, your cache utilization may be poor, and some of the approaches you tried will be bounded by memory bandwidth.
In my experience, a hash table will provide the best lookup performance. Personally, I use an array of pointers as the hash table itself, with a singly-linked list of colliding entries hanging off it. A few (3-8) rounds of Xorshift32 or Xorshift64 to better distribute integer keys works well; for string keys, I prefer DJB2-xor variant computed when the key is acquired. My entries always contain the original hash of the key very near the beginning of the entry, i.e. in C,
struct entry { struct entry *next; size_t hash; /* Key, and either data or a reference to the data */ };with preferably the data included in the entry itself. I do often use C99 flexible array member at end to hold the string key.
(The cache behaviour is simple: the hash table access hits one cacheline, followed by another cacheline per entry. If the linked list pointer and hash value are within that cacheline, then the cost of each collision is just one cacheline.)
For arbitrary range lookups – when you need to find multiple entries, say entries where the key is within some
min and
max –, I use a separate pair of arrays as a search index. One array contains only the keys in consecutive order, and the other contains only a pointer to the data entry. The reason they are separate and not interleaved is to keep the keys as compact as possible. I use a binary search to find the initial limit, and either a linear pass (if copying the keys) or a binary search to find the final limit (if not copying the keys). It is probably not optimal, but performs very well. Furthermore, it does not preclude also using a hash table for exact key searches.
It can seem odd, but for over a couple of decades now,
cache behaviour on desktop and server class machines has been dominant in data performance in HPC.
The algorithms are very hard to compare, because microbenchmarking can yield misleading data: a cache-intensive algorithm can slow up other code because the data they need is evicted from the cache during the search. Testing a whole simulation or computation using real data, refactoring the search part for different cases, is the way to go; but one does need to remember that the results can still differ for even just slightly different data processing.
As to results, I prefer to report
median wall clock access times myself, because it gives the duration in which at least half of searches complete. If the operation is nontrivial, minimum may only be achieved at very low loads (so that the task is not rescheduled) and basically no cache pressure (from other code); and maximum often includes reschedulings and other noise.
Mean (average) is similarly "polluted" by reschedulings and other noise. Sometimes, depending on the exact distribution of the access times, I provide a non-median confidence interval instead: the wall clock access time within which at least
N% of searches complete, with
N typically about 90% to 95%, thus only excluding the "rescheduling" and "cold cache" noise.