-
Do you know a good C hash library, or do you have written your own?
The libs and examples i looked at are toys with memory waste
and mediocre performance.
I do not want two pointers to keys and values as usual.
One pointer to the user data is enough, as the key is always
included in the user data.
I have coded my own text book open addressing hash functions,
but i am not satisfied with the result.
I used lazy deletion (tombstones), and i have experienced that this
is a no go in situations with regular deletions and insertions.
The table fills up rather quickly with the thombstones.
I have looked at the rust hash implementation (the rust code is mostly excellent),
but the code is complicated and too long for my brain.
Rust uses the google Swisscode idea, in the past they used Robin Hood hasing.
The hash function i want to use is the crc32c hardware accelerated intel function.
In my tests it is faster and has fewer collisions than most of the other
functions i have tested.
I definitly want something simple in a single C file, with not more than 500 lines.
Any opinions?
-
I would expect that any well written generic hash table implementation in C would have separate key and value pointers. That makes it possible to implement dictionaries between arbitrary types even when you can't change the value type to include the key. It also reduces how often you have to write a custom wrapper function around the key hash and comparison functions.
In my experience, if C programmers are looking for more optimization, they implement a custom hash table for their datatype.
-
I use lookup3. Small, fast, simple.
CRC would be a bit suspect as a hash function. A desired property is that, on average, changing one bit in the input changes half the bits in the hash. I don't think CRC satisfies this and you will get some weird distribution.
-
I tested the hash functions with a modified version of Peter Kankowski test program (https://www.strchr.com/hash_functions (https://www.strchr.com/hash_functions)).
Compiler is the Microsoft WinDDK 64 bit cl.exe with -O2.
lookup3 was one of the better functions, but it has small problems with the dic_common_words.txt test case
(words like: adult, man, woman, child, boy, girl, thing, cube, ball ).
My tests results:
[0]$ ./HashOpenAddress.exe -r500 dic_common_words450.txt
# --------------------------------------------------------------------------------
# Lines read: 450
# Table size: 512 (9 bits)
# Fill-factor: 0.88
# Mean chars in line: 6
# --------------------------------------------------------------------------------
# Id Hash-Function Insert-Search-Cycles Coll-Rate Max-Coll Total-Coll Used-RAM
# --------------------------------------------------------------------------------
1 K&R 34 58 2.99 69 1345 4096
2 FNV1A 43 46 2.46 65 1107 4096
3 FNV1A_unsigned 43 46 2.46 65 1107 4096
4 FNV4 51 52 3.71 62 1669 4096
5 FNV1A_unrolled8 35 38 1.87 37 843 4096
6 Paul_Hsieh 30 53 2.77 63 1248 4096
7 Lookup3 49 52 4.41 134 1985 4096
8 Murmur2 27 37 2.60 70 1170 4096
9 Murmur2A 41 42 3.11 64 1401 4096
10 SBox 32 54 2.55 65 1147 4096
11 CRC32_SW 41 42 2.69 90 1211 4096
12 CRC32C_HW64 36 39 2.78 71 1250 4096
The fill factor is high (0.9) to make some stress.
The Search- and Insert-Cycle count is the number of clock cycles measured
with rdtsc assembler instruction.
The Max-Coll is the maximum number of collisions, lookup3 has the highest
collision rate, even higher as Kernighan&Ritchie, but it is a very fast to calculate.
All shown hash functions are good for large number of test cases.
Except K&R (Kernighan&Ritchie) which is only useable for some test cases.
-
Here is a second test case with
100000 Strings in the form N$0, N$1, N$2 up to N$99999.
[0]$ ./HashOpenAddress.exe dic_nnn.txt
# --------------------------------------------------------------------------------
# Lines read: 100000
# Table size: 131072 (17 bits)
# Fill-factor: 0.76
# Mean chars in line: 6
# --------------------------------------------------------------------------------
# Id Hash-Function Insert-Search-Cycles Coll-Rate Max-Coll Total-Coll Used-RAM
# --------------------------------------------------------------------------------
1 K&R 1132 975 138.34 4265 13834390 1048576
2 FNV1A 70 91 1.52 111 151813 1048576
3 FNV1A_unsigned 70 91 1.52 111 151813 1048576
4 FNV4 77 99 1.60 157 159604 1048576
5 FNV1A_unrolled8 78 100 1.69 129 168592 1048576
6 Paul_Hsieh 158 170 6.89 619 688701 1048576
7 Lookup3 72 93 1.59 107 159348 1048576
8 Murmur2 69 89 1.65 117 164728 1048576
9 Murmur2A 73 92 1.58 107 158448 1048576
10 SBox 68 86 1.65 186 164985 1048576
11 CRC32_SW 68 76 1.98 136 197640 1048576
12 CRC32C_HW64 34 48 0.81 35 81457 1048576
Clearly not a test case for which the K&R or the Paul Hsieh function was designed for.
-
I would expect that any well written generic hash table implementation in C would have separate key and value pointers. That makes it possible to implement dictionaries between arbitrary types even when you can't change the value type to include the key. It also reduces how often you have to write a custom wrapper function around the key hash and comparison functions.
In my experience, if C programmers are looking for more optimization, they implement a custom hash table for their datatype.
Yes, i have not found something which is close to my wish list.
My typical C programs are not like python scripts, and the key
is always included in the data structure.
-
What is your typical use case for a hash library (with a little more details)? That in itself would certainly help determining the better option, and whether a "hash map" is even the best approach. Just a thought.
-
I always use custom data types and pick hash functions for the use case. For keywords (using ASCII alphanumeric characters), I often use DJB2 (xor variant), because it is fast and good enough, and easy to incorporate into the lexer so that each token becomes a triplet (pointer, length, hash).
The code I write typically use hash tables for human interface stuff; parsing text and such. It tends to be an insignificant part of the total runtime and resource use.
I use various trees, heaps, and disjoint set implementations much more often. (For molecular simulations, I even have my own data structures based on Verlet lists, which I probably should document one day; could be useful for other point-like interaction models, where the maximum interaction distance is limited, but the points move with respect to each other.)
I'm sure the use patterns vary a lot based on exactly what kind of problems one works on, so perhaps providing more information about your use cases would focus the discussion a bit.
-
The Bernstein hash function is not the fastest one,
but it works very good in all of my test cases.
But to clarify, the hash function is not my problem.
The hash function i want to use is the Intel CRC32C one,
because it has few collisions, and is fast.
And it is small and simple, only 88 Bytes.
Drawback is, it runs only on x64, but nothing is perfect.
My problem is the hash table data structure.
I want something like the C++ STL unordered map in C,
but without the code bloat and syntactic bloat of the STL.
I had looked at the FreeBSD code, and searched at github,
and found nothing in C only.
Maybe the STL is the real reason for the C++ success.
-
You should probably try Google's "Cityhash" and the newer "Metrohash". Probably "spookyhash" too, a newer hash from the lookup3 guy.
I think metro is the fastest of those, with no quality difference.
-
To clarify you want the key stored in your own struct (example myitem_s)
struct myitem_s {
char key[10];
int datac;
int datab;
};
You are happy to have a seperate table that just contains pointers to keys. remember you know the offset of the key into your struct so from the key pointer you can get the address of your myitem container. It also means that your container can be in different tables at the same time using different keys.
struct tableItem_s {
void *key;
} htable[TSIZE];
For the hashing table code to be generic and reusable on different items you want to pass a function that knows how to extract and compare keys be they numeric, string etc. Sounds like a fun excersize.
Oooh here is a neat trick, see that void *, you probably find that you don't use all the bits so can use as flags. if the pointer you are referencing is always even, you get a bit flag for free on the LSB ;)
Why doesn't a generic solution like this already exist that <500 lines?
I guess that these problems balloon as the solution is made more generic and able to handle solutions to other problems.
-
Let's say I was writing a code editor for different programming languages, and had a lexer/tokenizer for each language that produces a (string, length, hash) triplet for each token. Each token should be looked up primarily for syntax highlighting, but occasionally for references as well, so we'll use a hash table. For e.g. C, we actually use a stack of hash tables, with the bottom stack having the C keywords, and each stack on top having the variables, functions, macros etc. in each scope.
I've observed that data locality is important for efficiency, and prefer a C99 flexible array member to store the token string, with each hash table slot a pointer to a structure. Since each structure involves a pointer dereference anyway, making the hash table slot into a singly-linked list incurs no extra penalty. When adding a new member into a slot, it is the maximum length of the list (instead of how many of the hash table entries are populated) that determines whether the table should be extended or not.
Let's assume the following data structure suffices for each token:
struct token {
struct token *next; /* Next entry in the same hash table slot */
token_type_t type; /* Token classification (function, variable, reserved keyword, etc.) */
highlight_t high; /* Highlighting information (index to font+color table, themeable) */
struct refs *refs; /* References and other only occasionally needed data */
hash_t hash; /* Actual hash of the data */
size_t size; /* Length of the data */
char data[]; /* Data */
};
The struct refs *refs; member could even be NULL or only partially populated, until the user first selects the token, since it is not that common for the user to be interested in the original definition of the token and so on.
Although I'd need the hash tables to form a stack, array, or list, lets just concentrate on the single hash table case for now. My initial unit tests and experiments to see if this data structure works (and how to make the editor as efficient as possible by delaying slow operations like looking up the references (like where a function was originally defined, other places where it is used) until they are actually needed), would start with the following hash table data structure:
struct table {
size_t size; /* Number of slots */
struct token **slot; /* Each slot is a singly-linked list. */
uint32_t cost; /* Filtered average cost of access. */
};
#define TABLE_INIT { 0, NULL, 0 }
As usual, the slot index is always (hash % size).
The cost field is the best adaptable heuristic I've found useful. The idea is that a value of 0 indicates every access was resolved immediately (meaning all slots have at most a single entry), and some arbitrary limit, say 4096, indicates the hash table should be resized, because access cost is too high (due to the slot lists being too long).
Each access uses some kind of heuristic to estimate the cost of the access. A simple possibility would be just use the number of entries in the linked list in the target slot needed to traverse before a resolution was found, multiplied by some number. For example, multiplied by 1024 would indicate that if many consecutive accesses need to traverse four or more slots, the table needs to be resized. This multiplier would be a tunable, obviously.
To average the value, a simple filter is used. If cost is the cost in the hash table structure, and curr is the cost of the current access, then the general formula is cost = (cost * (N - 1) + curr) / N; where N is a suitable power of two for efficiency; or cost = (cost * (N - A) + curr * A) / N; where A is a (small) odd positive number, and N is a suitable power of two. This means that cost changes slowly, during multiple accesses, and is not affected too much by a rare single high-cost access.
Again, the cost approach is a heuristic, because in my experience, using only the list length as an absolute limit does not yield the kind of balance between table size and access cost, because sometimes you just hit a bad table size wrt. the hash function, and you get collisions, but those collisions don't affect performance much.
In this particular case, I'd probably observe during testing that the cost approach is not needed, and because each hash table slot only takes a single pointer, I can simply keep the hash table size between say (number of entries) and 3*(number of entries). So, I'd end up replacing all the cost stuff with a simple size_t entries; member, incremented whenever a new entry is inserted, decremented when deleted, and both operations resize the hash table when needed. Note that the size must have Schmitt triggering: you want the target resized size (2*(number entries)) to be in the middle of the "keep this size" range, so that a sequence of adding and then deleting an entry does not cause the table to be resized every time.
To resize the hash table, all I need is to save the old slot array in a temporary variable, allocate a new one, move each entry to the new one –– because the hash is already in the linked list structure, it's all only pointer work –– and finally free the old array. This means the resize cost is relatively small. Untested initial implementation:
int table_resize(struct table *t, size_t newsize)
{
/* Sanity checks; I like these.. */
if (!t || newsize < 1)
return errno = EINVAL;
struct token **newslot;
newslot = malloc(newsize * sizeof *newslot);
if (!newslot)
return errno = ENOMEM;
/* Initialize all entries to NULL. We could have also used calloc() above. */
for (size_t i = 0; i < newsize; i++)
newslot[i] = NULL;
struct token **oldslot = t->slot;
size_t oldsize = t->size;
for (size_t o = 0; o < size; o++) {
struct token *list = oldslot[o];
while (list) {
struct token *curr = list;
list = list->next;
const size_t i = curr->hash % newsize;
curr->next = newslot[i];
newslot[i] = curr;
}
}
t->size = newsize;
t->slot = newslot;
free(oldslot);
return 0;
}
In the application code, each table variable is declared and defined as
struct table example = TABLE_INIT;
so that there is no special "initialize table" call. Lookups will notice size == 0 and always return "none found", and first insertion will use a compile-time default size for the slot array. (Since each slot is only a pointer in size, it makes sense to start with relatively large tables, say 1024 entries or so, when each application will have a relatively small number of such hash tables. The compile-time default size thus needs to be user-overridable (#ifndef DEFAULT_SLOTS, #define DEFAULT_SLOTS 1024, #endif), and have a comment about sizing it depending more on how many such hash tables ones own application expects to have in use at the same time, as opposed to trying to guess how large it will eventually grow.)
TL;DR: Apologies for the verbosity, but I thought that showing how my development process would work, and how I'd end up with whatever I'd end up using, including a wrinkle – "access cost" – that might sometimes be useful but in this particular case is better replaced with something else – would be useful. At minimum, it should show why I don't really reach for a generic solution as my first step.
In development terms, I must emphasize that when I am considering this stuff, I am also thinking a lot about what the application needs, and how it could do it efficiently. You could say that I'm even getting twice my money's worth for the time I spend writing the above code, because as I write the code, I'm basing it on my understanding of the problem. Similar to the Rubber Duck method, I've found that this helps me better understand the needs and use cases in my target application or service; and that understanding is often more valuable to me than the code I produce at this phase.
-
Ok, How about the following, please don't crucify me, I haven't tested the code so it may need some special touching.
htable.h
#ifndef __HTABLE_H__
#define __HTABLE_H__
#include <stdint.h>
#define HTABLEDELETED UINT32_MAX
struct hitem_s {
union {
void *userItem;
};
};
struct htable_s {
int (*hash)(void* a, int size);
int (*cmp)(void* a, void* b);
int free;
int size;
struct hitem_s* htable;
};
void hashInsert(void* userItem, struct htable_s* htable);
struct hitem_s* hashFind(void* userItem, struct htable_s* htable);
void hashDelete(struct htable_s* htable, struct hitem_s* hitem);
#endif /* __HTABLE_H__ */
htable.c
#include <stdint.h>
#include "htable.h"
void
hashInsert(void* userItem, struct htable_s* htable) {
/* This fnction MUST only be called if free > 0
* This functin MUST only be called if key does not already exist in the table
*/
struct hitem_s* hitem = &htable->htable[htable->hash(userItem, htable->size)];
struct hitem_s* litem = &htable->htable[htable->size];
while (1) {
if ((hitem->userItem == (void*)0) || (hitem->userItem == (void *)HTABLEDELETED)) {
hitem->userItem = userItem;
htable->free--;
break;
}
/* Collision, move to next available space, repeat unti
* we find a free slot.
*/
if (++hitem == litem) {
hitem = htable->htable;
}
}
}
struct hitem_s*
hashFind(void* userItem, struct htable_s* htable) {
/* This function returns 0 if the item cannot be found
* or the address of the item if it is found
*/
struct hitem_s* hitem = &htable->htable[htable->hash(userItem, htable->size)];
struct hitem_s* litem = &htable->htable[htable->size];
while (1) {
if (hitem->userItem == (void*)0) {
return (struct hitem_s*)0;
}
if (((hitem->userItem != (void*)HTABLEDELETED)) && (htable->cmp(hitem->userItem, userItem) == 0)) {
return hitem->userItem;
break;
}
/* We didnt find our item, keep searching
*/
if (++hitem == litem) {
hitem = htable->htable;
}
}
}
void
hashDelete(struct htable_s* htable, struct hitem_s* hitem) {
hitem->userItem = (void*)HTABLEDELETED;
htable->free++;
}
void
hashCopy(struct htable_s* htable, struct htable_s *newTable) {
/* newTable MUST be initialised, sized and allocated
* all entries MUST be zeroed
*/
struct hitem_s* litem = htable->htable;
int i;
for (i = 0; i < htable->size; i++) {
if ((litem->userItem == (void*)0) || (litem->userItem == (void*)HTABLEDELETED)) {
continue;
}
hashInsert(litem->userItem, newTable);
litem++;
}
}
main.c
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include "htable.h"
struct myItem_s {
char name[10];
int age;
};
typedef struct myItem_s myitem_t;
int myHashFunc(void *a, int size) {
char* ptr = ((struct myItem_s*)a)->name;
int hash = 0;
while (*ptr) {
hash += *ptr;
ptr++;
}
return hash % size;
}
int myCmpFunc(void* a, void* b) {
return strcmp(((struct myItem_s*)a)->name, ((struct myItem_s*)b)->name);
}
int main()
{
#define TSIZE 5
static struct hitem_s items[TSIZE];
static struct htable_s htable = { myHashFunc, myCmpFunc, TSIZE - 1, TSIZE, items };
struct myItem_s people[4] = {
{ "bill", 42 },
{ "bob", 31 },
{ "dave", 52 },
{ "fred", 10 }
};
struct hitem_s key[] = { (void*)5 };
hashInsert(&people[0], &htable);
hashInsert(&people[1], &htable);
hashInsert(&people[2], &htable);
hashInsert(&people[3], &htable);
{
struct myItem_s person[1] = { "dave", 0 };
struct myItem_s* ptr;
ptr = hashFind(person, &htable);
printf("%s %d\n", ptr->name, ptr->age);
}
}
Think about how you will call each function, don't jump in blindly.
ie. You want to add an item, think about whether you know if the item is already in the list.
You could search for it if you are not sure. Then check to see if there will be space, you may need to grow the table by creating a second and calling hashCopy() to populate it. It looks fairly flexible but the more you look at it the more you wonder if you should just implement it inline with the code you want and not bother with it at all :-//
Have fun, MIT license, enjoy.
NB: the code is for 32bit pointers, if you want 16 or 64 bit edit the header and pick an appropriate value for HTABLEDELETED
-
Added a github project https://github.com/amwales-888/htable
-
If I were asked to do a generic hash table interface, then it's use for a string dictionary would probably look like this:
typedef struct {
size_t len;
char *val;
} dict_value;
#define HASHLIB_VALUE dict_value
#define HASHLIB_PREFIX dict_
#include "hashlib.h"
The interface exposed by that would be approximately
typedef struct dict_entry {
dict_entry *next;
size_t hash;
dict_value val;
size_t len;
char key[];
} dict_entry;
typedef struct dict_table {
size_t size;
size_t used;
dict_entry **slot;
hash_func hash;
} dict_table;
#define TABLE_INIT (0, 0, NULL, hash_func_default)
typedef size_t (*hash_function)(const void *, size_t);
size_t hash_func_default (const void *, size_t);
/* Run-time discovery of other hash functions. */
hash_function hash_func_find (const char *);
size_t hash_func_count (void);
hash_function hash_func_func (size_t);
const char *hash_func_name (size_t);
int dict_table_init (dict_table *, size_t, hash_func);
int dict_table_rehash (dict_table *, size_t, hash_func);
int dict_table_resize (dict_table *, size_t);
int dict_table_free (dict_table *);
dict_entry * dict_table_add (dict_table *, const char *, size_t, dict_value);
dict_entry * dict_table_add_entry (dict_table *, dict_entry *);
int dict_table_del (dict_table *, const char *, size_t);
int dict_table_del_entry (dict_table *, dict_entry *);
int dict_table_del_all (dict_table *, dict_value);
dict_entry * dict_table_find (dict_table *, const char *, size_t);
dict_entry * dict_table_find_hash (dict_table *, const char *, size_t, size_t);
dict_entry * dict_table_find_entry (dict_table *, dict_entry *);
dict_entry * dict_table_find_any (dict_table *, dict_value);
dict_entry * dict_table_detach (dict_table *, const char *, size_t);
dict_entry * dict_table_detach_hash (dict_table *, const char *, size_t, size_t);
dict_entry * dict_table_detach_entry (dict_table *, dict_entry *);
dict_entry * dict_table_detach_any (dict_table *, dict_value);
/* These are for constructing entries dynamically, before they're added to any tables. */
dict_entry * dict_entry_new (size_t);
dict_entry * dict_entry_resize (dict_entry *, size_t);
dict_entry * dict_entry_finish (dict_entry *, dict_table *);
#undef HASHLIB_VALUE
#undef HASHLIB_PREFIX
and allows multiple different hash tables to be included in the same compilation unit. (This is one of the rare situations where I'd use this kind of preprocessor-assisted polymorphism in C.)
The dict_ prefix functions would be static inline and marked optional, but internally use a set of helper functions implemented in the static or dynamic library. If a dynamic library is used, then adding new hash functions to the dynamic library will immediately make them available to binaries without recompiling them.
If I wanted an additional hash table to look up say functions operating on a stack of floating-point numbers, I could add say
typedef struct {
size_t size;
size_t used;
double *data;
} calc_stack;
#define CALC_STACK_INIT (0, 0, NULL)
typedef int (*calc_op)(calc_stack *);
#define HASHLIB_VALUE calc_op
#define HASHLIB_PREFIX calc_
#include "hashlib.h"
and it would expose an additional hash table of type calc_table, with entries of type calc_entry, and the same set of functions as for dict above, but for these new types. The preprocessor magick to make this all work without collisions is a bit messy, but it does work just fine.
Including hashlib.h automatically undefines HASHLIB_VALUE and HASHLIB_PREFIX. If you include it without those defined, all you get is the TABLE_INIT macro, the hash_function function pointer type, and the hash_func*() functions.
-
OK; I'm confused. Admittedly, by education predates a lot of the current fancy table algorithms, and long predates the "find a good library that implements it" age, but...
Isn't hashing essentially just a key-shortener? You pass your key (perhaps a string) into a hashing function, and you get back a much shorter key (9 bit integer.) Then you use the shorter key to (more rapidly) access some other data structure. But that data structure could be ANYTHING: array, linked list, some kind of tree... Pick as appropriate for your desired insert/delete/find behavior, probably.
If your hash function is "perfect" for your data, and you have fewer data items than can be counted by the bits in the shorter key, you could directly index an array from the short key. If you have collisions (different long key yields same short key), then the secondary data structure has to have a mechanism for insert/delete/find based on the long key. But that could be independent of the mechanism/datastructure used to access via the short key...
Sort of:item = findItemInCollisionStruct( findCollisionStructure( myHash(longKey) ), longKey );
(Code I've used used (bits of an?) IP checksum as the hash, picked up a linked list from an array, and did a linear search through the LL. We "assumed" (maybe we measured) that collisions would be relatively rare and the LLs would be short.)
So it seems like what OP is really looking for has little to do with the hashing, and more to do with the data structures thereafter. Which depends mostly on things other than the hashing itself.
(oohh. The secondary access could be another hash table - are there "mutually exclusive hashes", such that if there is a collision using one hashing algorithm, the 2nd hash is guaranteed NOT to collide?)
I also don't quite understand the desire to to avoid the "memory inefficiency of an extra pointer" at the same time we seem to be talking about an ia64 processor/environment, but I'm willing to assume that the OP has "reasons."
-
OK; I'm confused. Admittedly, by education predates a lot of the current fancy table algorithms, and long predates the "find a good library that implements it" age, but...
Me too. And thinking on this has definitely changed a lot.
The biggest change is that long ago you typically made your hash table a prime number size, and did a mod operation to get from the hash value to the table index. Which was slow. Although as the hash table size doesn't change often, you can figure out the multiplicative reciprocal of the table size and store that with the table, and multiply instead of dividing. I've done that as recently as 2006 or 2007 in the "Alchemo" Java to native (ARM, typically ARM7TDMI) compiler.
The modern thing is to use something that is close to being a cryptographically secure hash function, in that any 1 bit change in the large key has about a 50% (but uncorrelated) chance of changing each bit in the short output. i.e. something like MD5 or SHA-1. Since all the bits in the output are in theory just as good as each other, or as good as a combination, you can shorten the hash value further to your actual table size just by having a power of two table size and masking off the extra bits in the hash value.
MD5 is long since broken for security purposes, but it's still perfectly fine as a hash function in algorithms. The problem is MD5 and SHA-1 are pretty slow to calculate. Unless you have dedicated hardware...
The trend since 2000 or so has been finding hash functions that statistically look a lot like MD5 or SHA-1 but are a lot faster to calculate.
This was all started by Bob Jenkins with his "one_at_a_time" and then "lookup2" functions in 1997, followed by "lookup3" (mentioned in this thread) in 2009 and then "spookyhash" in 2011. Austin Appleby took up the baton in 2008 with "MurmurHash" -- explicitly trying to make a faster replacement for lookup3 -- in 2008 (followed by Murmur2 and Murmur3). Murmur3 is used (at least at one time, maybe still) in a LOT of programming language implementations, including Python, Go, C#, D, Lua, Perl, Ruby, Rust, Scala, Swift and by software such as libstdc++, nginx, hadoop.
Google put the big brains on to the task and published CityHash in 2011. It uses the x86 CRC32 instruction as part of it. The authors themselves say that SpookyHash is approximately as good, and better if you don't have CRC32 hardware. FarmHash (Google 2014) and MetroHash (J Andrew Rogers 2015) continue the trend to faster hashes that are just as good.
In my defense with the hash tables in Alchemo, all the really good hash functions starting with murmur and lookup3 postdate that work.
Around 2008-2011 is when it became definitively the thing to use power of two table sizes instead of prime number table sizes.
Isn't hashing essentially just a key-shortener? You pass your key (perhaps a string) into a hashing function, and you get back a much shorter key (9 bit integer.) Then you use the shorter key to (more rapidly) access some other data structure. But that data structure could be ANYTHING: array, linked list, some kind of tree... Pick as appropriate for your desired insert/delete/find behavior, probably.
Exactly. Though 9 bits would be a strange size. Typically a hash function for use in algorithms returns a key the same size as a pointer, or at least the same size as an array index i.e. 32 or 64 bits. There's no point in bigger, and anything much smaller falls up short with unnecessary duplicates if you have a really big data structure.
While you *can* used hashed values in a list or tree that doesn't give a lot of speedup -- mostly it just helps if your long keys have a significant danger of having long matching initial parts. And also being fixed size can help you to design a more cache-friendly data structure. You still need to go off and follow a pointer to compare the real key if the hash matches, but that's hopefully just once and meanwhile you rejected all the non-matches quickly.
The ideal use for a hashed value is to directly access an array, so (almost) everything is found first attempt.
(oohh. The secondary access could be another hash table - are there "mutually exclusive hashes", such that if there is a collision using one hashing algorithm, the 2nd hash is guaranteed NOT to collide?)
No, that's not a thing. At least not for arbitrary data. It's just low probability, much the same as if you'd just concatenated the two hashes.
I also don't quite understand the desire to to avoid the "memory inefficiency of an extra pointer" at the same time we seem to be talking about an ia64 processor/environment, but I'm willing to assume that the OP has "reasons."
ia64? I missed that. That's a super-rare beast!
-
ia64? I missed that. That's a super-rare beast!
That probably means I used the wrong abbreviation...
Compiler is the Microsoft WinDDK 64 bit cl.exe
So, 64bit x86 architecture. I guess ia64 was the ill-fated Itanium and I should have said AMD64 ir x86_64 (except I thought AMD64 was more of an ABI spec than CPU instruction set?)
My oops.
-
Oh ok yes, amd64, the instruction set Intel did not want to exist but were forced to adopt due to AMD's market success with Opteron and Athlon64 and their own abject failure with ia64 (which was designed to finally kill off AMD's second source license as well as all the RISC vendors)
Designed from the marketing and legal point of view, that is. Technically the design was awful. iAPX 432 all over again.
-
I want something like the C++ STL unordered map in C,
but without the code bloat and syntactic bloat of the STL.
I too seriously dislike the STL.
But if your key is always in the data (a very agreeable practice IMO!) you could use a set.
-
For hashing large amounts of data I like xxhash (https://github.com/Cyan4973/xxHash). It's pretty fast.
-
Depending on what your requirements are in terms of hashing quality and resource constraints, you might want to consider libsodium. It's a proven crypto grade library.
https://libsodium.org
-
This popped up on my FB feed recently:
https://mollyrocket.com/meowhash
It’s the fastest hash function we know of, and we have benchmarked all the ones we could find. On modern Intel x64 CPUs, it hashes 16 bytes per cycle single-threaded. This means in cache it can hash at a rate of 64 gigabytes per second on a 4.2gHz machine.
-
This popped up on my FB feed recently:
https://mollyrocket.com/meowhash
It’s the fastest hash function we know of, and we have benchmarked all the ones we could find. On modern Intel x64 CPUs, it hashes 16 bytes per cycle single-threaded. This means in cache it can hash at a rate of 64 gigabytes per second on a 4.2gHz machine.
Notes:
- requires AES hardware. If you don't have it then it will be slow slow slow
- claims to support arm64 as well as amd64. I can't find any evidence of arm64 code in the current version
- the source files mentioned in the README do not exist. They were deleted on 19 May 2019. Those files *did* have ARM support.
I don't know what's going on there.