Products > Programming
C only: What hash library are you using?
udok:
Do you know a good C hash library, or do you have written your own?
The libs and examples i looked at are toys with memory waste
and mediocre performance.
I do not want two pointers to keys and values as usual.
One pointer to the user data is enough, as the key is always
included in the user data.
I have coded my own text book open addressing hash functions,
but i am not satisfied with the result.
I used lazy deletion (tombstones), and i have experienced that this
is a no go in situations with regular deletions and insertions.
The table fills up rather quickly with the thombstones.
I have looked at the rust hash implementation (the rust code is mostly excellent),
but the code is complicated and too long for my brain.
Rust uses the google Swisscode idea, in the past they used Robin Hood hasing.
The hash function i want to use is the crc32c hardware accelerated intel function.
In my tests it is faster and has fewer collisions than most of the other
functions i have tested.
I definitly want something simple in a single C file, with not more than 500 lines.
Any opinions?
ejeffrey:
I would expect that any well written generic hash table implementation in C would have separate key and value pointers. That makes it possible to implement dictionaries between arbitrary types even when you can't change the value type to include the key. It also reduces how often you have to write a custom wrapper function around the key hash and comparison functions.
In my experience, if C programmers are looking for more optimization, they implement a custom hash table for their datatype.
bson:
I use lookup3. Small, fast, simple.
CRC would be a bit suspect as a hash function. A desired property is that, on average, changing one bit in the input changes half the bits in the hash. I don't think CRC satisfies this and you will get some weird distribution.
udok:
I tested the hash functions with a modified version of Peter Kankowski test program (https://www.strchr.com/hash_functions).
Compiler is the Microsoft WinDDK 64 bit cl.exe with -O2.
lookup3 was one of the better functions, but it has small problems with the dic_common_words.txt test case
(words like: adult, man, woman, child, boy, girl, thing, cube, ball ).
My tests results:
--- Code: ---[0]$ ./HashOpenAddress.exe -r500 dic_common_words450.txt
# --------------------------------------------------------------------------------
# Lines read: 450
# Table size: 512 (9 bits)
# Fill-factor: 0.88
# Mean chars in line: 6
# --------------------------------------------------------------------------------
# Id Hash-Function Insert-Search-Cycles Coll-Rate Max-Coll Total-Coll Used-RAM
# --------------------------------------------------------------------------------
1 K&R 34 58 2.99 69 1345 4096
2 FNV1A 43 46 2.46 65 1107 4096
3 FNV1A_unsigned 43 46 2.46 65 1107 4096
4 FNV4 51 52 3.71 62 1669 4096
5 FNV1A_unrolled8 35 38 1.87 37 843 4096
6 Paul_Hsieh 30 53 2.77 63 1248 4096
7 Lookup3 49 52 4.41 134 1985 4096
8 Murmur2 27 37 2.60 70 1170 4096
9 Murmur2A 41 42 3.11 64 1401 4096
10 SBox 32 54 2.55 65 1147 4096
11 CRC32_SW 41 42 2.69 90 1211 4096
12 CRC32C_HW64 36 39 2.78 71 1250 4096
--- End code ---
The fill factor is high (0.9) to make some stress.
The Search- and Insert-Cycle count is the number of clock cycles measured
with rdtsc assembler instruction.
The Max-Coll is the maximum number of collisions, lookup3 has the highest
collision rate, even higher as Kernighan&Ritchie, but it is a very fast to calculate.
All shown hash functions are good for large number of test cases.
Except K&R (Kernighan&Ritchie) which is only useable for some test cases.
udok:
Here is a second test case with
100000 Strings in the form N$0, N$1, N$2 up to N$99999.
--- Code: ---[0]$ ./HashOpenAddress.exe dic_nnn.txt
# --------------------------------------------------------------------------------
# Lines read: 100000
# Table size: 131072 (17 bits)
# Fill-factor: 0.76
# Mean chars in line: 6
# --------------------------------------------------------------------------------
# Id Hash-Function Insert-Search-Cycles Coll-Rate Max-Coll Total-Coll Used-RAM
# --------------------------------------------------------------------------------
1 K&R 1132 975 138.34 4265 13834390 1048576
2 FNV1A 70 91 1.52 111 151813 1048576
3 FNV1A_unsigned 70 91 1.52 111 151813 1048576
4 FNV4 77 99 1.60 157 159604 1048576
5 FNV1A_unrolled8 78 100 1.69 129 168592 1048576
6 Paul_Hsieh 158 170 6.89 619 688701 1048576
7 Lookup3 72 93 1.59 107 159348 1048576
8 Murmur2 69 89 1.65 117 164728 1048576
9 Murmur2A 73 92 1.58 107 158448 1048576
10 SBox 68 86 1.65 186 164985 1048576
11 CRC32_SW 68 76 1.98 136 197640 1048576
12 CRC32C_HW64 34 48 0.81 35 81457 1048576
--- End code ---
Clearly not a test case for which the K&R or the Paul Hsieh function was designed for.
Navigation
[0] Message Index
[#] Next page
Go to full version