Computing > Programming

C only: What hash library are you using?

(1/5) > >>

udok:
Do you know a good C hash library, or do you have written your own?

The libs and examples i looked at are toys with memory waste
and mediocre performance.

I do not want two pointers to keys and values as usual.
One pointer to the user data is enough, as the key is always
included in the user data.

I have coded my own text book open addressing hash functions,
but i am not satisfied with the result.

I used lazy deletion (tombstones), and i have experienced that this
is a no go in situations with regular deletions and insertions.
The table fills up rather quickly with the thombstones.

I have looked at the rust hash implementation (the rust code is mostly excellent),
but the code is complicated and too long for my brain.
Rust uses the google Swisscode idea, in the past they used Robin Hood hasing.

The hash function i want to use is the crc32c hardware accelerated intel function.
In my tests it is faster and has fewer collisions than most of the other
functions i have tested.

I definitly want something simple in a single C file, with not more than 500 lines.

Any opinions?

ejeffrey:
I would expect that any well written generic hash table implementation in C would have separate key and value pointers.  That makes it possible to implement dictionaries between arbitrary types even when you can't change the value type to include the key.  It also reduces how often you have to write a custom wrapper function around the key hash and comparison functions.

In my experience, if C programmers are looking for more optimization, they implement a custom hash table for their datatype.

bson:
I use lookup3.  Small, fast, simple.

CRC would be a bit suspect as a hash function.  A desired property is that, on average, changing one bit in the input changes half the bits in the hash.  I don't think CRC satisfies this and you will get some weird distribution.

udok:
I tested the hash functions with a modified version of Peter Kankowski test program (https://www.strchr.com/hash_functions).
Compiler is the Microsoft WinDDK 64 bit cl.exe with -O2.

lookup3 was one of the better functions, but it has small problems with the dic_common_words.txt test case
(words like: adult, man, woman, child, boy, girl, thing, cube, ball ).

My tests results:

--- Code: ---[0]$ ./HashOpenAddress.exe -r500 dic_common_words450.txt
# --------------------------------------------------------------------------------
# Lines read: 450
# Table size: 512 (9 bits)
# Fill-factor: 0.88
# Mean chars in line: 6
# --------------------------------------------------------------------------------
# Id   Hash-Function  Insert-Search-Cycles Coll-Rate Max-Coll Total-Coll  Used-RAM
# --------------------------------------------------------------------------------
  1              K&R         34         58      2.99       69       1345      4096
  2            FNV1A         43         46      2.46       65       1107      4096
  3   FNV1A_unsigned         43         46      2.46       65       1107      4096
  4             FNV4         51         52      3.71       62       1669      4096
  5  FNV1A_unrolled8         35         38      1.87       37        843      4096
  6       Paul_Hsieh         30         53      2.77       63       1248      4096
  7          Lookup3         49         52      4.41      134       1985      4096
  8          Murmur2         27         37      2.60       70       1170      4096
  9         Murmur2A         41         42      3.11       64       1401      4096
 10             SBox         32         54      2.55       65       1147      4096
 11         CRC32_SW         41         42      2.69       90       1211      4096
 12      CRC32C_HW64         36         39      2.78       71       1250      4096

--- End code ---

The fill factor is high (0.9) to make some stress.

The Search- and Insert-Cycle count is the number of clock cycles measured
with rdtsc assembler instruction.

The Max-Coll is the maximum number of collisions, lookup3 has the highest
collision rate, even higher as Kernighan&Ritchie, but it is a very fast to calculate.

All shown hash functions are good for large number of test cases.
Except K&R (Kernighan&Ritchie) which is only useable for some test cases.

udok:
Here is a second test case with
100000 Strings in the form N$0, N$1, N$2 up to N$99999.


--- Code: ---[0]$ ./HashOpenAddress.exe dic_nnn.txt
# --------------------------------------------------------------------------------
# Lines read: 100000
# Table size: 131072 (17 bits)
# Fill-factor: 0.76
# Mean chars in line: 6
# --------------------------------------------------------------------------------
# Id   Hash-Function  Insert-Search-Cycles Coll-Rate Max-Coll Total-Coll  Used-RAM
# --------------------------------------------------------------------------------
  1              K&R       1132        975    138.34     4265   13834390   1048576
  2            FNV1A         70         91      1.52      111     151813   1048576
  3   FNV1A_unsigned         70         91      1.52      111     151813   1048576
  4             FNV4         77         99      1.60      157     159604   1048576
  5  FNV1A_unrolled8         78        100      1.69      129     168592   1048576
  6       Paul_Hsieh        158        170      6.89      619     688701   1048576
  7          Lookup3         72         93      1.59      107     159348   1048576
  8          Murmur2         69         89      1.65      117     164728   1048576
  9         Murmur2A         73         92      1.58      107     158448   1048576
 10             SBox         68         86      1.65      186     164985   1048576
 11         CRC32_SW         68         76      1.98      136     197640   1048576
 12      CRC32C_HW64         34         48      0.81       35      81457   1048576

--- End code ---

Clearly not a test case for which the K&R or the Paul Hsieh function was designed for.

Navigation

[0] Message Index

[#] Next page

There was an error while thanking
Thanking...
Go to full version