Author Topic: C only: What hash library are you using? (Read 3046 times)

udok · « **on:** October 21, 2021, 08:45:03 pm »

Do you know a good C hash library, or do you have written your own?

The libs and examples i looked at are toys with memory waste
and mediocre performance.

I do not want two pointers to keys and values as usual.
One pointer to the user data is enough, as the key is always
included in the user data.

I have coded my own text book open addressing hash functions,
but i am not satisfied with the result.

I used lazy deletion (tombstones), and i have experienced that this
is a no go in situations with regular deletions and insertions.
The table fills up rather quickly with the thombstones.

I have looked at the rust hash implementation (the rust code is mostly excellent),
but the code is complicated and too long for my brain.
Rust uses the google Swisscode idea, in the past they used Robin Hood hasing.

The hash function i want to use is the crc32c hardware accelerated intel function.
In my tests it is faster and has fewer collisions than most of the other
functions i have tested.

I definitly want something simple in a single C file, with not more than 500 lines.

Any opinions?

ejeffrey · « **Reply #1 on:** October 21, 2021, 10:23:53 pm »

I would expect that any well written generic hash table implementation in C would have separate key and value pointers. That makes it possible to implement dictionaries between arbitrary types even when you can't change the value type to include the key. It also reduces how often you have to write a custom wrapper function around the key hash and comparison functions.

In my experience, if C programmers are looking for more optimization, they implement a custom hash table for their datatype.

bson · « **Reply #2 on:** October 22, 2021, 01:42:56 am »

I use lookup3. Small, fast, simple.

CRC would be a bit suspect as a hash function. A desired property is that, on average, changing one bit in the input changes half the bits in the hash. I don't think CRC satisfies this and you will get some weird distribution.

udok · « **Reply #3 on:** October 22, 2021, 08:17:39 am »

I tested the hash functions with a modified version of Peter Kankowski test program (https://www.strchr.com/hash_functions).
Compiler is the Microsoft WinDDK 64 bit cl.exe with -O2.

lookup3 was one of the better functions, but it has small problems with the dic_common_words.txt test case
(words like: adult, man, woman, child, boy, girl, thing, cube, ball ).

My tests results:

Code: [Select]

[0]$ ./HashOpenAddress.exe -r500 dic_common_words450.txt
# --------------------------------------------------------------------------------
# Lines read: 450
# Table size: 512 (9 bits)
# Fill-factor: 0.88
# Mean chars in line: 6
# --------------------------------------------------------------------------------
# Id   Hash-Function  Insert-Search-Cycles Coll-Rate Max-Coll Total-Coll  Used-RAM
# --------------------------------------------------------------------------------
  1              K&R         34         58      2.99       69       1345      4096
  2            FNV1A         43         46      2.46       65       1107      4096
  3   FNV1A_unsigned         43         46      2.46       65       1107      4096
  4             FNV4         51         52      3.71       62       1669      4096
  5  FNV1A_unrolled8         35         38      1.87       37        843      4096
  6       Paul_Hsieh         30         53      2.77       63       1248      4096
  7          Lookup3         49         52      4.41      134       1985      4096
  8          Murmur2         27         37      2.60       70       1170      4096
  9         Murmur2A         41         42      3.11       64       1401      4096
 10             SBox         32         54      2.55       65       1147      4096
 11         CRC32_SW         41         42      2.69       90       1211      4096
 12      CRC32C_HW64         36         39      2.78       71       1250      4096

The fill factor is high (0.9) to make some stress.

The Search- and Insert-Cycle count is the number of clock cycles measured
with rdtsc assembler instruction.

The Max-Coll is the maximum number of collisions, lookup3 has the highest
collision rate, even higher as Kernighan&Ritchie, but it is a very fast to calculate.

All shown hash functions are good for large number of test cases.
Except K&R (Kernighan&Ritchie) which is only useable for some test cases.

udok · « **Reply #4 on:** October 22, 2021, 08:28:17 am »

Here is a second test case with
100000 Strings in the form N$0, N$1, N$2 up to N$99999.

Code: [Select]

[0]$ ./HashOpenAddress.exe dic_nnn.txt
# --------------------------------------------------------------------------------
# Lines read: 100000
# Table size: 131072 (17 bits)
# Fill-factor: 0.76
# Mean chars in line: 6
# --------------------------------------------------------------------------------
# Id   Hash-Function  Insert-Search-Cycles Coll-Rate Max-Coll Total-Coll  Used-RAM
# --------------------------------------------------------------------------------
  1              K&R       1132        975    138.34     4265   13834390   1048576
  2            FNV1A         70         91      1.52      111     151813   1048576
  3   FNV1A_unsigned         70         91      1.52      111     151813   1048576
  4             FNV4         77         99      1.60      157     159604   1048576
  5  FNV1A_unrolled8         78        100      1.69      129     168592   1048576
  6       Paul_Hsieh        158        170      6.89      619     688701   1048576
  7          Lookup3         72         93      1.59      107     159348   1048576
  8          Murmur2         69         89      1.65      117     164728   1048576
  9         Murmur2A         73         92      1.58      107     158448   1048576
 10             SBox         68         86      1.65      186     164985   1048576
 11         CRC32_SW         68         76      1.98      136     197640   1048576
 12      CRC32C_HW64         34         48      0.81       35      81457   1048576

Clearly not a test case for which the K&R or the Paul Hsieh function was designed for.

udok · « **Reply #5 on:** October 22, 2021, 08:41:05 am »

Quote from: ejeffrey on October 21, 2021, 10:23:53 pm

I would expect that any well written generic hash table implementation in C would have separate key and value pointers. That makes it possible to implement dictionaries between arbitrary types even when you can't change the value type to include the key. It also reduces how often you have to write a custom wrapper function around the key hash and comparison functions.

In my experience, if C programmers are looking for more optimization, they implement a custom hash table for their datatype.

Yes, i have not found something which is close to my wish list.
My typical C programs are not like python scripts, and the key
is always included in the data structure.

SiliconWizard · « **Reply #6 on:** October 22, 2021, 05:19:49 pm »

What is your typical use case for a hash library (with a little more details)? That in itself would certainly help determining the better option, and whether a "hash map" is even the best approach. Just a thought.

Nominal Animal · « **Reply #7 on:** October 22, 2021, 10:19:09 pm »

I always use custom data types and pick hash functions for the use case. For keywords (using ASCII alphanumeric characters), I often use DJB2 (xor variant), because it is fast and good enough, and easy to incorporate into the lexer so that each token becomes a triplet (pointer, length, hash).

The code I write typically use hash tables for human interface stuff; parsing text and such. It tends to be an insignificant part of the total runtime and resource use.
I use various trees, heaps, and disjoint set implementations much more often. (For molecular simulations, I even have my own data structures based on Verlet lists, which I probably should document one day; could be useful for other point-like interaction models, where the maximum interaction distance is limited, but the points move with respect to each other.)

I'm sure the use patterns vary a lot based on exactly what kind of problems one works on, so perhaps providing more information about your use cases would focus the discussion a bit.

udok · « **Reply #8 on:** October 23, 2021, 09:28:53 am »

The Bernstein hash function is not the fastest one,
but it works very good in all of my test cases.

But to clarify, the hash function is not my problem.

The hash function i want to use is the Intel CRC32C one,
because it has few collisions, and is fast.
And it is small and simple, only 88 Bytes.
Drawback is, it runs only on x64, but nothing is perfect.

My problem is the hash table data structure.

I want something like the C++ STL unordered map in C,
but without the code bloat and syntactic bloat of the STL.

I had looked at the FreeBSD code, and searched at github,
and found nothing in C only.
Maybe the STL is the real reason for the C++ success.

brucehoult · « **Reply #9 on:** October 23, 2021, 10:17:49 am »

You should probably try Google's "Cityhash" and the newer "Metrohash". Probably "spookyhash" too, a newer hash from the lookup3 guy.

I think metro is the fastest of those, with no quality difference.

amwales · « **Reply #10 on:** October 23, 2021, 11:47:25 am »

To clarify you want the key stored in your own struct (example myitem_s)

struct myitem_s {
char key[10];
int datac;
int datab;
};

You are happy to have a seperate table that just contains pointers to keys. remember you know the offset of the key into your struct so from the key pointer you can get the address of your myitem container. It also means that your container can be in different tables at the same time using different keys.

struct tableItem_s {
void *key;
} htable[TSIZE];

For the hashing table code to be generic and reusable on different items you want to pass a function that knows how to extract and compare keys be they numeric, string etc. Sounds like a fun excersize.

Oooh here is a neat trick, see that void *, you probably find that you don't use all the bits so can use as flags. if the pointer you are referencing is always even, you get a bit flag for free on the LSB

Why doesn't a generic solution like this already exist that <500 lines?
I guess that these problems balloon as the solution is made more generic and able to handle solutions to other problems.

Nominal Animal · « **Reply #11 on:** October 23, 2021, 12:26:47 pm »

Let's say I was writing a code editor for different programming languages, and had a lexer/tokenizer for each language that produces a (string, length, hash) triplet for each token. Each token should be looked up primarily for syntax highlighting, but occasionally for references as well, so we'll use a hash table. For e.g. C, we actually use a stack of hash tables, with the bottom stack having the C keywords, and each stack on top having the variables, functions, macros etc. in each scope.

I've observed that data locality is important for efficiency, and prefer a C99 flexible array member to store the token string, with each hash table slot a pointer to a structure. Since each structure involves a pointer dereference anyway, making the hash table slot into a singly-linked list incurs no extra penalty. When adding a new member into a slot, it is the maximum length of the list (instead of how many of the hash table entries are populated) that determines whether the table should be extended or not.

Let's assume the following data structure suffices for each token:

Code: [Select]

struct token {
    struct token   *next;   /* Next entry in the same hash table slot */

    token_type_t    type;   /* Token classification (function, variable, reserved keyword, etc.) */
    highlight_t     high;   /* Highlighting information (index to font+color table, themeable) */

    struct refs    *refs;   /* References and other only occasionally needed data */

    hash_t          hash;   /* Actual hash of the data */
    size_t          size;   /* Length of the data */
    char            data[]; /* Data */
};

The struct refs *refs; member could even be NULL or only partially populated, until the user first selects the token, since it is not that common for the user to be interested in the original definition of the token and so on.

Although I'd need the hash tables to form a stack, array, or list, lets just concentrate on the single hash table case for now. My initial unit tests and experiments to see if this data structure works (and how to make the editor as efficient as possible by delaying slow operations like looking up the references (like where a function was originally defined, other places where it is used) until they are actually needed), would start with the following hash table data structure:

Code: [Select]

struct table {
    size_t          size;   /* Number of slots */
    struct token  **slot;   /* Each slot is a singly-linked list. */
    uint32_t        cost;   /* Filtered average cost of access. */
};
#define  TABLE_INIT  { 0, NULL, 0 }

As usual, the slot index is always (hash % size).

The cost field is the best adaptable heuristic I've found useful. The idea is that a value of 0 indicates every access was resolved immediately (meaning all slots have at most a single entry), and some arbitrary limit, say 4096, indicates the hash table should be resized, because access cost is too high (due to the slot lists being too long).

Each access uses some kind of heuristic to estimate the cost of the access. A simple possibility would be just use the number of entries in the linked list in the target slot needed to traverse before a resolution was found, multiplied by some number. For example, multiplied by 1024 would indicate that if many consecutive accesses need to traverse four or more slots, the table needs to be resized. This multiplier would be a tunable, obviously.

To average the value, a simple filter is used. If cost is the cost in the hash table structure, and curr is the cost of the current access, then the general formula is cost = (cost * (N - 1) + curr) / N; where N is a suitable power of two for efficiency; or cost = (cost * (N - A) + curr * A) / N; where A is a (small) odd positive number, and N is a suitable power of two. This means that cost changes slowly, during multiple accesses, and is not affected too much by a rare single high-cost access.

Again, the cost approach is a heuristic, because in my experience, using only the list length as an absolute limit does not yield the kind of balance between table size and access cost, because sometimes you just hit a bad table size wrt. the hash function, and you get collisions, but those collisions don't affect performance much.

In this particular case, I'd probably observe during testing that the cost approach is not needed, and because each hash table slot only takes a single pointer, I can simply keep the hash table size between say (number of entries) and 3*(number of entries). So, I'd end up replacing all the cost stuff with a simple size_t entries; member, incremented whenever a new entry is inserted, decremented when deleted, and both operations resize the hash table when needed. Note that the size must have Schmitt triggering: you want the target resized size (2*(number entries)) to be in the middle of the "keep this size" range, so that a sequence of adding and then deleting an entry does not cause the table to be resized every time.

To resize the hash table, all I need is to save the old slot array in a temporary variable, allocate a new one, move each entry to the new one –– because the hash is already in the linked list structure, it's all only pointer work –– and finally free the old array. This means the resize cost is relatively small. Untested initial implementation:

Code: [Select]

int  table_resize(struct table *t, size_t newsize)
{
    /* Sanity checks; I like these.. */
    if (!t || newsize < 1)
        return errno = EINVAL;

    struct token  **newslot;

    newslot = malloc(newsize * sizeof *newslot);
    if (!newslot)
        return errno = ENOMEM;

    /* Initialize all entries to NULL.  We could have also used calloc() above. */
    for (size_t i = 0; i < newsize; i++)
        newslot[i] = NULL;

    struct token **oldslot = t->slot;
    size_t         oldsize = t->size;

    for (size_t o = 0; o < size; o++) {
        struct token *list = oldslot[o];
        while (list) {
            struct token *curr = list;

            list = list->next;

            const size_t  i = curr->hash % newsize;

            curr->next = newslot[i];
            newslot[i] = curr;
        }
    }

    t->size = newsize;
    t->slot = newslot;

    free(oldslot);

    return 0;
}

In the application code, each table variable is declared and defined as
struct table example = TABLE_INIT;
so that there is no special "initialize table" call. Lookups will notice size == 0 and always return "none found", and first insertion will use a compile-time default size for the slot array. (Since each slot is only a pointer in size, it makes sense to start with relatively large tables, say 1024 entries or so, when each application will have a relatively small number of such hash tables. The compile-time default size thus needs to be user-overridable (#ifndef DEFAULT_SLOTS, #define DEFAULT_SLOTS 1024, #endif), and have a comment about sizing it depending more on how many such hash tables ones own application expects to have in use at the same time, as opposed to trying to guess how large it will eventually grow.)

TL;DR: Apologies for the verbosity, but I thought that showing how my development process would work, and how I'd end up with whatever I'd end up using, including a wrinkle – "access cost" – that might sometimes be useful but in this particular case is better replaced with something else – would be useful. At minimum, it should show why I don't really reach for a generic solution as my first step.

In development terms, I must emphasize that when I am considering this stuff, I am also thinking a lot about what the application needs, and how it could do it efficiently. You could say that I'm even getting twice my money's worth for the time I spend writing the above code, because as I write the code, I'm basing it on my understanding of the problem. Similar to the Rubber Duck method, I've found that this helps me better understand the needs and use cases in my target application or service; and that understanding is often more valuable to me than the code I produce at this phase.

amwales · « **Reply #12 on:** October 23, 2021, 03:03:07 pm »

Ok, How about the following, please don't crucify me, I haven't tested the code so it may need some special touching.

htable.h

Code: [Select]

#ifndef __HTABLE_H__
#define __HTABLE_H__

#include <stdint.h>

#define HTABLEDELETED UINT32_MAX

struct hitem_s {

	union {
		void *userItem;
	};
};

struct htable_s {

	int (*hash)(void* a, int size);
	int (*cmp)(void* a, void* b);

	int free;
	int size;

	struct hitem_s* htable;
};

void hashInsert(void* userItem, struct htable_s* htable);
struct hitem_s* hashFind(void* userItem, struct htable_s* htable);
void hashDelete(struct htable_s* htable, struct hitem_s* hitem);

#endif /* __HTABLE_H__ */

htable.c

Code: [Select]

#include <stdint.h>

#include "htable.h"


void
hashInsert(void* userItem, struct htable_s* htable) {

	/* This fnction MUST only be called if free > 0
	 * This functin MUST only be called if key does not already exist in the table
	 */

	struct hitem_s* hitem = &htable->htable[htable->hash(userItem, htable->size)];
	struct hitem_s* litem = &htable->htable[htable->size];

	while (1) {

		if ((hitem->userItem == (void*)0) || (hitem->userItem == (void *)HTABLEDELETED)) {
			hitem->userItem = userItem;
			htable->free--;
			break;
		}

		/* Collision, move to next available space, repeat unti
		 * we find a free slot.
		 */
		if (++hitem == litem) {
			hitem = htable->htable;
		}
	}
}

struct hitem_s*
hashFind(void* userItem, struct htable_s* htable) {

	/* This function returns 0 if the item cannot be found
	 * or the address of the item if it is found
	 */

	struct hitem_s* hitem = &htable->htable[htable->hash(userItem, htable->size)];
	struct hitem_s* litem = &htable->htable[htable->size];

	while (1) {

		if (hitem->userItem == (void*)0) {
			return (struct hitem_s*)0;
		}

		if (((hitem->userItem != (void*)HTABLEDELETED)) && (htable->cmp(hitem->userItem, userItem) == 0)) {
			return hitem->userItem;
			break;
		}

		/* We didnt find our item, keep searching 
		 */
		if (++hitem == litem) {
			hitem = htable->htable;
		}
	}
}

void
hashDelete(struct htable_s* htable, struct hitem_s* hitem) {


	hitem->userItem = (void*)HTABLEDELETED;
	htable->free++;
}

void
hashCopy(struct htable_s* htable, struct htable_s *newTable) {

	/* newTable MUST be initialised, sized and allocated
	 * all entries MUST be zeroed
	 */

	struct hitem_s* litem = htable->htable;
	int i;

	for (i = 0; i < htable->size; i++) {

		if ((litem->userItem == (void*)0) || (litem->userItem == (void*)HTABLEDELETED)) {
			continue;
		}

		hashInsert(litem->userItem, newTable);
		litem++;
	}
}

main.c

Code: [Select]

#include <stdio.h>
#include <stdint.h>
#include <string.h>


#include "htable.h"

struct myItem_s {

	char name[10];
	int age;
};

typedef struct myItem_s myitem_t;

int myHashFunc(void *a, int size) {

	char* ptr = ((struct myItem_s*)a)->name;
	int hash = 0;

	while (*ptr) {
		hash += *ptr;
		ptr++;
	}
	return hash % size;
}

int myCmpFunc(void* a, void* b) {

	return strcmp(((struct myItem_s*)a)->name, ((struct myItem_s*)b)->name);
}


int main()
{
	#define TSIZE 5
	static struct hitem_s items[TSIZE];
	static struct htable_s htable = { myHashFunc, myCmpFunc, TSIZE - 1, TSIZE, items };

	struct myItem_s people[4] = {

		{ "bill", 42 },
		{ "bob", 31 },
		{ "dave", 52 },
		{ "fred", 10 }
	};

	struct hitem_s key[] = { (void*)5 };

	hashInsert(&people[0], &htable);
	hashInsert(&people[1], &htable);
	hashInsert(&people[2], &htable);
	hashInsert(&people[3], &htable);


	{
		struct myItem_s person[1] = { "dave", 0 };
		struct myItem_s* ptr;

		ptr = hashFind(person, &htable);

		printf("%s %d\n", ptr->name, ptr->age);
	}
}

Think about how you will call each function, don't jump in blindly.
ie. You want to add an item, think about whether you know if the item is already in the list.
You could search for it if you are not sure. Then check to see if there will be space, you may need to grow the table by creating a second and calling hashCopy() to populate it. It looks fairly flexible but the more you look at it the more you wonder if you should just implement it inline with the code you want and not bother with it at all

Have fun, MIT license, enjoy.

NB: the code is for 32bit pointers, if you want 16 or 64 bit edit the header and pick an appropriate value for HTABLEDELETED

amwales · « **Reply #13 on:** October 23, 2021, 03:46:40 pm »

Added a github project https://github.com/amwales-888/htable

Nominal Animal · « **Reply #14 on:** October 23, 2021, 05:27:02 pm »

If I were asked to do a generic hash table interface, then it's use for a string dictionary would probably look like this:

Code: [Select]

typedef struct {
    size_t  len;
    char   *val;
} dict_value;

#define  HASHLIB_VALUE   dict_value
#define  HASHLIB_PREFIX  dict_
#include "hashlib.h"

The interface exposed by that would be approximately

Code: [Select]

typedef  struct dict_entry {
    dict_entry  *next;
    size_t       hash;
    dict_value   val;
    size_t       len;
    char         key[];
} dict_entry;

typedef  struct dict_table {
    size_t       size;
    size_t       used;
    dict_entry **slot;
    hash_func    hash;
} dict_table;
#define TABLE_INIT  (0, 0, NULL, hash_func_default)

typedef  size_t (*hash_function)(const void *, size_t);
size_t            hash_func_default (const void *, size_t);

/* Run-time discovery of other hash functions. */
hash_function   hash_func_find    (const char *);
size_t          hash_func_count   (void);
hash_function   hash_func_func    (size_t);
const char     *hash_func_name    (size_t);

int             dict_table_init         (dict_table *, size_t, hash_func);
int             dict_table_rehash       (dict_table *, size_t, hash_func);
int             dict_table_resize       (dict_table *, size_t);
int             dict_table_free         (dict_table *);

dict_entry *    dict_table_add          (dict_table *, const char *, size_t, dict_value);
dict_entry *    dict_table_add_entry    (dict_table *, dict_entry *);

int             dict_table_del          (dict_table *, const char *, size_t);
int             dict_table_del_entry    (dict_table *, dict_entry *);
int             dict_table_del_all      (dict_table *, dict_value);

dict_entry *    dict_table_find         (dict_table *, const char *, size_t);
dict_entry *    dict_table_find_hash    (dict_table *, const char *, size_t, size_t);
dict_entry *    dict_table_find_entry   (dict_table *, dict_entry *);
dict_entry *    dict_table_find_any     (dict_table *, dict_value);

dict_entry *    dict_table_detach       (dict_table *, const char *, size_t);
dict_entry *    dict_table_detach_hash  (dict_table *, const char *, size_t, size_t);
dict_entry *    dict_table_detach_entry (dict_table *, dict_entry *);
dict_entry *    dict_table_detach_any   (dict_table *, dict_value);

/* These are for constructing entries dynamically, before they're added to any tables. */
dict_entry *    dict_entry_new          (size_t);
dict_entry *    dict_entry_resize       (dict_entry *, size_t);
dict_entry *    dict_entry_finish       (dict_entry *, dict_table *);

#undef  HASHLIB_VALUE
#undef  HASHLIB_PREFIX

and allows multiple different hash tables to be included in the same compilation unit. (This is one of the rare situations where I'd use this kind of preprocessor-assisted polymorphism in C.)

The dict_ prefix functions would be static inline and marked optional, but internally use a set of helper functions implemented in the static or dynamic library. If a dynamic library is used, then adding new hash functions to the dynamic library will immediately make them available to binaries without recompiling them.

If I wanted an additional hash table to look up say functions operating on a stack of floating-point numbers, I could add say

Code: [Select]

typedef struct {
    size_t   size;
    size_t   used;
    double  *data;
} calc_stack;
#define  CALC_STACK_INIT  (0, 0, NULL)

typedef  int (*calc_op)(calc_stack *);

#define  HASHLIB_VALUE   calc_op
#define  HASHLIB_PREFIX  calc_
#include "hashlib.h"

and it would expose an additional hash table of type calc_table, with entries of type calc_entry, and the same set of functions as for dict above, but for these new types. The preprocessor magick to make this all work without collisions is a bit messy, but it does work just fine.

Including hashlib.h automatically undefines HASHLIB_VALUE and HASHLIB_PREFIX. If you include it without those defined, all you get is the TABLE_INIT macro, the hash_function function pointer type, and the hash_func*() functions.

westfw · « **Reply #15 on:** October 24, 2021, 12:47:10 am »

OK; I'm confused. Admittedly, by education predates a lot of the current fancy table algorithms, and long predates the "find a good library that implements it" age, but...

Isn't hashing essentially just a key-shortener? You pass your key (perhaps a string) into a hashing function, and you get back a much shorter key (9 bit integer.) Then you use the shorter key to (more rapidly) access some other data structure. But that data structure could be ANYTHING: array, linked list, some kind of tree... Pick as appropriate for your desired insert/delete/find behavior, probably.

If your hash function is "perfect" for your data, and you have fewer data items than can be counted by the bits in the shorter key, you could directly index an array from the short key. If you have collisions (different long key yields same short key), then the secondary data structure has to have a mechanism for insert/delete/find based on the long key. But that could be independent of the mechanism/datastructure used to access via the short key...
Sort of:

Code: [Select]

item = findItemInCollisionStruct( findCollisionStructure( myHash(longKey) ), longKey );(Code I've used used (bits of an?) IP checksum as the hash, picked up a linked list from an array, and did a linear search through the LL. We "assumed" (maybe we measured) that collisions would be relatively rare and the LLs would be short.)

So it seems like what OP is really looking for has little to do with the hashing, and more to do with the data structures thereafter. Which depends mostly on things other than the hashing itself.

(oohh. The secondary access could be another hash table - are there "mutually exclusive hashes", such that if there is a collision using one hashing algorithm, the 2nd hash is guaranteed NOT to collide?)
I also don't quite understand the desire to to avoid the "memory inefficiency of an extra pointer" at the same time we seem to be talking about an ia64 processor/environment, but I'm willing to assume that the OP has "reasons."

brucehoult · « **Reply #16 on:** October 24, 2021, 02:00:29 am »

Quote from: westfw on October 24, 2021, 12:47:10 am

OK; I'm confused. Admittedly, by education predates a lot of the current fancy table algorithms, and long predates the "find a good library that implements it" age, but...

Me too. And thinking on this has definitely changed a lot.

The biggest change is that long ago you typically made your hash table a prime number size, and did a mod operation to get from the hash value to the table index. Which was slow. Although as the hash table size doesn't change often, you can figure out the multiplicative reciprocal of the table size and store that with the table, and multiply instead of dividing. I've done that as recently as 2006 or 2007 in the "Alchemo" Java to native (ARM, typically ARM7TDMI) compiler.

The modern thing is to use something that is close to being a cryptographically secure hash function, in that any 1 bit change in the large key has about a 50% (but uncorrelated) chance of changing each bit in the short output. i.e. something like MD5 or SHA-1. Since all the bits in the output are in theory just as good as each other, or as good as a combination, you can shorten the hash value further to your actual table size just by having a power of two table size and masking off the extra bits in the hash value.

MD5 is long since broken for security purposes, but it's still perfectly fine as a hash function in algorithms. The problem is MD5 and SHA-1 are pretty slow to calculate. Unless you have dedicated hardware...

The trend since 2000 or so has been finding hash functions that statistically look a lot like MD5 or SHA-1 but are a lot faster to calculate.

This was all started by Bob Jenkins with his "one_at_a_time" and then "lookup2" functions in 1997, followed by "lookup3" (mentioned in this thread) in 2009 and then "spookyhash" in 2011. Austin Appleby took up the baton in 2008 with "MurmurHash" -- explicitly trying to make a faster replacement for lookup3 -- in 2008 (followed by Murmur2 and Murmur3). Murmur3 is used (at least at one time, maybe still) in a LOT of programming language implementations, including Python, Go, C#, D, Lua, Perl, Ruby, Rust, Scala, Swift and by software such as libstdc++, nginx, hadoop.

Google put the big brains on to the task and published CityHash in 2011. It uses the x86 CRC32 instruction as part of it. The authors themselves say that SpookyHash is approximately as good, and better if you don't have CRC32 hardware. FarmHash (Google 2014) and MetroHash (J Andrew Rogers 2015) continue the trend to faster hashes that are just as good.

In my defense with the hash tables in Alchemo, all the really good hash functions starting with murmur and lookup3 postdate that work.

Around 2008-2011 is when it became definitively the thing to use power of two table sizes instead of prime number table sizes.

Quote

Isn't hashing essentially just a key-shortener? You pass your key (perhaps a string) into a hashing function, and you get back a much shorter key (9 bit integer.) Then you use the shorter key to (more rapidly) access some other data structure. But that data structure could be ANYTHING: array, linked list, some kind of tree... Pick as appropriate for your desired insert/delete/find behavior, probably.

Exactly. Though 9 bits would be a strange size. Typically a hash function for use in algorithms returns a key the same size as a pointer, or at least the same size as an array index i.e. 32 or 64 bits. There's no point in bigger, and anything much smaller falls up short with unnecessary duplicates if you have a really big data structure.

While you *can* used hashed values in a list or tree that doesn't give a lot of speedup -- mostly it just helps if your long keys have a significant danger of having long matching initial parts. And also being fixed size can help you to design a more cache-friendly data structure. You still need to go off and follow a pointer to compare the real key if the hash matches, but that's hopefully just once and meanwhile you rejected all the non-matches quickly.

The ideal use for a hashed value is to directly access an array, so (almost) everything is found first attempt.

Quote

(oohh. The secondary access could be another hash table - are there "mutually exclusive hashes", such that if there is a collision using one hashing algorithm, the 2nd hash is guaranteed NOT to collide?)

No, that's not a thing. At least not for arbitrary data. It's just low probability, much the same as if you'd just concatenated the two hashes.

Quote

I also don't quite understand the desire to to avoid the "memory inefficiency of an extra pointer" at the same time we seem to be talking about an ia64 processor/environment, but I'm willing to assume that the OP has "reasons."

ia64? I missed that. That's a super-rare beast!

westfw · « **Reply #17 on:** October 24, 2021, 02:13:37 am »

Quote

ia64? I missed that. That's a super-rare beast!

That probably means I used the wrong abbreviation...

Quote

Compiler is the Microsoft WinDDK 64 bit cl.exe

So, 64bit x86 architecture. I guess ia64 was the ill-fated Itanium and I should have said AMD64 ir x86_64 (except I thought AMD64 was more of an ABI spec than CPU instruction set?)

My oops.

brucehoult · « **Reply #18 on:** October 24, 2021, 02:59:07 am »

Oh ok yes, amd64, the instruction set Intel did not want to exist but were forced to adopt due to AMD's market success with Opteron and Athlon64 and their own abject failure with ia64 (which was designed to finally kill off AMD's second source license as well as all the RISC vendors)

Designed from the marketing and legal point of view, that is. Technically the design was awful. iAPX 432 all over again.

bson · « **Reply #19 on:** October 24, 2021, 03:02:58 am »

Quote from: udok on October 23, 2021, 09:28:53 am

I want something like the C++ STL unordered map in C,
but without the code bloat and syntactic bloat of the STL.

I too seriously dislike the STL.

But if your key is always in the data (a very agreeable practice IMO!) you could use a set.

gf · « **Reply #20 on:** October 25, 2021, 07:05:56 pm »

For hashing large amounts of data I like xxhash. It's pretty fast.

AntiProtonBoy · « **Reply #21 on:** October 26, 2021, 01:04:35 am »

Depending on what your requirements are in terms of hashing quality and resource constraints, you might want to consider libsodium. It's a proven crypto grade library.

https://libsodium.org

westfw · « **Reply #22 on:** October 31, 2021, 06:11:44 am »

This popped up on my FB feed recently:
https://mollyrocket.com/meowhash

Quote

It’s the fastest hash function we know of, and we have benchmarked all the ones we could find. On modern Intel x64 CPUs, it hashes 16 bytes per cycle single-threaded. This means in cache it can hash at a rate of 64 gigabytes per second on a 4.2gHz machine.

brucehoult · « **Reply #23 on:** October 31, 2021, 06:35:22 am »

Quote from: westfw on October 31, 2021, 06:11:44 am

This popped up on my FB feed recently:
https://mollyrocket.com/meowhash
Quote
It’s the fastest hash function we know of, and we have benchmarked all the ones we could find. On modern Intel x64 CPUs, it hashes 16 bytes per cycle single-threaded. This means in cache it can hash at a rate of 64 gigabytes per second on a 4.2gHz machine.

Notes:

- requires AES hardware. If you don't have it then it will be slow slow slow

- claims to support arm64 as well as amd64. I can't find any evidence of arm64 code in the current version

- the source files mentioned in the README do not exist. They were deleted on 19 May 2019. Those files *did* have ARM support.

I don't know what's going on there.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: C only: What hash library are you using? (Read 3046 times)

Share me