Let's say I was writing a code editor for different programming languages, and had a lexer/tokenizer for each language that produces a (string, length, hash) triplet for each token. Each token should be looked up primarily for syntax highlighting, but occasionally for references as well, so we'll use a hash table. For e.g. C, we actually use a stack of hash tables, with the bottom stack having the C keywords, and each stack on top having the variables, functions, macros etc. in each scope.
I've observed that data locality is important for efficiency, and prefer a C99 flexible array member to store the token string, with each hash table slot a pointer to a structure. Since each structure involves a pointer dereference anyway, making the hash table slot into a singly-linked list incurs no extra penalty. When adding a new member into a slot, it is the maximum length of the list (instead of how many of the hash table entries are populated) that determines whether the table should be extended or not.
Let's assume the following data structure suffices for each token:
struct token {
struct token *next; /* Next entry in the same hash table slot */
token_type_t type; /* Token classification (function, variable, reserved keyword, etc.) */
highlight_t high; /* Highlighting information (index to font+color table, themeable) */
struct refs *refs; /* References and other only occasionally needed data */
hash_t hash; /* Actual hash of the data */
size_t size; /* Length of the data */
char data[]; /* Data */
};
The struct refs *refs; member could even be NULL or only partially populated, until the user first selects the token, since it is not that common for the user to be interested in the original definition of the token and so on.
Although I'd need the hash tables to form a stack, array, or list, lets just concentrate on the single hash table case for now. My initial unit tests and experiments to see if this data structure works (and how to make the editor as efficient as possible by delaying slow operations like looking up the references (like where a function was originally defined, other places where it is used) until they are actually needed), would start with the following hash table data structure:
struct table {
size_t size; /* Number of slots */
struct token **slot; /* Each slot is a singly-linked list. */
uint32_t cost; /* Filtered average cost of access. */
};
#define TABLE_INIT { 0, NULL, 0 }
As usual, the slot index is always (hash % size).
The cost field is the best adaptable heuristic I've found useful. The idea is that a value of 0 indicates every access was resolved immediately (meaning all slots have at most a single entry), and some arbitrary limit, say 4096, indicates the hash table should be resized, because access cost is too high (due to the slot lists being too long).
Each access uses some kind of heuristic to estimate the cost of the access. A simple possibility would be just use the number of entries in the linked list in the target slot needed to traverse before a resolution was found, multiplied by some number. For example, multiplied by 1024 would indicate that if many consecutive accesses need to traverse four or more slots, the table needs to be resized. This multiplier would be a tunable, obviously.
To average the value, a simple filter is used. If cost is the cost in the hash table structure, and curr is the cost of the current access, then the general formula is cost = (cost * (N - 1) + curr) / N; where N is a suitable power of two for efficiency; or cost = (cost * (N - A) + curr * A) / N; where A is a (small) odd positive number, and N is a suitable power of two. This means that cost changes slowly, during multiple accesses, and is not affected too much by a rare single high-cost access.
Again, the cost approach is a heuristic, because in my experience, using only the list length as an absolute limit does not yield the kind of balance between table size and access cost, because sometimes you just hit a bad table size wrt. the hash function, and you get collisions, but those collisions don't affect performance much.
In this particular case, I'd probably observe during testing that the cost approach is not needed, and because each hash table slot only takes a single pointer, I can simply keep the hash table size between say (number of entries) and 3*(number of entries). So, I'd end up replacing all the cost stuff with a simple size_t entries; member, incremented whenever a new entry is inserted, decremented when deleted, and both operations resize the hash table when needed. Note that the size must have Schmitt triggering: you want the target resized size (2*(number entries)) to be in the middle of the "keep this size" range, so that a sequence of adding and then deleting an entry does not cause the table to be resized every time.
To resize the hash table, all I need is to save the old slot array in a temporary variable, allocate a new one, move each entry to the new one –– because the hash is already in the linked list structure, it's all only pointer work –– and finally free the old array. This means the resize cost is relatively small. Untested initial implementation:
int table_resize(struct table *t, size_t newsize)
{
/* Sanity checks; I like these.. */
if (!t || newsize < 1)
return errno = EINVAL;
struct token **newslot;
newslot = malloc(newsize * sizeof *newslot);
if (!newslot)
return errno = ENOMEM;
/* Initialize all entries to NULL. We could have also used calloc() above. */
for (size_t i = 0; i < newsize; i++)
newslot[i] = NULL;
struct token **oldslot = t->slot;
size_t oldsize = t->size;
for (size_t o = 0; o < size; o++) {
struct token *list = oldslot[o];
while (list) {
struct token *curr = list;
list = list->next;
const size_t i = curr->hash % newsize;
curr->next = newslot[i];
newslot[i] = curr;
}
}
t->size = newsize;
t->slot = newslot;
free(oldslot);
return 0;
}
In the application code, each table variable is declared and defined as
struct table example = TABLE_INIT;
so that there is no special "initialize table" call. Lookups will notice size == 0 and always return "none found", and first insertion will use a compile-time default size for the slot array. (Since each slot is only a pointer in size, it makes sense to start with relatively large tables, say 1024 entries or so, when each application will have a relatively small number of such hash tables. The compile-time default size thus needs to be user-overridable (#ifndef DEFAULT_SLOTS, #define DEFAULT_SLOTS 1024, #endif), and have a comment about sizing it depending more on how many such hash tables ones own application expects to have in use at the same time, as opposed to trying to guess how large it will eventually grow.)
TL;DR: Apologies for the verbosity, but I thought that showing how my development process would work, and how I'd end up with whatever I'd end up using, including a wrinkle – "access cost" – that might sometimes be useful but in this particular case is better replaced with something else – would be useful. At minimum, it should show why I don't really reach for a generic solution as my first step.
In development terms, I must emphasize that when I am considering this stuff, I am also thinking a lot about what the application needs, and how it could do it efficiently. You could say that I'm even getting twice my money's worth for the time I spend writing the above code, because as I write the code, I'm basing it on my understanding of the problem. Similar to the Rubber Duck method, I've found that this helps me better understand the needs and use cases in my target application or service; and that understanding is often more valuable to me than the code I produce at this phase.