Author Topic: hash of the string, preserving ordering? (Read 2130 times)

legacy · « **on:** August 25, 2019, 12:51:00 pm »

constraints

strings have only these allowed chars: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_~"
the string size is 20 char (plus '\0', not counted in the size)
each hash uses 4 bytes (uint32_t) of storage

I am trying to convert a string into a integer, the so-called "hash of the string". I am looking for an algorithm which generates a minimal (perfect?) hash that preserves ordering, e.g. if string "a" < string 'b", then hash(string "a") < hash(string "b")).

it seems impossible with the above constraints, and I am not able to calculate the collision probability

I need it for handling a b+tree algorithm which sometimes needs to use string keys rather than integer keys, so ... instead of having to make the algorithm polymorphic (able to accept and handle both integers and strings, it can be done, even in C, but ... it requires a lot of effort), I would like to collapse a string into a number so the algorithm can naturally process it as a number.

Ideas? Suggestions?

legacy · « **Reply #1 on:** August 25, 2019, 12:58:12 pm »

Quote from: legacy on August 25, 2019, 12:51:00 pm

collision probability

string_t A, B;
boolean_t is_collision;
is_collision = ((A isNotEqualTo B) and (hash(A) isEqualTo hash(B));

randomly given n strings (e.g. grabbing filenames inside the same folder of a hard drive), how many times will you see is_collsion = True?

SiliconWizard · « **Reply #2 on:** August 25, 2019, 01:32:48 pm »

You may take a look at this if you haven't already:
https://www.researchgate.net/publication/221301015_Order_Preserving_Minimal_Perfect_Hash_Functions_and_Information_Retrieval
https://jingdongwang2017.github.io/Pubs/ACMMM13-OrderPreservingHashing.pdf

ejeffrey · « **Reply #3 on:** August 25, 2019, 02:11:37 pm »

I don't think a locality preserving hash will really do what you want. They only preserve local ordering with some probability. They explicitly don't preserve long range order and it is only statistical anyway. If you want your string keys in order then you need to use the strings or truncated versions of them.

You could always convert your numeric values to digit strings....

legacy · « **Reply #4 on:** August 25, 2019, 02:39:47 pm »

Quote from: ejeffrey on August 25, 2019, 02:11:37 pm

You could always convert your numeric values to digit strings....

Code: [Select]

 char_t    word[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_~";

only 64 chars are allowed for strings, so 6bit/char instead of 8bit is ok, but this means that only (32bit / 6) 5 chars per each uint32_t hash are possible, while I need *somehow* a way to rappresent 20 chars with a single uint32_t hash.

Code: [Select]

sizeof(hash1)=4 bytes, hash2_get
sizeof(hash2)=8 bytes, pearson16

hash1=3927903123 hash2=10429985209355937563 filename[app00000000000000001]
hash1=3927898083 hash2=15694057609022770940 filename[app00000000000000100]
hash1=3041402986 hash2=15462748812045223140 filename[app200]
hash1=3204864491 hash2=09388938726757435289 filename[app100]
hash1=3222460756 hash2=15312885477942118663 filename[bpp100]
hash1=2640722377 hash2=03919188176753789008 filename[a]
hash1=2658318642 hash2=16516498363500941832 filename[b]
hash1=2675914907 hash2=12962826904521526098 filename[c]
hash1=2824123552 hash2=17410942629007529365 filename[~~~~~~~~~~~~~~~~~~~~]

Code: [Select]

/*
 * convert a string into a integer, the so-called hash of the string
 * prime numbers
 * {
 *   2 3 5 7
 *   11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
 *   101 103 107 109 113 127 131 137 139 149 151 157 163 167 173 179 181 191
 * }
 */
uint64_t hash2_get
(
    chart_t msg[],
    uint32_t msg_size
)
{
    uint64_t  ans;
    uint32_t  hash;
    uint32_t  p;
    uint32_t  m;
    uint32_t  hpow;
    uint8_t   ch;
    uint8_t   key;
    uint32_t  i;
    uint32_t  j;

    p = 71;  /* magic prime number, close to 64, which is the number of allowed chars */
    m = 4e8; /* magic number horizon, don't make it too large or things will blow up */

    /*
     * read in reverse order
     * from the last to the first
     */
    hpow = 1;
    hash = 0;
    for (i = 0; i < msg_size - 1; i++)
    {
        j  = ((msg_size - 1) - 1 - i);
        ch = uc_char_to_uint8(msg[j]);

        if (ch isEqualTo '\0')
        {
            ch = '~';
        }

        key   = (65 - get_chkey64(ch)); /* maps the subset from a ASCII string */
        hash += ((key * hpow) remaind m);
        hpow  = ((hpow * p) remaind m);
    }

    ans = cc_uint64_to_uint32(hash); /* here it checks for overflow */
    return ans;
}

This, or something like this, is what I am going to use. It's experimental.

sokoloff · « **Reply #5 on:** August 25, 2019, 02:40:38 pm »

Quote from: legacy on August 25, 2019, 12:51:00 pm

it seems impossible with the above constraints, and I am not able to calculate the collision probability

The original space has 64 characters and 20 instances, so a size of 64²⁰. The hash space has 4 bytes of 256 possibilities so 256⁴.

64 is 2⁶ and 256 is 2⁸, so that means there are 2^{6 * 20} possibilities being mapped into 2^8*4 hashes, meaning there are 2¹²⁰ / 2³² or 2⁸⁸ unique input strings that map to each possible hash (in the best case scenario).

sokoloff · « **Reply #6 on:** August 25, 2019, 02:50:29 pm »

If you want the best possible preservation of ordering, take the first 32 bits of the input string (the first 5 full characters [30 bits] plus the 2 high order bits of the 6th character) and return that as a 32 bit number [assuming you are treating the input strings definition of "sorted" as being "a comes before b before c before A before B before C before 0 before ..."]

That also makes perfectly clear that you're ignoring the lower 4 bits of the 6th character and all the information in characters 7-20 (84 bits) of the input string, which checks out with the "there are 2⁸⁸ items in the domain for every number in the range" calculation above.

helius · « **Reply #7 on:** August 25, 2019, 04:15:18 pm »

You can't calculate the collision probability without first having a probability distribution of the 20-character strings. But if you assume that the distribution is uniform, you don't even need to know the string length: just that they are being mapped into 32-bit numbers is enough.
With a uniform distribution, the probability of landing in any bucket of the 32-bit hash is 2^(-32), and the probability of two independent strings landing in that same bucket is the square of that, 2^(-64). There are 2^(32) possible buckets in which this collision could happen, so the collision probability of two strings is 2^(32)*2^(-64), or just 2^(-32). You do have a Birthday Problem once you have many strings, which increases the likelihood of a collision. For example, if you have 2^(16) strings (65536), uniformly distributed, the probability of some of them colliding is 2^(-16).

legacy · « **Reply #8 on:** August 26, 2019, 11:08:30 am »

Quote from: helius on August 25, 2019, 04:15:18 pm

if you have 2^(16) strings (65536), uniformly distributed, the probability of some of them colliding is 2^(-16).

dunno if this is acceptable scenario for a real life filesystem's folder, where you have files. Strings in this case are filenames, so the distribution is not uniform since human beings are used to give a meaning to file-names representing the file content.

Code: [Select]

# ls doc/*
bPtree_vs_bAtree.txt
bMtree_vs_bPtree.txt
btree_test.txt
btree_on_disk1.txt
btree_on_disk2.txt
bugs.txt
draft.txt
hash_string.txt
avoid_stack.txt
dead_loops.txt
no_recursion1.txt
no_recursion2.txt
one_ret_value.txt
quality.txt
table_01.txt
table_02.txt
table_03.txt

There are two alternatives to the hash idea:

-1- (neat solution, requires a lot of effort) to make the b+tree algorithm polymorphic on its data record and data manipulation methods, so a string will be handled as a string, preserving order by lexicographic ordering.
-2- (workaround) to convert each char of a string of 20 chars into 6 bit representing the subgroup of allowed chards, to compose a vector of 4 uint32_t, handled as an atomic group of 4 keys by the current implementation of the b+tree

magic · « **Reply #9 on:** August 26, 2019, 02:19:18 pm »

Order and hashing doesn't really mix.

What you are trying to do is order the space of all strings and divide it into 2^32 consecutive buckets. That's it. Any "hashing" tricks, any XORs, prime numbers and whatnot will only destroy monotonicity.

If nothing is known about distribution, the best/only option is to take the first 32bits of each string, as others said. If you know that 50% of filenames start with 'a', devote the first 2^31 buckets to strings starting with 'a' and the rest to the rest. ~~If you know they are of the form a<some_word>NNNN, use "hash" like 0<15 first bits of some_word>NNNN to reduce collisions.~~(wait, not, that's not monotinic) You get the idea.

But seriously, make the tree keyed by binary blobs of runtime-specified length and store either ints or strings or anything else as binary blobs. Then it boils down to designing a monotonic encoding of strings as binary blobs, no rocket science.

legacy · « **Reply #10 on:** August 26, 2019, 02:37:53 pm »

the above workaround has 0 probability of collision, but it takes 128 bit per key to represent 20 of chars in the reduced field of 64 allowed chars.

128 bit is four uint32_t words, this means a group of four keys handled as monolithic block. This requires a few changes in the current code but it doesn't require rewriting too much C code in order to introduce the polymorphism that would be required to handle both integer keys (which are a must for other purposes), and string keys (which are required for filenames).

I will have to consider things carefully, a lot of work means weeks of work.

magic · « **Reply #11 on:** August 26, 2019, 02:45:01 pm »

Not sure what's the idea. Use four nodes to store one key and hack the algorithm so that it only considers every 4th node for lookups and ignores the rest?

I think at this point it's easier to just add a key_length argument and modify node data structures. BTW, the keys may be uint32_t arrays, not necessarily byte arrays. Probably more efficient on 32b systems.

legacy · « **Reply #12 on:** August 26, 2019, 03:34:32 pm »

Quote from: magic on August 26, 2019, 02:45:01 pm

Not sure what's the idea. Use four nodes to store one key and hack the algorithm so that it only considers every 4th node for lookups and ignores the rest?

Precisely! Basicaly, you have to indroduce the hack on the data structure and a method to compare 128bit words to say if A > B, or if A = B. This requires only a fraction of code to be modified.

Quote from: magic on August 26, 2019, 02:45:01 pm

I think at this point it's easier to just add a key_length argument and modify node data structures. BTW, the keys may be uint32_t arrays, not necessarily byte arrays. Probably more efficient on 32b systems.

The same code (physically in the same library, I don't have enough space in the ROM for forking the b+tree, one for integer keys, one for string keys) must handle both integer keys and string keys, so with a group of four integer keys, the data structure is not touched, which is the reason why I was looking at hash(string).

Touching the datastrucutre has a cost.

NorthGuy · « **Reply #13 on:** August 26, 2019, 04:29:23 pm »

I don't know why you want to preserve ordering. If you do, you may get some cases where all the files may fall into the same bucket, since humans (me for example) can name files as:

Code: [Select]

invoice_0001.pdf
invoice_0002.pdf
invoice_0003.pdf
invoice_0004.pdf

...

invoice_2879.pdf

If you want to make your tree shallow, you'd rather work out a hash which would make difficult for humans to skew the hash distribution between buckets too far from uniform.

magic · « **Reply #14 on:** August 26, 2019, 09:10:06 pm »

Good illustration of the difficulty of making a monotonic "hash". Thousands of buckets are required for strings starting with "invoice", which itself is one of 2^42 possible 7-letter prefixes - no way to have buckets with such fine granularity for all of them.

No randomization could solve it, but perhaps some domain knowledge: devote half of all buckets or even more to strings starting with a word and cover the space between words with the remainder. The algorithm would have a dictionary and know how many buckets are assigned to each word so it could compute which bucket any string falls into. A bit impractical maybe but hey, gets the job done better than simply taking the 32-bit prefix

brucehoult · « **Reply #15 on:** August 27, 2019, 05:36:34 am »

A good hash function has the property that changing any one bit in the input will flip a random 50% of the output bits. MD4, MD5, SHA* come incredibly close to achieving this. Even the very fast Murmur Hash, City Hash, Farm Hash series are so close it takes a lot of work to see the imperfections.

This all runs totally contrary to preserving lexicographic ordering!

hamster_nz · « **Reply #16 on:** August 27, 2019, 06:52:01 am »

Each character is about 6 bits of information, so packing 120 bits into 32 bits will be impossible unless there is redundancy.

Just grab any 2^32+1 a tiny part of the possible pool, and you are stuffed.

You need at least 15 bytes for each value.

legacy · « **Reply #17 on:** August 27, 2019, 09:03:36 am »

Quote from: brucehoult on August 27, 2019, 05:36:34 am

MD4, MD5, SHA* come incredibly close to achieving this.

Code: [Select]

private uint8_t pearson_table[256] =
{
    /*
     * 0-255 shuffled in any (random) order suffices
     */
    98,    6,  85, 150,  36,  23, 112, 164, 135, 207, 169,   5,  26,  64, 165, 219,     //  1
    61,   20,  68,  89, 130,  63,  52, 102,  24, 229, 132, 245,  80, 216, 195, 115,     //  2
    90,  168, 156, 203, 177, 120,   2, 190, 188,   7, 100, 185, 174, 243, 162, 10,      //  3
    237,  18, 253, 225,   8, 208, 172, 244, 255, 126, 101,  79, 145, 235, 228, 121,     //  4
    123, 251,  67, 250, 161,   0, 107,  97, 241, 111, 181,  82, 249,  33,  69, 55,      //  5
    59,  153,  29,   9, 213, 167,  84,  93,  30,  46,  94,  75, 151, 114,  73, 222,     //  6
    197,  96, 210,  45,  16, 227, 248, 202,  51, 152, 252, 125,  81, 206, 215, 186,     //  7
    39,  158, 178, 187, 131, 136,   1,  49,  50,  17, 141,  91,  47, 129,  60, 99,      //  8
    154,  35,  86, 171, 105,  34,  38, 200, 147,  58,  77, 118, 173, 246,  76, 254,     //  9
    133, 232, 196, 144, 198, 124,  53,   4, 108,  74, 223, 234, 134, 230, 157, 139,     // 10
    189, 205, 199, 128, 176,  19, 211, 236, 127, 192, 231,  70, 233,  88, 146, 44,      // 11
    183, 201,  22,  83,  13, 214, 116, 109, 159,  32,  95, 226, 140, 220,  57, 12,      // 12
    221,  31, 209, 182, 143,  92, 149, 184, 148,  62, 113,  65,  37,  27, 106, 166,     // 13
    3,    14, 204,  72,  21,  41,  56,  66,  28, 193,  40, 217,  25,  54, 179, 117,     // 14
    238,  87, 240, 155, 180, 170, 242, 212, 191, 163,  78, 218, 137, 194, 175, 110,     // 15
    43,  119, 224,  71, 122, 142,  42, 160, 104,  48, 247, 103,  15,  11, 138, 239      // 16
};

uint64_t hash_pearson16_get
(
    char_t x[],
    uint32_t len
)
{
    uint64_t ans;
    uint64_t magic;
    uint32_t i;
    uint32_t j;
    uint8_t  h;
    uint8_t  hh[8];

    ans = 0;
    for (j = 0; j < 8; j++)
    {
        h = pearson_table[(x[0] + j) remaind 256];
        for (i = 1; i < len; i++)
        {
            h = pearson_table[h bitwiseExOr x[i]];
        }
        hh[j] = h;
        magic = uint8_to_uint64(h);
        magic = (magic shiftLeft(j * 8)); /* in reverse order */
        ans = ans bitwiseOr magic;
    }

    return ans;
}

Even Pearson's algorithm! For a given chunk of data, it produces only an 8-bit byte, 0-255, however, the algorithm makes it extremely easy to generate a hash of whatever length is desired.

It doesn't solve my problem, anyway it's something good I have learned

legacy · « **Reply #18 on:** August 27, 2019, 03:15:46 pm »

Quote from: legacy on August 26, 2019, 11:08:30 am

There are two alternatives to the hash idea:
-1- (neat solution, requires a lot of effort) to make the b+tree algorithm polymorphic on its data record and data manipulation methods, so a string will be handled as a string, preserving order by lexicographic ordering.
-2- (workaround) to convert each char of a string of 20 chars into 6 bit representing the subgroup of allowed chards, to compose a vector of 4 uint32_t, handled as an atomic group of 4 keys by the current implementation of the b+tree

Thinking about planning -1-, I wanted to get an idea on the required changes, so I wrote a sugar-module to abstract all the operations made on the key-field by the B+tree algorithm. Changing the key from uint32_t to string only requires the above methods.

Code: [Select]

void item_clean(p_char_t target);
void item_copy(p_char_t target, p_char_t source);
boolean_t item_cmp_is_grt(p_char_t source0, p_char_t source1); /* A > B */
boolean_t item_cmp_is_get(p_char_t source0, p_char_t source1); /* A >= B*/
boolean_t item_cmp_is_lst(p_char_t source0, p_char_t source1); /* A <B */
boolean_t item_cmp_is_let(p_char_t source0, p_char_t source1); /* A <= B */
boolean_t item_cmp_is_eq(p_char_t source0, p_char_t source1); /* A = B */
boolean_t item_show(p_char_t source0);
void item_input(p_char_t target0);

(methods)

Only the following functions require the above methods:

Code: [Select]

public p_btree_node_t  btree_key_find_leaf /* cmp method */
public btree_ans_t btree_key_find /* cmp method */
public btree_ans_t btree_key_find_star /* cmp method */
private uint32_t  node_insert_into_tree_get_insertion_point /* cmp method */
private p_btree_node_t  node_insert_into_tree /* cmp method */
private p_btree_node_t node_insert_into_tree_after_splitting /* cmp method */

So I have just modified those functions in the b+tree to use the sugar-code.

Code: [Select]

OrangeCube # nano project.list.myprj

#item_uint32
item_string
.

Code: [Select]

OrangeCube # make clean_all
OrangeCube # mybuild-project-makefile-v9
- cleaning all
- touching interface private
- touching interface public
- generating interface private for item_string/lib_btree_item_v1
- generating interface private for app
- generating interface private for app_helper
- generating interface private for lib_block_alloc_v1
- generating interface private for lib_btree_simple_toy_v1
- generating interface public for item_string/lib_btree_item_v1.c
- generating interface public for app.c
- generating interface public for app_helper.c
- generating interface public for lib_block_alloc_v1.c
- generating interface public for lib_btree_simple_toy_v1.c

I have rebuilt the project, so structures are now considering the correct data type for the key (string, instead of integer)

Code: [Select]

OrangeCube # make
- compiling item_string/lib_btree_item_v1 ... done
- compiling app ... done
- compiling app_helper ... done
- compiling lib_block_alloc_v1 ... done
- compiling lib_btree_simple_toy_v1 ... done
- linking to app

The builder has accepted all the changes without complaing.

Code: [Select]

OrangeCube # make run
sandbox (out/OrangeCube-HPPA2/app) ...

> h
i: insert
l: leaves
d: delete
f: find
a: find the closest

> i hallo 1
> i hAllo 2
> i hello 3
> i hEllo 4
> i home 5
> i hOme 6
> i hill 7
> i hillside 8
> i hallway 9
> l
+
{ hAllo (2) hEllo (4) hOme (6) hallo (1) hallway (9) }
{ hello (3) hill (7) hillside (8) home (5) }
> g hi
coin_closest {L,R}={home ,hi }
home (5)
> g ha
coin_closest {L,R}={hallway ,ha }
hallway (9)
>

And the result seems working. But this way I have to seriously think about introducing the polymorphism, which means abstracting the data-structure and introducing a mechanism for overloading operators (let_assign, compare{==, <, >, <=, >=}, show, etc) and methods.

At the moment the above code needs to be recompiled for a specific datatype, so if I wanted to go back to integer-keys I would have to do the following

Code: [Select]

OrangeCube # nano project.list.myprj

item_uint32
#item_string
.

Code: [Select]

OrangeCube # make clean_all
OrangeCube # mybuild-project-makefile-v9
OrangeCube # make

magic · « **Reply #19 on:** August 27, 2019, 07:05:24 pm »

Now add a "key length" argument and write a wrapper type which casts ints to 4-byte strings and stores them on a 4-byte string tree.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: hash of the string, preserving ordering? (Read 2130 times)

Share me