Hashing string issue

**Ramkiller1** · 10-29-2012

Hi All,

I am currently making a hash table to hold strings.

My first guess were to add the ascii values but I was told that

hello and elhol would add up to be the same value (which is expected).

I was told I should do something called weighting but I do not know what that is. Can someone please explain or send a link to where can teach me this.

**qny** · 10-29-2012

Have you tried the programmer's best friend?

**Ramkiller1** · 10-29-2012

Originally Posted by qny

Have you tried the programmer's best friend?

yip but oh well, whatever

**christop** · 10-29-2012

Note that any hash function will necessarily have collisions (two different string inputs give the same hash), and your hash table implementation will have to deal with this somehow.

**Ramkiller1** · 10-29-2012

Originally Posted by christop

Note that any hash function will necessarily have collisions (two different string inputs give the same hash), and your hash table implementation will have to deal with this somehow.

Understood, one question that I am getting problem to understand.

I have to make a program to take words of a file and enter it into a hash table.

I have the solution to getting the words, I made the word into an integer value. But my problem is understand what to do next.

I have the word, I have the location. I would think I would put the word in the table (ex, char table[100]). Lets say the word is hello( in variable word) which gave me an integer of 10. The hash function gave me a location of 0 for the word hello. How do I put hello in the location.

Would it just be like

Code:

while (i!=max)
{
table[i]=word[i];
i++;
}

I was looking at http://research.cs.vt.edu/AVresearch...ng/strings.php where this person got the word in the array in one location. How do I do that?

**Nominal Animal** · 10-29-2012

I'd use a C99 structure for each string, something like

Code:

struct string {
    unsigned long  hash;
    size_t         size;   /* Allocated */
    size_t         used;   /* Used */
    unsigned char  data[];
};

and the XOR variant of the Bernstein hash function, hash_n = (33 hash_n-1) XOR char_n, with hash_-1 = 0. For example,

Code:

static inline unsigned long hash_data(const void *const data, const size_t size)
{
    const unsigned char        *head = (const unsigned char *)data;
    const unsigned char *const  tail = (const unsigned char *)data + size;
    unsigned long               curr = 0UL;

    while (head < tail)
        curr = (33UL * curr) ^ (*(head++));

    return curr;
}

The hash values stored for each string never change, unless the string changes. If you have a hash table with N entries, the index for the entry with hash H is H modulo N, i.e. (H % N) in C. Thus, to grow or shrink a hash table, you only need to move and relink the entries, not rehash any data. (And, when looking for a match, you don't need to compare the contents of most of the strings, as the unsigned long hash is likely to differ, even though the hashes map to the same hash table entry. Very efficient, well worth the 4 or 8 bytes per string it costs.)

If you wish to use a linked list for the hash entries, you could just add a next pointer to the string struct.

You can use the hash value zero, 0UL, as a special "unknown" hash value. Then, your string manipulation functions can simply reset the hash to zero, instead of rehashing the string. Whenever you do need the hash function, you recalculate the hash if it is zero. (The zero hash result is so rare you don't need to worry about it. If you want, you can have your hash functions return a fixed constant instead of zero for non-empty data, but I would not bother.) If you append to a string, you don't need to rehash the existing string; you can just rehash the appended part if you know the hash of the existing part.

When comparing two strings for equivalence, first check if their lengths match. If they don't, then they do not match. If both have a nonzero hash, and the hashes do not match, then the strings do not match. If the string data compares nonzero using memcmp(), the strings do not match. Otherwise, they do match.

Note: this approach supports any binary data, including embedded NUL bytes ('\0'). Very useful if you use POSIX getline() or getdelim() to read input lines.

**Ramkiller1** · 10-29-2012

Actually, I need to do it using open addressing double hashing. I would think the solution you gave was for using Chaining. My only problem right now is how to store a string data into the array.

**Ramkiller1** · 10-29-2012

Actually, I understand now what to do. Sorry it took me some time to understand I had to use a structure

**Nominal Animal** · 10-29-2012

Originally Posted by Ramkiller1

I would think the solution you gave was for using Chaining

You'd think wrong.

For open addressing, also known as closed hashing, each hash table entry has just one string, and collisions are resolved by probing different entries.

Code:

struct string {
    unsigned long   hash;
    size_t          size;   /* Allocated */
    size_t          used;   /* Used */
    unsigned char   data[];
};

struct hash_table {
    size_t          size;
    size_t          used;
    struct string **item;
};

For double hashing, each string is hashed twice, with different algorithms. If the hashes for a string are hash₁ and hash₂, the initial entry is (hash₁ modulo size) in an open addressing double hash table. If that entry does not match (or is already taken if adding an entry into the table), then entries ((hash₁ + n hash₂) modulo size) are checked for n = 1, 2, 3, .. until an empty entry is found, or until ((hash₁ + n hash₂) modulo size == hash₁ modulo size).

The string struct only saves the first hash (hash₁ in the above parlance), and assumes the second hash is recalculated whenever you need it. I've never used double hashing -- it has never fit my use cases --, so I cannot really tell whether it would be worth it to save also the second hash for each string. The hash table is mostly empty, so the second hash is only needed when the first probe fails, and when rehashing the table.

There are other approaches for open-addressing double-hashing string hash tables, of course. For example, if your hashed strings are immutable (never modified, just removed/added), and are only stored within the hash table, you could use

Code:

struct hash_item {
    unsigned long     hash1;
    unsigned long     hash2;
    size_t            offset;
    size_t            length;
};

struct hash_table {
    size_t            size;
    size_t            used;
    struct hash_item *item;
    size_t            area_size;
    size_t            area_used;
    char             *area;
};

where individual strings are saved in a single, contiguous memory area controlled by the hash table. If you were to store the longest strings first in the area, and you use the length instead of relying on an end-of-string '\0', you can trivially compact the string area by reusing the same data for multiple strings. (When adding a new string, you'd use memmem() or strstr() to see if the same data already exists, and if so, reuse it; otherwise append it to the area.)

In the latter example structures, because the individual strings are named using offsets to the hash table string area, the string area can be resized when necessary. If pointers were used, they'd all have to be adjusted after a resize since the start address usually changes then too.

Thread: Hashing string issue

Thread Tools

Search Thread

Display

Hashing string issue

Similar Threads

Hashing a string

hashing issue. Help!

String hashing

String hashing

hashing a string