Thread: Hashing string issue

  1. #1
    Registered User
    Join Date
    Sep 2012
    Posts
    34

    Hashing string issue

    Hi All,

    I am currently making a hash table to hold strings.

    My first guess were to add the ascii values but I was told that

    hello and elhol would add up to be the same value (which is expected).

    I was told I should do something called weighting but I do not know what that is. Can someone please explain or send a link to where can teach me this.

  2. #2
    Registered User
    Join Date
    Sep 2012
    Posts
    357

  3. #3
    Registered User
    Join Date
    Sep 2012
    Posts
    34
    Quote Originally Posted by qny View Post
    yip but oh well, whatever

  4. #4
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    948
    Note that any hash function will necessarily have collisions (two different string inputs give the same hash), and your hash table implementation will have to deal with this somehow.

  5. #5
    Registered User
    Join Date
    Sep 2012
    Posts
    34
    Quote Originally Posted by christop View Post
    Note that any hash function will necessarily have collisions (two different string inputs give the same hash), and your hash table implementation will have to deal with this somehow.
    Understood, one question that I am getting problem to understand.

    I have to make a program to take words of a file and enter it into a hash table.

    I have the solution to getting the words, I made the word into an integer value. But my problem is understand what to do next.

    I have the word, I have the location. I would think I would put the word in the table (ex, char table[100]). Lets say the word is hello( in variable word) which gave me an integer of 10. The hash function gave me a location of 0 for the word hello. How do I put hello in the location.

    Would it just be like
    Code:
    while (i!=max)
    {
    table[i]=word[i];
    i++;
    }
    I was looking at http://research.cs.vt.edu/AVresearch...ng/strings.php where this person got the word in the array in one location. How do I do that?

  6. #6
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    I'd use a C99 structure for each string, something like
    Code:
    struct string {
        unsigned long  hash;
        size_t         size;   /* Allocated */
        size_t         used;   /* Used */
        unsigned char  data[];
    };
    and the XOR variant of the Bernstein hash function, hashn = (33 hashn-1) XOR charn, with hash-1 = 0. For example,
    Code:
    static inline unsigned long hash_data(const void *const data, const size_t size)
    {
        const unsigned char        *head = (const unsigned char *)data;
        const unsigned char *const  tail = (const unsigned char *)data + size;
        unsigned long               curr = 0UL;
    
        while (head < tail)
            curr = (33UL * curr) ^ (*(head++));
    
        return curr;
    }
    The hash values stored for each string never change, unless the string changes. If you have a hash table with N entries, the index for the entry with hash H is H modulo N, i.e. (H % N) in C. Thus, to grow or shrink a hash table, you only need to move and relink the entries, not rehash any data. (And, when looking for a match, you don't need to compare the contents of most of the strings, as the unsigned long hash is likely to differ, even though the hashes map to the same hash table entry. Very efficient, well worth the 4 or 8 bytes per string it costs.)

    If you wish to use a linked list for the hash entries, you could just add a next pointer to the string struct.

    You can use the hash value zero, 0UL, as a special "unknown" hash value. Then, your string manipulation functions can simply reset the hash to zero, instead of rehashing the string. Whenever you do need the hash function, you recalculate the hash if it is zero. (The zero hash result is so rare you don't need to worry about it. If you want, you can have your hash functions return a fixed constant instead of zero for non-empty data, but I would not bother.) If you append to a string, you don't need to rehash the existing string; you can just rehash the appended part if you know the hash of the existing part.

    When comparing two strings for equivalence, first check if their lengths match. If they don't, then they do not match. If both have a nonzero hash, and the hashes do not match, then the strings do not match. If the string data compares nonzero using memcmp(), the strings do not match. Otherwise, they do match.

    Note: this approach supports any binary data, including embedded NUL bytes ('\0'). Very useful if you use POSIX getline() or getdelim() to read input lines.

  7. #7
    Registered User
    Join Date
    Sep 2012
    Posts
    34
    Actually, I need to do it using open addressing double hashing. I would think the solution you gave was for using Chaining. My only problem right now is how to store a string data into the array.
    Last edited by Ramkiller1; 10-29-2012 at 07:38 PM.

  8. #8
    Registered User
    Join Date
    Sep 2012
    Posts
    34
    Actually, I understand now what to do. Sorry it took me some time to understand I had to use a structure

  9. #9
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by Ramkiller1 View Post
    I would think the solution you gave was for using Chaining
    You'd think wrong.

    For open addressing, also known as closed hashing, each hash table entry has just one string, and collisions are resolved by probing different entries.
    Code:
    struct string {
        unsigned long   hash;
        size_t          size;   /* Allocated */
        size_t          used;   /* Used */
        unsigned char   data[];
    };
    
    struct hash_table {
        size_t          size;
        size_t          used;
        struct string **item;
    };
    For double hashing, each string is hashed twice, with different algorithms. If the hashes for a string are hash1 and hash2, the initial entry is (hash1 modulo size) in an open addressing double hash table. If that entry does not match (or is already taken if adding an entry into the table), then entries ((hash1 + n hash2) modulo size) are checked for n = 1, 2, 3, .. until an empty entry is found, or until ((hash1 + n hash2) modulo size == hash1 modulo size).

    The string struct only saves the first hash (hash1 in the above parlance), and assumes the second hash is recalculated whenever you need it. I've never used double hashing -- it has never fit my use cases --, so I cannot really tell whether it would be worth it to save also the second hash for each string. The hash table is mostly empty, so the second hash is only needed when the first probe fails, and when rehashing the table.

    There are other approaches for open-addressing double-hashing string hash tables, of course. For example, if your hashed strings are immutable (never modified, just removed/added), and are only stored within the hash table, you could use
    Code:
    struct hash_item {
        unsigned long     hash1;
        unsigned long     hash2;
        size_t            offset;
        size_t            length;
    };
    
    struct hash_table {
        size_t            size;
        size_t            used;
        struct hash_item *item;
        size_t            area_size;
        size_t            area_used;
        char             *area;
    };
    where individual strings are saved in a single, contiguous memory area controlled by the hash table. If you were to store the longest strings first in the area, and you use the length instead of relying on an end-of-string '\0', you can trivially compact the string area by reusing the same data for multiple strings. (When adding a new string, you'd use memmem() or strstr() to see if the same data already exists, and if so, reuse it; otherwise append it to the area.)

    In the latter example structures, because the individual strings are named using offsets to the hash table string area, the string area can be resized when necessary. If pointers were used, they'd all have to be adjusted after a resize since the start address usually changes then too.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Hashing a string
    By jeanermand in forum C Programming
    Replies: 4
    Last Post: 03-12-2012, 06:58 AM
  2. hashing issue. Help!
    By yapic in forum C++ Programming
    Replies: 5
    Last Post: 05-16-2010, 01:44 AM
  3. String hashing
    By rags123 in forum C Programming
    Replies: 1
    Last Post: 04-03-2010, 07:47 AM
  4. String hashing
    By jack_carver in forum C Programming
    Replies: 19
    Last Post: 09-11-2009, 07:32 PM
  5. hashing a string
    By cozman in forum C++ Programming
    Replies: 2
    Last Post: 12-31-2001, 12:42 PM