I think you need to split up your functions into smaller pieces, so you can test each one separately to verify they work. (Unit testing)
Seeing that you can handle dynamic memory management, why don't you create a hash table and hash table entry types, and then the functions needed to manipulate them?
I like C99 flexible arrays (size determined at run time, must be allocated dynamically), so I would personally go with
Code:
struct hashentry {
struct hashentry *next;
unsigned long hash;
size_t size;
unsigned char data[];
};
struct hashtable {
unsigned long entries;
struct hashentry *entry[];
};
struct hashentry *hash_getline(FILE *const input);
struct hashentry *hash_string(const char *const string);
struct hashentry *hash_data(const void *const data, const size_t length);
struct hashentry *hash_free(struct hashentry *const hash); /* Always returns NULL */
struct hashtable *hashtable_create(const unsigned long entries);
int hashtable_add(struct hashtable *const table, struct hashentry *const entry);
struct hashentry *hashtable_detach(struct hashtable *const table, const unsigned long hash, const void *const data, const size_t length);
struct hashtable *hashtable_destroy(struct hashtable *const table); /* Always returns NULL */
/* Notes: Bernstein hash, hash = (hash * 33UL) ^ ((unsigned long)(newchar)), is okay.
* Entry for hash h in table t is (t->entry[h % t->entries]).
* Store hashes unmodified, so you can rehash the table: reallocate it, then relink entries;
* no need to recalculate any hashes from data. (h->hash do not change.)
* Two hash entries, h1 and h2, are the same if and only if
* (h1->size == h2->size &&
* h1->hash == h2->hash &&
* !memcmp(h1->data, h2->data, h1->size))
* The hash entries are suitable for binary data. For ease of use, I recommend always
* reserving room and appending a '\0' after the hashed data, i.e. h->data[h->size] = '\0'
*/
A test program that does what you stated should take about 300 lines. I wrote a test implementation. The hash_getline() alone took me 100 lines, as it skips leading and trailing whitespace, calculates the Bernstein hash inline (as it reads the input), and accepts any newline convention. The other functions took about 100 lines altogether. The last 100 lines were in main(), reading the word list from one file to a hash table, removing entries based on another file, plus verbose output, command line parameter checking (including help/usage if none), and so on.
But this is just my personal preference; pick any approach (structures) you like; just write and test the hash table and hash entry functions first.