If you happen to have a C99 compiler and a POSIX.1-2008 -compatible C library (basically all except Microsoft ones are), the code for reading hashtableable lines from any standard C stream is pretty straightforward:
Code:
#define _ISO_C99_SOURCE
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <ctype.h>
#include <errno.h>
struct entry {
struct entry *next;
long key;
size_t size;
char data[];
};
struct entry *fentry(char **const dataptr, size_t *const sizeptr,
FILE *const input)
{
ssize_t len;
size_t datalen;
long key;
int keyend, head;
struct entry *ent;
len = getline(dataptr, sizeptr, input);
if (len < (ssize_t)1) {
if (ferror(input))
errno = EIO; /* Read error */
else
if (!feof(input))
errno = ENOMEM; /* Out of memory */
else
errno = 0; /* End of input */
return NULL;
}
/* Parse key. */
keyend = head = -1;
(void)sscanf(*dataptr, " %ld%n %n", &key, &keyend, &head);
if (head == keyend) {
errno = EDOM; /* Input formatting error */
return NULL;
}
/* Trim trailing whitespace off from the line. */
while (len > (ssize_t)0 && isspace((*dataptr)[len - 1]))
len--;
/* Data is all that follows head, up till newline. */
if ((ssize_t)head < len)
datalen = (size_t)len - (size_t)head;
else
datalen = 0;
/* Allocate memory for the new entry, data, and end-of-string mark. */
ent = malloc(datalen + 1 + sizeof *ent);
if (!ent) {
errno = ENOMEM;
return NULL;
}
/* Fill in the entry structure. */
ent->next = NULL;
ent->key = key;
ent->size = datalen;
if (datalen > 0)
memmove(ent->data, (*dataptr) + head, datalen);
ent->data[datalen] = '\0';
return ent;
}
The above code can be made C89 compliant, but I like C99 and POSIX.1-2008 better. It ignores leading and trailing whitespace on each line, and handles embedded NUL bytes (\0) correctly.
The
" %ld%n %n"
scanning pattern may require a bit of an explanation. It basically reads: After any number of whitespace (or none), convert and assign a long int integer, store the number of characters consumed thus far into an int, skip any number of whitespace (or none), and store the number of characters consumed thus far.
Because different implementations of sscanf() may or may not count %n in the result, I cannot rely on the sscanf() return value. I instead preset keyend (the variable corresponding to the first %n) and head (the variable corresponding to the second %n) to -1. If these are unchanged, the conversion failed.
If you look closely, I actually only verify that head > keyend. Because %n stores a nonnegative value, that test actually verifies that the initial long int was matched and assigned (as otherwise both would still be -1), and that at least one whitespace character followed it (as otherwise head would be -1, and then head <= keyend).
I did this because I didn't want to accept data such as '1234abc'; I require some kind of space in between the key and the value.
To read entries (lines) from say standard input, you can first try the following main():
Code:
int main(void)
{
char *cache_data = NULL;
size_t cache_size = 0;
struct entry *curr = NULL;
while ((curr = fentry(&cache_data, &cache_size, stdin))) {
printf("Key %ld, data '%s' (length %lu)\n", curr->key, curr->data, (unsigned long)curr->size);
free(curr);
}
if (errno) {
const char *const errmsg = strerror(errno);
fprintf(stderr, "Error reading standard input: %s.\n", errmsg);
return 1;
}
free(cache_data);
cache_data = NULL;
cache_size = 0;
return 0;
}
To create a hash table, first allocate an array of (hash table entries) of pointers to struct entry:
Code:
struct entry **hashtable_entry; /* Array of pointers */
unsigned long hashtable_size; /* number of pointers allocated */
Initialize all the pointers to NULL, when you create the (empty) hash table.
Because your hash keys are already integer numbers, your hash function can be as simple as
Code:
((unsigned long)key) % hashtable_size
in other words, positive key modulo hash table size. All possible key values will map to [0, hashtable_size-1] this way.
Inserting a new entry curr into the hash table is as simple as
Code:
{
const unsigned long i = ((unsigned long)key) % hashtable_size;
curr->next = hashtable_entry[i];
hashtable_entry[i] = curr;
}
Note that the above actually prepends the new entry in front of existing hashed entries, because it's simpler. (It turns out that in practice, it does not affect performance.)
A loop over all hash table entries having key key is as simple as
Code:
{
struct entry *ent = hashtable_entry[((unsigned long)key) % hashtable_size];
while (ent) {
if (ent->key == key) {
/* Yes, ent is one of the entries with the desired key */
}
ent = ent->next;
}
}
When the number of entries is not known beforehand, the hash table size is basically a guess. Since the actual hash table does not use that much memory (4+4N bytes on 32-bit architectures, and 8+8N bytes on 64-bit architectures), it makes sense to start with a large hash table, say one with 4093 or 8191 entries.
Having a hash table "too large" is never an issue; the only cost is the memory used (not much!) and the CPU time used to initialize the pointers to NULL (even less!).
When you start having collisions (multiple entries chained on the same hash), you can recreate the hash table at a larger size. You simply create a new array, and insert each entry from the old array to the new one.
The point where rehashing is needed/useful, depends heavily on the data. In this case, I suspect it would be an acceptable starting point to increase the hash table size by say 50% whenever there are more total entries than hash table entries. That approach should scale well to millions of entries.
If the order of the entries in the file was also meaningful, you could either add an index variable, or a second pointer, to the entry structure. I'd prefer a second pointer. When reading new entries, I'd prepend the new entry into the singly-linked list via those additional pointers. (If you have to, you can always reverse the list afterwards in a single pass, if you need them in the original order; prepending causes the list to be in reverse order.)
So, as you can see, the main thing here is to have a robust, reliable function that reads your input correctly, and into an useful data structure. Test it first, so you know it works. You should test it not only for correctness (that it parses your data correctly), but also that it is fast enough for your needs. I/O is surprisingly often the bottleneck!
Adding the hash table stuff, and manipulating the data read, is pretty simple afterwards.