Reading textfile

**DeanWinchester** · 05-25-2013

Hello everyone,

I have a textfile that looks like this :

Code:

64	.
16	./.DS_Store
8	./a
8	./merdas.txt
8	./o
24	./so
8	./so.c

I need to read this into a structure in which the values (64,16,8,24) will be the keys and ". , ./.DS_Store, etc ..." will be the info.

Would this be a proper structure for that ending ?

Code:

typedef struct Info {
    char key;
    int value;
    struct Info *next;
}Info;

Thanks

**migf1** · 05-25-2013

Assuming you intend to use a linked list for storing the processed file in memory, I would say something like the following...

Code:

struct Node {
    int key;
    char *info;
    struct Node *next;
};

or if info cannot exceed a known maximum size, then...

Code:

struct Node {
    int key;
    char info[MAX_INFOSIZ];
    struct Node *next;
};

which it may overall use more memory but it can considerably simplify the code.

**DeanWinchester** · 05-25-2013

Originally Posted by migf1

Assuming you intend to use a linked list for storing the processed file in memory, I would say something like the following...

Code:

struct Node {
    int key;
    char *info;
    struct Node *next;
};

or if info cannot exceed a known maximum size, then...

Code:

struct Node {
    int key;
    char info[MAX_INFOSIZ];
    struct Node *next;
};

which it may overall use more memory but it can considerably simplify the code.

I'm gonna stick with the first one, thanks.

How should I be doing the reading to key and info now ?

Thanks.

**migf1** · 05-25-2013

Originally Posted by DeanWinchester

I'm gonna stick with the first one, thanks.

How should I be doing the reading to key and info now ?

Thanks.

If there's a maximum expected length for each line in the text file (say MAX_LNLEN), you could read lines one by one using for example...

Code:

fgets(buffer, MAX_LNLEN+1, fp)

with buffer declared as an array of MAX_LNLEN+1 characters (+1 is for '\0').

Now, if you use the second struct, you could simply...

Code:

sscanf(buffer, "%d %s", &node->key, node->info);

By using the first struct, things get a bit more complicated because you need to tokenize the buffer yourself (for example using strtok) and then count the length of the 2nd token, so you can allocate the required memory for node->info and copy the 2nd token into it.

Also, when freeing up your linked list, then for every node you should first free up node->info before freeing up the node itself. This wouldn't be necessary if you had chosen then 2nd struct.

Finally, if there is no expected maximum length for the lines in the file, then using fgets is not really an option. In that case you should read the file char by char and do all the housekeeping manually. However, if that's just a homework, I doubt this will be necessary.

**DeanWinchester** · 05-25-2013

Originally Posted by migf1

If there's a maximum expected length for each line in the text file (say MAX_LNLEN), you could read lines one by one using for example...

Code:

fgets(buffer, MAX_LNLEN+1, fp)

with buffer declared as an array of MAX_LNLEN+1 characters (+1 is for '\0').

Now, if you use the second struct, you could simply...

Code:

sscanf(buffer, "%d %s", &node->key, node->info);

By using the first struct, things get a bit more complicated because you need to tokenize the buffer yourself (for example using strtok) and then count the length of the 2nd token, so you can allocate the required memory for node->info and copy the 2nd token into it.

Also, when freeing up your linked list, then for every node you should first free up node->info before freeing up the node itself. This wouldn't be necessary if you had chosen then 2nd struct.

Finally, if there is no expected maximum length for the lines in the file, then using fgets is not really an option. In that case you should read the file char by char and do all the housekeeping manually. However, if that's just a homework, I doubt this will be necessary.

Thanks for your awesome replies.
And no, it's not homework, it's something I need to deliver for a class in University and it's very very important, should be done to the perfection.

I email'ed the Professor and he replied saying my struct shouldn't be as shown, that I should use an Hash Table since I'll have the same key a few times.

What would be the typical struct for an hash table ?

**migf1** · 05-25-2013

Originally Posted by DeanWinchester

Thanks for your awesome replies.

You are very welcome

And no, it's not homework, it's something I need to deliver for a class in University and it's very very important, should be done to the perfection.

I email'ed the Professor and he replied saying my struct shouldn't be as shown, that I should use an Hash Table since I'll have the same key a few times.

What would be the typical struct for an hash table ?

In its simplest form, and judging from the sample data you provided, I can only guess that you have a relative small and finite set of keys (say from 0 to 128), then a simple hash table would be an array of 129 pointers, with each one pointing to a linked list. Each of those linked-lists will contain info-strings that are hashed under the same key. For example, in your sample data, 4 different info-strings will be hashed under the key 8.

So, something like this should get you started...

Code:

#define NKEYS 129

struct Node {
    char *info;
    struct Node *next;
};

Node hashtable[NKEYS];

Note that this a rather naive implementation. There are several sources for much more efficient implementations (your textbook most probably is one of them

).

**DeanWinchester** · 05-25-2013

Originally Posted by migf1

You are very welcome

In its simplest form, and judging from the sample data you provided, I can only guess that you have a relative small and finite set of keys (say from 0 to 128), then a simple hash table would be an array of 129 pointers, with each one pointing to a linked list. Each of those linked-lists will contain info-strings that are hashed under the same key. For example, in your sample data, 4 different info-strings will be hashed under the key 8.

So, something like this should get you started...

Code:

#define NKEYS 129

struct Node {
    char *info;
    struct Node *next;
};

Node hashtable[NKEYS];

Note that this a rather naive implementation. There are several sources for much more efficient implementations (your textbook most probably is one of them

).

I understood, this looks like a very good implementation by you.

Thanks so much.

I can always raise NKEYS depending on what Professor might say but that looks like a perfect hashtable, I'm gonna use it.

I'll be trying to read the textfile now and update this with doubts I might find.

Thanks so much once again!

**DeanWinchester** · 05-25-2013

Now that I'm working with an hash table what would be a way of reading from the text file to it ?

**migf1** · 05-25-2013

Originally Posted by DeanWinchester

Now that I'm working with an hash table what would be a way of reading from the text file to it ?

More or less like before. But instead of having the key part of the node you can use a local variable for it (say: key) so you can insert the node containing the info-string into the hashtable[key] linked list.

**Nominal Animal** · 05-25-2013

If you happen to have a C99 compiler and a POSIX.1-2008 -compatible C library (basically all except Microsoft ones are), the code for reading hashtableable lines from any standard C stream is pretty straightforward:

Code:

#define _ISO_C99_SOURCE
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <ctype.h>
#include <errno.h>

struct entry {
    struct entry    *next;
    long             key;
    size_t           size;
    char             data[];
};

struct entry *fentry(char **const dataptr, size_t *const sizeptr,
                     FILE *const input)
{
    ssize_t         len;
    size_t          datalen;
    long            key;
    int             keyend, head;
    struct entry   *ent;

    len = getline(dataptr, sizeptr, input);
    if (len < (ssize_t)1) {
        if (ferror(input))
            errno = EIO;    /* Read error */
        else
        if (!feof(input))
            errno = ENOMEM; /* Out of memory */
        else
            errno = 0;      /* End of input */
        return NULL;
    }

    /* Parse key. */
    keyend = head = -1;
    (void)sscanf(*dataptr, " %ld%n %n", &key, &keyend, &head);
    if (head == keyend) {
        errno = EDOM; /* Input formatting error */
        return NULL;
    }

    /* Trim trailing whitespace off from the line. */
    while (len > (ssize_t)0 && isspace((*dataptr)[len - 1]))
        len--;

    /* Data is all that follows head, up till newline. */
    if ((ssize_t)head < len)
        datalen = (size_t)len - (size_t)head;
    else
        datalen = 0;

    /* Allocate memory for the new entry, data, and end-of-string mark. */
    ent = malloc(datalen + 1 + sizeof *ent);
    if (!ent) {
        errno = ENOMEM;
        return NULL;
    }

    /* Fill in the entry structure. */
    ent->next = NULL;
    ent->key  = key;
    ent->size = datalen;
    if (datalen > 0)
        memmove(ent->data, (*dataptr) + head, datalen);
    ent->data[datalen] = '\0';

    return ent;
}

The above code can be made C89 compliant, but I like C99 and POSIX.1-2008 better. It ignores leading and trailing whitespace on each line, and handles embedded NUL bytes (\0) correctly.

The

" %ld%n %n"

scanning pattern may require a bit of an explanation. It basically reads: After any number of whitespace (or none), convert and assign a long int integer, store the number of characters consumed thus far into an int, skip any number of whitespace (or none), and store the number of characters consumed thus far.

Because different implementations of sscanf() may or may not count %n in the result, I cannot rely on the sscanf() return value. I instead preset keyend (the variable corresponding to the first %n) and head (the variable corresponding to the second %n) to -1. If these are unchanged, the conversion failed.

If you look closely, I actually only verify that head > keyend. Because %n stores a nonnegative value, that test actually verifies that the initial long int was matched and assigned (as otherwise both would still be -1), and that at least one whitespace character followed it (as otherwise head would be -1, and then head <= keyend).

I did this because I didn't want to accept data such as '1234abc'; I require some kind of space in between the key and the value.

To read entries (lines) from say standard input, you can first try the following main():

Code:

int main(void)
{
    char         *cache_data = NULL;
    size_t        cache_size = 0;

    struct entry *curr = NULL;

    while ((curr = fentry(&cache_data, &cache_size, stdin))) {
        printf("Key %ld, data '%s' (length %lu)\n", curr->key, curr->data, (unsigned long)curr->size);
        free(curr);
    }
    if (errno) {
        const char *const errmsg = strerror(errno);
        fprintf(stderr, "Error reading standard input: %s.\n", errmsg);
        return 1;
    }

    free(cache_data);
    cache_data = NULL;
    cache_size = 0;

    return 0;
}

To create a hash table, first allocate an array of (hash table entries) of pointers to struct entry:

Code:

struct entry **hashtable_entry; /* Array of pointers */
unsigned long  hashtable_size; /* number of pointers allocated */

Initialize all the pointers to NULL, when you create the (empty) hash table.

Because your hash keys are already integer numbers, your hash function can be as simple as

Code:

((unsigned long)key) % hashtable_size

in other words, positive key modulo hash table size. All possible key values will map to [0, hashtable_size-1] this way.

Inserting a new entry curr into the hash table is as simple as

Code:

{
    const unsigned long  i = ((unsigned long)key) % hashtable_size;
    curr->next = hashtable_entry[i];
    hashtable_entry[i] = curr;
}

Note that the above actually prepends the new entry in front of existing hashed entries, because it's simpler. (It turns out that in practice, it does not affect performance.)

A loop over all hash table entries having key key is as simple as

Code:

{
    struct entry *ent = hashtable_entry[((unsigned long)key) % hashtable_size];
    while (ent) {
        if (ent->key == key) {

            /* Yes, ent is one of the entries with the desired key */

        }
        ent = ent->next;
    }
}

When the number of entries is not known beforehand, the hash table size is basically a guess. Since the actual hash table does not use that much memory (4+4N bytes on 32-bit architectures, and 8+8N bytes on 64-bit architectures), it makes sense to start with a large hash table, say one with 4093 or 8191 entries.

Having a hash table "too large" is never an issue; the only cost is the memory used (not much!) and the CPU time used to initialize the pointers to NULL (even less!).

When you start having collisions (multiple entries chained on the same hash), you can recreate the hash table at a larger size. You simply create a new array, and insert each entry from the old array to the new one.

The point where rehashing is needed/useful, depends heavily on the data. In this case, I suspect it would be an acceptable starting point to increase the hash table size by say 50% whenever there are more total entries than hash table entries. That approach should scale well to millions of entries.

If the order of the entries in the file was also meaningful, you could either add an index variable, or a second pointer, to the entry structure. I'd prefer a second pointer. When reading new entries, I'd prepend the new entry into the singly-linked list via those additional pointers. (If you have to, you can always reverse the list afterwards in a single pass, if you need them in the original order; prepending causes the list to be in reverse order.)

So, as you can see, the main thing here is to have a robust, reliable function that reads your input correctly, and into an useful data structure. Test it first, so you know it works. You should test it not only for correctness (that it parses your data correctly), but also that it is fast enough for your needs. I/O is surprisingly often the bottleneck!

Adding the hash table stuff, and manipulating the data read, is pretty simple afterwards.

**DeanWinchester** · 05-25-2013

Thanks, I'll study everything you said upfront.
Right now, I'm just reading the file, not into the hash table yet.

My code is the following :

Code:

 char buffer[500];
    FILE *fp=NULL;
    fp = fopen ("merdas.txt", "r");
    while(!feof(fp)){
        fgets(buffer,200,fp);
        printf("%s",buffer);
    }
    fclose(fp);
    printf("%s",buffer);

It's not printing anything though.
You know what's the error ?

**laserlight** · 05-25-2013

Well...

Check that the file was opened correctly.
Do not use feof to control a loop like that; rather, use the return value of fgets.

**DeanWinchester** · 05-25-2013

Originally Posted by laserlight

Well...

Check that the file was opened correctly.
Do not use feof to control a loop like that; rather, use the return value of fgets.

Followed your advice and I made it like this :

Code:

fp = fopen("/Users/machd/Desktop/SistemasOperativos2013/GuardaLinhas/merdas.txt","r");
    if (!fp) {
        printf("sucess");
    }

Sucess isn't being printed.
Guess that means the file isn't being opened correctly right ?

**laserlight** · 05-25-2013

Yes (though I was confused for a moment because I think of a file not opening as "failure", not "success"

). If the file path is correct, check file permissions.

**DeanWinchester** · 05-25-2013

Originally Posted by laserlight

Yes (though I was confused for a moment because I think of a file not opening as "failure", not "success"

). If the file path is correct, check file permissions.

Haha.
I have, every user has reading permissions

Thread: Reading textfile

Thread Tools

Search Thread

Display

Reading textfile

Similar Threads

Reading from a textfile with unknown length?

Reading textfile into array

Computing Numerical Grade and Letter Grade by reading e textfile and using structures

reading a textfile and storing the information into a structure array?

Reading a TextFile into a Linked List