Thread: Read and search CSV

  1. #16
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    you can use strcmp (or memcmp) to compare lines
    They aren't the same. strcmp() will compare string lexicographically - memcmp() isn't guaranteed to do so. It only cares if n bytes of this is less than, equal to, or greater than n bytes of that.

  2. #17
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by whiteflags View Post
    If I adopted this approach I would worry about becoming complacent with working code, and using it all the time
    I'm personally way too paranoid for that
    Quote Originally Posted by phantomotap View Post
    Yes, in C, there are certainly cases where `goto' is the better solution because it isolates otherwise duplicated code without wonky cleanup functions having several references to local functions. (Don't quote me on it though; if the situation comes up where a newbie is using `goto' poorly I will simply tell them "You should not be using `goto' as it is evil." rather than take the chance on explaining rare circumstances when it can be used properly. The newbie set already rights bad code on average and `goto' will only make that worse.)
    Hm. I fully concur.

    And, you have an even stronger point that I do. Most of the readers of this thread are newbies, especially if they're having difficulty parsing CSV in C. (I don't mean any of the posters are newbies. I mean that the thread title is such that this thread is likely to be read by many newbies later on.)
    Therefore, the suggestions made here should be directed more towards newbies.

    After taking a step back, and re-reading this thread, I'm convinced the non-goto version is better in this case. In particular, there will be less risk of new programmers misunderstanding and learning an unintended bad habit. Therefore, please allow me to replace my suggestion with this version of the csv_field() function:
    Code:
    #include <stdlib.h>
    #include <string.h>
    #include <stdio.h>
    #include <errno.h>
    
    /* RFC4180-compatible CSV field reader. Does not consume the separator.
     * Hash function is DJB2 XOR-variant:
     *     hash(0) = 5318
     *     hash(i) = (hash(i-1) * 33U) ^ character(i)
     * If you are not interested in the hash, just supply a NULL pointer.
     *
     * Returns the length of the field read.
     * If the function returns zero, also check errno for errors. (0 is OK; empty field.)
    */
    size_t csv_field(char **const dataptr, size_t *const sizeptr, unsigned int *const hashptr, FILE *const input)
    {
        char        *data, *temp;
        size_t       size;
        size_t       used = 0;
        unsigned int hash = 5318U;
        int          quoted = 0;
        int          c;
    
        /* Invalid parameters? */
        if (!dataptr || !sizeptr || !input) {
            errno = EINVAL;
            return 0;
        }
    
        /* Initialize field content buffer. Same logic as POSIX.1-2008 getline(). */
        if (*dataptr) {
            data = *dataptr;
            size = *sizeptr;
        } else {
            data = NULL;
            size = 0;
        }
    
        c = getc(input);
    
        /* Skip leading whitespace. This is not strictly RFC4180-compliant,
         * but it allows the use of both \n and \r\n newline convention.
         * Quoted values will retain their leading whitespace, of course. */
        while (c == '\t' || c == '\v' || c == '\f' || c == '\r' || c == ' ')
            c = getc(input);
    
        /* Non-empty field? */
        if (c != EOF && c != '\n' && c != ',') {
    
            /* Is the field quoted? */
            if (c == '"') {
                quoted = 1;
                c = getc(input);
            }
    
            while (c != EOF) {
    
                /* If the field is not quoted, newline or comma ends the field. */
                if (!quoted && (c == '\n' || c == ','))
                    break;
    
                if (quoted && c == '"') {
                    /* " in a quoted value is special. */
                    c = getc(input);
    
                    /* Did the " end the quoted field? */
                    if (c == EOF || c == '\n' || c == ',')
                        break;
    
                    /* It really should be ", then. */
                    if (c != '"') {
                        /* Un-escaped " within field text; this is really an error.
                         * However, we're robust, and treat as if it was escaped.
                        */
                        ungetc(c, input);
                        c = '"';
                    }
                }
    
                /* Enough room for the new character? */
                if (used >= size) {
                    if (used < 4096)
                        size = 4096; /* Minimum 4096 */
                    else
                    if (used < 1048576)
                        size = (used * 5) / 4; /* Add 25%, up to one megabyte */
                    else
                        size = (used | 131071) + 130944; /* Pad to next (128k-128). */
        
                    temp = realloc(data, size);
                    if (!temp) {
                        errno = ENOMEM;
                        return 0;
                    }
    
                    data = temp;
    
                    *dataptr = temp;
                    *sizeptr = size;
                }
    
                hash = (33U * hash) ^ (unsigned int)c;
    
                data[used++] = c;
    
                c = getc(input);
            }
        }
    
        /* Do not consume the delimiter, if there was a delimiter. */
        if (c != EOF)
            ungetc(c, input);
    
        /* Enough room for the end-of-string mark? */
        if (used >= size) {
            size = (used | 7) + 1; /* Next multiple of 8. */
    
            temp = malloc(size);
            if (!temp) {
                errno = ENOMEM;
                return 0;
            }
    
            data = temp;
    
            *dataptr = temp;
            *sizeptr = size;
        }
    
        /* Terminate field value, */
        data[used] = '\0';
    
        /* save hash, if asked, */
        if (hashptr)
            *hashptr = hash;
    
        /* and return the length of the field. */
        errno = 0;
        return used;
    }
    The only difference to the previous version is that the goto has been replaced with an if clause.
    Last edited by Nominal Animal; 05-25-2013 at 06:08 PM. Reason: readers here -> readers of this thread.

  3. #18
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by whiteflags View Post
    They aren't the same. strcmp() will compare string lexicographically - memcmp() isn't guaranteed to do so. It only cares if n bytes of this is less than, equal to, or greater than n bytes of that.
    There is another issue as well. One issue I didn't consider is if you have identical lines. To get strcmp to work, you'd need to replace all occurences of '\n' with '\0' in the buffer holding the file. To get memcmp to work, you'd need set n to the size of the longer of two lines, and rely on the fact that '\n' is less than '0' thorugh '9'. memcmp is usually faster than strcmp, but the program would need to generate an array of line sizes in order to use it.

  4. #19
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Oh. Forgot to mention. The csv_field() function above handles even embedded NUL bytes correctly. (You could easily amend the function to replace them with another value, or skip them altogether.)

    If you run e.g.
    Code:
    printf 'First,Second\0field,Third\n' | ./example
    the output will be
    Code:
    Record 1, field 1: 'First' (5 chars, hash 0x7916bf1c)
    Record 1, field 2: 'Second' (12 chars, hash 0xbf1b2374)
    Record 1, field 3: 'Third' (5 chars, hash 0x7aa2ba05)
    Note the second field: only the output of it is truncated; it is still internally known to be 12 chars long. It's just that to printf(), the seventh char, being NUL (\0), acts like an end-of-string mark.

    You can get the full (binary) output by replacing the output printf() in the main() with
    Code:
            /* Output this field. */
            printf("Record %ld, field %ld: '", record, field);
            fwrite(data, length, 1, stdout);
            printf("' (%lu chars, hash 0x%x)\n", (unsigned long)length, hash);

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Binary Search Tree-search method help?
    By shocklightning in forum C++ Programming
    Replies: 5
    Last Post: 03-25-2012, 10:57 PM
  2. Advanced Search -> Search Multiple Content Types
    By phantomotap in forum Tech Board
    Replies: 2
    Last Post: 05-21-2011, 07:28 AM
  3. Difference Between A Linear Search And Binary Search
    By ImBack92 in forum C Programming
    Replies: 4
    Last Post: 05-12-2011, 08:47 AM
  4. Allowing my search function to search sub directories!
    By Queatrix in forum Windows Programming
    Replies: 10
    Last Post: 09-30-2005, 04:54 PM
  5. Search Engine - Binary Search Tree
    By Gecko2099 in forum C Programming
    Replies: 9
    Last Post: 04-17-2005, 02:56 PM