Thread: Read and search CSV

  1. #1
    Registered User
    Join Date
    May 2013
    Posts
    1

    Read and search CSV

    HI All,

    I was hoping someone could assist with the following coding;

    I am trying to search a very large csv file (approx 200,000 lines) for a array of strings and get the output as the line where there text is found.
    E.g. CSV file:
    123,A,B,C
    456,D,E,F
    123,G,H,I
    Now if I search for 123
    my output array should give me
    A,C
    G,I

    Could anyone please help with the coding for this, I need a pure C code for this? I am concerned with the very large csv I am using which will prob slow everything down.

    Would appreciate any help on this,

    Kind Regards,
    Last edited by electroon; 05-24-2013 at 11:16 PM.

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    If you are going to search for many different keys, then it might make sense to parse the entire CSV file once into a structure that facilitates such key finding, e.g., a hash table. A related option if the number of rows might be even greater would be to parse the entire file once into say a SQLite database, then make use of database indices and such for the searching.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    How many searches would you be making through this data? Would the searches be simple - like the example here - or would they sometimes include more complex target keys to be searched for?

    Do you have a run-time target you need to beat? If so, what is it? What PC or system is being used to run this?

    Would you be able to post 30 lines and a typical batch of 30 target keys - something to look at for actual run time?

    200,000 lines is not a large amount of lines to be searched through, today.

  4. #4
    Registered User
    Join Date
    May 2012
    Posts
    505
    Quote Originally Posted by electroon View Post
    HI All,

    I was hoping someone could assist with the following coding;

    I am trying to search a very large csv file (approx 200,000 lines) for a array of strings and get the output as the line where there text is found.
    E.g. CSV file:
    123,A,B,C
    456,D,E,F
    123,G,H,I
    Now if I search for 123
    my output array should give me
    A,C
    G,I

    Could anyone please help with the coding for this, I need a pure C code for this? I am concerned with the very large csv I am using which will prob slow everything down.

    Would appreciate any help on this,

    Kind Regards,
    Do you know the maximum line length?
    If so, just call fgets() in a loop, and strstr() to look for your sub-string. If strstr() reports a hit, process the line properly, occasionally you might need to reject it if the string is in the wrong place. If not, quickly reject it and go on to the next line.
    I'm the author of MiniBasic: How to write a script interpreter and Basic Algorithms
    Visit my website for lots of associated C programming resources.
    https://github.com/MalcolmMcLean


  5. #5
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by Malcolm McLean View Post
    Do you know the maximum line length?
    If so, just call fgets() in a loop, and strstr() to look for your sub-string. If strstr() reports a hit, process the line properly, occasionally you might need to reject it if the string is in the wrong place. If not, quickly reject it and go on to the next line.
    Or, if you have a C library that supports POSIX.1-2008, use getline():
    Code:
    #define _POSIX_C_SOURCE 200809L
    #include <stdio.h>
    
    int scan(FILE *const input, const char *const substring[], const size_t substrings)
    {
        char   *line = NULL;
        size_t  size = 0;
        ssize_t len;
        unsigned long linenum = 0UL;
    
        /* Invalid parameters? */
        if (!input || !substring || substrings < 1)
            return errno = EINVAL;
    
        while ((len = getline(&line, &size, input)) > (ssize_t)0) {
            size_t i = substrings;
            linenum++;
    
            while (i-->0)
                if (strstr(line, substring[i]) {
    
                    /*
                     * This line (linenum) might have a match, process it!
                    */
    
                    break;
                }
        }
    
        /* Input line buffer is no longer needed. Discard it. */
        free(line);
    
        /* Error? */
        if (ferror(input))
            return errno = EIO; /* Read error */
        else
        if (!feof(input))
            return errno = ENOMEM; /* No I/O error, but not all input read */
        else
            return 0; /* Success */
    }
    The downside of this approach is that quoted newlines are not allowed in fields. (You could do a quick commas-and-quotes check to see if you need to append the next line to the current line, but the code gets complicated.)

    This is by far not the fastest option, if you have lots and lots of data. But 200,000 lines of data is not much, especially if you only have a few substrings you are looking for.

    Tools like grep use techniques like building a finite state machine that consumes each incoming character (or actually, pre-processed character; they usually have special characters like start-of-entry, start-of-field, end-of-field, end-of-entry, with escapes and special characters already interpreted), and notifies if there is a match. Building the finite state machine is not difficult, but tends to be slow (I mean, building the state machine is slow; the machine itself will run very, very fast).

    If the CSV data contains a lot of repeated entries -- say numbers --, you can keep a cache of the field values, and only do the matching on cached field values. When reading a new field, you'd first check if each value already exists in the cache or not. If it does, you already know whether that field matches your substrings or not. If the value is not yet in the cache, you add it, and check if it contains your substrings or not. While not as fast as a finite state machine, this is pretty simple to implement, and a simple hash table for the known field values makes it very fast to check if a field value is already known or not. Using "slow" substring searches only for the new values tends to be fast, as the values are usually short (within one or two cache lines), and current CPUs can do such searches very fast.

    If your rules can contain e.g. numerical comparisons in addition to substring searches, the cache approach is a very efficient one. Some quick testing I did hints that even a simple implementation of the cache approach beats a single-pass database approach, when rules include both substring and numerical value comparisons. (I did some test code based on RFC4180 CSV format (sans a header line), and it ends up being about 400 lines or so; not including command-line parsing for strings to search for and numerical comparisons to apply.)
    Last edited by Nominal Animal; 05-25-2013 at 12:15 PM.

  6. #6
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Nominal Animal
    Or, if you have a C99 compiler, use getline():
    I don't know about C11, but getline is not part of the C standard library as described in C99.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  7. #7
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by laserlight View Post
    getline is not part of C99
    Right; fixed. It's defined in POSIX.1-2008 instead.

  8. #8
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Since reading CSV in C seems to be such a frequent question, I thought I'd post a simple and robust implementation:
    Code:
    #include <stdlib.h>
    #include <string.h>
    #include <stdio.h>
    #include <errno.h>
    
    /* RFC4180-compatible CSV field reader. Does not consume the separator.
     * Hash function is DJB2 XOR-variant:
     *     hash(0) = 5318
     *     hash(i) = (hash(i-1) * 33U) ^ character(i)
     * If you are not interested in the hash, just supply a NULL pointer.
     *
     * Returns the length of the field read.
     * If the function returns zero, also check errno for errors. (0 is OK; empty field.)
    */
    size_t csv_field(char **const dataptr, size_t *const sizeptr, unsigned int *const hashptr, FILE *const input)
    {
        char        *data, *temp;
        size_t       size;
        size_t       used = 0;
        unsigned int hash = 5318U;
        int          quoted = 0;
        int          c;
    
        /* Invalid parameters? */
        if (!dataptr || !sizeptr || !input) {
            errno = EINVAL;
            return 0;
        }
    
        /* Initialize field content buffer. Same logic as POSIX.1-2008 getline(). */
        if (*dataptr) {
            data = *dataptr;
            size = *sizeptr;
        } else {
            data = NULL;
            size = 0;
        }
    
        c = getc(input);
    
        /* Skip leading whitespace. This is not strictly RFC4180-compliant,
         * but it allows the use of both \n and \r\n newline convention.
         * Quoted values will retain their leading whitespace, of course. */
        while (c == '\t' || c == '\v' || c == '\f' || c == '\r' || c == ' ')
            c = getc(input);
    
        /* End of input/record/field? */
        if (c == EOF || c == '\n' || c == ',')
            goto done;
    
        /* Quoted value? */
        if (c == '"') {
            quoted = 1;
            c = getc(input);
        }
    
        while (c != EOF) {
    
            /* If the field is not quoted, newline or comma ends the field. */
            if (!quoted && (c == '\n' || c == ','))
                break;
    
            if (quoted && c == '"') {
                /* " in a quoted value is special. */
                c = getc(input);
    
                /* Did the " end the quoted field? */
                if (c == EOF || c == '\n' || c == ',')
                    break;
    
                /* It really should be ", then. */
                if (c != '"') {
                    /* Un-escaped " within field text; this is really an error.
                     * However, we're robust, and treat as if it was escaped.
                    */
                    ungetc(c, input);
                    c = '"';
                }
            }
    
            /* Enough room for the new character? */
            if (used >= size) {
                if (used < 4096)
                    size = 4096; /* Minimum 4096 */
                else
                if (used < 1048576)
                    size = (used * 5) / 4; /* Add 25%, up to one megabyte */
                else
                    size = (used | 131071) + 130944; /* Pad to next (128k-128). */
    
                temp = realloc(data, size);
                if (!temp) {
                    errno = ENOMEM;
                    return 0;
                }
    
                data = temp;
    
                *dataptr = temp;
                *sizeptr = size;
            }
    
            hash = (33U * hash) ^ (unsigned int)c;
    
            data[used++] = c;
    
            c = getc(input);
        }
    
    done:
    
        /* Do not consume the delimiter. */
        if (c != EOF)
            ungetc(c, input);
    
        /* Enough room for the end-of-string mark? */
        if (used >= size) {
            size = (used | 7) + 1; /* Next multiple of 8. */
    
            temp = malloc(size);
            if (!temp) {
                errno = ENOMEM;
                return 0;
            }
    
            data = temp;
    
            *dataptr = temp;
            *sizeptr = size;
        }
    
        /* Terminate field value, */
        data[used] = '\0';
    
        /* save hash, if asked, */
        if (hashptr)
            *hashptr = hash;
    
        /* and return the length of the field. */
        errno = 0;
        return used;
    }
    The function reads one (RFC4180-like) CSV field into the dynamically allocated buffer. It does not consume the separator (comma or newline).

    It's also one of the better examples how to sanely use goto in C. Sure, you can easily replace it with an if clause, but I do believe this is simpler and easier to understand. (I mostly used the goto because I often use goto for the error/abnormal code paths. Here, the three cases -- end-of-input read as the first non-whitespace character, newline read as the first non-whitespace character, and comma read as the first non-whitespace character, just happened to fold nicely into one if clause.)

    Because csv_field() reads the input stream character by character, it is not very fast. However, it's very robust, and handles even stuff like embedded newlines correctly.

    The pointer to the hash is optional: you can supply NULL, if you are not interested what the DJB2 hash of the field value is.

    Here's an example of how to use the function to read a CSV file:
    Code:
    int main(void)
    {
        char  *data = NULL;
        size_t size = 0;
        size_t length;
        long   record = 0L, field = 0L;
        unsigned int hash;
        int    c;
    
        while (1) {
    
            length = csv_field(&data, &size, &hash, stdin);
            if (!length && errno) {
                const char *const errmsg = strerror(errno);
                fprintf(stderr, "Error reading standard input: %s.\n", errmsg);
                return 1;
            }
    
            /* Advance field count, and record count if this is first field */
            if (!field++)
                record++;
    
            /* Don't bother printing empty fields. */
            if (!length)
                return 0;
    
            /* Output this field. */
            printf("Record %ld, field %ld: '%s' (%lu chars, hash 0x%x)\n", record, field, data, (unsigned long)length, hash);
    
            /* Get delimiter. */
            c = getc(stdin);
            if (c == EOF) {
                /* EOF: end of input */
                break;
    
            } else
            if (c == '\n') {
                /* Newline: new record */
                field = 0L;
                continue;
    
            } else
            if (c == ',') {
                /* Comma: new field */
                continue;
    
            } else {
                /* Invalid separator */
                if (c != '\'' && c > 32 && c < 127)
                    fprintf(stderr, "Invalid delimiter ('%c', ASCII %d) in standard input!\n", c, c);
                else
                    fprintf(stderr, "Invalid delimiter (ASCII %d) in standard input!\n", c);
                return 1;
            }
        }
    
        printf("Read %ld records.\n", record);
    
        /* Discard dynamically allocated field buffer. */
        free(data);
        data = NULL;
        size = 0;
    
        return 0;
    }
    If you need faster CSV handling, write your own functions. You can use the above code as a benchmark and unit test: you can verify your code parses some difficult test CSV (containing quoted values, escaped quotes, and embedded newlines et cetera) the same way.

  9. #9
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    It's also one of the better examples how to sanely use goto in C. Sure, you can easily replace it with an if clause, but I do believe this is simpler and easier to understand.
    O_o

    I disagree completely.

    You can trivially invert the condition so that the normal and expected path is further nested leaving the common cleanup code to follow as expected.

    You agree that the replacement is easy; let's consider the semantic value.

    For the sake of brevity, let's replace the condition with a simple name `empty_record'.

    Code:
    if(!empty_record)
    {
    // process record
    }
    // cleanup code common to both empty records and successfully read records
    I see the example code as being far easier to reason about.

    I'm not speaking about the evils of `goto' or commenting otherwise on your code.

    [Edit]
    Actually, I will say one thing about your code: I'm happy to see someone else use constant parameters for the benefit of the implementation. I don't see that nearly often enough to suite me.
    [/Edit]

    Soma

  10. #10
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by phantomotap View Post
    [Edit]
    Actually, I will say one thing about your code: I'm happy to see someone else use constant parameters for the benefit of the implementation. I don't see that nearly often enough to suite me.
    [/Edit]
    True, and for the length of the function it makes it jusdt that little bit easier to reason about.

    I would try and make sure it complied with the CSV file format according to Wikipedia. E.g. newlines can appear inside columns.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  11. #11
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by phantomotap View Post
    I disagree completely.
    I understand your point, and agree with your reasoning.

    I just don't think you are considering the full picture here.

    First, it was just a coincidence that the three cases happened to coalesce into a single one. While I was writing the code, the three cases (and goto's) were interspersed with other code. There was no single conditional; replacing the gotos would have nested the inner while loop into at least three scopes deep. (I did try to explain this as a reason why I thought the goto was a good example, but I guess it backfired.)

    Second, and perhaps more importantly, I only use goto for error/abnormal code paths. Whenever I see a goto, I immediately recognize that that case is an abnormal one, and the goto indicates that cleanup is needed. (Not only for that single case, but within the current scope in general.) It changes my mental gears, if you will.

    I personally benefit from the extra information from seeing goto (wrt. error cases and abnormal code paths). At least with the Linux kernel source code, it helps me understand the expected code flow much better, even if the individual operations done as cleanup after various error cases is sometimes hard to decipher (I often need external notes to keep track). Perhaps that's why the habit has stuck on me?

    If the goto is replaced by an if clause, it is much easier to overlook that there is common cleanup done. Future programmers may for example add code that simply returns early from within the while loop, ignoring the cleanup. I have seen this often in code, but less often with code that contains gotos in the error paths.

    If you consider the code "complete", then I must agree with your reasoning, and replacing the goto with an if clause is warranted. (The function is robust enough so it does not fail with an error after that point, so you don't have to treat the three cases as errors, really.)

    I do not consider the code "complete". If anybody uses the code at all, I hope they tinker with it; modify it to suit their needs, or modify it just to see how those modifications affect its behaviour. So, I find that having the goto there helps understand my original intent, and is basically an extra hint to those programmers.

    Granted, there is always the possibility that I'm just very conceited, and in reality the goto does not convey any extra information to other programmers. I hope not, though

  12. #12
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by iMalc View Post
    I would try and make sure it complied with the CSV file format according to Wikipedia. E.g. newlines can appear inside columns.
    Assuming there are no bugs lurking in it, it does. For example, given input
    Code:
    First, Second, Third
    """Bah,"" she said.", pretty"robust, "multiple
    lines
    are not a problem"
    foo" bar "baz
    the example program outputs
    Code:
    Record 1, field 1: 'First' (5 chars, hash 0x7916bf1c)
    Record 1, field 2: 'Second' (6 chars, hash 0xc7a3daf6)
    Record 1, field 3: 'Third' (5 chars, hash 0x7aa2ba05)
    Record 2, field 1: '"Bah," she said.' (16 chars, hash 0xca2a566e)
    Record 2, field 2: 'pretty"robust' (13 chars, hash 0x639862b7)
    Record 2, field 3: 'multiple
    lines
    are not a problem' (32 chars, hash 0x95d21f02)
    Record 3, field 1: 'foo" bar "baz' (13 chars, hash 0xad6be28)
    In the output, the field value is printed within single quotes; the outermost single quotes are not part of the field value returned by the csv_field() function.

    I'm sure the function could be enhanced quite a bit. If not otherwise, then at least you could get rid of the goto, if you wanted to. Feel free to post any fixes or enhancements!

  13. #13
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    Second, and perhaps more importantly, I only use goto for error/abnormal code paths. Whenever I see a goto, I immediately recognize that that case is an abnormal one, and the goto indicates that cleanup is needed. (Not only for that single case, but within the current scope in general.) It changes my mental gears, if you will.
    If I adopted this approach I would worry about becoming complacent with working code, and using it all the time, when a little bit of thinking would have made the control flow completely obvious. And while you say it communicates something extra to you, the way that goto works is basically just a jump. So from a completely objective view, every use you have to figure out where you're going. Plus there are other guidelines that can help you organize your function. Like prefering/enforcing a single exit point.

  14. #14
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Second, and perhaps more importantly, I only use goto for error/abnormal code paths.
    O_o

    I didn't overlook, by virtue of not seeing the big picture, that at all.

    Yes, in C, there are certainly cases where `goto' is the better solution because it isolates otherwise duplicated code without wonky cleanup functions having several references to local functions. (Don't quote me on it though; if the situation comes up where a newbie is using `goto' poorly I will simply tell them "You should not be using `goto' as it is evil." rather than take the chance on explaining rare circumstances when it can be used properly. The newbie set already rights bad code on average and `goto' will only make that worse.) I'm not suggesting otherwise.

    You have just done the job, isolating the common cleanup code, without `goto' being necessary. Certainly, if the code would still have otherwise duplicated that common cleanup, I would agree.

    First, it was just a coincidence that the three cases happened to coalesce into a single one. While I was writing the code, the three cases (and goto's) were interspersed with other code. There was no single conditional; replacing the gotos would have nested the inner while loop into at least three scopes deep.
    I'm intentionally quoting this after the other because it rather fits perfectly what I intend.

    There was no coincidence. None. You kept working the code.

    I'm not saying that you intentionally coalesced the common cleanup code. I'm not saying you worked to remove the `goto'.

    You did work at the implementation. You aren't incompetent. You weren't lazy.

    So, sure, you may not be a lazy or incompetent programmer, but that says nothing about every other person.

    The common thread of `goto' causing spaghetti code is, naturally, false. The spaghetti code comes from the lazy or incompetent. Look at my above comment about newbies; the use of `goto' doesn't necessarily make their code worse, but if they just reach for `goto' when they code themselves into a corner, they will not spend the effort it takes to consider, to refactor the code.

    I guess it backfired.
    *shrug*

    It sounds like early on `goto' may have been reasonable, but you worked at the implementation until it became unnecessary.

    If the `goto' is the better approach, so be it, but it sounds to me like the code had some issues with or without the `goto' which you improved by virtue of the effort you spent.

    I don't know if this "backfired" or not, but that's exactly what I desire from people.

    Soma

  15. #15
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    If the only key to be searched is the numeric string at the beginning of each line, you could just read in the entire file and create an array or vector of pointers to each line. Optionally you could sort the array or vector of pointers to each line and use a binary search via the array or vector to find a line. Since ',' is less than '0' through '9', then you can use strcmp (or memcmp) to compare lines ("123,", will be considered less than "1230,").

    What I tend to do for stuff like this is one large allocation of memory, like 1GB, then set all the pointers I need as offsets from the base pointer to the allocated memory.
    Last edited by rcgldr; 05-25-2013 at 05:49 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Binary Search Tree-search method help?
    By shocklightning in forum C++ Programming
    Replies: 5
    Last Post: 03-25-2012, 10:57 PM
  2. Advanced Search -> Search Multiple Content Types
    By phantomotap in forum Tech Board
    Replies: 2
    Last Post: 05-21-2011, 07:28 AM
  3. Difference Between A Linear Search And Binary Search
    By ImBack92 in forum C Programming
    Replies: 4
    Last Post: 05-12-2011, 08:47 AM
  4. Allowing my search function to search sub directories!
    By Queatrix in forum Windows Programming
    Replies: 10
    Last Post: 09-30-2005, 04:54 PM
  5. Search Engine - Binary Search Tree
    By Gecko2099 in forum C Programming
    Replies: 9
    Last Post: 04-17-2005, 02:56 PM