Thread: fastest way of processing a file

  1. #1
    Just kidding.... fnoyan's Avatar
    Join Date
    Jun 2003
    Location
    Still in the egg
    Posts
    275

    fastest way of processing a file

    Hi,

    I have huge CSV files that I need to import a database. For this approach, I have to write my own import function. The code works fine, except... it takes time to import files (sometimes an hour!)

    Basically, the process is
    Code:
    1 ) open file
    2 ) read file - fgets()
    3 ) split each line into tokens
    4 ) for each read token
         4.1 ) check if there are any "bad characters" which will cause SQL errors while executing SQL queries
              4.1.1 ) if there is, remove it
         4.2 ) construct SQL query
         4.3 ) update database
    I have to perform step 4 since I had errors because of bad characters before. But evaluating each character really takes time.

    To improve the overall performance, I was thinking to map the file into memory before fgets() but mmap() is not portable

    any recommendations are welcome....

  2. #2
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Please post up a small raw input file - just 5 lines will do - and what you would need done to those 5 lines of code, so it's ready to use to construct your SQL query.

    Specify the "bad characters", so we know more about them.

    Also, post the code you are using to handle this input, currently. (Just that bit of code). That will give us a baseline to use for comparisons.

    Fun!

  3. #3
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    An hour is ridiculous, unless the file is something like 50GB in size.

    Have you checked what part is taking all the time? The actual reading in and parsing of the CSV file should be pretty fast and I would expect the slowest part by far to be the actual inserting into the database. If this turns out not to be the case then you must have gone about things in a very inefficient manner. That would mean you'd be best to post your actual code on here for us to help you optimise.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  4. #4
    Just kidding.... fnoyan's Avatar
    Join Date
    Jun 2003
    Location
    Still in the egg
    Posts
    275
    to remove bad chars
    Code:
    (like ":;.%$#@[]{}|'<>*")
    int zChrSearch(char *token,char s){
    	int i=0;
    	
    	if (!token || s=='\0')
    		return 0;
    
    	while(token[i]) {
    		if (token[i]==s)
    			return 1;
    		i++;
    	}
    	return 0;
    }
    
    
    char *zStrrmv(char *str,char *delim) {
    	int index=0;	
    
    	if (str==0 || delim==0)
    		return 0;
    
    	while(str[index]){;
    		if (zChrSearch(delim,str[index])) {
    			memmove (str+index,str+index+1,strlen(str)-index);
    			index--;
    		}
    		index++;
    	}
    	return str;
    }
    to insert database
    Code:
    int sqlite3_import_file (sqlite3 *db, char *table, char *fp, const char *sep) {
        char *element;                      /* Array to store a single element */
        char line[MAX_LINE_LENGTH];         /* Array to store each line read from import file */
        char *badCharacters=".:;`~*\\?=)(/&%+^'!\"|}][{$ #@\t"; /* list of characters to remove */
        char sqlCmd[MAX_LINE_LENGTH];       /* SQL Comamnd string */
        FILE *fd;                           /* File descriptor */
        int cnt, j,i;                       /* Loop counters */
        sqlite3_stmt *stmt;                 /* sqlite3 statement */
    
        if ((fd=fopen(fp,"rb"))==NULL)
            return -1;
    
        if (fgets(line,sizeof(line),fd)==NULL)
            return -2;
    /*
    * Construct SQL Statement to create import table
    */
        cnt = 0;
        for (j=0 ; j<strlen(line);j++)
            if (line[j]==sep[0])
                cnt++;
    
        if (cnt>MAX_COLUMN_COUNT)
            cnt = MAX_COLUMN_COUNT;
    
        sprintf(sqlCmd,"CREATE TABLE %s (",table);
        strncat(sqlCmd,zStrrmv(zStrtok(line,sep),badCharacters),ELEMENT_LENGTH);
        strcat(sqlCmd," CHAR(64),");
    
        for(j=1; j<cnt; j++){
                strncat(sqlCmd,zStrrmv(zStrtok(NULL,sep),badCharacters),ELEMENT_LENGTH);
                strcat(sqlCmd," CHAR(64),");
        }
    
        element = zStrrmv(zStrtok(NULL,sep),badCharacters);
        strncat(sqlCmd,element,(((strlen(element)-2)>ELEMENT_LENGTH)?ELEMENT_LENGTH:(strlen(element)-2)));
        strcat(sqlCmd," CHAR(64));");
    
        if (sqlite3_prepare_v2(db,sqlCmd,strlen(sqlCmd)+1,&stmt,NULL)!=SQLITE_OK)
            return sqlite3_errcode(db);
    
        if (sqlite3_step(stmt)!=SQLITE_DONE)
            return sqlite3_errcode(db);
    
        if (sqlite3_finalize(stmt)!=SQLITE_OK)
            return sqlite3_errcode(db);
    
    /*
    * Prepared statement for INSERT
    */
        sprintf(sqlCmd,"INSERT INTO %s VALUES(?",table);
        i=strlen(sqlCmd);
        for(j=0; j<cnt; j++){
            sqlCmd[i++] = ',';
            sqlCmd[i++] = '?';
        }
        sqlCmd[i++] = ')';
        sqlCmd[i++] = ';';
        sqlCmd[i] = 0;
    
        if (sqlite3_prepare_v2(db,sqlCmd,strlen(sqlCmd)+1,&stmt,NULL)!=SQLITE_OK)
            return sqlite3_errcode(db);
    
        while(fgets(line,sizeof(line),fd)){
            element = zStrrmv(zStrtok(line,sep),badCharacters);
            sqlite3_bind_text(stmt, 1,(strcmp(element,sep)==0)?"NULL":element , -1, SQLITE_STATIC);
            for (j=2;j<cnt+1;j++) {
                    element = zStrrmv(zStrtok(NULL,sep),badCharacters);
                    sqlite3_bind_text(stmt, j, (strcmp(element,sep)==0)?"NULL":element, -1, SQLITE_STATIC);
            }
            element = zStrrmv(zStrtok(NULL,sep),badCharacters);
            if (element[strlen(element)-1] == '\n')
                element[strlen(element)-1] = 0;
            sqlite3_bind_text(stmt, j, element, -1, NULL);
    
            if (sqlite3_step(stmt)!=SQLITE_DONE)
                return sqlite3_errcode(db);
    
            sqlite3_reset(stmt);
        }
    
        sqlite3_finalize(stmt);
        fclose(fd);
        return 0;
    }
    input is CSV file
    Last edited by fnoyan; 12-12-2012 at 11:33 PM.

  5. #5
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Okay, here's a fast replacement for zStrrmv which will be far faster than what you have.
    Heck I even tested this:
    Code:
    unsigned char isBad[256] = {};
    
    void initBadChars(const char *rejects)
    {
        assert(rejects != NULL);
        while (*rejects != '\0')
            isBad[*rejects++] = 1;
    }
    
    char *zStrrmv(char *str)
    {
        assert(str != NULL);
        char *src = str, *dst = str;
        while (*src != '\0')
        {
            if (isBad[*src])
                ++src;
            else
                *dst++ = *src++;
        }
        *dst = '\0';
        return str;
    }
    Call initBadChars once to set up the array of which chars to filter out.
    Then call zStrrmv with just the string to filter, as you don't need to pass the string of characters to filter out each time any more.

    The idea is to set up a lookup table which will tell you immediately if a char is good/bad, and then remove those from the string by using destination and source iterators.

    One thing to consider is that you may be better off with a list of characters to include rather than exclude.

    I tested it with this string: "a.$ #!b\"cd|}]e;f`~:g&%+hi^'j/[k{*l\\m?=)n(@o\tp" which gives abcdefghijklmnop"

    If you like how much faster that makes it, perhaps post your zStrtok function as well etc and see what we can do with that.
    Last edited by iMalc; 12-13-2012 at 12:27 AM.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  6. #6
    Registered User
    Join Date
    Jul 2012
    Posts
    51
    Here's a way I did it using built in string.h functions. Not sure about the performance, don't have any large csv files.

    iMalc code looks better.

    Code:
    #include <stdio.h>
    #include <string.h>
    
    int main (void)
    {
        char *p, csv[] = "Mr. John Doe,[email protected],!not$200?which&%,$200.00,${CODE}";
        char rmc[] = ".:;`~*\\?=)(/&%+^'!\"|}][{$ #@\t";
        char pstr[10][100];
        char ostr[10][100];
        int i = 0, c = 0;
        p = strtok (csv, ",");
        while (p)
        {
            strcpy (pstr[i], p);
            i++;
            p = strtok (NULL, ",");
        }
        c = i;
        for (i = 0; i < c; i++)
        {
            p = strtok (pstr[i], rmc);
            strcpy (ostr[i], "");
            while (p)
            {
                strcat (ostr[i], p);
                strcat (ostr[i], " ");
                p = strtok (NULL, rmc);
            }
        }
        for (i = 0; i < c; i++)
        {
            printf ("%d: %s\n", i, ostr[i]);
        }
        return 0;
    }

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > memmove (str+index,str+index+1,strlen(str)-index);
    This is bad.
    You're copying the WHOLE string tail, just to remove each bad character.

    Code:
    for ( i = 0, j = 0 ; string[i] != '\0' ; i++ ) {
      if ( good(string[i]) ) string[j++] = string[i];
    }
    string[j] = '\0';
    j will never overtake i, and with each bad character, it falls further behind.
    Each char is only tested and copied once.

    > for (j=0 ; j<strlen(line);j++)
    This is awful.
    Use line[j] != '\0' as the test.

    > One thing to consider is that you may be better off with a list of characters to include rather than exclude.
    Good idea - I would suggest isalnum as being a good starting point.


    > MAX_COLUMN_COUNT
    > ELEMENT_LENGTH
    Are these symbols consistent with MAX_LINE_LENGTH

    > if ((fd=fopen(fp,"rb"))==NULL)
    Why are you opening the file in binary mode?

    Being pedantic, fopen returns a file pointer, not a file descriptor.

    > zStrtok
    Does this take into account adjacent separators?
    If this is just a thin layer over regular strtok, then "hello,,,,,,world" is just two tokens, not seven.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  8. #8
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    You will be looking at a big speed up!

    I made up a 655MB text file, with at least 3 bad chars on each line of text, and 1,491,006 lines of CSV's.

    This is the program:
    Code:
    #include <stdio.h>
    #include <string.h>
    #include <time.h>
    #include <ctype.h>
    
    
    //int zChrSearch(char *token,char s);
    //void zStrrmv(char *str);
    
    int main(void) {
       clock_t timer;
       unsigned long int count; 
       int i,j;
       FILE *fpIn,*fpOut;
       char str[100];
       char dup[100];
          
       fpIn=fopen("temp.txt", "r");
       fpOut=fopen("temp2.txt","w");
        
       if(!fpIn || !fpOut) {
          printf("Error opening file\n");
          return 1;
       }
       timer=clock();
       count=0;
       j=i=0;
       while(fgets(str, sizeof(str),fpIn)) {
          i=j=0;
          while(str[i]) {
             if(isalpha(str[i]) || isdigit(str[i]) || str[i]==',' || str[i]==' ') {
                dup[j]=str[i];
                ++j;
             }
             ++i;
          }
          dup[j]='\0';
          fprintf(fpOut,"%s\n",dup);
    
          ++count;
       }
       timer=clock()-timer;
       fclose(fpIn);
       
       printf("Strings Processed: %d  Elapsed time: %.3f\n",count,(double)timer/CLOCKS_PER_SEC);
       return 0;
    }
    Strings Processed: 1491006 Elapsed time: 1.124 seconds
    < sweet! >

    Your program will need more time, because it does a lot more, but if use a basic logic similar to this, you'll be flying.
    Last edited by Adak; 12-13-2012 at 05:17 AM.

  9. #9
    Just kidding.... fnoyan's Avatar
    Join Date
    Jun 2003
    Location
    Still in the egg
    Posts
    275
    zStrtok() code needs much more improvement but regarding to Salem's question, it takes consecutive separators into account (returns separator itself in case of consecutive separators). The one included in string.h does not work for me since it does not take consecutive separators into account.

    below is the code for zStrtok()
    Code:
    char *zStrtok(char *str, const char *delim) {
        static char *static_str=0;    	/* var to store last address */
        int index=0, strlength=0;       	/* integers for indexes */
        int found = 0;              	/* check if delim is found */
    
        /* delimiter cannot be NULL
        * if no more char left, return NULL as well
        */
        if (delim==0 || (str == 0 && static_str == 0))
            return 0;
    
        if (str == 0)
            str = static_str;
    
            /* get length of string */
            while(str[strlength])
                    strlength++;
    
        /* find the first occurance of delim */
        for (index=0;index<strlength;index++)
            if (str[index]==delim[0]) {
                found=1;
                break;
            }
    
        /* if delim is not contained in str, return str */
        if (!found) {
            static_str = 0;
            return str;
        }
    
        /* check for consecutive delimiters
        *if first char is delim, return delim
        */
        if (str[0]==delim[0]) {
            static_str = (str + 1);
            return (char *)delim;
        }
    
        /* terminate the string
        * this assignmetn requires char[], so str has to
        * be char[] rather than *char which will cause SIGSEGV
        */
        str[index] = '\0';
    
        /* save the rest of string */
        if ((str + index + 1)!=0)
            static_str = (str + index + 1);
        else
            static_str = 0;
    
            return str;
    }
    @Salem,
    Does having a for() loop with increment makes a lot of difference compared to just evaluating "end of string" case?

    @Adak,
    As iMalc indicated in his first post, the database insert also takes time (I may be considering INSERT'ing not for each line but bunch of lines with semicolum separated SQL commands, so only once call for sqlite3_* APIs). So, we better test the code embedded in file import function. But, yes, the result is pretty good

    I will try the recommendation and let you know about the import times for the same file
    Last edited by fnoyan; 12-13-2012 at 06:56 AM.

  10. #10
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > /* get length of string */
    This is a waste of effort.

    > for (index=0;index<strlength;index++)
    If you just change this to
    for (index=0;str[index]!='\0';index++)
    then the whole problem of working out the string length in advance goes away.

    > *if first char is delim, return delim
    Shouldn't you return a pointer to an empty string?
    Like say
    return "";

    > Does having a for() loop with increment makes a lot of difference with just evaluating "end of string" case?
    I don't know what you mean. You'd have to post two example loops for us to comment on the relative differences.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  11. #11
    Just kidding.... fnoyan's Avatar
    Join Date
    Jun 2003
    Location
    Still in the egg
    Posts
    275
    Quote Originally Posted by Salem View Post
    >
    > *if first char is delim, return delim
    Shouldn't you return a pointer to an empty string?
    Like say
    return "";

    > Does having a for() loop with increment makes a lot of difference with just evaluating "end of string" case?
    I don't know what you mean. You'd have to post two example loops for us to comment on the relative differences.
    like
    > for (j=0 ; j<strlen(line);j++)
    This is awful.
    Use line[j] != '\0' as the test

    I am returning delim because I am testing the case in the code (if it is delim, I'will INSERT into SQL table "NULL". It is probable to we have valid " " as a contained values between two delimiters)
    Last edited by fnoyan; 12-13-2012 at 07:30 AM.

  12. #12
    Registered User ledow's Avatar
    Join Date
    Dec 2011
    Posts
    435
    I would also suggest, as a general troubleshooting technique under similar circumstances that may crop up, to familiarise yourself with some basic profiling. Having people fix your problem for you is all well and good, but if you couldn't even see where the main performance hit was coming from then how did you ever expect to be able to answer it yourself? It seems you were looking into mmap'ing before you even thought to profile the app and see where the bottleneck actually was.

    - Honestly, compiling and running the program through gprof would immediately tell you where you're spending 99.9% of your time anyway.
    - Hell, even a primitive printf on entering and leaving functions using some crude timings would show you quite a lot about where the bottleneck is (I'm a big advocate of "simple debugging" like this because people whinge so much when you say "just learn how to use this quite complex tool that may not even be available for your platform", and simple debugging eliminates those complaints before you even start. Seriously, a printf with the current time in it, copy/pasted to the critical parts of your program - e.g. "Entering func() at 112487354.2 seconds. Leaving func after spending 89.2 seconds inside it" - with maybe even a counter "I've entered this function 32748237489273 times since the start of the program" - that's within ANYONE'S range of programming skill, requires no external tools, and is good enough to help you get a hint at what's happening. Hell, I have a 150,000 line program that I'm working on at the moment and I have a macro that just sticks that printf with the information into any function I paste it into - even with the function name - so I can work out what happened without having to churn through reams of debugger/profiler output for a single answer).

    And once you know where the problem is, you can just sit and tweak then time, tweak then time until you can get it down to the lowest you can until you can narrow the problem to a particular function or even line of code and then post it for optimisation help.

    Help yourself, and put some simple debugging or run a proper profiler (which costs nothing but a few characters in your compile line if you're using gcc, or MinGW, or another derivative of it already)

    - Compiler warnings are like "Bridge Out Ahead" warnings. DON'T just ignore them.
    - A compiler error is something SO stupid that the compiler genuinely can't carry on with its job. A compiler warning is the compiler saying "Well, that's bloody stupid but if you WANT to ignore me..." and carrying on.
    - The best debugging tool in the world is a bunch of printf()'s for everything important around the bits you think might be wrong.

  13. #13
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > > for (j=0 ; j<strlen(line);j++)
    If line is 10 chars long, you call strlen() 10 times, and it always returns 10
    If line is 1000 chars long, you call strlen() 1000 times, and it always returns 1000
    Not only does each strlen() call take longer and longer to do (with increasing line length), you call it more and more times (twice the hit!)
    On the whole, it's much better to not call it at all.

    Given Adak's post #8, and ledow's timely introduction of a bit of profiling sanity, my gut feel is that you're going to find that the SQL is the elephant in the room all of a sudden.


    > I am returning delim because I am testing the case in the code (if it is delim, I'will INSERT into SQL table "NULL". It is probable to we have valid " " as a contained values between two delimiters)
    Note that I said
    return ""; // empty string
    not
    return " "; // single space
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  14. #14
    Registered User
    Join Date
    Jul 2012
    Posts
    51
    > I made up a 655MB text file, with at least 3 bad chars on each line of text, and 1,491,006 lines of CSV's.
    @Adak - How did you make your test file? Just random text with some "bad" chars thrown in?

  15. #15
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by fnprintf View Post
    > I made up a 655MB text file, with at least 3 bad chars on each line of text, and 1,491,006 lines of CSV's.
    @Adak - How did you make your test file? Just random text with some "bad" chars thrown in?
    I do a LOT of folding for Folding@Home (world-wide distributed computer type disease research run by Stanford University).
    Folding@Home: Folding@home - HomePage

    Me: Adak - User Summary - EXTREME Overclocking Folding @ Home Stats

    The huge data file was the download of every folders stats, that Stanford makes available. I used that, wrote some code to give it some extra bad chars I knew it had a few on each line, because it uses tabs '\t' to separate the fields of each record. These tabs are listed as "bad" in the Original Posters program, and wrote that out to a file. To make it closer to the OP's needs, I added commas to separate each field.

    That "bad" file, was input for the timed run. The lines aren't long, but overall, it was a good fit, and a huge file. Input data looked like this:

    Sun Jul 10 09:20:00 PDT 2011
    name, ;newcredit, sum(total), team
    anonymous, 2743076847, 16093991, 0
    PS3, .2436132886, 10113108, 0
    PDC, 920381552, 15965, 1
    Angra, :427916763, 108859, 36362
    kennish, 375378324, 105099, 39340
    AtlasFolder, ;356534004, 695128, 36167
    Scott_H, 342966259, 437344, 2654
    awachs, `341500090, 77431, 181223
    SACO, 308934471, 15098, 3213
    HayesK, ~302161623, 179545, 32
    Michael_McCord,_M.D., 246149204, 248070, 11108
    Leganfuh, *243263626, 147521, 75255
    ChasR, 236938861, 310008, 32
    [Zebulon.fr]_Cobra, \233830170, 144926, 51
    mmillion, 229920411, 140418, 111065
    mutsu, ?228498654, 145086, 11108
    OverClocking-Masters.com, 213583169, 183699, 51
    dropper, =212694672, 50968, 33
    phoenicis, 211662473, 93610, 35947
    zz9pzza, )211014076, 15793, 35947
    Computekinc.us, 203181956, 204244, 32
    musky, (200650606, 6715, 33
    EVGApes, 200215225, 147386, 111065
    OlivierZ, /180898994, 43221, 51
    So each line of text had 3 bad tabs that needed to be removed, and every other line had an extra (and rotating) bad char from the bad char array.

    Output file looked like this:
    Sun Jul 10 092000 PDT 2011
    name, newcredit, sumtotal, team
    anonymous, 2743076847, 16093991, 0
    PS3, 2436132886, 10113108, 0
    PDC, 920381552, 15965, 1
    Angra, 427916763, 108859, 36362
    kennish, 375378324, 105099, 39340
    AtlasFolder, 356534004, 695128, 36167
    ScottH, 342966259, 437344, 2654
    awachs, 341500090, 77431, 181223
    SACO, 308934471, 15098, 3213
    HayesK, 302161623, 179545, 32
    MichaelMcCord,MD, 246149204, 248070, 11108
    Leganfuh, 243263626, 147521, 75255
    ChasR, 236938861, 310008, 32
    ZebulonfrCobra, 233830170, 144926, 51
    mmillion, 229920411, 140418, 111065
    mutsu, 228498654, 145086, 11108
    OverClockingMasterscom, 213583169, 183699, 51
    dropper, 212694672, 50968, 33
    phoenicis, 211662473, 93610, 35947
    zz9pzza, 211014076, 15793, 35947
    Computekincus, 203181956, 204244, 32
    musky, 200650606, 6715, 33
    EVGApes, 200215225, 147386, 111065
    OlivierZ, 180898994, 43221, 51
    Last edited by Adak; 12-13-2012 at 03:44 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. File processing: How to read a part of a text file?
    By HardOverdrive in forum C Programming
    Replies: 4
    Last Post: 12-05-2011, 12:33 PM
  2. processing file?
    By ripspinner in forum Windows Programming
    Replies: 4
    Last Post: 12-13-2009, 10:38 PM
  3. Problem with Creating File for File Processing
    By Dampecram in forum C Programming
    Replies: 2
    Last Post: 12-07-2008, 01:26 AM
  4. What's the fastest way to write a file?
    By pingpangpang in forum C++ Programming
    Replies: 12
    Last Post: 06-08-2007, 11:58 AM
  5. Reading a File - FASTEST WAY POSSIBLE??
    By Mastadex in forum C++ Programming
    Replies: 15
    Last Post: 06-19-2004, 03:19 PM