fastest way of processing a file

**fnoyan** · 12-12-2012

Hi,

I have huge CSV files that I need to import a database. For this approach, I have to write my own import function. The code works fine, except... it takes time to import files (sometimes an hour!)

Basically, the process is

Code:

1 ) open file
2 ) read file - fgets()
3 ) split each line into tokens
4 ) for each read token
     4.1 ) check if there are any "bad characters" which will cause SQL errors while executing SQL queries
          4.1.1 ) if there is, remove it
     4.2 ) construct SQL query
     4.3 ) update database

I have to perform step 4 since I had errors because of bad characters before. But evaluating each character really takes time.

To improve the overall performance, I was thinking to map the file into memory before fgets() but mmap() is not portable

any recommendations are welcome....

**Adak** · 12-12-2012

Please post up a small raw input file - just 5 lines will do - and what you would need done to those 5 lines of code, so it's ready to use to construct your SQL query.

Specify the "bad characters", so we know more about them.

Also, post the code you are using to handle this input, currently. (Just that bit of code). That will give us a baseline to use for comparisons.

Fun!

**iMalc** · 12-12-2012

An hour is ridiculous, unless the file is something like 50GB in size.

Have you checked what part is taking all the time? The actual reading in and parsing of the CSV file should be pretty fast and I would expect the slowest part by far to be the actual inserting into the database. If this turns out not to be the case then you must have gone about things in a very inefficient manner. That would mean you'd be best to post your actual code on here for us to help you optimise.

**fnoyan** · 12-12-2012

to remove bad chars

Code:

(like ":;.%$#@[]{}|'<>*")
int zChrSearch(char *token,char s){
	int i=0;
	
	if (!token || s=='\0')
		return 0;

	while(token[i]) {
		if (token[i]==s)
			return 1;
		i++;
	}
	return 0;
}


char *zStrrmv(char *str,char *delim) {
	int index=0;	

	if (str==0 || delim==0)
		return 0;

	while(str[index]){;
		if (zChrSearch(delim,str[index])) {
			memmove (str+index,str+index+1,strlen(str)-index);
			index--;
		}
		index++;
	}
	return str;
}

to insert database

Code:

int sqlite3_import_file (sqlite3 *db, char *table, char *fp, const char *sep) {
    char *element;                      /* Array to store a single element */
    char line[MAX_LINE_LENGTH];         /* Array to store each line read from import file */
    char *badCharacters=".:;`~*\\?=)(/&%+^'!\"|}][{$ #@\t"; /* list of characters to remove */
    char sqlCmd[MAX_LINE_LENGTH];       /* SQL Comamnd string */
    FILE *fd;                           /* File descriptor */
    int cnt, j,i;                       /* Loop counters */
    sqlite3_stmt *stmt;                 /* sqlite3 statement */

    if ((fd=fopen(fp,"rb"))==NULL)
        return -1;

    if (fgets(line,sizeof(line),fd)==NULL)
        return -2;
/*
* Construct SQL Statement to create import table
*/
    cnt = 0;
    for (j=0 ; j<strlen(line);j++)
        if (line[j]==sep[0])
            cnt++;

    if (cnt>MAX_COLUMN_COUNT)
        cnt = MAX_COLUMN_COUNT;

    sprintf(sqlCmd,"CREATE TABLE %s (",table);
    strncat(sqlCmd,zStrrmv(zStrtok(line,sep),badCharacters),ELEMENT_LENGTH);
    strcat(sqlCmd," CHAR(64),");

    for(j=1; j<cnt; j++){
            strncat(sqlCmd,zStrrmv(zStrtok(NULL,sep),badCharacters),ELEMENT_LENGTH);
            strcat(sqlCmd," CHAR(64),");
    }

    element = zStrrmv(zStrtok(NULL,sep),badCharacters);
    strncat(sqlCmd,element,(((strlen(element)-2)>ELEMENT_LENGTH)?ELEMENT_LENGTH:(strlen(element)-2)));
    strcat(sqlCmd," CHAR(64));");

    if (sqlite3_prepare_v2(db,sqlCmd,strlen(sqlCmd)+1,&stmt,NULL)!=SQLITE_OK)
        return sqlite3_errcode(db);

    if (sqlite3_step(stmt)!=SQLITE_DONE)
        return sqlite3_errcode(db);

    if (sqlite3_finalize(stmt)!=SQLITE_OK)
        return sqlite3_errcode(db);

/*
* Prepared statement for INSERT
*/
    sprintf(sqlCmd,"INSERT INTO %s VALUES(?",table);
    i=strlen(sqlCmd);
    for(j=0; j<cnt; j++){
        sqlCmd[i++] = ',';
        sqlCmd[i++] = '?';
    }
    sqlCmd[i++] = ')';
    sqlCmd[i++] = ';';
    sqlCmd[i] = 0;

    if (sqlite3_prepare_v2(db,sqlCmd,strlen(sqlCmd)+1,&stmt,NULL)!=SQLITE_OK)
        return sqlite3_errcode(db);

    while(fgets(line,sizeof(line),fd)){
        element = zStrrmv(zStrtok(line,sep),badCharacters);
        sqlite3_bind_text(stmt, 1,(strcmp(element,sep)==0)?"NULL":element , -1, SQLITE_STATIC);
        for (j=2;j<cnt+1;j++) {
                element = zStrrmv(zStrtok(NULL,sep),badCharacters);
                sqlite3_bind_text(stmt, j, (strcmp(element,sep)==0)?"NULL":element, -1, SQLITE_STATIC);
        }
        element = zStrrmv(zStrtok(NULL,sep),badCharacters);
        if (element[strlen(element)-1] == '\n')
            element[strlen(element)-1] = 0;
        sqlite3_bind_text(stmt, j, element, -1, NULL);

        if (sqlite3_step(stmt)!=SQLITE_DONE)
            return sqlite3_errcode(db);

        sqlite3_reset(stmt);
    }

    sqlite3_finalize(stmt);
    fclose(fd);
    return 0;
}

input is CSV file

**iMalc** · 12-13-2012

Okay, here's a fast replacement for zStrrmv which will be far faster than what you have.
Heck I even tested this:

Code:

unsigned char isBad[256] = {};

void initBadChars(const char *rejects)
{
    assert(rejects != NULL);
    while (*rejects != '\0')
        isBad[*rejects++] = 1;
}

char *zStrrmv(char *str)
{
    assert(str != NULL);
    char *src = str, *dst = str;
    while (*src != '\0')
    {
        if (isBad[*src])
            ++src;
        else
            *dst++ = *src++;
    }
    *dst = '\0';
    return str;
}

Call initBadChars once to set up the array of which chars to filter out.
Then call zStrrmv with just the string to filter, as you don't need to pass the string of characters to filter out each time any more.

The idea is to set up a lookup table which will tell you immediately if a char is good/bad, and then remove those from the string by using destination and source iterators.

One thing to consider is that you may be better off with a list of characters to include rather than exclude.

I tested it with this string: "a.$ #!b\"cd|}]e;f`~:g&%+hi^'j/[k{*l\\m?=)n(@o\tp" which gives abcdefghijklmnop"

If you like how much faster that makes it, perhaps post your zStrtok function as well etc and see what we can do with that.

**fnprintf** · 12-13-2012

Here's a way I did it using built in string.h functions. Not sure about the performance, don't have any large csv files.

iMalc code looks better.

Code:

#include <stdio.h>
#include <string.h>

int main (void)
{
    char *p, csv[] = "Mr. John Doe,[email protected],!not$200?which&%,$200.00,${CODE}";
    char rmc[] = ".:;`~*\\?=)(/&%+^'!\"|}][{$ #@\t";
    char pstr[10][100];
    char ostr[10][100];
    int i = 0, c = 0;
    p = strtok (csv, ",");
    while (p)
    {
        strcpy (pstr[i], p);
        i++;
        p = strtok (NULL, ",");
    }
    c = i;
    for (i = 0; i < c; i++)
    {
        p = strtok (pstr[i], rmc);
        strcpy (ostr[i], "");
        while (p)
        {
            strcat (ostr[i], p);
            strcat (ostr[i], " ");
            p = strtok (NULL, rmc);
        }
    }
    for (i = 0; i < c; i++)
    {
        printf ("%d: %s\n", i, ostr[i]);
    }
    return 0;
}

**Salem** · 12-13-2012

> memmove (str+index,str+index+1,strlen(str)-index);
This is bad.
You're copying the WHOLE string tail, just to remove each bad character.

Code:

for ( i = 0, j = 0 ; string[i] != '\0' ; i++ ) {
  if ( good(string[i]) ) string[j++] = string[i];
}
string[j] = '\0';

j will never overtake i, and with each bad character, it falls further behind.
Each char is only tested and copied once.

> for (j=0 ; j<strlen(line);j++)
This is awful.
Use line[j] != '\0' as the test.

> One thing to consider is that you may be better off with a list of characters to include rather than exclude.
Good idea - I would suggest isalnum as being a good starting point.

> MAX_COLUMN_COUNT
> ELEMENT_LENGTH
Are these symbols consistent with MAX_LINE_LENGTH

> if ((fd=fopen(fp,"rb"))==NULL)
Why are you opening the file in binary mode?

Being pedantic, fopen returns a file pointer, not a file descriptor.

> zStrtok
Does this take into account adjacent separators?
If this is just a thin layer over regular strtok, then "hello,,,,,,world" is just two tokens, not seven.

**Adak** · 12-13-2012

You will be looking at a big speed up!

I made up a 655MB text file, with at least 3 bad chars on each line of text, and 1,491,006 lines of CSV's.

This is the program:

Code:

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <ctype.h>


//int zChrSearch(char *token,char s);
//void zStrrmv(char *str);

int main(void) {
   clock_t timer;
   unsigned long int count; 
   int i,j;
   FILE *fpIn,*fpOut;
   char str[100];
   char dup[100];
      
   fpIn=fopen("temp.txt", "r");
   fpOut=fopen("temp2.txt","w");
    
   if(!fpIn || !fpOut) {
      printf("Error opening file\n");
      return 1;
   }
   timer=clock();
   count=0;
   j=i=0;
   while(fgets(str, sizeof(str),fpIn)) {
      i=j=0;
      while(str[i]) {
         if(isalpha(str[i]) || isdigit(str[i]) || str[i]==',' || str[i]==' ') {
            dup[j]=str[i];
            ++j;
         }
         ++i;
      }
      dup[j]='\0';
      fprintf(fpOut,"%s\n",dup);

      ++count;
   }
   timer=clock()-timer;
   fclose(fpIn);
   
   printf("Strings Processed: %d  Elapsed time: %.3f\n",count,(double)timer/CLOCKS_PER_SEC);
   return 0;
}

Strings Processed: 1491006 Elapsed time: 1.124 seconds
< sweet! >

Your program will need more time, because it does a lot more, but if use a basic logic similar to this, you'll be flying.

**fnoyan** · 12-13-2012

zStrtok() code needs much more improvement

but regarding to Salem's question, it takes consecutive separators into account (returns separator itself in case of consecutive separators). The one included in string.h does not work for me since it does not take consecutive separators into account.

below is the code for zStrtok()

Code:

char *zStrtok(char *str, const char *delim) {
    static char *static_str=0;    	/* var to store last address */
    int index=0, strlength=0;       	/* integers for indexes */
    int found = 0;              	/* check if delim is found */

    /* delimiter cannot be NULL
    * if no more char left, return NULL as well
    */
    if (delim==0 || (str == 0 && static_str == 0))
        return 0;

    if (str == 0)
        str = static_str;

        /* get length of string */
        while(str[strlength])
                strlength++;

    /* find the first occurance of delim */
    for (index=0;index<strlength;index++)
        if (str[index]==delim[0]) {
            found=1;
            break;
        }

    /* if delim is not contained in str, return str */
    if (!found) {
        static_str = 0;
        return str;
    }

    /* check for consecutive delimiters
    *if first char is delim, return delim
    */
    if (str[0]==delim[0]) {
        static_str = (str + 1);
        return (char *)delim;
    }

    /* terminate the string
    * this assignmetn requires char[], so str has to
    * be char[] rather than *char which will cause SIGSEGV
    */
    str[index] = '\0';

    /* save the rest of string */
    if ((str + index + 1)!=0)
        static_str = (str + index + 1);
    else
        static_str = 0;

        return str;
}

@Salem,
Does having a for() loop with increment makes a lot of difference compared to just evaluating "end of string" case?

@Adak,
As iMalc indicated in his first post, the database insert also takes time (I may be considering INSERT'ing not for each line but bunch of lines with semicolum separated SQL commands, so only once call for sqlite3_* APIs). So, we better test the code embedded in file import function. But, yes, the result is pretty good

I will try the recommendation and let you know about the import times for the same file

**Salem** · 12-13-2012

> /* get length of string */
This is a waste of effort.

> for (index=0;index<strlength;index++)
If you just change this to
for (index=0;str[index]!='\0';index++)
then the whole problem of working out the string length in advance goes away.

> *if first char is delim, return delim
Shouldn't you return a pointer to an empty string?
Like say
return "";

> Does having a for() loop with increment makes a lot of difference with just evaluating "end of string" case?
I don't know what you mean. You'd have to post two example loops for us to comment on the relative differences.

**fnoyan** · 12-13-2012

Originally Posted by Salem

>
> *if first char is delim, return delim
Shouldn't you return a pointer to an empty string?
Like say
return "";

> Does having a for() loop with increment makes a lot of difference with just evaluating "end of string" case?
I don't know what you mean. You'd have to post two example loops for us to comment on the relative differences.

like
> for (j=0 ; j<strlen(line);j++)
This is awful.
Use line[j] != '\0' as the test

I am returning delim because I am testing the case in the code (if it is delim, I'will INSERT into SQL table "NULL". It is probable to we have valid " " as a contained values between two delimiters)

**ledow** · 12-13-2012

I would also suggest, as a general troubleshooting technique under similar circumstances that may crop up, to familiarise yourself with some basic profiling. Having people fix your problem for you is all well and good, but if you couldn't even see where the main performance hit was coming from then how did you ever expect to be able to answer it yourself? It seems you were looking into mmap'ing before you even thought to profile the app and see where the bottleneck actually was.

- Honestly, compiling and running the program through gprof would immediately tell you where you're spending 99.9% of your time anyway.
- Hell, even a primitive printf on entering and leaving functions using some crude timings would show you quite a lot about where the bottleneck is (I'm a big advocate of "simple debugging" like this because people whinge so much when you say "just learn how to use this quite complex tool that may not even be available for your platform", and simple debugging eliminates those complaints before you even start. Seriously, a printf with the current time in it, copy/pasted to the critical parts of your program - e.g. "Entering func() at 112487354.2 seconds. Leaving func after spending 89.2 seconds inside it" - with maybe even a counter "I've entered this function 32748237489273 times since the start of the program" - that's within ANYONE'S range of programming skill, requires no external tools, and is good enough to help you get a hint at what's happening. Hell, I have a 150,000 line program that I'm working on at the moment and I have a macro that just sticks that printf with the information into any function I paste it into - even with the function name - so I can work out what happened without having to churn through reams of debugger/profiler output for a single answer).

And once you know where the problem is, you can just sit and tweak then time, tweak then time until you can get it down to the lowest you can until you can narrow the problem to a particular function or even line of code and then post it for optimisation help.

Help yourself, and put some simple debugging or run a proper profiler (which costs nothing but a few characters in your compile line if you're using gcc, or MinGW, or another derivative of it already)

**Salem** · 12-13-2012

> > for (j=0 ; j<strlen(line);j++)
If line is 10 chars long, you call strlen() 10 times, and it always returns 10
If line is 1000 chars long, you call strlen() 1000 times, and it always returns 1000
Not only does each strlen() call take longer and longer to do (with increasing line length), you call it more and more times (twice the hit!)
On the whole, it's much better to not call it at all.

Given Adak's post #8, and ledow's timely introduction of a bit of profiling sanity, my gut feel is that you're going to find that the SQL is the elephant in the room all of a sudden.

> I am returning delim because I am testing the case in the code (if it is delim, I'will INSERT into SQL table "NULL". It is probable to we have valid " " as a contained values between two delimiters)
Note that I said
return ""; // empty string
not
return " "; // single space

**fnprintf** · 12-13-2012

> I made up a 655MB text file, with at least 3 bad chars on each line of text, and 1,491,006 lines of CSV's.
@Adak - How did you make your test file? Just random text with some "bad" chars thrown in?

**Adak** · 12-13-2012

Originally Posted by fnprintf

> I made up a 655MB text file, with at least 3 bad chars on each line of text, and 1,491,006 lines of CSV's.
@Adak - How did you make your test file? Just random text with some "bad" chars thrown in?

I do a LOT of folding for Folding@Home (world-wide distributed computer type disease research run by Stanford University).
Folding@Home: Folding@home - HomePage

Me: Adak - User Summary - EXTREME Overclocking Folding @ Home Stats

The huge data file was the download of every folders stats, that Stanford makes available. I used that, wrote some code to give it some extra bad chars I knew it had a few on each line, because it uses tabs '\t' to separate the fields of each record. These tabs are listed as "bad" in the Original Posters program, and wrote that out to a file. To make it closer to the OP's needs, I added commas to separate each field.

That "bad" file, was input for the timed run. The lines aren't long, but overall, it was a good fit, and a huge file. Input data looked like this:

Sun Jul 10 09:20:00 PDT 2011
name, ;newcredit, sum(total), team
anonymous, 2743076847, 16093991, 0
PS3, .2436132886, 10113108, 0
PDC, 920381552, 15965, 1
Angra, :427916763, 108859, 36362
kennish, 375378324, 105099, 39340
AtlasFolder, ;356534004, 695128, 36167
Scott_H, 342966259, 437344, 2654
awachs, `341500090, 77431, 181223
SACO, 308934471, 15098, 3213
HayesK, ~302161623, 179545, 32
Michael_McCord,_M.D., 246149204, 248070, 11108
Leganfuh, *243263626, 147521, 75255
ChasR, 236938861, 310008, 32
[Zebulon.fr]_Cobra, \233830170, 144926, 51
mmillion, 229920411, 140418, 111065
mutsu, ?228498654, 145086, 11108
OverClocking-Masters.com, 213583169, 183699, 51
dropper, =212694672, 50968, 33
phoenicis, 211662473, 93610, 35947
zz9pzza, )211014076, 15793, 35947
Computekinc.us, 203181956, 204244, 32
musky, (200650606, 6715, 33
EVGApes, 200215225, 147386, 111065
OlivierZ, /180898994, 43221, 51

So each line of text had 3 bad tabs that needed to be removed, and every other line had an extra (and rotating) bad char from the bad char array.

Output file looked like this:

Sun Jul 10 092000 PDT 2011
name, newcredit, sumtotal, team
anonymous, 2743076847, 16093991, 0
PS3, 2436132886, 10113108, 0
PDC, 920381552, 15965, 1
Angra, 427916763, 108859, 36362
kennish, 375378324, 105099, 39340
AtlasFolder, 356534004, 695128, 36167
ScottH, 342966259, 437344, 2654
awachs, 341500090, 77431, 181223
SACO, 308934471, 15098, 3213
HayesK, 302161623, 179545, 32
MichaelMcCord,MD, 246149204, 248070, 11108
Leganfuh, 243263626, 147521, 75255
ChasR, 236938861, 310008, 32
ZebulonfrCobra, 233830170, 144926, 51
mmillion, 229920411, 140418, 111065
mutsu, 228498654, 145086, 11108
OverClockingMasterscom, 213583169, 183699, 51
dropper, 212694672, 50968, 33
phoenicis, 211662473, 93610, 35947
zz9pzza, 211014076, 15793, 35947
Computekincus, 203181956, 204244, 32
musky, 200650606, 6715, 33
EVGApes, 200215225, 147386, 111065
OlivierZ, 180898994, 43221, 51

Thread: fastest way of processing a file

Thread Tools

Search Thread

Display

fastest way of processing a file

Similar Threads

File processing: How to read a part of a text file?

processing file?

Problem with Creating File for File Processing

What's the fastest way to write a file?

Reading a File - FASTEST WAY POSSIBLE??