![]() |
| | #1 |
| Registered User Join Date: Jun 2009
Posts: 5
| Trying to make this run faster First, I'd like to say that I'm new here and I hope to find the help I need. I wrote a c program that reads a GIGANTIC file (~16GB) line by line using fscanf in a while loop as show below Code: while(fscanf(file, "%d %d %d", &inst_type, &in_delay, &sp_type ) != EOF)
{
// here I do simple processing with the variables read from the file
}
Anyways, if I don't read the file and instead generate those variables randomly, the program runs very quickly (few minutes to finish). I tried to output something(say result y) every 100 thousand lines and see how fast it's going and I noticed the following. The first 10 y's came out quickly (they followed one another in an almost fixed interval of a one second or two), but then the y's started to come out slower and slower. It was like becoming exponentially SLOWER. I don't know what the problem is. Is it the way I'm reading the file? Is there a buffer somewhere that's getting huge and making things dog slow? I really don't know. Does anyone have any idea about this? Thanks P.S. I also tried to profile (using gprof ) the program, but did not have luck getting it to work. There was all kids of errors that gprof generated and I'm no mood getting busy with those. |
| wkohlani is offline | |
| | #2 |
| Guest Join Date: Aug 2001
Posts: 5,249
| Not sure, but fscanf isn't a very efficient function. You might have better luck doing things "manually" (ie: fgets, then parsing the ints out, etc).
__________________ Code: int main(void){srand(time(0));for(double l=rand(),l0=0,l00=0;;l0+=0.1){for(double l000=0;l000
<1;l000+=.001,l+=((double)rand()/RAND_MAX)/0x64,l00+=((sin(l*0x8*atan(l0)*l000-(l0*0x8*atan
(l)))*0.5)+0.5)){for(size_t l0000=0,l00000=(size_t)(0x50*(l00-floor(l00)));l0000<l00000;++l0000
)putchar(0x20);putchar(0x61+(int)((double)rand()/RAND_MAX*0x1a));putchar('\n');}}return 0;}
|
| Sebastiani is offline | |
| | #3 |
| Registered User Join Date: Sep 2006
Posts: 3,720
| Fscanf() is the fastest file reader for text, in the tests I've seen, so that's not your problem. Fread() in binary mode is naturally faster, however - no translation for a text file is done. I suspect you're trying to read and write out, in alternating loops, a very small amount of data, to the same hard drive. That will just tear up the performance of the HD. Your HD has a buffer, and you should be reading in data in a tight loop, and putting that data into a large array to process, later. That will optimize the HD performance. One or two lines being read every second or two is an absolute travesty! You should have >500 lines read, in two seconds, if the data is adjacent on the HD. Then when you've filled up the array, process all the data (do no more HD work, until your data is ready to be written to a file). Then write out all the data you've processed from the array, all at once. Last edited by Adak; 06-23-2009 at 11:56 PM. |
| Adak is offline | |
| | #4 | ||
| Registered User Join Date: Sep 2004 Location: California
Posts: 3,029
| Quote:
Quote:
| ||
| bithub is offline | |
| | #5 |
| Registered User Join Date: Jun 2009
Posts: 5
| Thanks guys for this good start ! I don't think I'm very worried about "fgets" or "fscanf" because I've used both before and things were the same. The big problem that I don't understand is why at first things are running very fast (which I would like to keep ), but then gradually the program gets slower and slower. That is, the first 3-4 million lines are processed in less than 12 minutes, but the next 4 million lines would take a few hours, and the next few million lines will take even longer .....etc. That's what I don't understand. Thanks again for the effort, and I hope someone can give me a pointer as to what is slowing things down. |
| wkohlani is offline | |
| | #6 | |
| Registered User Join Date: Sep 2006
Posts: 3,720
| Quote:
You can't, period. I never said load all 16 GB into an array - I said load as much as you can into a large array, all at once. It slows down because the HD buffer (which is fast RAM), has been filled up, and now you're reading and writing in small chunks. Between each read and write, your buffer is probably flushing itself out. Your throughput is slowing down so much, I wonder if your OS is starting to use the drive to emulate RAM. That will bring your processing to a near stop! What does Task Manager or System Resources tell you about this, while it's running the program? I'll be glad to check it out, and maybe even tweak it for you, if you like. Upload the program and a data file, to Swoopshare, and give me the url in a private message. A 1 GB file (whatever the biggest that Swoopshare will allow), for testing will be fine. | |
| Adak is offline | |
| | #7 |
| Woof, woof! Join Date: Mar 2007 Location: Australia
Posts: 3,400
| This is not something I would use text-mode for. Read say, 8mb or more a time with fread(), and parse each line yourself. You should see a tremendous increase then. |
| zacs7 is offline | |
| | #8 |
| Registered User Join Date: Jun 2009
Posts: 313
| You say you output every millions lines: read the relevent info from 1 million lines into an array do your operations and output continue reading a million lines at a time into this array and repeat An array of 3 million ints is certainly manageable, and you can keep memory use low by just overwriting it each time. you say that the first millions lines are very fast, and it gets slower after that. I don't know the cause, but doing this might allow you to keep the speed of the first million lines. Last edited by KBriggs; 06-24-2009 at 08:06 AM. |
| KBriggs is offline | |
| | #9 | |
| dat is, vast staat Join Date: Jul 2008 Location: SE Queens
Posts: 6,612
| Quote:
Code: #include <stdio.h>
#include <time.h>
void test_fgets(FILE *fd) {
char buffer[64];
while (fgets(buffer,64,fd));
rewind(fd);
}
void test_fscanf(FILE *fd) {
char buffer[64];
while (fscanf(fd,"%s",buffer)!=EOF);
rewind(fd);
}
int main() {
double lapse;
time_t start, end;
FILE *ptr=fopen("input/dictionary.txt","r");
int i;
start=time(NULL);
for (i=0;i<1000;i++) { test_fgets(ptr); fprintf(stderr,"."); }
end=time(NULL);
lapse=difftime(end,start);
printf("test_fgets: %d seconds\n",(int)lapse);
start=time(NULL);
for (i=0;i<1000;i++) { test_fscanf(ptr); fprintf(stderr,"."); }
end=time(NULL);
lapse=difftime(end,start);
printf("test_fscanf: %d seconds\n",(int)lapse);
fclose(ptr);
return 0;
}
test_fscanf: 11 seconds "dictionary.txt" is 622kb. Just watching the output ("......") it is easy to tell fgets is about twice as fase. So depending on what the OP is doing with each line, this could make a HUGE amount of difference.
__________________ C programming resources: GNU C Function and Macro Index -- glibc reference manual The C Book -- nice online learner guide Current ISO draft standard CCAN -- new CPAN like open source library repository GDB tutorial #1 -- gnu debugger tutorials -- GDB tutorial #2 cpwiki -- our wiki on sourceforge | |
| MK27 is offline | |
| | #10 | |
| Tha 1 Sick RAT Join Date: Dec 2003
Posts: 271
| Quote:
Remeber as you start working with such large data, the things that don't matter so much come into play, like your HD buffer, your Workling RAM, all other hardware and memory management related issues come into play, hence the reason Meteorological centers have vast arrays of powerful equipment the grids their data sets.
__________________ A hundred Elephants can knock down the walls of a fortress... One diseased rat can kill everyone inside | |
| WDT is offline | |
| | #11 |
| and the hat of Destiny Join Date: Aug 2001 Location: The edge of the known universe
Posts: 22,495
| What else is going on inside that loop, besides printing something every so often? Are you allocating any memory at all? Because that will hose your machine unless you actually have a 64-bit OS and substantially more than 16GB of RAM to play with. Try this: Code: int lines = 0;
char buff[BUFSIZ];
while ( fgets( buff, sizeof buff, file ) != NULL ) {
if ( ++lines == 1000000 ) {
putchar('.'); fflush(stdout);
lines = 0;
}
}
If it doesn't, then something else is going on. |
| Salem is offline | |
| | #12 | |
| Registered User Join Date: Sep 2004 Location: California
Posts: 3,029
| Quote:
Ok. I used this program to fill a file with 10 million lines: Code: #include <stdio.h>
int main(void)
{
FILE* f = fopen ("./data", "w");
unsigned int i;
for(i = 0; i < 10000000; ++i)
fprintf(f, "%d %d %d\n", i, i, i);
fclose(f);
}
Code: #include <stdio.h>
#include <ctype.h>
void slow(void)
{
FILE* f = fopen("./data", "r");
int first, second, third;
int counter = 0;
while(fscanf(f, "%d %d %d", &first, &second, &third) != EOF)
{
counter++;
}
printf("Got %d entries\n", counter);
}
int get_value(const char** pp)
{
const char* p = *pp;
int value = 0;
while(isdigit(*p))
{
value *= 10;
value += *p++ - '0';
}
*pp = p + 1; /* Add 1 to skip space */
return value;
}
void fast(void)
{
FILE* f = fopen("./data", "r");
int first, second, third;
int counter = 0;
char buffer[128];
int value;
while(fgets(buffer, sizeof(buffer), f))
{
const char* p = buffer;
first = get_value(&p);
second = get_value(&p);
third = get_value(&p);
counter++;
}
printf("Got %d entries\n", counter);
}
int main(void)
{
slow();
return 0;
}
| |
| bithub is offline | |
| | #13 |
| Guest Join Date: Aug 2001
Posts: 5,249
| >> The slow method took 9.52 seconds, the fast method took 1.73 seconds. This is using GCC 4.01 with optimizations turned on. I wasn't quite sure how much of an improvement it would make, but I'm really not suprised that it turned out to be several times faster. fscanf is a huge, monolithic, swiss-army knife of a function that's great for general purpose work but often suboptimal when compared to a special purpose solution. In most cases I would prefer the former, but when it becomes a bottleneck you sometimes have to resort to "hand-rolled" approaches. Having said that, I would agree with Salem that the actual issue is probably somewhere else. But it may be a good idea, anyway, to adopt something like Bithub suggested, in order to squeeze out as much performance as possible.
__________________ Code: int main(void){srand(time(0));for(double l=rand(),l0=0,l00=0;;l0+=0.1){for(double l000=0;l000
<1;l000+=.001,l+=((double)rand()/RAND_MAX)/0x64,l00+=((sin(l*0x8*atan(l0)*l000-(l0*0x8*atan
(l)))*0.5)+0.5)){for(size_t l0000=0,l00000=(size_t)(0x50*(l00-floor(l00)));l0000<l00000;++l0000
)putchar(0x20);putchar(0x61+(int)((double)rand()/RAND_MAX*0x1a));putchar('\n');}}return 0;}
|
| Sebastiani is offline | |
| | #14 |
| Registered User Join Date: Sep 2006
Posts: 3,720
| Well done, Bithub! ![]() I don't consider your fast function anything standard for input from a file, however. Here's the times on my (rather slow), SATA drive: Fast ( fgets() with custom() ): 3.95 seconds (two runs averaged) Slow ( standard fscanf() ): 6.81 seconds (two runs averaged) When you use the standard sscanf() in place of your custom function, you get: Fast ( fgets() with sscanf() ): 7.36 seconds (two runs averaged) I've seen this tested several times and fscanf() always beats other standard text mode input from a file with assignments. |
| Adak is offline | |
| | #15 |
| Registered User Join Date: Sep 2006
Posts: 3,720
| wkohlani: You know what would be fun for the forum members, and *great* for you, is to put a test data file, and your current program up where it can be d/l'ed, and then the members here could have a little fun contest and see how who can make it run the fastest and by how much. ![]() I'm sure your program's speed would be significantly faster, and we can also see what might be the problem that is causing your program/system, to continually degrade in throughput. Best of all, it's free. ![]() Naturally, don't include any sensitive info in the files. |
| Adak is offline | |
![]() |
| Thread Tools | |
| Display Modes | |
|
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Some help with make my programs faster | Sshakey6791 | C++ Programming | 11 | 12-11-2008 01:41 PM |
| What to make? | Caldus | C++ Programming | 4 | 04-06-2005 01:12 PM |
| A few questions on programs and when they can run... | Junior89 | Windows Programming | 2 | 04-05-2005 07:47 PM |
| Making standalone APP run in background | hart | Windows Programming | 3 | 02-27-2005 11:20 AM |
| plz help me run this program!!! | galmca | C Programming | 8 | 02-01-2005 01:00 PM |