performance - read huge FASTA file line by line

**gevni** · 05-25-2017

Hi
I have a FASTA file that contain sequences string upto 2000000 line of strings. I wrote the code that works well with smaller size but when the size of file grows it get slower. Can you please help me how I can make it efficient? I am reading it line by line.
My code is

Code:

#include "kseq.h"
KSEQ_INIT(gzFile, gzread)




       int z=0;
        fp = gzopen(dbFile, "r");   //Read database Fasta file into host memory
        seq_d = kseq_init(fp);
        while ((d = kseq_read(seq_d)) >= 0) {
                unsigned char *b = (unsigned char *)malloc(sizeof(unsigned char) * 256);
               
                memcpy(b, seq_d->seq.s, 256);
....
do work with b
....
............
z++
  free(b);
}
        kseq_destroy(seq_d);
        gzclose(fp);

I am confused that why it take more time when file size let see is 100000 even for first iteration that run very efficiently in case of 10,000.
For Exampe: I put printf statement for each iteration. In case of 10,000 first iteration take let say less then a milli second. where as in case of 100000 strings even the first iteration will take more then 30 second to print and so on. Why it could be slow like that?

**laserlight** · 05-25-2017

It looks like you free(b) at the wrong place. Anyway, with just 256 unsigned char, you probably don't need malloc and the overhead that comes with it: just declare a fixed size array of 256 unsigned char.

**gevni** · 05-25-2017

Sorry it was mistake in post. I actually free (b) inside the loop. Also I put 256 here just for asking question. This length is varying depending on the string length.

Originally Posted by laserlight

It looks like you free(b) at the wrong place. Anyway, with just 256 unsigned char, you probably don't need malloc and the overhead that comes with it: just declare a fixed size array of 256 unsigned char.

**laserlight** · 05-25-2017

Originally Posted by gevni

This length is varying depending on the string length.

Ah, then you should keep track of it, and only realloc when you need a longer length than has already been allocated. This way, you avoid allocating memory on each iteration, in exchange for a bit of housekeeping to decide when to reallocate.

**gevni** · 05-25-2017

I am confused that why it take more time when file size let see is 100000 even for first iteration that run very efficiently in case of 10,000.
For Exampe: I put printf statement for each iteration. In case of 10,000 first iteration take let say less then a milli second. where as in case of 100000 strings even the first iteration will take more then 30 second to print and so on. Why it could be slow like that?

Originally Posted by laserlight

Ah, then you should keep track of it, and only realloc when you need a longer length than has already been allocated. This way, you avoid allocating memory on each iteration, in exchange for a bit of housekeeping to decide when to reallocate.

**laserlight** · 05-25-2017

Originally Posted by gevni

For Exampe: I put printf statement for each iteration. In case of 10,000 first iteration take let say less then a milli second. where as in case of 100000 strings even the first iteration will take more then 30 second to print and so on.

Do you mean "100000 strings" or "100000 characters"?

Perhaps you should post your actual code with the printf example such that you obtained this "less than a milli second" versus "more then 30 second" difference.

**Salem** · 05-25-2017

Reading files is slow, whatever you do.

Code:

seq_d = kseq_init(fp);
while ((d = kseq_read(seq_d)) >= 0) {
}
kseq_destroy(seq_d);

Your code will never be any faster than this, so you need to know what the baseline is.

Let's say it takes 1m to read your large file.

If the code you add into the while loop extends that to say 1m5s, then what is there to stress about? Nothing you can do will make your users believe anything other than "it takes about a minute".

Now if the code you add changes 1m into 10m, then sure, you should (or could) do something about it - it really depends on the amount of work you need to do in the loop. But even if you can't, your users might then think "oh, it takes 10 minutes, time for coffee".

**gevni** · 05-25-2017

100000 strings not the characters of each string. When the number of strings [lines] grow the code get slow even for the very first iteration .

Originally Posted by laserlight

Do you mean "100000 strings" or "100000 characters"?

Perhaps you should post your actual code with the printf example such that you obtained this "less than a milli second" versus "more then 30 second" difference.

**laserlight** · 05-25-2017

Originally Posted by gevni

100000 strings not the characters of each string. When the number of strings [lines] grow the code get slow even for the very first iteration .

kseq_read reads a line, right? So that doesn't make sense, unless the library is performing some kind of buffering that turns out to be slow, though then subsequent line reads should be fast, so you should check that. If it turns out that reading subsequent lines is still similiarly slow, then perhaps whatever library you're using has a performance bug with reading large files line by line. (Or... maybe kseq_read is the wrong function to use.)

**gevni** · 05-25-2017

Let me explain it again, I think you misunderstood it.
Let say there are 5 strings in the FASTA file (so there are 5 iterations of while loop) if the size of each string is same then each iteration will take equal time ..Right
so if I run the same code with 5 strings, first iteration will finish in 2 ms. 2nd in 2ms etc..
Now if i run the same code with 5,0000 strings FASTA file. The first iteration with same string size take more time then 2ms even for first iteration. I know it will be slower when more strings but why it could be slower even for single iteration?

Originally Posted by Salem

Reading files is slow, whatever you do.

Code:

seq_d = kseq_init(fp);
while ((d = kseq_read(seq_d)) >= 0) {
}
kseq_destroy(seq_d);

Your code will never be any faster than this, so you need to know what the baseline is.

Let's say it takes 1m to read your large file.

If the code you add into the while loop extends that to say 1m5s, then what is there to stress about? Nothing you can do will make your users believe anything other than "it takes about a minute".

Now if the code you add changes 1m into 10m, then sure, you should (or could) do something about it - it really depends on the amount of work you need to do in the loop. But even if you can't, your users might then think "oh, it takes 10 minutes, time for coffee".

**Salem** · 05-25-2017

Just run the empty loop to find out the performance characteristics as the file grows.

If the behaviour is as you describe, then your real issue is with the people who provided you with the code/library for 'kseq'.

There isn't anything we (or you) can do about it.

**gevni** · 05-25-2017

Thanks @salem. You gave me a very good point. I didn't notice but in my code there were two loops that actually run to the size of file and don't needed I just eliminate them and now it work perfect. I just figure it out by running the empty loop. Thanks again

**laserlight** · 05-25-2017

Originally Posted by gevni

I didn't notice but in my code there were two loops that actually run to the size of file and don't needed I just eliminate them and now it work perfect. I just figure it out by running the empty loop.

This is why you should post the smallest and simplest (compilable) code that demonstrates the problem. Your code in post #1 had typo errors, so presumably you didn't actually try to run it and time it, then you did not take up my suggestion in post #6,

Originally Posted by laserlight

Perhaps you should post your actual code with the printf example such that you obtained this "less than a milli second" versus "more then 30 second" difference.

so all along we have been reasoning about a program that wasn't like the program that you were actually trying to improve, and that is just a waste of time.

Thread: performance - read huge FASTA file line by line

Thread Tools

Search Thread

Display

performance - read huge FASTA file line by line

Similar Threads

Open the text file and read the content line by line and put into array

How to read a text file line by line?

How to read from a file line by line and store lines in separate arrays?

Read file line by line and interpret them

Read text file line by line and write lines to other files

Tags for this Thread