Thread: performance - read huge FASTA file line by line

  1. #1
    Registered User
    Join Date
    Feb 2013
    Posts
    45

    performance - read huge FASTA file line by line

    Hi
    I have a FASTA file that contain sequences string upto 2000000 line of strings. I wrote the code that works well with smaller size but when the size of file grows it get slower. Can you please help me how I can make it efficient? I am reading it line by line.
    My code is

    Code:
    #include "kseq.h"
    KSEQ_INIT(gzFile, gzread)
    
    
    
    
           int z=0;
            fp = gzopen(dbFile, "r");   //Read database Fasta file into host memory
            seq_d = kseq_init(fp);
            while ((d = kseq_read(seq_d)) >= 0) {
                    unsigned char *b = (unsigned char *)malloc(sizeof(unsigned char) * 256);
                   
                    memcpy(b, seq_d->seq.s, 256);
    ....
    do work with b
    ....
    ............
    z++
      free(b);
    }
            kseq_destroy(seq_d);
            gzclose(fp);
    
    
    
    I am confused that why it take more time when file size let see is 100000 even for first iteration that run very efficiently in case of 10,000.
    For Exampe: I put printf statement for each iteration. In case of 10,000 first iteration take let say less then a milli second. where as in case of 100000 strings even the first iteration will take more then 30 second to print and so on. Why it could be slow like that?
    Last edited by gevni; 05-25-2017 at 05:58 AM.

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    It looks like you free(b) at the wrong place. Anyway, with just 256 unsigned char, you probably don't need malloc and the overhead that comes with it: just declare a fixed size array of 256 unsigned char.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Feb 2013
    Posts
    45
    Sorry it was mistake in post. I actually free (b) inside the loop. Also I put 256 here just for asking question. This length is varying depending on the string length.

    Quote Originally Posted by laserlight View Post
    It looks like you free(b) at the wrong place. Anyway, with just 256 unsigned char, you probably don't need malloc and the overhead that comes with it: just declare a fixed size array of 256 unsigned char.

  4. #4
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by gevni
    This length is varying depending on the string length.
    Ah, then you should keep track of it, and only realloc when you need a longer length than has already been allocated. This way, you avoid allocating memory on each iteration, in exchange for a bit of housekeeping to decide when to reallocate.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  5. #5
    Registered User
    Join Date
    Feb 2013
    Posts
    45
    I am confused that why it take more time when file size let see is 100000 even for first iteration that run very efficiently in case of 10,000.
    For Exampe: I put printf statement for each iteration. In case of 10,000 first iteration take let say less then a milli second. where as in case of 100000 strings even the first iteration will take more then 30 second to print and so on. Why it could be slow like that?


    Quote Originally Posted by laserlight View Post
    Ah, then you should keep track of it, and only realloc when you need a longer length than has already been allocated. This way, you avoid allocating memory on each iteration, in exchange for a bit of housekeeping to decide when to reallocate.

  6. #6
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by gevni
    For Exampe: I put printf statement for each iteration. In case of 10,000 first iteration take let say less then a milli second. where as in case of 100000 strings even the first iteration will take more then 30 second to print and so on.
    Do you mean "100000 strings" or "100000 characters"?

    Perhaps you should post your actual code with the printf example such that you obtained this "less than a milli second" versus "more then 30 second" difference.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Reading files is slow, whatever you do.
    Code:
    seq_d = kseq_init(fp);
    while ((d = kseq_read(seq_d)) >= 0) {
    }
    kseq_destroy(seq_d);
    Your code will never be any faster than this, so you need to know what the baseline is.

    Let's say it takes 1m to read your large file.

    If the code you add into the while loop extends that to say 1m5s, then what is there to stress about? Nothing you can do will make your users believe anything other than "it takes about a minute".

    Now if the code you add changes 1m into 10m, then sure, you should (or could) do something about it - it really depends on the amount of work you need to do in the loop. But even if you can't, your users might then think "oh, it takes 10 minutes, time for coffee".
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  8. #8
    Registered User
    Join Date
    Feb 2013
    Posts
    45
    100000 strings not the characters of each string. When the number of strings [lines] grow the code get slow even for the very first iteration .

    Quote Originally Posted by laserlight View Post
    Do you mean "100000 strings" or "100000 characters"?

    Perhaps you should post your actual code with the printf example such that you obtained this "less than a milli second" versus "more then 30 second" difference.

  9. #9
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by gevni
    100000 strings not the characters of each string. When the number of strings [lines] grow the code get slow even for the very first iteration .
    kseq_read reads a line, right? So that doesn't make sense, unless the library is performing some kind of buffering that turns out to be slow, though then subsequent line reads should be fast, so you should check that. If it turns out that reading subsequent lines is still similiarly slow, then perhaps whatever library you're using has a performance bug with reading large files line by line. (Or... maybe kseq_read is the wrong function to use.)
    Last edited by laserlight; 05-25-2017 at 06:45 AM.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  10. #10
    Registered User
    Join Date
    Feb 2013
    Posts
    45
    Let me explain it again, I think you misunderstood it.
    Let say there are 5 strings in the FASTA file (so there are 5 iterations of while loop) if the size of each string is same then each iteration will take equal time ..Right
    so if I run the same code with 5 strings, first iteration will finish in 2 ms. 2nd in 2ms etc..
    Now if i run the same code with 5,0000 strings FASTA file. The first iteration with same string size take more time then 2ms even for first iteration. I know it will be slower when more strings but why it could be slower even for single iteration?

    Quote Originally Posted by Salem View Post
    Reading files is slow, whatever you do.
    Code:
    seq_d = kseq_init(fp);
    while ((d = kseq_read(seq_d)) >= 0) {
    }
    kseq_destroy(seq_d);
    Your code will never be any faster than this, so you need to know what the baseline is.

    Let's say it takes 1m to read your large file.

    If the code you add into the while loop extends that to say 1m5s, then what is there to stress about? Nothing you can do will make your users believe anything other than "it takes about a minute".

    Now if the code you add changes 1m into 10m, then sure, you should (or could) do something about it - it really depends on the amount of work you need to do in the loop. But even if you can't, your users might then think "oh, it takes 10 minutes, time for coffee".

  11. #11
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Just run the empty loop to find out the performance characteristics as the file grows.

    If the behaviour is as you describe, then your real issue is with the people who provided you with the code/library for 'kseq'.

    There isn't anything we (or you) can do about it.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  12. #12
    Registered User
    Join Date
    Feb 2013
    Posts
    45
    Thanks @salem. You gave me a very good point. I didn't notice but in my code there were two loops that actually run to the size of file and don't needed I just eliminate them and now it work perfect. I just figure it out by running the empty loop. Thanks again

  13. #13
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by gevni
    I didn't notice but in my code there were two loops that actually run to the size of file and don't needed I just eliminate them and now it work perfect. I just figure it out by running the empty loop.
    This is why you should post the smallest and simplest (compilable) code that demonstrates the problem. Your code in post #1 had typo errors, so presumably you didn't actually try to run it and time it, then you did not take up my suggestion in post #6,
    Quote Originally Posted by laserlight
    Perhaps you should post your actual code with the printf example such that you obtained this "less than a milli second" versus "more then 30 second" difference.
    so all along we have been reasoning about a program that wasn't like the program that you were actually trying to improve, and that is just a waste of time.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 2
    Last Post: 05-10-2017, 08:30 PM
  2. How to read a text file line by line?
    By Sam Conran in forum C++ Programming
    Replies: 3
    Last Post: 02-18-2016, 09:22 AM
  3. Replies: 21
    Last Post: 08-07-2011, 09:55 PM
  4. Read file line by line and interpret them
    By sombrancelha in forum C Programming
    Replies: 8
    Last Post: 03-17-2011, 09:48 AM
  5. Read text file line by line and write lines to other files
    By magische_vogel in forum C Programming
    Replies: 10
    Last Post: 01-23-2011, 10:51 AM

Tags for this Thread