Thread: parsing a FASTA file (protein sequences)

  1. #1
    Registered User Sargnagel's Avatar
    Join Date
    Aug 2002
    Posts
    166

    Post parsing a FASTA file (protein sequences)

    I am currently rewriting an old biology program of mine.
    I want to parse a FASTA file (contains short protein descriptions and protein sequences) and extract the description line (starting with a '>') and the protein sequence separately.

    Example of a FASTA input file:
    Code:
    >protein1
    MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
    MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
    RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
    GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
    IPVHPNDHVNKSQ
    >protein2
    MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
    LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
    MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
    TKHHEGFTNW*
    >protein3
    MKTLLLLAVIMIFGLLQAHGNLVNFHRMIKLTTGKEAALSYGF
    CHCGVGGRGSPKDATDRCCVTHDCCYKRLEKRGCGTKFLSYKF
    SNSGSRITCAKQDSCRSQLCECDKAAATCFARNKTTY
    I've come up with the following function:
    Code:
    int getProtein(FILE *in, char descr[], char seq[])
    {
    	int d, s, length = 0;
    	
    	if(in == NULL)
    		return -1;
    	
    	d = fscanf(in, "%[^\n]%*c", descr); // read until '\n' is found
    	s = fscanf(in, "%[^>]%n%*c", seq, &length); // read until '>' is found
    	
    	if(d == EOF || s == EOF)
    		return -1;
    	
    	return length;
    }
    I would like to know if this is indeed a working solution or if there are some pitfalls left. I also wonder if I have to differentiate between Linux/Unix and Windows systems regarding '\n' and '\n\r'?
    It would be great if there is a way to filter out the '\n' or '\n\r' with additional conversion specifiers.

    [edit]
    Hmmm ... at least 1 pitfall have I found. How to avoid the buffer overflow if the input string is longer than the string descr/seq ? Adding a length modifier didn't help or I did it the wrong way.
    [/edit]

    Thank you for your help.
    Last edited by Sargnagel; 04-03-2003 at 06:08 AM.

  2. #2
    Open to suggestions Brighteyes's Avatar
    Join Date
    Mar 2003
    Posts
    204
    I would like to know if this is indeed a working solution or if there are some pitfalls left.
    That'll work, but it'll also be very bad if 'descr' and 'seq' are too small to hold all of the data. You could also do more error checking with fscanf, and you should test d immediately after the function call so that you can return immediately instead of call fscanf again as if nothing went wrong.
    I also wonder if I have to differentiate between Linux/Unix and Windows systems regarding '\n' and '\n\r'?
    Not if the file is opened as text. C will convert whatever the operating system uses into C's '\n' so you don't have to worry about it.

  3. #3
    Registered User Sargnagel's Avatar
    Join Date
    Aug 2002
    Posts
    166
    Originally posted by Brighteyes
    That'll work, but it'll also be very bad if 'descr' and 'seq' are too small to hold all of the data.
    Yes, that's true. The only thing I can do, is check if the number of converted characters, returned by %n, is greater than the maximum array length. That will not prevent the buffer overflow, but at least I can print out a message and exit the program.

    Originally posted by Brighteyes
    You could also do more error checking with fscanf, and you should test d immediately after the function call so that you can return immediately instead of call fscanf again as if nothing went wrong.
    Yes, that's true indeed. I will change that. Thanks for pointing this out.

    Originally posted by Brighteyes
    Not if the file is opened as text. C will convert whatever the operating system uses into C's '\n' so you don't have to worry about it.
    Hmm ... after having compiled my program the first time with gcc cygwin on W2k, the '\r' where found in the string. I had to remove them with a little if statement.
    But if I remember correctly the Borland Command Line Tools do not have such problems. They behave like you've described.

  4. #4
    Open to suggestions Brighteyes's Avatar
    Join Date
    Mar 2003
    Posts
    204
    That will not prevent the buffer overflow, but at least I can print out a message and exit the program.
    You can still prevent buffer overflow, but if the size of 'descr' and 'seq' is likely to change it's a lot harder. If they're static arrays that won't change then you can just do this
    Code:
    int getProtein(FILE *in, char descr[], char seq[])
    {
        int d, s, length = 0;
    
        if(in == NULL)
            return -1;
    
        /* Assuming 1001 is the array size */
        d = fscanf(in, "%1000[^\n]%*c", descr); // read until '\n' is found
        s = fscanf(in, "%1000[^>]%n%*c", seq, &length); // read until '>' is found
    
        if(d == EOF || s == EOF)
            return -1;
    
        return length;
    }
    But if you think the array sizes might change, which shouldn't happen if you choose the sizes well, then you can pass the size of the arrays as arguments and then use sprintf to set up an fscanf format like above
    Code:
    int getProtein(FILE *in, char descr[], int dlen, char seq[], int slen)
    {
        char dbuf[20];
        char sbuf[20];
        int d, s, length = 0;
        
        if(in == NULL)
            return -1;
        
        sprintf(dbuf, "%%%d[^\\n]%%*c", dlen);
        sprintf(sbuf, "%%%d[^>]%%n%%*c", slen);
    
        d = fscanf(in, dbuf, descr);
        s = fscanf(in, sbuf, seq, &length);
        
        if(d == EOF || s == EOF)
            return -1;
        
        return length;
    }

  5. #5
    Registered User Sargnagel's Avatar
    Join Date
    Aug 2002
    Posts
    166
    Hmmm ... I guess I will have to go with a much bigger static array. I hope a size of about 70000 will do ...
    Your last example is very interesting. I will check out if I can somehow implement it.
    Thank you very much for your help.
    Last edited by Sargnagel; 04-03-2003 at 09:16 AM.

  6. #6
    Registered User Sargnagel's Avatar
    Join Date
    Aug 2002
    Posts
    166
    Originally posted by Salem
    >Because cygwin assumes you will be feeding it unix format text files, so it does not perform the translations on text files
    Ah, thank you for your explanation.

    Originally posted by Salem
    You're probably better off writing a couple of simple loops with a counter to prevent buffer overflow.
    Hmmm ... I guess, you're right - it shouldn't be too difficult.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. File transfer- the file sometimes not full transferred
    By shu_fei86 in forum C# Programming
    Replies: 13
    Last Post: 03-13-2009, 12:44 PM
  2. To find the memory leaks without using any tools
    By asadullah in forum C Programming
    Replies: 2
    Last Post: 05-12-2008, 07:54 AM
  3. Parsing a flat file to a SQl Server database?
    By Michael71 in forum C Programming
    Replies: 0
    Last Post: 01-16-2007, 12:45 PM
  4. Game Pointer Trouble?
    By Drahcir in forum C Programming
    Replies: 8
    Last Post: 02-04-2006, 02:53 AM
  5. System
    By drdroid in forum C++ Programming
    Replies: 3
    Last Post: 06-28-2002, 10:12 PM