Thread: Need help splitting input file into multiple output files

  1. #1
    Registered User
    Join Date
    Feb 2008
    Posts
    77

    Need help splitting input file into multiple output files

    Hello to all,

    I am new to programming and need help with a program. I am trying to split one large file into multiple smaller files. The smaller files have to start from a line that starts with "LOCUS.." and end at a line that starts with "//..." The way the original file is setup, this occurs several times.

    Also, I need the output files to be titled a specific way. In the first line that contains "LOCUS", I need the second word that appears to be my title. It is a unique ID. I know I can use strtok but, I am not sure how to set it up.

    Below is the code I have so far. Its not much. Any help would be appreciated.

    Code:
    main( int argc, char **argv )
        {
            FILE *input ; 
            FILE *output ;
            char data[100000] ;
        
            if( ! ( input = fopen( argv[1], "r" ) ) )
                { 
                    printf( "COULD NOT OPEN FILE %s - Exit!\n", argv[1]) ; 
                    exit(1) ; 
                }
         
         // put the accension id as file name
            output = fopen( "id" , "w" ) ;         
                  
            while( (fgets(data, 100000, input)) != "//" )
                {
                   fputs( data, output) ;    
                }
        
            fclose (input) ;
            fclose (output) ;
            
            return(0) ;
        
        }

  2. #2
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by gkoenig View Post
    Hello to all,

    I am new to programming and need help with a program. I am trying to split one large file into multiple smaller files. The smaller files have to start from a line that starts with "LOCUS.." and end at a line that starts with "//..." The way the original file is setup, this occurs several times.

    Also, I need the output files to be titled a specific way. In the first line that contains "LOCUS", I need the second word that appears to be my title. It is a unique ID. I know I can use strtok but, I am not sure how to set it up.

    Code:
    main( int argc, char **argv )
        {
            FILE *input ; 
            FILE *output ;
            char data[100000] ;
        
            if( ! ( input = fopen( argv[1], "r" ) ) )
                { 
                    printf( "COULD NOT OPEN FILE %s - Exit!\n", argv[1]) ; 
                    exit(1) ; 
                }
         
         // put the accension id as file name
            output = fopen( "id" , "w" ) ;         
                  
            while( (fgets(data, 100000, input)) != "//" )
                {
                   fputs( data, output) ;    
                }
        
            fclose (input) ;
            fclose (output) ;
            
            return(0) ;
        
        }
    Since you have one input file and multiple output files, you'll need to set up a loop for the output files. This loop will need to scan the input lines looking for your LOCUS tag, and when it finds one to then parse off the output file name (into some new char array variable) -- I like sscanf.

    After this trigger is found, it appears that you would continue reading lines and outputting them to the output file if they don't begin with // .... On that trigger, close the file and begin the next iteration of the loop.

    Well, many more details, but I hope this is enough for a start.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  3. #3
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    Thanks for the advice. My next problem is with the title of my files. I managed to get it to put the unique id as the title. What I need is a file extension to go on the end. To illustrate, I need the filename to go from AJ002507 to AJ002507.gb.

    Heres my code; any help would be great.

    Code:
    main( int argc, char **argv )
        {
            FILE *input ; 
            FILE *output ;
            char data[1000000] ;
            
            char *firstWord, *GenBankID ;
            char headerLine[1000] ;
            
            if( ! ( input = fopen( argv[1], "r" ) ) )
                { 
                    printf( "COULD NOT OPEN FILE %s - Exit!\n", argv[1]) ; 
                    exit(1) ; 
                }
            
            output = fopen( "id", "w") ;
            
            while( (fgets(data, 1000000, input)))
                {
                if(strstr( data, "LOCUS"))
                    {
                    // Labeling file with unique ID
                    strcpy( headerLine, data ) ;
                    firstWord = strtok( headerLine, " \t" ) ;
                    GenBankID = strtok( NULL, " \t" ) ;    
                    output = fopen ( GenBankID , "w" ) ;
                    fputs(data, output ) ;
                    }
                else if(strstr( data, "//"))
                    {
                        fputs(data, output) ;
                        fclose (output ) ;
                    }
                else
                    {
                        fputs(data, output) ;
                    }
                }
       
        
            fclose (input) ;
            
            return(0) ;
        
        }

  4. #4
    abyss - deep C
    Join Date
    Oct 2007
    Posts
    46
    you can use strcat to concatenate two strings
    for eg:
    strcat(str1,".gb");
    assuming str1 is a string with contents "AJ002507".

    cheers
    maverix

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Code:
            char data[1000000] ;
    Don't use HUGE arrays on the stack - if you run out of stack-space, you will have an immediate crash with no way to determine what went wrong.

    In this particular case, it would make sense to allow some arbitrary large line length (e.g. 1000 or 5000) - sufficient for most things that produce "lines". If you check that the last char in the buffer is newline, you can find if it's a too long line:
    Code:
       if (data[strlen(data)-1] != '\n')   ... Line too long  ... ;
    I have never seen a text file with a line much longer than a few thousand bytes (actually, about 20K, but that's a REALLY strange case in itself) - not even close to a million chars.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #6
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    Thanks to all for the help. Program is fine. Sorry for the ugly code; newbie. Will modify code so that it is more efficient.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Can we have vector of vector?
    By ketu1 in forum C++ Programming
    Replies: 24
    Last Post: 01-03-2008, 05:02 AM
  2. Packed Files (Puting multiple files in one file)
    By MrKnights in forum C++ Programming
    Replies: 17
    Last Post: 07-22-2007, 04:21 PM
  3. Basic text file encoder
    By Abda92 in forum C Programming
    Replies: 15
    Last Post: 05-22-2007, 01:19 PM
  4. Trouble with a lab
    By michael- in forum C Programming
    Replies: 18
    Last Post: 12-06-2005, 11:28 PM
  5. File I/O with 2 input and 1 output file
    By Sandy in forum C Programming
    Replies: 1
    Last Post: 04-19-2003, 12:06 PM