Thread: Problem collecting large string of data

  1. #1
    Registered User
    Join Date
    Feb 2008
    Posts
    77

    Problem collecting large string of data

    Hello to all,

    I am having difficulty trying to build a large string (approx 5 million characters) from a large input file (approx. 11 MB). The program compiles fine, it just doesn't stop running. Here is my code:

    Code:
    main( int argc, char **argv )
        {
            FILE *input ; 
            FILE *output ;
            char buffer[1000] ;        
            int i = 0 ;
            char c ;
             
            char *seqData ;
            seqData = (char *)malloc(10000000) ;
            
           
      /*      char t1, t2, t3, t4 ;
            int index ; 
            int tupleCount[4][4][4][4] ; */
                    
            if( ! ( input = fopen( argv[1], "r" ) ) )
                { 
                    printf( "COULD NOT OPEN FILE %s - Exit!\n", argv[1]) ; 
                    exit(1) ; 
                }        
    
            while(fgets(buffer, 1000, input))
                            {
                            // start obtaining bases after ORIGIN
                            if(strstr(buffer, "ORIGIN")) 
                                {                                                  
                                   int i = 0 ;
                                   while((c=getchar()) != '/')
                                        {
                                        if(c >= 'a' && c <= 'z')
                                            {
                                            seqData[i++] = c ;
                                            }
                                        }                            
                                }           
                            }
            
            /*for( index = 0 ; index < strlen(seqData) - 3 ; ++index )
                {
                    t1 = seqData[index] ;
                    t2 = seqData[index + 1] ;
                    t3 = seqData[index + 2] ;
                    t4 = seqData[index + 3] ;
                     count the different 4-mers 
                }*/
                          
            
            printf("Here is the sequence:\n" ) ;     
            return(0) ;
        }
    Any help would be great. Please keep in mind that I am a newbie to programming.

  2. #2
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Please keep in mind that starting new threads on exactly the same topic is unlikely to make you popular. http://cboard.cprogramming.com/showthread.php?t=99741

    Now please, let us continue our discussion in your other thread. What did you not understand about the replies?

    [edit] I'm sorry, I just noticed that you have changed you code, to use malloc() now. You still should have used the same thread, but I'll reply to this one now . . . .

    Code:
                                   int i = 0 ;
                                   while((c=getchar()) != '/')
                                        {
                                        if(c >= 'a' && c <= 'z')
                                            {
                                            seqData[i++] = c ;
                                            }
                                        }
    I think perhaps you should declare i outside the loops. As it is, the elements of seqData will be overwritten with each new occurrence of ORIGIN. Perhaps this is what you wanted; in that case, you should do something with seqData before it is overwritten.

    I'd suggest using islower() from <ctype.h> instead of "if(c >= 'a' && c <= 'z')". It's easier to use and read, and more portable too.

    Don't forget to free() the memory when you're done with it! Try adding
    Code:
    free(seqData);
    right before your return 0.

    Another thing: perhaps your while loop with getchar() should be parsing buffer instead?

    4-mers? twomers will be offended! Your program is twice as big as he is. [/edit]
    Last edited by dwks; 02-29-2008 at 08:52 PM.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  3. #3
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Actually, I think the getchar() problem is why your program hangs.
    Code:
    while((c=getchar()) != '/')
    You just keep reading and reading until you get a '/'. If you happened to encounter EOF (End Of File) before you got a '/', you'd get an infinite loop. A quick fix for this would be:
    Code:
    int c;
    while((c = getchar()) != '/' && c != EOF)
    I've made c an int because EOF can only be stored in an int -- don't worry, it doesn't affect the rest of your code.

    In addition -- d'oh! -- getchar() reads from the keyboard, not the file input. To read from the file input, use getc(input) instead of getchar().

    But I think the main problem is that you're reading from the file/screen/whatever instead of parsing buffer further. Perhaps you should store the position that "ORIGIN" was found and continue parsing the line with sscanf() or something after that? I'm not sure if you're expecting ORIGIN on a line of its own or what.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  4. #4
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    Thank you for the help dwks. Everything is running fine. The issue with the tag ORIGIN is that this is where in the file I must start collecting characters. The files it acts on all have a standardized architecture.

    Again, thanks for the help. I'll catch you later when I know I will need help with a 4-D array

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. xor linked list
    By adramalech in forum C Programming
    Replies: 23
    Last Post: 10-14-2008, 10:13 AM
  2. problem with data in allocated memory
    By supi in forum C Programming
    Replies: 3
    Last Post: 06-09-2008, 02:06 AM
  3. Bitmasking Problem
    By mike_g in forum C++ Programming
    Replies: 13
    Last Post: 11-08-2007, 12:24 AM
  4. Calculator + LinkedList
    By maro009 in forum C++ Programming
    Replies: 20
    Last Post: 05-17-2005, 12:56 PM
  5. HUGE fps jump
    By DavidP in forum Game Programming
    Replies: 23
    Last Post: 07-01-2004, 10:36 AM