Thread: comparing huge text

  1. #1
    Registered User
    Join Date
    Oct 2011
    Posts
    2

    Unhappy comparing huge text

    Hi all,

    I am trying to write a program which would compare contents of lines within a file. The text file that I have is very big. A sample of this is below. The first field is time. What I want to do is to

    1. take in as input the time and the second column(full line). increment a counter. move to the next line and do the same.
    2. scan the second line, if the time is within 5 seconds, the same data appears again, then dont increment the counter. If it is after 5 seconds, increment it.
    3. Run till end of the file and throw output on number of unique pairs for the entire file (ie unique pairs seperated by more than 5 seconds).

    here is the sample input:
    1319042174.524413 :0:7:3:5:21:48:255:25:255:255:255:255:256:A:u
    1319042174.789191 :0:4:35:16:18:16:255:255:255:255:255:255:256:A:u

    I am clueless on how to handle this. Previously I was using awk to do this but it takes too much time to run. My knowledge of C is limited. Please give me a starting point to tackle this issue.

    Thanks in advance.

  2. #2
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    Start with some basic tutorials, like the ones we have here: Cprogramming.com - Programming Tutorials: C++ Made Easy and C Made Easy. Google should turn up plenty more if you need additional help, and grab a good book (recommendations here: C Book Recommendations).

    Look into the fgets function for reading a line. You'll need to loop through each line in the file, so use the return value from fgets for controlling the loop. You can use the sscanf function for parsing that line. It sounds like you need to remember only two lines at a time, so you need two buffers, the "current" line and the "previous" line. Read into the current buffer with fgets, and each time through the loop.

    A big help when programming is to write very small pieces of code at once, and test as you go. So for starters, just make a program that successfully reads each line of the file and echoes it out to the screen. Then, modify that to read each line and extract the time stamp, and echo that to the screen. Modify it to keep the previous and current line, and echo the time stamps for both, along with the difference. Then add in the checks for whether a line is unique and the counter.

  3. #3
    Registered User
    Join Date
    Oct 2011
    Posts
    2
    Hi
    My search had also led me to something very similar to what u've mentioned. I think I should be able to manage it.

    One issue that I havent solved is that, I need to hold all the information for the past 5 seconds in a buffer to compare and make a decision based on the result. I dono which way is the best. Using string arrays or linked list? Could you suggest me which is the best one?

    Thanks

  4. #4
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by jamie_123 View Post
    Hi all,

    I am trying to write a program which would compare contents of lines within a file. The text file that I have is very big. A sample of this is below. The first field is time. What I want to do is to

    1. take in as input the time and the second column(full line). increment a counter. move to the next line and do the same.
    2. scan the second line, if the time is within 5 seconds, the same data appears again, then dont increment the counter. If it is after 5 seconds, increment it.
    3. Run till end of the file and throw output on number of unique pairs for the entire file (ie unique pairs seperated by more than 5 seconds).

    here is the sample input:
    1319042174.524413 :0:7:3:5:21:48:255:25:255:255:255:255:256:A:u
    1319042174.789191 :0:4:35:16:18:16:255:255:255:255:255:255:256:A:u

    I am clueless on how to handle this. Previously I was using awk to do this but it takes too much time to run. My knowledge of C is limited. Please give me a starting point to tackle this issue.

    Thanks in advance.
    You're definitely in the right forum! Seriously, when you see the speed, you'll feel like you've died and gone to heaven.

    Code:
    /*
    1. take in as input the time and the second column(full line). increment a counter. move to the next line and do the same.
    */
    
    /* a quick outline for your program, not a running program */
    
    #include <stdio.h>
    #include <string.h>
    
    #define MAX 80
    
    struct record {
       unsigned long long time1;
       char str[MAX];
    }   
    
    int main(void) {
        int i, count=0; 
       char buff[MAX];
       FILE *fp  //file pointer
       struct record rec[2]; //2 record array of structs
    
    
    
       if((fp=fopen("filename", "r")) == NULL) {
          perror(" Error: file did not open! \n");
          return 0;
       }
       fgets(buff, MAX, fp); //get the first line of text
       //separate the content to time and non-time here and assign it to rec[0].time and rec[0].str (string)
       
       while((fgets(buff, MAX, fp) != NULL) { //get a full line of text from the file, until the end of the file
              //code to handle the input all goes here
              //separate the content of time and non-time here
              //and assign it to rec[1] (time and string)
              
     
    
    
        
              //output as needed
              
              //move struct in rec[1] to rec[0] preparing for next input
              //to rec[1]
              rec[0]=rec[1];
       
              rec[1].time1=0;  //zero out the last rec's time
              rec[1].str[0]='\0'; //zero out the last rec string
       }
       
       //print a summary of the run if desired
       return 0;
    }
    /*
    2. scan the second line, if the time is within 5 seconds, the same data appears again, then dont increment the counter. If it is after 5 seconds, increment it.
    3. Run till end of the file and throw output on number of unique pairs for the entire file (ie unique pairs seperated by more  than 5 seconds).
    */
    You might use another function for input (which you would simply call from the appropriate spots above), and also, perhaps one for output, and or the summary. If it's very short, I'd just leave it here inside the loop. Otherwise, use another function.
    Last edited by Adak; 10-22-2011 at 03:20 PM.

  5. #5
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    Quote Originally Posted by jamie_123 View Post
    One issue that I havent solved is that, I need to hold all the information for the past 5 seconds in a buffer to compare and make a decision based on the result. I dono which way is the best. Using string arrays or linked list? Could you suggest me which is the best one?
    Linked lists are better suited for situations where the number of records can vary wildly and there is no upper limit. A situation where you may need 5, 500 or 500,000 different records. When you have a fixed size, or a known maximum, arrays usually work best, as they're easier to implement and quicker. One technique that would be well suited for your task is called a circular buffer. You could have a circular buffer of strings or char arrays. That would probably be fastest. Something slightly easier to implement would be uses arrays and "shift down" the strings as you read them, so array[0] is the earliest of the 5 strings you read, and array[4] is the most recent (or vice-versa).

  6. #6
    Programming Wraith GReaper's Avatar
    Join Date
    Apr 2009
    Location
    Greece
    Posts
    2,738
    Quote Originally Posted by Adak View Post
    You're definitely in the right forum! Seriously, when you see the speed, you'll feel like you've died and gone to heaven.
    Couldn't you be a little more careless? You gave him an almost working code with only three syntactical error in it! I'd put some logical ones to spice things up a little...
    Devoted my life to programming...

  7. #7
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by GReaper View Post
    Couldn't you be a little more careless? You gave him an almost working code with only three syntactical error in it! I'd put some logical ones to spice things up a little...
    I didn't plan on putting in any errors**. My first idea was to make it just pseudo code, with general logic, but then I thought "What good will that do him?"

    Next thing I know (see, I'm learning to feign ignorance, like our politicians do so well - although how much they need to feign, is unclear), the code is in the reply.

    Now I just need to find a scapegoat to blame it all on!

    **Joking aside, I'm solidly against including errors into code I post.

  8. #8
    Registered User
    Join Date
    Aug 2010
    Posts
    231
    You need a string-dictionary for this case, like map<string,int> in C++/STL.
    A simple example see below. (works only for 32-bit int)
    If you need more performance you can
    - replace malloc/free with a memory-manager (search in www)
    - replace bsearch/qsort with a search-tree (RB,AVL,...)

    Code:
    #define MAXSTRING 256
    
    typedef struct {
      char **a;
      int i;
    } StrDict;
    
    int cmp(const void*a,const void *b) {
      return strcmp(*(char**)a+4,*(char**)b+4); }
    
    void add(StrDict *a,char *s) {
      char tmp[MAXSTRING],*x=tmp;
      int **p=bsearch((strcpy(x+4,s),&x),a->a,a->i,sizeof*a->a,cmp);
      if( p )
        (**p)++;
      else
        strcpy((a->a[++a->i-1]=calloc(1,5+strlen(s)))+4,s),qsort(a->a,a->i,sizeof*a->a,cmp);
    }
    
    int main()
    {
      StrDict sd={malloc(1000000*sizeof*sd.a)}; /* max 1 mio entries */
      long last = 0;
      char line[MAXSTRING],z[MAXSTRING];
      FILE *f = fopen("test.txt","r");
    
      assert( sizeof(int)==4 ); /* !!! */
    
      while( fgets(line,MAXSTRING,f) ) {
        if( atol(line)-last>5 && 1==sscanf(line,"%*s%s",z) )
          add(&sd,z);
        last=atol(line);
      }
      fclose(f);
    
      while( sd.i-- )
        printf("%5d %s\n",*(int*)sd.a[sd.i]+1,sd.a[sd.i]+4),free(sd.a[sd.i]);
    
      free(sd.a);
      return 0;
    }

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. text comparing
    By Picachu in forum C Programming
    Replies: 48
    Last Post: 12-14-2006, 03:44 AM
  2. Randomly shuffle lines of huge text file
    By veecee in forum C++ Programming
    Replies: 8
    Last Post: 06-12-2006, 07:54 PM
  3. question about comparing text in a file...
    By laar in forum C++ Programming
    Replies: 12
    Last Post: 03-19-2006, 02:03 PM
  4. comparing inputted text to text in a file
    By ailis in forum C++ Programming
    Replies: 3
    Last Post: 02-06-2005, 09:43 AM
  5. help comparing text values with if
    By Bigbio2002 in forum C Programming
    Replies: 4
    Last Post: 12-30-2002, 07:31 AM