Thread: finding lines of text from one file in another

  1. #1
    Registered User
    Join Date
    Nov 2009
    Posts
    4

    finding lines of text from one file in another

    Hello there,

    I've got the following programming challenge:
    A C program that performs the following text-processing on ascii files: it compares two files and prints out lines of the second file that are not identical to any line in the first file. Assume that the files are large -- about 1 million lines each. Do not use library functions other than those in the standard C library.

    I have the following draft which does the job, however, for one million lines, as it turns out, it takes many hours!

    I was hoping to get some ideas on this list on how to speed up the code.

    Thanks.

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    #define LINE_LENGTH 256
    
    int main(int argc, char **argv)
    {
      int j, n1, i1;
      char line[LINE_LENGTH];
      FILE * fp; 
      char * file_name;
      int rvalue;
    
      /* read into memory file 1 */
      fp = fopen("text1.txt","r");
    
      char ** file1 = NULL;
      int * lengths = NULL;
      j = 0;
      while ( fgets(line,LINE_LENGTH,fp) != NULL) {
        file1 = realloc(file1,(j+1) * sizeof(char *)); 
        file1[j] = calloc(sizeof(char), strlen(line)+1);
        strcpy(file1[j],line);
        lengths = (int *)realloc(lengths, (j+1)*sizeof(int)); 
        lengths[j] = strlen(line);
        j++;
      }
      n1 = j;
    
      fclose(fp);
    
      /* open file 2 and compare lines */
      fp = fopen("text2.txt","r");
    
      int has_match;
      int i = 0 ;
      while ( fgets(line,LINE_LENGTH,fp) != NULL) {
        has_match = 0;
        for ( i1 = 0 ; (i1 < n1) && (has_match == 0); i1++) {
          if ( strcmp(file1[i1], line) == 0 ) {
    	has_match = 1;
          }
        }
    
        if ( has_match == 0 ) {
          printf("%s",line);
        }
    
      }
    
      fclose(fp);
    
      return EXIT_SUCCESS;
    }

  2. #2
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    When you're looking for a line of text in the second file, are you:

    1) looking for it only in the same place as it is in the first file or

    2) looking for the line of text anywhere in the second file

    Second question:

    Can you sort the lines of text before we start searching?

    It's easy for a good sorter to handle a big load like this, and it makes searching SO much easier.

    I'd be surprised if it couldn't be finished in 10 minutes or less, on a modern PC.

    malloc and realloc stuff, has *GOT* to go. It's unnecessary and a TERRIBLE slow down.

    Use binary file mode - it's always faster since no newlines have to be expanded
    Last edited by Adak; 11-20-2009 at 11:36 PM.

  3. #3
    Registered User
    Join Date
    Jan 2009
    Posts
    21
    You could try to minimize your disk seeks by reading in large blocks of data and then adding them to your list. I made something like this but as I read in the first file I sorted into a data structure.

  4. #4
    Registered User
    Join Date
    Nov 2009
    Posts
    4
    Thanks for the pointers, I actually read the 2nd file into memory too in theh meanwhile and that improved things quite a bit.

    I think I need to stick to something like calloc since I don't know the length of the text file and the length of the lines. But thanks for pointing it out, I actually found a bug in the code I posted.

    I think to do some sort of sorting and chopping the text into equivalence classes could help further since then only part of the text needs to be searched. Any concrete ideas how to do that?

    I thought about sorting the lines alphabetically, are there any C codes on the web available for that?

    Thanks again.

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quicksort is a part of the standard C library under the name qsort.

    Mergesort is another high performance sorter that is especially good for sorting data that's contained in storage other than RAM (files, tapes, etc.).

    If you can put the strings of text to be sorted into an array, I'd use Quicksort. If it's still on a file/files, then I'd use mergesort.

    iMalc is a forum regular who's really into sorting. I'd seek his advice, amongst others.

    This is the easiest example I have of using Quicksort, (which is called qsort), in the standard C library, on strings:

    Code:
    #include <stdio.h>
    #include <stdlib.h>             //required for qsort()
    
    
    int sort_function( const void *a, const void *b);
    
    int main(void) {
      int i;
      char list[5][8] = { "Ate", "able", "Attest", "ate", "attest" };
    
       qsort((void *)list, 5, sizeof(list[0]), sort_function);
    
       for (i = 0; i < 5; i++)
          printf("%s\n", list[i]);
    
       printf("\n\n\t\t\t     press enter when ready");
       i = getchar();
       return 0;
    }
    
    int sort_function( const void *a, const void *b)
    {
       return( strcmp(a,b) );
    }
    Note that all uppercase letters, sort out lower than any lowercase letters.
    Last edited by Adak; 11-22-2009 at 05:05 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How would I go about counting the lines in a text file?
    By mickpc in forum Linux Programming
    Replies: 7
    Last Post: 10-07-2009, 07:21 AM
  2. Replies: 19
    Last Post: 05-30-2007, 05:47 PM
  3. Counting the number of lines in a text file - help
    By Erkan in forum C Programming
    Replies: 11
    Last Post: 11-12-2005, 05:12 PM
  4. Replies: 3
    Last Post: 03-04-2005, 02:46 PM
  5. Finding Values within a text file
    By saeed01 in forum C++ Programming
    Replies: 1
    Last Post: 03-11-2004, 05:13 PM