finding lines of text from one file in another

**Zimbobo** · 11-20-2009

Hello there,

I've got the following programming challenge:
A C program that performs the following text-processing on ascii files: it compares two files and prints out lines of the second file that are not identical to any line in the first file. Assume that the files are large -- about 1 million lines each. Do not use library functions other than those in the standard C library.

I have the following draft which does the job, however, for one million lines, as it turns out, it takes many hours!

I was hoping to get some ideas on this list on how to speed up the code.

Thanks.

Code:

#include <stdio.h>
#include <stdlib.h>

#define LINE_LENGTH 256

int main(int argc, char **argv)
{
  int j, n1, i1;
  char line[LINE_LENGTH];
  FILE * fp; 
  char * file_name;
  int rvalue;

  /* read into memory file 1 */
  fp = fopen("text1.txt","r");

  char ** file1 = NULL;
  int * lengths = NULL;
  j = 0;
  while ( fgets(line,LINE_LENGTH,fp) != NULL) {
    file1 = realloc(file1,(j+1) * sizeof(char *)); 
    file1[j] = calloc(sizeof(char), strlen(line)+1);
    strcpy(file1[j],line);
    lengths = (int *)realloc(lengths, (j+1)*sizeof(int)); 
    lengths[j] = strlen(line);
    j++;
  }
  n1 = j;

  fclose(fp);

  /* open file 2 and compare lines */
  fp = fopen("text2.txt","r");

  int has_match;
  int i = 0 ;
  while ( fgets(line,LINE_LENGTH,fp) != NULL) {
    has_match = 0;
    for ( i1 = 0 ; (i1 < n1) && (has_match == 0); i1++) {
      if ( strcmp(file1[i1], line) == 0 ) {
	has_match = 1;
      }
    }

    if ( has_match == 0 ) {
      printf("%s",line);
    }

  }

  fclose(fp);

  return EXIT_SUCCESS;
}

**Adak** · 11-20-2009

When you're looking for a line of text in the second file, are you:

1) looking for it only in the same place as it is in the first file or

2) looking for the line of text anywhere in the second file

Second question:

Can you sort the lines of text before we start searching?

It's easy for a good sorter to handle a big load like this, and it makes searching SO much easier.

I'd be surprised if it couldn't be finished in 10 minutes or less, on a modern PC.

malloc and realloc stuff, has *GOT* to go. It's unnecessary and a TERRIBLE slow down.

Use binary file mode - it's always faster since no newlines have to be expanded

**seaking1** · 11-21-2009

You could try to minimize your disk seeks by reading in large blocks of data and then adding them to your list. I made something like this but as I read in the first file I sorted into a data structure.

**Zimbobo** · 11-21-2009

Thanks for the pointers, I actually read the 2nd file into memory too in theh meanwhile and that improved things quite a bit.

I think I need to stick to something like calloc since I don't know the length of the text file and the length of the lines. But thanks for pointing it out, I actually found a bug in the code I posted.

I think to do some sort of sorting and chopping the text into equivalence classes could help further since then only part of the text needs to be searched. Any concrete ideas how to do that?

I thought about sorting the lines alphabetically, are there any C codes on the web available for that?

Thanks again.

**Adak** · 11-21-2009

Quicksort is a part of the standard C library under the name qsort.

Mergesort is another high performance sorter that is especially good for sorting data that's contained in storage other than RAM (files, tapes, etc.).

If you can put the strings of text to be sorted into an array, I'd use Quicksort. If it's still on a file/files, then I'd use mergesort.

iMalc is a forum regular who's really into sorting. I'd seek his advice, amongst others.

This is the easiest example I have of using Quicksort, (which is called qsort), in the standard C library, on strings:

Code:

#include <stdio.h>
#include <stdlib.h>             //required for qsort()


int sort_function( const void *a, const void *b);

int main(void) {
  int i;
  char list[5][8] = { "Ate", "able", "Attest", "ate", "attest" };

   qsort((void *)list, 5, sizeof(list[0]), sort_function);

   for (i = 0; i < 5; i++)
      printf("%s\n", list[i]);

   printf("\n\n\t\t\t     press enter when ready");
   i = getchar();
   return 0;
}

int sort_function( const void *a, const void *b)
{
   return( strcmp(a,b) );
}

Note that all uppercase letters, sort out lower than any lowercase letters.

Thread: finding lines of text from one file in another

Thread Tools

Search Thread

Display

finding lines of text from one file in another

Similar Threads

How would I go about counting the lines in a text file?

Simple text file editor - Having some problems - C program

Counting the number of lines in a text file - help

airport Log program using 3D linked List : problem reading from file

Finding Values within a text file