Thread: counting duplicates from one file to another not correct on larger files

  1. #1
    Registered User
    Join Date
    Oct 2020
    Posts
    2

    counting duplicates from one file to another not correct on larger files

    The program takes two txt files as arguments, the text file to be searched and the the text file containing the string to search for. We aren't allowed to use any string library functions. For some reason it works just fine on smaller text files for example:

    textFileToSearch.txt:
    this is fun
    is this fun
    fish
    patternToSearchFor.txt:
    is
    returns:
    Matches: 5
    which is correct.

    but if I use a much larger file like this:
    Video provides a powerful way to help you prove your point.
    When you click Online Video, you can paste in the embed code for the video you want to add.
    You can also type a keyword to search online for the video that best fits your document.
    To make your document look professionally produced,
    Word provides header, footer, cover page, and text box
    designs that complement each other.
    For example, you can add a matching cover page,
    header, and sidebar. Click Insert and then choose
    the elements you want from the different galleries.
    The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
    The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog.


    Video provides a powerful way to help you prove your point.
    When you click Online Video, you can paste in the embed code for the video you want to add.
    You can also type a keyword to search online for the video that best fits your document.
    To make your document look professionally produced, Word provides header, footer, cover page, and text box designs that comp$
    more Great stuff
    and search for
    er
    , it returns:
    Matches: 62
    when I'm pretty sure there are only 18 "er"s in that file.

    I can't figure out why it doesn't work on bigger files. Any help would much appreciate.

    (I know the Lines and Columns part isn't correct either, I'm more worried about the counting part right now)

    Code:
    #include<stdio.h>
    #include <string.h>
    #include <unistd.h>
    #include <stdlib.h>
    
    int patternFileLength(FILE *pattern){
      int len = 0;
      int chr;
    
      while((chr = fgetc(pattern)) != EOF){
        if(chr != '\0'){
          len++;
        }
      }
      return len;
    }
    
    int charcmp( char a, char b, int type ){
      if ( a == b ){
        type = 1;
        return 1;
      }
      else if(a == b || a - 32 == b || a + 32 == b){
        type = 2;
        return 1;
      }
      else if((a >= 48 && a <= 57) && (b >= 48 && b <= 57)){
        type = 3;
        return 1;
      }
      else{
        return 0;
      }
    }
    
    int main(int argc, char *argv[])
    {
      FILE *textFile;
      FILE *patternFile;
      char x,y;
      int type = 0, match = 0, count = 0;
      int line = 1, col = -3;
    
      textFile = fopen(argv[1], "r");
      patternFile = fopen(argv[2], "r");
    
      int length = patternFileLength(patternFile);
      rewind(patternFile);
    
      y = fgetc(patternFile);
    
      while((x = fgetc(textFile)) != EOF){
        if(charcmp(x, y, type) == 1){
          y = fgetc(patternFile);
          match++;
          if(match == length){
            count++;
            match = 0;
            printf("LINE: %d    COL: %d\n", line, col);
          }
        }
        else{
          rewind(patternFile);
          y = fgetc(patternFile);
        }
        if(x == '\n' || x == '\0'){
          line++;
          col = -3;
        }
        col++;
      }
    
      printf("\nMatches: %d\n", count);
    
      fclose(textFile);
      fclose(patternFile);
    
      return 0;
    }
    
    

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    In charcmp, you aren't comparing correctly. Your second if statement gives no consideration to whether the character really is uppercase or lowercase before you do the addition or subtraction, which means you risk false negatives, e.g., in ASCII 'A' - 32 == '!', so you're saying that 'A' should be matched with '!'. Your third if statement makes it look like you always match with digits, which doesn't make sense.

    Also in charcmp, it looks like you're trying to use type as an output parameter, but because it is an int rather than a pointer to int, your assignments to it from within the function have no net effect in the sense that the variable from the caller remains unaffected. Additionally, you should use named constants, not magic numbers.

    In main, I think your string searching algorithm in flawed: it'll probably be unable to detect matches coming after partial failed matches like "er" as the search string and "deer" in the text, i.e., you'll skip "ee" and hence never match against "er". Also, you don't reset match when there is no match.

    I suggest that you read both files into strings and operate on the strings instead of on the files as it doesn't look like what constitutes a "larger file" is so large as to make it prohibitive to store in memory. This might make it easier for you to do the matching.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Oct 2020
    Posts
    2
    Ah, I see the problem with the second if statement now, thank you. That doesn't really solve the problem with getting so many extra matches with the current files I'm using because none of them contain anything except letters and they wouldn't be picking up any punctuation with what I'm search for, I think? The third if statement is part of the assignment, we are supposed to make numbers match with other numbers. The type parameter in the charcmp is also a requirement in the homework that just hasn't been put to use yet.

    I'm not sure how I'd put the files into strings and operate on the strings when we aren't allowed to use string library functions?

  4. #4
    Registered User
    Join Date
    Sep 2020
    Posts
    150
    To read a file into a string:
    1. determine the filesize - write a function long file_size(const char *filename)
    2. allocate a buffer with malloc - using the length of the file
    3. read the file with fread into the buffer
    4. close the file

    If you can't use the string library functions you need to write your own.
    You basically need a strstr function.

    Hope this makes sense. English is not my native language so it might sound a bit clumsy.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 2
    Last Post: 02-01-2019, 12:19 PM
  2. Binary File-how do you find duplicates?
    By wacky_jay in forum C Programming
    Replies: 6
    Last Post: 10-06-2012, 06:19 AM
  3. Replies: 5
    Last Post: 03-29-2012, 02:57 AM
  4. Replies: 24
    Last Post: 02-06-2012, 06:35 PM
  5. Error when trying to read and process larger files
    By tanjinjack in forum C Programming
    Replies: 19
    Last Post: 03-25-2011, 12:15 PM

Tags for this Thread