Thread: Searching Binary Files for a Pattern

  1. #1
    Drunken Progammer CaptainMorgan's Avatar
    Join Date
    Feb 2006
    Location
    On The Rocks
    Posts
    45

    Searching Binary Files for a Pattern

    I am having much trouble trying to figure this out. I am trying to search for the hex pattern(which you can find in the below program source) in a binary file and output to another file the files found by matching the pattern. The header(hex pattern) signifies the beginning of the file found. 7 bytes from where the header was found, the next 4 bytes should found signify the file size. The output file should contain a list of the files found listed by file number, its offset within the file and the size of the file.

    I've been at this for a while now and I'm starting to go blind and pull my hair out. The most difficulty I have is with implementing fseek and fread since these appear, after much research, to find the pattern I am looking for. I believe I understand how they work in concept and theory but am having trouble trying to make them work. You'll notice I have left some commented sections out... simply because I am at my wits end and commenting things left and right.

    Any help is appreciated.

    EDIT: I know about gets() and such.. for now I am just trying to make it function correctly. I would then go back and change gets(). If it's a mess, please just locate the while loop and have a looksee. Thanks.

    EDIT 2: Oh, let me just mention what it's doing wrong. When I did have it working it was producing an output file with only two files found when fact there are three. Further, the offsets were incorrect from what I located in my hex editor.

    Code:
    // The first three lines make the functions within
    //  the declared libraries available to the program.
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    // program entry point
    int main(int argc, char *argv[])
    {
      // used to store the user input which represent
      //  the input and output file name and location
      char cFileInName[255], cFileOutName[255];
      // pointers to input and output files
      FILE *FileIn, *FileOut;
    
      char outStringData[256]; // Character variable to hold data
      // char outFilename[256];   // Another character variable to hold data*/
    
      long int position = 0;   // position counter
      int currentFile = 1;     // file counter
    
      // hex pattern to search for
      int pattern = 0xFF575043; 
      char *patternArray = (char *)&pattern; // Create an array called pattern to hold the bytes read in from the input file
    
      long int tempInteger = 0; // file size accumulator
      char *tempBuffer = (char *)&tempInteger;
    
      printf("Please enter the name and path of the binary file: ");                  
      gets(cFileInName); // Get the user input and store it in the character array
    
      if ( (FileIn = fopen(cFileInName, "rb") ) == NULL)
      { // error reading the file
        printf("The file could not be opened, please check your file and location before trying again.\n");
        return 0;
      }
    
      printf("Enter the name and path for the output file: ");
      gets(cFileOutName);
    
      // create a file and open it in append mode ready for output
      FileOut = fopen(cFileOutName, "a");
    
      // while the end of file has not been reached
      while (!feof(FileIn)) 
      {
        position++;   // increase the character counter
        // Check the next byte in the file to see if it matches the
        //  file header, fgetc takes the file pointer and returns
        //  the byte that was read from the file.
        if (fgetc(FileIn) == patternArray[0])
            {
    
          // When it finds the first byte in patternArray. Set the first
          //  byte in tempBuffer to the first byte in patternArray.
          tempBuffer[0] = patternArray[0];
    
          // read the next 3 bytes into the rest of tempBuffer
          fread(tempBuffer + 1, sizeof(char), 3, FileIn);
    
          // Check if the 4 bytes read from the file are the pattern.
                /*if(tempInteger != pattern)
                { printf("Hello Mofo\n");
            // Skip back three bytes so that it doesnt miss any headers
            //  due to a flase positive being read in
            fseek(FileIn, -3, SEEK_CUR);                                     
            continue;  // Go back to the top of the loop
                }*/
    
                //fseek(FileIn, 7, SEEK_CUR); // Seek ahead 7 bytes and then extract 4
                fread(tempBuffer, sizeof(char), 4, FileIn);
    
          // create the string so we can write to the output file, store it in outStringData
                sprintf(outStringData, "&#37;-5i\tOffset: %-12ld\tSize: %ld\n", currentFile++, position - 1, tempInteger);
          // write the string to the output file
                fwrite(outStringData, sizeof(char), strlen(outStringData), FileOut);   
            } 
        }
    
      // close the files and at the same time flush the buffers
      fclose(FileIn);
      fclose(FileOut);
    
      printf("\n Done!\n Press enter to exit\n");
      getchar();
    
      return 0;
    }
    Last edited by CaptainMorgan; 06-15-2007 at 12:39 AM.

  2. #2
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    Looking for WordPerfect files?

    So basically, what is the real trouble that you are having? You seem to be doing a lot of ridiculously difficult reading operations that you really should not be doing. I think you're making the problem much harder than it really is.

    First of all, I think you should not be using fgetc() on a binary file, but someone may correct me on that. I think you should be using fread() on it all the way. Your while() condition is wrong. You should not be using feof() to check if you've reached EOF.

    And yeah, gets(), like you said....

  3. #3
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    fgetc is binary-safe it should be fine to use...

    char *tempBuffer = (char *)&tempInteger;
    char *patternArray = (char *)&pattern; // Create an array called pattern to hold the bytes read in from the input file
    Any reason for the redundant casts?

    You can also go 'over the top' with comments, your comments should explain WHY you've done it, not simply repeating what the code says.

    See the FAQ about foef(), Also fgetc() returns an int, not a char (EOF won't fit in a char either),
    Code:
    char *patternArray = (char *)&pattern;
    fgetc(FileIn) == patternArray[0]
    Again, stop with the redundant casts.
    Last edited by zacs7; 06-15-2007 at 12:53 AM.

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,656
    > EDIT: I know about gets() and such.. for now I am just trying to make it function correctly.
    Perhaps consider
    Code:
    void doConversion ( const char *infile, const char *outfile ) {
      // your code here
    }
    int main ( ) {
      doConversion( "input.dat", "output.txt" );
      return 0;
    }
    And save yourself the problems of reading a filename for later, which is a trivial problem.

    When it works for one file, then expand to
    Code:
    int main ( int argc, char *argv[] ) {
      doConversion( argv[1], argv[2] );
      return 0;
    }
    Or maybe reading filenames from stdin using fgets, then calling the function.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Drunken Progammer CaptainMorgan's Avatar
    Join Date
    Feb 2006
    Location
    On The Rocks
    Posts
    45
    Quote Originally Posted by MacGyver View Post
    Looking for WordPerfect files?

    So basically, what is the real trouble that you are having? You seem to be doing a lot of ridiculously difficult reading operations that you really should not be doing. I think you're making the problem much harder than it really is.

    First of all, I think you should not be using fgetc() on a binary file, but someone may correct me on that. I think you should be using fread() on it all the way. Your while() condition is wrong. You should not be using feof() to check if you've reached EOF.

    And yeah, gets(), like you said....
    Ok, so I've changed the code quite a bit thanks to you folks. I guess my underlying problem is understanding this: If I can read in successive characters using fgetc, what then is the strategy within the body of the while loop to literally find and match the pattern, if I don't make those casts? If your answer is fread and fseek, I've been over them and I just can't wrap my head around them... I've seen many examples online but I've yet to find an example that similar to actually searching for a matching hex pattern, and then reading ahead 7 bits etc. If you say I am doing too much reading of the files.. should I be able to call fread() only once and be able to accomplish this? And yes, I do have the tendency to make things more difficult than I'd hoped.

    I took out feof() although in my reference manual, "C: A Reference Manual", it makes good note of using it when using fgetc and the other f's.

    Thanks Salem, I have now broken the code up, which I should've done earlier... I was attempting to alter someone else's code. Anyhoo, I'll post the code in a bit, but just wanted to get your reactions from this post. Thanks
    Last edited by CaptainMorgan; 06-15-2007 at 02:27 AM.

  6. #6
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    I should have clarified about feof(). You should use it, just not in that manner. For more information:

    http://faq.cprogramming.com/cgi-bin/...&id=1043284351

    I would just simply suggest

    1. Read in special hex signature via fread().
    2. fseek() to next spot (7 bytes away I think you said).
    3. Read in the filesize via fread().
    4. Anything else you need to do....


    Requires about 2 freads() and an fseek() I think. Am I oversimplifying it?

  7. #7
    Drunken Progammer CaptainMorgan's Avatar
    Join Date
    Feb 2006
    Location
    On The Rocks
    Posts
    45
    Mac, from what I can tell, that link shows why not to use.. but doesn't give an example of successfully using it, in fact it's rewritten to take fgets() in the while condition.

    I was thinking:
    Code:
      while ((c = fgetc(FileIn)) != EOF)
    where c is of type int.
    No?

    Btw, thanks for the rundown above of your strategy.

  8. #8
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    feof() should be used after fgets() or an equivalent I/O operation fails. For example:

    Code:
    if(fgets(szBuffer,sizeof(szBuffer),fIn))
    {
    	/* User buffer */
    }
    else	/* Find out why the read failed */
    {
    	if(feof(fIn))
    	{
    		/* Simple EOF occurred */
    	}
    	else if(ferror(fIn))
    	{
    		/* An actual error occurred. */
    	}
    }
    Personally, I would stay away from the fgetc() and such when dealing with a situation like this. You know exactly where the data is you want to read. Just read it directly.

    It's only about 3 lines of code or so (or more if you have trouble with endianess and such).

  9. #9
    Drunken Progammer CaptainMorgan's Avatar
    Join Date
    Feb 2006
    Location
    On The Rocks
    Posts
    45
    Here's what I have so far, I am simply trying to find the matching hex and output it. it does nothing but compile and outputs nothing but "Done!....".

    Code:
    // The first three lines make the functions within
    //  the declared libraries available to the program.
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    void performFileIO(const char *infile, const char *outfile)
    {
      FILE *FileIn, *FileOut;
    
      if ((FileIn = fopen(infile, "rb")) == NULL)
      { // error reading the file
        printf("The file could not be opened, please check your file and location before trying again.\n");
        exit(1);
      }
    
      // create a file and open it in append mode ready for output
      FileOut = fopen("text.txt", "a");
    
      int c; // char to read every character in file looking for EOF
      long int position = 0;   // file position counter
    
      // hex pattern to search for in the input file,
      //  marks the header of a file found within the input file
      int pattern = 0xFF575043;
    
      char *tempBuffer; // = (char *)&tempInteger;
    
      // read successive chars from file until EOF is reached
      while ((c = fgetc(FileIn)) != EOF)
      {
        position++;   // increase the character-read counter
        // Check the next byte in the file to see if it matches the
        //  file header, fgetc takes the file pointer and returns
        //  the byte that was read from the file.
    
        fread(tempBuffer, sizeof(pattern), 1, FileIn);
        if (tempBuffer == (char *)&pattern)
          printf("Hello %s\n", tempBuffer);
      }
    
      // close the files and at the same time flush the buffers
      fclose(FileIn);
      fclose(FileOut);
    }
    
    // program entry point
    int main(int argc, char *argv[])
    {
      performFileIO("file.bin", "text.txt");
    
      printf("\n Done!\n Press enter to exit\n");
      getchar();
    
      return 0;
    }

  10. #10
    Drunken Progammer CaptainMorgan's Avatar
    Join Date
    Feb 2006
    Location
    On The Rocks
    Posts
    45
    Quote Originally Posted by MacGyver View Post
    I should have clarified about feof(). You should use it, just not in that manner. For more information:

    http://faq.cprogramming.com/cgi-bin/...&id=1043284351

    I would just simply suggest
    1. Read in special hex signature via fread().
    2. fseek() to next spot (7 bytes away I think you said).
    3. Read in the filesize via fread().
    4. Anything else you need to do....

    Requires about 2 freads() and an fseek() I think. Am I oversimplifying it?
    Would the bold be placed in a while loop? or no? wait.. yes.. right? I'm still not understanding how to literally match the pattern to a value found using fread().. notice I tried:
    Code:
        if (tempBuffer == (char *)&pattern)
    
          printf("Hello &#37;s\n", tempBuffer);
    I'm at a loss on this one...

  11. #11
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    Assuming I understand what you're trying to do, no that's not meant to be in a loop.

    The code you put there is most likely not what you want at all. It just compares memory addresses. If they are equal, then it attempts to print what is inside tempBuffer as a string, although I doubt it is a real string, which means you might get some undefined behavior going on in your program.

  12. #12
    Drunken Progammer CaptainMorgan's Avatar
    Join Date
    Feb 2006
    Location
    On The Rocks
    Posts
    45
    Quote Originally Posted by MacGyver View Post
    Assuming I understand what you're trying to do, no that's not meant to be in a loop.

    The code you put there is most likely not what you want at all. It just compares memory addresses. If they are equal, then it attempts to print what is inside tempBuffer as a string, although I doubt it is a real string, which means you might get some undefined behavior going on in your program.
    yea.. I did figure out that it's just comparing addresses... but I don't know how to compare them... if I do tempBuffer == pattern that's comparing a pointer to an int and that doesn't work

  13. #13
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    You might be going way too fast then. Consider slowing down a bit and working a bit slower, focusing more on applying concepts than racing towards the end result of a project.

  14. #14
    Drunken Progammer CaptainMorgan's Avatar
    Join Date
    Feb 2006
    Location
    On The Rocks
    Posts
    45
    Here's what I've come up with thus far. I know that there are three files to be found in the input binary file. The three files are found with this little program and written to the output file with their corresponding offsets. The pattern(4-byte hex) is what identified the location of these files. The first offset(of first file) appears to be correct, yet the following two are off by increasing numbers. Depending on what I try the second offset will be off by 10, 11, or 22 whereas the third file's offset will off by 22, 32, 100+. Something is wrong with my calculation and accumulation but I can't spot it. Lastly, I know that tempBuffer isn't correct within sprintf... I am trying to find the filesize which is located 7 bytes from the file headers(as a result of the incorrect calculations I don't think I could find it except maybe for the first file) I'm trying to find a function for easy conversion from the pointer so that it can be output sprintf.. any recommendations?

    Code:
    // The first three lines make the functions within
    //  the declared libraries available to the program.
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    void performFileIO(const char *infile)
    {
      FILE *FileIn, *FileOut;
      char inBuffer[BUFSIZ];
      char tempBuffer[4];
      char outStringData[256];
      char *filesize, *endptr;                                                     
      int currentFile = 1, position = 0; //, tempInteger = 0;
    
      // properly open the files for reading and appending
      FileIn = fopen(infile, "rb");
    
      if (FileIn == NULL) {
        printf("The file could not be opened, please check your file and location before trying again.\n");
        exit(1);
      }
    
      printf("Enter the name and path for the output file: ");
      gets(inBuffer);
      FileOut = fopen(inBuffer, "a");
    
      // marks the header of a file found within the input file
      unsigned char pattern[] = {0xFF, 0x57, 0x50, 0x43};
    
      while (!feof(FileIn)) {
        position++;
    
        if (fgetc(FileIn) == pattern[0]) {
          tempBuffer[0] = pattern[0];
          fread(tempBuffer + 1, sizeof(pattern) - 1, 1, FileIn);
          //fread(tempBuffer, sizeof(pattern), 1, FileIn);
    
          if (memcmp(pattern, tempBuffer, sizeof(pattern)) == 0) {
            printf("Match Found.\n");
     
            fseek(FileIn, 7, SEEK_CUR);
            fread(tempBuffer, sizeof(pattern), 1, FileIn);
    
            sprintf(outStringData, "%-5i\tOffset: %-12i\tSize: %d\n", currentFile++, position - 1, tempBuffer);
            fwrite(outStringData, sizeof(char), strlen(outStringData), FileOut);
            continue;
          }
    
          fseek(FileIn, -3, SEEK_CUR);
        }
    
      }
      
      // close the files and at the same time flush the buffers
      fclose(FileIn);
      fclose(FileOut);
    }
    
    // program entry point
    int main(int argc, char *argv[])
    {
      char buffer[BUFSIZ];
    
      printf("Please enter the name and path of the binary file: ");
      gets(buffer);
    
      performFileIO(buffer);
    
      printf("\n Done!\n Press enter to exit\n");
      getchar();
    
      return 0;
    }

  15. #15
    Registered User
    Join Date
    Jun 2007
    Location
    Michigan
    Posts
    12
    Attached is an example that works with a memory buffer or file. A couple of things:

    1. In one of your previous attempts you have an int assigned to the search string. The problem with this is going to be byte order on the platform you are on. When dealing with file, always use bytes specifically.

    2. If you want this to be fast, don't use the character based file functions (fopen, fgetc, fgets, fclose, etc..) They are buffered in the kernel, then you are reading the bytes into your own buffer (one byte at a time), then comparing that buffer. You are triple processing the data...

    3. Use a search algorithm that is proven, like Boyer-Moore. I did a quick search and used the C code given on the Wikipedia page Boyer-Moore Algorithm

    4. Use mmap for your files. I lets you treat the file as an array of bytes (or whatever you cast the mmap() results to). The kernel buffers the disk pages and you do not copy the data all over the place. This will also be very fast, and on a 32-bit platform let you search files up to about 2G. On a 64-bit box, you can open files to about 18EB (exobytes).

    I had the mmap code from work I am current doing, the Boyer-Moore code is directly from the Wikipedia page. It was written very well with large file and memory buffers in mind. Fit right in with my mmap code without modification.

    This code compiles for me and runs. I copied the source code to search.bin just for my example. Your OS might not support the madvise() call, so modify as needed. I'm running on Solaris 10 with a SPARC-IV+ CPU (64-bit). Sorry I use the C++ style // comments, so you might have to mod those to /* */ is your compiler complains.

    M@

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. added start menu crashes game
    By avgprogamerjoe in forum Game Programming
    Replies: 6
    Last Post: 08-29-2007, 01:30 PM
  2. Copying Binary Files
    By mikeman118 in forum C++ Programming
    Replies: 9
    Last Post: 08-11-2007, 10:55 PM
  3. Processing binary or plaintext files
    By Jags in forum C Programming
    Replies: 12
    Last Post: 08-04-2006, 02:35 PM
  4. send/recv binary files using sockets in C++
    By dafatdude in forum Networking/Device Communication
    Replies: 14
    Last Post: 07-25-2004, 11:00 AM
  5. Binary Searching a Structure Array
    By Simon in forum C Programming
    Replies: 10
    Last Post: 09-03-2002, 06:02 AM