algorithm for duplicate file checking help

This is a discussion on algorithm for duplicate file checking help within the C Programming forums, part of the General Programming Boards category; hello. I'm trying to figure out how most duplicate file finder programs do the actual comparison of files (algorithm). I ...

  1. #1
    Registered User geekoftheweek's Avatar
    Join Date
    Mar 2003
    Location
    maine
    Posts
    8

    algorithm for duplicate file checking help

    hello.

    I'm trying to figure out how most duplicate file finder programs do the actual comparison of files (algorithm). I tried to start REALLY simple and figure an algorithm out on my own. Would like any critisism and suggestions as I'm new to this type of algorithm.

    My example has 10 files:
    Code:
    file a = 10 bytes
    file b = 100 bytes
    file c = 0 bytes
    file d = 2458 bytes
    file e = 10 bytes
    file f = 100 bytes
    file g = 1024 bytes
    file h = 34 bytes
    file i = 100 bytes
    file j = 1024 bytes
    The first thing that I can think of when comparing these files for duplicates is to compare their sizes first (probably with stat()) since they MUST have the same size in order to be a dup. So next, I would assume that my algorithm would have to record the size of file a and check it against every other file to find similiar sizes (file e) and tie them together somehow to later be checked by some function that compares their contents. Next, I would proceed to file b and check its size against all other files and do the same as I did for file a until all files are checked. So when I'm complete I'll have most likely some linked-lists that have similiar sized files that would be checked by content against each other. Am I going in the right direction with this? Is this how most file compare programs do their comparisons?

    Thank you.

  2. #2
    Crazy Fool Perspective's Avatar
    Join Date
    Jan 2003
    Location
    Canada
    Posts
    2,640
    Compute a hash of each file (like md5) and store it in a look-up table. Files with the same hash are likely the same file (with high probability, but not guaranteed). You can then compare files with the same hash byte for byte if you want.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. gcc link external library
    By spank in forum C Programming
    Replies: 6
    Last Post: 08-08-2007, 04:44 PM
  2. Basic text file encoder
    By Abda92 in forum C Programming
    Replies: 15
    Last Post: 05-22-2007, 02:19 PM
  3. Simple File encryption
    By caroundw5h in forum C Programming
    Replies: 2
    Last Post: 10-13-2004, 11:51 PM
  4. Hmm....help me take a look at this: File Encryptor
    By heljy in forum C Programming
    Replies: 3
    Last Post: 03-23-2002, 10:57 AM
  5. Simple File Creation Algorithm
    By muffin in forum C Programming
    Replies: 13
    Last Post: 08-24-2001, 04:28 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21