algorithm for duplicate file checking help

**geekoftheweek** · 04-04-2009

hello.

I'm trying to figure out how most duplicate file finder programs do the actual comparison of files (algorithm). I tried to start REALLY simple and figure an algorithm out on my own. Would like any critisism and suggestions as I'm new to this type of algorithm.

My example has 10 files:

Code:

file a = 10 bytes
file b = 100 bytes
file c = 0 bytes
file d = 2458 bytes
file e = 10 bytes
file f = 100 bytes
file g = 1024 bytes
file h = 34 bytes
file i = 100 bytes
file j = 1024 bytes

The first thing that I can think of when comparing these files for duplicates is to compare their sizes first (probably with stat()) since they MUST have the same size in order to be a dup. So next, I would assume that my algorithm would have to record the size of file a and check it against every other file to find similiar sizes (file e) and tie them together somehow to later be checked by some function that compares their contents. Next, I would proceed to file b and check its size against all other files and do the same as I did for file a until all files are checked. So when I'm complete I'll have most likely some linked-lists that have similiar sized files that would be checked by content against each other. Am I going in the right direction with this? Is this how most file compare programs do their comparisons?

Thank you.

**Perspective** · 04-04-2009

Compute a hash of each file (like md5) and store it in a look-up table. Files with the same hash are likely the same file (with high probability, but not guaranteed). You can then compare files with the same hash byte for byte if you want.

Thread: algorithm for duplicate file checking help

Thread Tools

Search Thread

Display

algorithm for duplicate file checking help

Similar Threads

gcc link external library

Basic text file encoder

Simple File encryption

Hmm....help me take a look at this: File Encryptor

Simple File Creation Algorithm