hello.
I'm trying to figure out how most duplicate file finder programs do the actual comparison of files (algorithm). I tried to start REALLY simple and figure an algorithm out on my own. Would like any critisism and suggestions as I'm new to this type of algorithm.
My example has 10 files:
Code:
file a = 10 bytes
file b = 100 bytes
file c = 0 bytes
file d = 2458 bytes
file e = 10 bytes
file f = 100 bytes
file g = 1024 bytes
file h = 34 bytes
file i = 100 bytes
file j = 1024 bytes
The first thing that I can think of when comparing these files for duplicates is to compare their sizes first (probably with stat()) since they MUST have the same size in order to be a dup. So next, I would assume that my algorithm would have to record the size of file a and check it against every other file to find similiar sizes (file e) and tie them together somehow to later be checked by some function that compares their contents. Next, I would proceed to file b and check its size against all other files and do the same as I did for file a until all files are checked. So when I'm complete I'll have most likely some linked-lists that have similiar sized files that would be checked by content against each other. Am I going in the right direction with this? Is this how most file compare programs do their comparisons?
Thank you.