Matching string by a percentage?
Hi there, haven't been coding for a while, but I think I should start back as I have been away from it for too long.
I have a set of files in two different directories, say DirA, and DirB. The problem is there maybe duplicate files in the directories, and the file name for a file may not exactly be spelt the name way, e.g. have a couple additional spaces, or maybe an apostrophe instead of an underscore.
I really can't remember what is the term used to compare strings for possible matches (I think one existed sometime, possibly not Cpp though), anyone know of anything I'm talking about? Reading up on that, should help me makeup an algorithm as to how I am going to delete the duplicate files.
By the way, I did a dir dump of all the files in each directory and removed the extensions, so all I have to do now is figure out how to match the duplicates (file name strings ) in the directories, and then output those names into a text file and then do a batch delete.
Levenshtein distance algorithm
The Levenshtein distance algorithm is probably what you are looking for. It measures the minimum number of transformations (insertions, deletions or substitutions) needed to turn one word into another. Eg cat to caps would have a Levenshtein distance of 2 (change t to p, then add s). The "percentage" change could then be 50% (2 / len(caps)). See http://en.wikipedia.org/wiki/Levenshtein_distance