Building a tree from a directory structure

**geekoftheweek** · 11-25-2007

Hello.

I'm designing (or trying) a program that will be command line on linux/unix that
checks for duplicate files. I'm coding a function that will take argv[1] - argv[n] and build
a linked list tree out of those directory arguments:

e.g.

$ checkdups /bin /home

this will check for duplicate files in the /bin and /home directories descending into their
subdirectories also. My question is with my implementation. I have a:

Code:

struct dir {
          char name[256];
          struct dir *subdirs;
          struct file *files;
          struct dir *next;
};

struct file {
           char name[256];
           off_t size;
           struct file *next;
};

this gives the current directory name, followed by a pointer to a linked list of all its
sub-directories, followed by a pointer to another linked list with all the current
directories files, and a *next pointer if this directory happens to be on same level
with other directories. Using the two above structs I would be able to describe a
directory structure using linked lists and be able to traverse it to operate on all files
below a particular set of directories. Am I going about this in the right way?

thanks

**Adak** · 11-26-2007

If your implementation leads you to a jump off point for an efficient algorithm to test for those duplicates, then I'd have to say "Yes - definitely". Otherwise, no.

Several years back, I wrote a program to do this in DOS, and simply found a dir command with the added options that could be output to a file, and then my program used that file to do it's work.

My concern with your lists lies in the case where you're processing an entire hard drive for duplicate files. There might be tens of thousands of files - possibly hundreds of thousands. Can your linked lists handle that quantity of files?

Do you want it to be able to look for duplicate files across multiple drives? Now the number of files could be very large indeed.

Will you have enough memory to build all these lists, and have enough memory left over, to then run the rest of your program?

**matsp** · 11-26-2007

Adak has valid concerns, alhtough perhaps not as bad as you would think. I did some calculations:
My C: has 64000 files on it. With your structure, that would be 17MB. My D: has 240000 files on it [many of which are duplicates because I have multiple copies of the same project source code, in different versions], which would take about 60 MB.

You may want to use a pointer to the filename, as a huge number of your files will have names that are shorter than 256 - most would probably be around 12-20 bytes long + a 4 byte pointer -> 16-24 bytes to store the name - a saving of about 230 bytes per entry. If you have 1000 files, that's 230 KB of memory.

With my structure, it would be 16 bytes for a directory and 12 bytes for a file + average filename length of 20 bytes. If we assume there's 1 directory per ten files, we get an average structure size of 20 + 12 + (12-16) / 10 -> 32.5 bytes per entry -> 7.6 MB.

[Of course, in both cases, we don't take into account extra space used by malloc - we probably need to add another half again for the malloc overhead - unless you allocate a bunch of records at a time, which may make sense to do].

The other factor is of course that sometimes, files can appear to be duplicates even if they don't have the same content - this is particularly common for binary files that have a fixed format and name - the file is always a fixed length and always the same name, no matter what you do. So once you have found a "match" in name/size, you probably should use some more advanced method to compare the content of the file - a simple "read a block from each and compare it, repeat until all the file has been read" would work.

--
Mats

Thread: Building a tree from a directory structure

Thread Tools

Search Thread

Display

Building a tree from a directory structure

Similar Threads

any smarter ways to design this tree structure

BST delete again, but this time I think I'm close

Binary Search Trees Part III

C++ Data structure Tree