Thread: Building a tree from a directory structure

  1. #1
    Registered User geekoftheweek's Avatar
    Join Date
    Mar 2003

    Building a tree from a directory structure


    I'm designing (or trying) a program that will be command line on linux/unix that
    checks for duplicate files. I'm coding a function that will take argv[1] - argv[n] and build
    a linked list tree out of those directory arguments:


    $ checkdups /bin /home

    this will check for duplicate files in the /bin and /home directories descending into their
    subdirectories also. My question is with my implementation. I have a:
    struct dir {
              char name[256];
              struct dir *subdirs;
              struct file *files;
              struct dir *next;
    struct file {
               char name[256];
               off_t size;
               struct file *next;
    this gives the current directory name, followed by a pointer to a linked list of all its
    sub-directories, followed by a pointer to another linked list with all the current
    directories files, and a *next pointer if this directory happens to be on same level
    with other directories. Using the two above structs I would be able to describe a
    directory structure using linked lists and be able to traverse it to operate on all files
    below a particular set of directories. Am I going about this in the right way?


  2. #2
    Registered User
    Join Date
    Sep 2006
    If your implementation leads you to a jump off point for an efficient algorithm to test for those duplicates, then I'd have to say "Yes - definitely". Otherwise, no.

    Several years back, I wrote a program to do this in DOS, and simply found a dir command with the added options that could be output to a file, and then my program used that file to do it's work.

    My concern with your lists lies in the case where you're processing an entire hard drive for duplicate files. There might be tens of thousands of files - possibly hundreds of thousands. Can your linked lists handle that quantity of files?

    Do you want it to be able to look for duplicate files across multiple drives? Now the number of files could be very large indeed.

    Will you have enough memory to build all these lists, and have enough memory left over, to then run the rest of your program?

  3. #3
    Kernel hacker
    Join Date
    Jul 2007
    Farncombe, Surrey, England
    Adak has valid concerns, alhtough perhaps not as bad as you would think. I did some calculations:
    My C: has 64000 files on it. With your structure, that would be 17MB. My D: has 240000 files on it [many of which are duplicates because I have multiple copies of the same project source code, in different versions], which would take about 60 MB.

    You may want to use a pointer to the filename, as a huge number of your files will have names that are shorter than 256 - most would probably be around 12-20 bytes long + a 4 byte pointer -> 16-24 bytes to store the name - a saving of about 230 bytes per entry. If you have 1000 files, that's 230 KB of memory.

    With my structure, it would be 16 bytes for a directory and 12 bytes for a file + average filename length of 20 bytes. If we assume there's 1 directory per ten files, we get an average structure size of 20 + 12 + (12-16) / 10 -> 32.5 bytes per entry -> 7.6 MB.

    [Of course, in both cases, we don't take into account extra space used by malloc - we probably need to add another half again for the malloc overhead - unless you allocate a bunch of records at a time, which may make sense to do].

    The other factor is of course that sometimes, files can appear to be duplicates even if they don't have the same content - this is particularly common for binary files that have a fixed format and name - the file is always a fixed length and always the same name, no matter what you do. So once you have found a "match" in name/size, you probably should use some more advanced method to compare the content of the file - a simple "read a block from each and compare it, repeat until all the file has been read" would work.

    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. any smarter ways to design this tree structure
    By George2 in forum C++ Programming
    Replies: 5
    Last Post: 04-19-2008, 07:34 AM
  2. BST delete again, but this time I think I'm close
    By tms43 in forum C++ Programming
    Replies: 9
    Last Post: 11-05-2006, 06:24 PM
  3. Binary Search Trees Part III
    By Prelude in forum A Brief History of
    Replies: 16
    Last Post: 10-02-2004, 03:00 PM
  4. C++ Data structure Tree
    By Unregistered in forum C++ Programming
    Replies: 2
    Last Post: 10-23-2001, 10:14 PM