Thread: Listing duplicate filenames

  1. #1
    Registered User
    Join Date
    Jan 2015
    Location
    Rydułtowy, Poland
    Posts
    4

    Question Listing duplicate filenames

    Hi, good people!

    To start with I will tell you the problem I'm dealing with is homework and I am not expecting any of you guys to do it for me. I just need some guidance, because while my brain can do the thing in English I have trouble converting that to C - if that makes any sense.

    Anyway, I'm asked to write a program that searches through a certain directory (with subdirectories) and lists all duplicate file names with full paths in a text file.

    Now, I have managed to get the program to do the first thing - go through the directory with all the subdirectories and I can make it list all the found file names. The problem is finding the duplicates.
    Basically, I thought that since I have those file names within a text file I can just work on this text file and simply delete whatever is not repeating (and leave only one of the duplicates, of course) and I'm done. However, I found this rather brain - bending and I have to say I'm not entirely fluent when it comes to dealing with file I/O.

    My initial idea was to open the text file, get the first line into some temp string and then copy the rest of the file into a new temp file. Then I wanted to compare my temp string to all the lines in the new text file and if a duplicate was found it would write the temp string into a third file which would be the final list.
    Then, just delete the initial text file and rename the temp file to the initial file's name and repeat the process.

    While this kinda works in my mind, I think it would list duplicates that appear 3 times, twice in the output file and that's no good :C

    I also thought it might be a good idea not to put the filenames in a text file but rather an array of strings and then working on that array may be easier. This, however I cannot really get my head around and I guess the ListDirectoryContents function would have to return the array (right?).

    Anyway, if you guys could give me some advice on how to list the file names and compare them - I would be very grateful.

    Here's my code so far: (I only have one function in it, cause the rest I've just been trying to figure out and cannot really think of anything worth writing in the code)

    Code:
    bool ListDirectoryContents(char *sDir);const char *filename = "test.txt";
    FILE *fp = fopen(filename, "w");
    
    
    
    
    int main()
    {
        if (fp != NULL)
        {
            ListDirectoryContents("C:\\Foldername\\");
        }
        else
        {
            perror("Couldn't open directory");
            system("pause");
        }
        fclose(fp);
    
    
    }
    bool ListDirectoryContents(char *sDir)
    {
        WIN32_FIND_DATA fd;
        HANDLE hFind = NULL;
    
    
        char sPath[2048];
    
    
        sprintf(sPath, "%s\\*.*", sDir);
    
    
        if ((hFind = FindFirstFile(sPath, &fd)) == INVALID_HANDLE_VALUE)
        {
            printf("Path not found: [%s]\n", sDir);
            return false;
        }
    
    
        do
        {
            if (strcmp(fd.cFileName, ".") != 0
                && strcmp(fd.cFileName, "..") != 0)
            {
                sprintf(sPath, "%s\\%s", sDir, fd.cFileName);
    
    
                if (fd.dwFileAttributes &FILE_ATTRIBUTE_DIRECTORY)
                {
                    fprintf(fp, "Directory: %s\n", sPath);
                    ListDirectoryContents(sPath);
                }
                else
                {
                    fprintf(fp, "%s\n", fd.cFileName);
                }
            }
        } while (FindNextFile(hFind, &fd));
    
    
        FindClose(hFind);
    
    
        return true;
    }

  2. #2
    Registered User
    Join Date
    May 2013
    Posts
    228
    I'd maintain some data structure that will count the number of appearances for each and every string added to it.

    Here's some kind of linked list that counts number of insertions.

    List.h:
    Code:
    #ifndef LIST_H_
    #define LIST_H_
    
    #include<stdlib.h>
    #include<string.h>
    #include<stdio.h>
    
    #define MAX_NAME 1024
    
    typedef struct entry {
    
        char key[MAX_NAME];
        int value;
        struct entry* next;
    
    }entry_t;
    
    typedef struct list {
    
        entry_t* head;
        entry_t* tail;
        size_t size;
    
    }list_t;
    
    
    list_t* createList();
    void addToList(list_t* list, const char* key);
    void printList(list_t* list);
    void destroyList(list_t* list);
    
    
    #endif /* LIST_H_ */

    List.c:
    Code:
    #include"List.h"
    
    
    list_t* createList() {
    
        list_t* new_list=(list_t*)malloc(sizeof(list_t));
        new_list->size=0;
        return new_list;
    
    }
    
    
    void addToList(list_t* list, const char* key) {
    
        // first entry gets special treatment
        if (!list->size) {
    
            entry_t* new_entry=(entry_t*)malloc(sizeof(entry_t));
            strncpy(new_entry->key,key, MAX_NAME);
            new_entry->value=1;
            new_entry->next=NULL;
    
            list->head=new_entry;
            list->tail=new_entry;
    
            list->size++;
            return;
        }
    
    
        entry_t* iterator=list->head;
        do {
            if (!strcmp(iterator->key, key)) {
                iterator->value++;
                return;
            }
    
            iterator=iterator->next;
    
        } while (iterator);
    
        // no match found, add new entry...
        entry_t* new_entry=(entry_t*)malloc(sizeof(entry_t));
        strncpy(new_entry->key, key, MAX_NAME);
        new_entry->value=1;
        new_entry->next=NULL;
    
        list->tail->next=new_entry;
        list->tail=list->tail->next;
    
        list->size++;
        return;
    
    }
    
    
    void printList(list_t* list) {
    
        entry_t* iterator=list->head;
    
        while (iterator) {
            printf("key: %s\t\tvalue: %d\n", iterator->key, iterator->value);
            iterator=iterator->next;
        }
    }
    
    
    void destroyList(list_t* list) {
    
        int i;
        entry_t* temp;
    
        for (i=0; i<list->size; ++i) {
            temp=list->head;
            list->head=list->head->next;
            free(temp);
        }
    
        free(list);
    
    }
    And this is how you'd use it:

    main.c:

    Code:
    #include<stdio.h>
    #include"List.h"
    
    
    
    int main() {
    
        list_t* list=createList();
    
        addToList(list, "file1");
        addToList(list, "file2");
        addToList(list, "file3");
        addToList(list, "file2");
        addToList(list, "file2");
        addToList(list, "file3");
    
        printList(list);
    
        destroyList(list);
    
    }
    Output:

    Code:
    key: file1        value: 1
    key: file2        value: 3
    key: file3        value: 2
    This can be also extended in order to support some more operations to your taste and needs.
    If you decide to use it eventually, bare in mind that this was not extensively checked, debugged and tested, and was written willy nilly just to get you started.
    Last edited by Absurd; 06-01-2015 at 05:35 PM.

  3. #3
    Registered User
    Join Date
    Jan 2015
    Location
    Rydułtowy, Poland
    Posts
    4
    Thank you for your response!

    Your suggestions are awesome and hopefully someone will stumble upon this thread in the future and us them well I however managed to come up with a similar idea... I guess my brain just needs some time not 'thinking in C' to get those ideas flowing again ^^
    Actually, yesterday morning it struck me that creating a linked list to store the data would be a lot easier and since I'm quite familiar with linked lists it went rather smoothly and my program works fine

    One thing I'm having trouble with, however is passing the directory name from user input into the function ListDirectoryContents. I'll show you how I implemented it after some changes:

    Code:
    void ListDirectoryContents(char *sDir){
    
    
    	WIN32_FIND_DATA fd;
    	HANDLE hFind = NULL;
    
    
    	char sPath[2048];
    
    
    	sprintf(sPath, "%s\\*.*", sDir);
    
    
    	if ((hFind = FindFirstFile(sPath, &fd)) == INVALID_HANDLE_VALUE)
    	{
    		printf("Path not found: [%s]\n", sDir);
    		system("pause");
    	}
    
    
    	do
    	{
    		if (strcmp(fd.cFileName, ".") != 0
    			&& strcmp(fd.cFileName, "..") != 0)
    		{
    			sprintf(sPath, "%s\\%s", sDir, fd.cFileName);
    
    
    			if (fd.dwFileAttributes &FILE_ATTRIBUTE_DIRECTORY)
    			{
    				ListDirectoryContents(sPath);
    			}
    			else
    			{
    				if (start == NULL)
    				{
    					start = AppendItem(NULL, sPath, fd.cFileName);
    					newestFile = start;
    				}
    				else
    				{
    					newestFile = AppendItem(newestFile, sPath, fd.cFileName);
    				}
    			}
    		}
    	} while (FindNextFile(hFind, &fd));
    
    
    	FindClose(hFind);
    }
    now, when I call it in main I can hard-code the directory in the parameter the function accepts but I'm having a hard time passing some user input there. It appears to me that the "\n" character at the end of the input is messing it up, but I can't seem to figure out how to get rid of it. I'll show you what I mean:

    Code:
    int main()
    {
    	if (fp != NULL)
    	{
    		char input[MAX_PATH];
    		fgets(input, MAX_PATH, stdin);
    		ListDirectoryContents(input);
    		DuplicatesFound(start);
    		system("pause");
    	}
    	else
    	{
    		perror("Couldn't open directory");
    		system("pause");
    	}
    
    
    	CleanMemory(start);
    	fclose(fp);
    	return 0;
    }
    I think it's the 'newline' character messing it up, because when I enter the directory in the console as follows: "C:\\Foldername\\" (I happen to have such folder created for testing) I get an error mesage in the console:

    "Path not found: [C:\\Foldername\\
    ]"

    So there is the "\n" character after my directory name. And btw. I enter it that way, because i know this format works when I hard-code it.

    Any suggestions how to work around this?

  4. #4
    Registered User
    Join Date
    May 2013
    Posts
    228
    According to this:

    Reads characters from stream and stores them as a C string into str until (num-1) characters have been read or either a newline or the end-of-file is reached, whichever happens first.

    A newline character makes fgets stop reading, but it is considered a valid character by the function and included in the string copied to str.

    A terminating null character is automatically appended after the characters copied to str.
    So what I usually do is overwriting the newline character with a null terminating character:

    Code:
    int main()
    {
        if (fp != NULL)
        {
            char input[MAX_PATH];
            fgets(input, MAX_PATH, stdin);
            input[strlen(input)-1]='\0'; // eliminate '\n' character
            ListDirectoryContents(input);
            DuplicatesFound(start);
            system("pause");
        }
        else
        {
            perror("Couldn't open directory");
            system("pause");
        }
    
    
        CleanMemory(start);
        fclose(fp);
        return 0;
    }
    You'd have to include string.h for strlen().

  5. #5
    Registered User
    Join Date
    Sep 2014
    Posts
    364
    Quote Originally Posted by Absurd View Post
    According to this:



    So what I usually do is overwriting the newline character with a null terminating character:

    Code:
    int main()
    {
        if (fp != NULL)
        {
            char input[MAX_PATH];
            fgets(input, MAX_PATH, stdin);
            input[strlen(input)-1]='\0'; // eliminate '\n' character
            ListDirectoryContents(input);
            DuplicatesFound(start);
            system("pause");
        }
        else
        {
            perror("Couldn't open directory");
            system("pause");
        }
    
    
        CleanMemory(start);
        fclose(fp);
        return 0;
    }
    You'd have to include string.h for strlen().
    With this, you allways overwrite the last character. It doesn't matter if it is a newline or not.
    I use a function like this one:
    Code:
    #include <ctype.h>
    #include <string.h>
    
    int str_trim(char *source, char *dest) {
        size_t i, j;
        if (dest == NULL) dest = source;
        for (i = 0 ; isspace(source[i]) ; i++ );
        for (j = 0 ; !iscntrl(source[i]) ; i++, j++) dest[j] = source[i];
        dest[j] = '\0';
        if (strlen(dest) < 1) return -1;
        for (j = strlen(dest) - 1 ; isspace(dest[j]) ; j--) dest[j] = '\0';
        return 0;
    }
    This function delete leading and trailing spaces.
    If you give source and destination, the source is unchanged.
    If you give only the source (destination is NULL), than the changes will be made inplace in source.
    Other have classes, we are class

  6. #6
    Registered User
    Join Date
    Jun 2011
    Posts
    4,513
    Quote Originally Posted by Absurd View Post
    So what I usually do is overwriting the newline character with a null terminating character:

    Code:
    int main()
    {
        if (fp != NULL)
        {
            char input[MAX_PATH];
            fgets(input, MAX_PATH, stdin);
            input[strlen(input)-1]='\0'; // eliminate '\n' character
            ListDirectoryContents(input);
            DuplicatesFound(start);
            system("pause");
        }
        else
        {
            perror("Couldn't open directory");
            system("pause");
        }
    
    
        CleanMemory(start);
        fclose(fp);
        return 0;
    }
    As WoodSTokk said, the last character is not guaranteed to be a newline, if the input exceeds the length passed to "fgets()".

    Furthermore, it is possible that the resulting string might have a length of zero. In that case, your code can exhibit a buffer-underflow bug.

    I put together this little example to demonstrate:

    Code:
    #include <stdio.h>
    #include <string.h>
    
    #define STR_LEN 32
    
    int main(void)
    {
        char string[STR_LEN] = {0};
    
        while(fgets(string,STR_LEN,stdin) != NULL)
            printf(">%s",string);
    
        printf("\nString Length = %d\n",strlen(string));
    
        return 0;
    }
    First run, with a valid string (EOF triggered to end the loop):

    Code:
    test
    >test
    ^Z
    
    String Length = 5
    Second run, with no string entered (EOF triggered to end the loop):

    Code:
    ^Z
    
    String Length = 0
    I personally like the approach given in the FAQ: FAQ > Get a line of text from the user/keyboard (C) - Cprogramming.com

  7. #7
    Registered User
    Join Date
    Jan 2015
    Location
    Rydułtowy, Poland
    Posts
    4
    Thank you all very much, you guys are epic and making this forum probably the best place to learn programming on the web!

    I really appreciate your help and I certainly have learnt something from this keep it up

  8. #8
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    Quote Originally Posted by Matticus View Post

    Code:
    #include <stdio.h>
    #include <string.h>
    
    #define STR_LEN 32
    
    int main(void)
    {
        char string[STR_LEN] = {0};
    
        while(fgets(string,STR_LEN,stdin) != NULL)
            printf(">%s",string);
    
        printf("\nString Length = %d\n",strlen(string));
    
        return 0;
    }
    Technically speaking you make a mistake here

    If a read error occurs, the error indicator (ferror) is set and a null pointer is also returned (but the contents pointed by str may have changed).
    When fgets fails and returns NULL - your initially empty buffer could have changed - so it is not safe to call strlen to verify it is empty.
    I would do no assumption regarding buffer contents after failed read operation and would not call any library function on it treating it as uninitialized and thus containing garbage.
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  9. #9
    Registered User
    Join Date
    Jun 2011
    Posts
    4,513
    Quote Originally Posted by vart View Post
    Technically speaking you make a mistake here



    When fgets fails and returns NULL - your initially empty buffer could have changed - so it is not safe to call strlen to verify it is empty.
    I would do no assumption regarding buffer contents after failed read operation and would not call any library function on it treating it as uninitialized and thus containing garbage.
    That code was intended for illustrative purposes only. I often omit "defensive" programming practices for such simple examples, since that can confuse the main point being made.

    Technically speaking, you are correct. But I do not consider it a mistake, since the sample was intended merely to illustrate a point, and as such did not assume a read error.

    The fgets function returns s if successful. If end-of-file is encountered and no
    characters have been read into the array, the contents of the array remain unchanged and a
    null pointer is returned
    . If a read error occurs during the operation, the array contents are
    indeterminate and a null pointer is returned.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Windows 98 and filenames
    By Cactus_Hugger in forum Tech Board
    Replies: 2
    Last Post: 08-06-2006, 07:16 PM
  2. Filenames in a directory
    By devour89 in forum C++ Programming
    Replies: 1
    Last Post: 01-25-2003, 07:49 AM
  3. Extended Filenames
    By UnclePunker in forum C++ Programming
    Replies: 3
    Last Post: 10-22-2002, 06:00 AM
  4. ?'s in filenames
    By sean in forum A Brief History of Cprogramming.com
    Replies: 3
    Last Post: 06-06-2002, 09:18 PM
  5. getting filenames
    By madsmile in forum C++ Programming
    Replies: 4
    Last Post: 03-12-2002, 02:40 PM

Tags for this Thread