Thread: Sorting a file with a lot of words.

  1. #1
    Registered User
    Join Date
    Jan 2008
    Posts
    182

    Sorting a file with a lot of words.

    I was thinking on making a program that would take a text file, and extract all words from it into a new file, each word on its own line, in alphabetical order, and not repeated. So if you have a file that says:
    Code:
    The quick brown fox jumped over the lazy dog.
    The dog got really really mad.
    The program would output:
    Code:
    brown
    dog
    got
    fox
    jumped
    lazy
    mad
    over
    quick
    really
    the
    The problem is that I haven't started, because in the program planning stage, I went through a lot of walls.

    I think that maybe I can use fgetc to read the words char by char, words get separated by anything that is not a letter, check if the word has already been read, and then I sort the words using something like quicksort, and then I write the file.

    But lets say I have a text file with hundreds of thousands of words in it. How will I store so much words? If I use linked lists, how would I sort them? The big question is... How would you do it (some pseudo code or algorithm would be useful )?

    Thanks!!!

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Note: these comments should only be taken as placeholders until iMalc notices the thread.

    Anyway: for small datasets, insertion sort (using an array/dynamic "array" of pointers, perhaps) seems like the obvious choice. For larger sets, the array of pointers would be necessary (far easier to sort pointers than real data); I might build the table, sort the whole thing, and remove duplicates.

  3. #3
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Well, I guess it is far easier to sort pointers. But how do I keep reading the words without wasting memory (by declaring a 5 thousand pointer array like *words[5000]) and at the same time by making sure there is enough?

  4. #4
    Banned master5001's Avatar
    Join Date
    Aug 2001
    Location
    Visalia, CA, USA
    Posts
    3,685
    This reminds me of the introduction to binary trees in K&R... You could actually just use a binary tree for this application.

  5. #5
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    I have no idea of what a binary tree is...

  6. #6
    Registered User
    Join Date
    Oct 2001
    Posts
    2,934
    Quote Originally Posted by samus250 View Post
    Well, I guess it is far easier to sort pointers. But how do I keep reading the words without wasting memory (by declaring a 5 thousand pointer array like *words[5000]) and at the same time by making sure there is enough?
    Well for the first version, you could simply use a dynamic array, calling realloc() to expand the array size (say doubling the array size each time) when necessary. Then call the standard qsort() to sort the words. Then once you have that code working, if you prefer, you could convert from an array to some other data structure.

    The only advantage I can think to using some other data structure is if words will need to be added after the initial list is built.

  7. #7
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Ok so I should load the FULL txt file to RAM (by reallocating the char array when necessary), then sort everything out with qsort(), and then store it to the disk? I think I'll also need a 2 dimension array right?

  8. #8
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Don't think of "enough". There is NEVER "enough". Any structure or array size you use, can be easily busted, short of a slow disk file, for overflow. IMO that defeats the purpose of reading it in from disk.

    Think of "handling input in chunks", (I would use an array, but whatever you like). Then sort the chunk you have, and write it out to a file.

    After all the chunks have been read in, then mergesort all the sorted "chunk" files, into one file. It's very handy if you have a unique name for the chunk files, something like:

    dat 1 000001.txt

    Which tells you this is the first level of chunk files, and it's also the number 1 chunk file at that level.

    Now your mergesort program can use that info, to find it's input, and produce it's output appropriately.

    I wrote this type of program twice. Used mergesort the first time, and quicksort the second. The first one was a more elegant program, but I do like quicksort and would probably use it again for writing out the first level chunk files.

    Depending on your hardware/software, something between 5 and 10 thousand words in a chunk, should be good.
    Last edited by Adak; 04-25-2008 at 07:41 PM.

  9. #9
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    ok well, a very miniature scale model: I am scanning words. Stopped scanning at 3, and then sort that chunk. The chunk sorted contains this:
    Code:
    apples
    pears
    torrents
    then, keep reading, and the second 3 sorted words chunk contains:
    Code:
    bananas
    monkeys
    watermelons
    If I just merge them, they will be out of order, so I have to like, put bananas after apples and monkeys before pears and watermelons after torrents.

    How do I do that with two different chunks, so that I can merge them into one 6 word sorted file?

  10. #10
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by samus250 View Post
    ok well, a very miniature scale model: I am scanning words. Stopped scanning at 3, and then sort that chunk. The chunk sorted contains this:
    Code:
    apples
    pears
    torrents
    then, keep reading, and the second 3 sorted words chunk contains:
    Code:
    bananas
    monkeys
    watermelons
    If I just merge them, they will be out of order, so I have to like, put bananas after apples and monkeys before pears and watermelons after torrents.

    How do I do that with two different chunks, so that I can merge them into one 6 word sorted file?
    No No! When I say "merge" them, I mean merge them IN ORDER, ala mergesort. This is the specialty of mergesort, and has been since IBM mainframes ruled the earth.

    Your first level chunks might be named:
    dat 1 1.txt, and dat 1 2.txt, which your program will be looking for, when it's ready to start merging files.

    The basic concept is:
    Open both dat files if they exist, for reading.
    Open one dat level 2 file for output, maybe dat 2 1.txt


    get a word from both files. Compare them, with strcmp(). Write out the lower of the two to the output file.
    Now:
    while neither file has reached the eof marker

    get the next word from the file that had the lower word, the last time.
    if the word from dat 1 1.txt is < the word from dat 1 2.txt (and you'll use strcmp to tell you
    which is less, so include <string.h> in your header files.)

    then write out the lesser word to the output file.

    loop back to while.

    Now write any words left over in any file that still has data, until it's read and written, them all.

    The devilish details are when you have 16 levels of small sorted files that all need to be merged, and have to tell your program EXACTLY what the next levels filenames will be, etc.

    This works ONLY because the small chunk files, (and all temporary merge files since then), are all in sorted order.

    Hope that helps.
    Last edited by Adak; 04-25-2008 at 09:19 PM.

  11. #11
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    OK cool, I think I got that more or less.

    I'll start writing the code, and then I will post here if something doesn't work. I also found this: http://en.wikipedia.org/wiki/Merge_sort

    Thanks.

  12. #12
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    MergeSort is a particularly good choice if you might be doing external sorting, i.e. sorting more data than you're holding in RAM at any one point.
    How large might the files of words get? Is it possible that these might be larger than you'd want to try loading into RAM?
    As for how you store it in memory, whatever way you store it is going to be less inefficient in terms of memory usage than sorting say integers because the length of words vary. You pretty much have to use dynamic memory allocation.

    The interesting things here is that you also want to remove duplicates. You could do that as you go, or you could do another pass afterwards specifically to do this. Some sorting methods such as tree sorting make it easy to remove duplicates as you go. However you might find it easier to do it as a post-procesing step.

    Adak seems to have the right idea, except that you "Write out the lower of the two to the output file" in all cases except where one of them is less than the last item written out, and the other is greater than the last item written out. Otherwise you'd end up with polynomial running time.
    http://en.wikipedia.org/wiki/External_sort
    http://www.google.com/search?hl=en&q...+sort%22&meta=
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  13. #13
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    OK well, I have a couple of text files larger than 2MB that I want to test with. So first I'll make the 'Load everything to RAM first and then sort' program. But I'm stuck. This is what I have right now.


    Code:
    #include <stdio.h>
    
    int main(int argc, char *argv[])
    {
    	FILE *read_file;
    	char ch;
    	int x = 0; 
    	char word[32]; // word storage
    	
    	read_file = fopen(argv[1],"r"); // open file where specified
    	if(read_file == NULL) {
    		printf("Error reading file\n");
    		return 1;
    	}
    
    	while( (ch = fgetc(read_file)) != EOF) {
    		if(isalpha(ch)) {
    			while(isalpha(ch)) {
    				word[x] = ch;
    				x++;
    				ch = fgetc(read_file);
    			}
    			word[x] = '\0';
    
    		} else {
    			// A non alphabetical char was encountered.
    			// I suppose that I allocate more memory for the next word here
    			// not sure how to do it
    			// allocate_memory()
    			continue;
    		}
    	}
    	
    	// sort_words();
    	// write_to_file();
    	
    	return 0;
    }
    As you can see by the comments, I don't know how I'm going to allocate memory. If I make an array of char pointers, each pointer holding a word, what will I do when I run out of pointers? How do I know how big the word is before scanning it so that I can malloc just enough space to put the word in a char pointer?

    Sorry, I'm messed up. I really need some malloc explanations.

    Thanks.

  14. #14
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    ch = fgetc(read_file)) != EOF
    fgetc returns int because it can read values from 0 to 255 and EOF

    ch should be declared as int

    when declared as char - you cannot distinguish between 255 and EOF

    -------
    word[x] = '\0';
    now you have a full word read
    allocate enough memory and copy word there
    After that you can use the word buffer again to read futher from file

    do not forget to skip empty words and add check agains memory overrun (no more that 31 char is possible to store in your word buffer)
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  15. #15
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    ... But, I've made other programs like this and it works. I thought it might be because I am evaluating directly what fgetc returns and assign its value to ch seperately.
    Code:
    (ch = fgetc(read_file)) != EOF
    I put everything between parenthesis so I guess that what is being compared is what the function returns directly, not ch... or am I wrong?

    And also, that means that I should change the array to an int array, or would it convert automatically from int to char? ... Or I could add a typecast
    Last edited by samus250; 04-26-2008 at 03:30 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. A development process
    By Noir in forum C Programming
    Replies: 37
    Last Post: 07-10-2011, 10:39 PM
  2. Newbie homework help
    By fossage in forum C Programming
    Replies: 3
    Last Post: 04-30-2009, 04:27 PM
  3. File transfer- the file sometimes not full transferred
    By shu_fei86 in forum C# Programming
    Replies: 13
    Last Post: 03-13-2009, 12:44 PM
  4. Basic text file encoder
    By Abda92 in forum C Programming
    Replies: 15
    Last Post: 05-22-2007, 01:19 PM
  5. Encryption program
    By zeiffelz in forum C Programming
    Replies: 1
    Last Post: 06-15-2005, 03:39 AM