Sorting a file with a lot of words.

**samus250** · 04-25-2008

I was thinking on making a program that would take a text file, and extract all words from it into a new file, each word on its own line, in alphabetical order, and not repeated. So if you have a file that says:

Code:

The quick brown fox jumped over the lazy dog.
The dog got really really mad.

The program would output:

Code:

brown
dog
got
fox
jumped
lazy
mad
over
quick
really
the

The problem is that I haven't started, because in the program planning stage, I went through a lot of walls.

I think that maybe I can use fgetc to read the words char by char, words get separated by anything that is not a letter, check if the word has already been read, and then I sort the words using something like quicksort, and then I write the file.

But lets say I have a text file with hundreds of thousands of words in it. How will I store so much words? If I use linked lists, how would I sort them? The big question is... How would you do it (some pseudo code or algorithm would be useful

)?

Thanks!!!

**tabstop** · 04-25-2008

Note: these comments should only be taken as placeholders until iMalc notices the thread.

Anyway: for small datasets, insertion sort (using an array/dynamic "array" of pointers, perhaps) seems like the obvious choice. For larger sets, the array of pointers would be necessary (far easier to sort pointers than real data); I might build the table, sort the whole thing, and remove duplicates.

**samus250** · 04-25-2008

Well, I guess it is far easier to sort pointers. But how do I keep reading the words without wasting memory (by declaring a 5 thousand pointer array like *words[5000]) and at the same time by making sure there is enough?

**~~master5001~~** · 04-25-2008

This reminds me of the introduction to binary trees in K&R... You could actually just use a binary tree for this application.

**samus250** · 04-25-2008

I have no idea of what a binary tree is...

**swoopy** · 04-25-2008

Originally Posted by samus250

Well, I guess it is far easier to sort pointers. But how do I keep reading the words without wasting memory (by declaring a 5 thousand pointer array like *words[5000]) and at the same time by making sure there is enough?

Well for the first version, you could simply use a dynamic array, calling realloc() to expand the array size (say doubling the array size each time) when necessary. Then call the standard qsort() to sort the words. Then once you have that code working, if you prefer, you could convert from an array to some other data structure.

The only advantage I can think to using some other data structure is if words will need to be added after the initial list is built.

**samus250** · 04-25-2008

Ok so I should load the FULL txt file to RAM (by reallocating the char array when necessary), then sort everything out with qsort(), and then store it to the disk? I think I'll also need a 2 dimension array right?

**Adak** · 04-25-2008

Don't think of "enough". There is NEVER "enough". Any structure or array size you use, can be easily busted, short of a slow disk file, for overflow. IMO that defeats the purpose of reading it in from disk.

Think of "handling input in chunks", (I would use an array, but whatever you like). Then sort the chunk you have, and write it out to a file.

After all the chunks have been read in, then mergesort all the sorted "chunk" files, into one file. It's very handy if you have a unique name for the chunk files, something like:

dat 1 000001.txt

Which tells you this is the first level of chunk files, and it's also the number 1 chunk file at that level.

Now your mergesort program can use that info, to find it's input, and produce it's output appropriately.

I wrote this type of program twice. Used mergesort the first time, and quicksort the second. The first one was a more elegant program, but I do like quicksort and would probably use it again for writing out the first level chunk files.

Depending on your hardware/software, something between 5 and 10 thousand words in a chunk, should be good.

**samus250** · 04-25-2008

ok well, a very miniature scale model: I am scanning words. Stopped scanning at 3, and then sort that chunk. The chunk sorted contains this:

Code:

apples
pears
torrents

then, keep reading, and the second 3 sorted words chunk contains:

Code:

bananas
monkeys
watermelons

If I just merge them, they will be out of order, so I have to like, put bananas after apples and monkeys before pears and watermelons after torrents.

How do I do that with two different chunks, so that I can merge them into one 6 word sorted file?

**Adak** · 04-25-2008

Originally Posted by samus250

ok well, a very miniature scale model: I am scanning words. Stopped scanning at 3, and then sort that chunk. The chunk sorted contains this:

Code:

apples
pears
torrents

then, keep reading, and the second 3 sorted words chunk contains:

Code:

bananas
monkeys
watermelons

If I just merge them, they will be out of order, so I have to like, put bananas after apples and monkeys before pears and watermelons after torrents.

How do I do that with two different chunks, so that I can merge them into one 6 word sorted file?

No No! When I say "merge" them, I mean merge them IN ORDER, ala mergesort. This is the specialty of mergesort, and has been since IBM mainframes ruled the earth.

Your first level chunks might be named:
dat 1 1.txt, and dat 1 2.txt, which your program will be looking for, when it's ready to start merging files.

The basic concept is:
Open both dat files if they exist, for reading.
Open one dat level 2 file for output, maybe dat 2 1.txt

get a word from both files. Compare them, with strcmp(). Write out the lower of the two to the output file.
Now:
while neither file has reached the eof marker

get the next word from the file that had the lower word, the last time.
if the word from dat 1 1.txt is < the word from dat 1 2.txt (and you'll use strcmp to tell you
which is less, so include <string.h> in your header files.)

then write out the lesser word to the output file.

loop back to while.

Now write any words left over in any file that still has data, until it's read and written, them all.

The devilish details are when you have 16 levels of small sorted files that all need to be merged, and have to tell your program EXACTLY what the next levels filenames will be, etc.

This works ONLY because the small chunk files, (and all temporary merge files since then), are all in sorted order.

Hope that helps.

**samus250** · 04-26-2008

OK cool, I think I got that more or less.

I'll start writing the code, and then I will post here if something doesn't work. I also found this: http://en.wikipedia.org/wiki/Merge_sort

Thanks.

**iMalc** · 04-26-2008

MergeSort is a particularly good choice if you might be doing external sorting, i.e. sorting more data than you're holding in RAM at any one point.
How large might the files of words get? Is it possible that these might be larger than you'd want to try loading into RAM?
As for how you store it in memory, whatever way you store it is going to be less inefficient in terms of memory usage than sorting say integers because the length of words vary. You pretty much have to use dynamic memory allocation.

The interesting things here is that you also want to remove duplicates. You could do that as you go, or you could do another pass afterwards specifically to do this. Some sorting methods such as tree sorting make it easy to remove duplicates as you go. However you might find it easier to do it as a post-procesing step.

Adak seems to have the right idea, except that you "Write out the lower of the two to the output file" in all cases except where one of them is less than the last item written out, and the other is greater than the last item written out. Otherwise you'd end up with polynomial running time.
http://en.wikipedia.org/wiki/External_sort
http://www.google.com/search?hl=en&q...+sort%22&meta=

**samus250** · 04-26-2008

OK well, I have a couple of text files larger than 2MB that I want to test with. So first I'll make the 'Load everything to RAM first and then sort' program. But I'm stuck. This is what I have right now.

Code:

#include <stdio.h>

int main(int argc, char *argv[])
{
	FILE *read_file;
	char ch;
	int x = 0; 
	char word[32]; // word storage
	
	read_file = fopen(argv[1],"r"); // open file where specified
	if(read_file == NULL) {
		printf("Error reading file\n");
		return 1;
	}

	while( (ch = fgetc(read_file)) != EOF) {
		if(isalpha(ch)) {
			while(isalpha(ch)) {
				word[x] = ch;
				x++;
				ch = fgetc(read_file);
			}
			word[x] = '\0';

		} else {
			// A non alphabetical char was encountered.
			// I suppose that I allocate more memory for the next word here
			// not sure how to do it
			// allocate_memory()
			continue;
		}
	}
	
	// sort_words();
	// write_to_file();
	
	return 0;
}

As you can see by the comments, I don't know how I'm going to allocate memory. If I make an array of char pointers, each pointer holding a word, what will I do when I run out of pointers? How do I know how big the word is before scanning it so that I can malloc just enough space to put the word in a char pointer?

Sorry, I'm messed up. I really need some malloc explanations.

Thanks.

**vart** · 04-26-2008

ch = fgetc(read_file)) != EOF
fgetc returns int because it can read values from 0 to 255 and EOF

ch should be declared as int

when declared as char - you cannot distinguish between 255 and EOF

-------
word[x] = '\0';
now you have a full word read
allocate enough memory and copy word there
After that you can use the word buffer again to read futher from file

do not forget to skip empty words and add check agains memory overrun (no more that 31 char is possible to store in your word buffer)

**samus250** · 04-26-2008

... But, I've made other programs like this and it works. I thought it might be because I am evaluating directly what fgetc returns and assign its value to ch seperately.

Code:

(ch = fgetc(read_file)) != EOF

I put everything between parenthesis so I guess that what is being compared is what the function returns directly, not ch... or am I wrong?

And also, that means that I should change the array to an int array, or would it convert automatically from int to char? ... Or I could add a typecast

Thread: Sorting a file with a lot of words.

Thread Tools

Search Thread

Display

Sorting a file with a lot of words.

Similar Threads

A development process

Newbie homework help

File transfer- the file sometimes not full transferred

Basic text file encoder

Encryption program