Sorting a file with a lot of words.

**samus250** · 04-25-2008

I was thinking on making a program that would take a text file, and extract all words from it into a new file, each word on its own line, in alphabetical order, and not repeated. So if you have a file that says:

Code:

The quick brown fox jumped over the lazy dog.
The dog got really really mad.

The program would output:

Code:

brown
dog
got
fox
jumped
lazy
mad
over
quick
really
the

The problem is that I haven't started, because in the program planning stage, I went through a lot of walls.

I think that maybe I can use fgetc to read the words char by char, words get separated by anything that is not a letter, check if the word has already been read, and then I sort the words using something like quicksort, and then I write the file.

But lets say I have a text file with hundreds of thousands of words in it. How will I store so much words? If I use linked lists, how would I sort them? The big question is... How would you do it (some pseudo code or algorithm would be useful

)?

Thanks!!!

**tabstop** · 04-25-2008

Note: these comments should only be taken as placeholders until iMalc notices the thread.

Anyway: for small datasets, insertion sort (using an array/dynamic "array" of pointers, perhaps) seems like the obvious choice. For larger sets, the array of pointers would be necessary (far easier to sort pointers than real data); I might build the table, sort the whole thing, and remove duplicates.

**samus250** · 04-25-2008

Well, I guess it is far easier to sort pointers. But how do I keep reading the words without wasting memory (by declaring a 5 thousand pointer array like *words[5000]) and at the same time by making sure there is enough?

**swoopy** · 04-25-2008

Originally Posted by samus250

Well, I guess it is far easier to sort pointers. But how do I keep reading the words without wasting memory (by declaring a 5 thousand pointer array like *words[5000]) and at the same time by making sure there is enough?

Well for the first version, you could simply use a dynamic array, calling realloc() to expand the array size (say doubling the array size each time) when necessary. Then call the standard qsort() to sort the words. Then once you have that code working, if you prefer, you could convert from an array to some other data structure.

The only advantage I can think to using some other data structure is if words will need to be added after the initial list is built.

**samus250** · 04-25-2008

Ok so I should load the FULL txt file to RAM (by reallocating the char array when necessary), then sort everything out with qsort(), and then store it to the disk? I think I'll also need a 2 dimension array right?

**Adak** · 04-25-2008

Don't think of "enough". There is NEVER "enough". Any structure or array size you use, can be easily busted, short of a slow disk file, for overflow. IMO that defeats the purpose of reading it in from disk.

Think of "handling input in chunks", (I would use an array, but whatever you like). Then sort the chunk you have, and write it out to a file.

After all the chunks have been read in, then mergesort all the sorted "chunk" files, into one file. It's very handy if you have a unique name for the chunk files, something like:

dat 1 000001.txt

Which tells you this is the first level of chunk files, and it's also the number 1 chunk file at that level.

Now your mergesort program can use that info, to find it's input, and produce it's output appropriately.

I wrote this type of program twice. Used mergesort the first time, and quicksort the second. The first one was a more elegant program, but I do like quicksort and would probably use it again for writing out the first level chunk files.

Depending on your hardware/software, something between 5 and 10 thousand words in a chunk, should be good.

**samus250** · 04-25-2008

ok well, a very miniature scale model: I am scanning words. Stopped scanning at 3, and then sort that chunk. The chunk sorted contains this:

Code:

apples
pears
torrents

then, keep reading, and the second 3 sorted words chunk contains:

Code:

bananas
monkeys
watermelons

If I just merge them, they will be out of order, so I have to like, put bananas after apples and monkeys before pears and watermelons after torrents.

How do I do that with two different chunks, so that I can merge them into one 6 word sorted file?

**~~master5001~~** · 04-25-2008

This reminds me of the introduction to binary trees in K&R... You could actually just use a binary tree for this application.

**samus250** · 04-25-2008

I have no idea of what a binary tree is...

**vart** · 04-26-2008

ch = fgetc(read_file)) != EOF
fgetc returns int because it can read values from 0 to 255 and EOF

ch should be declared as int

when declared as char - you cannot distinguish between 255 and EOF

-------
word[x] = '\0';
now you have a full word read
allocate enough memory and copy word there
After that you can use the word buffer again to read futher from file

do not forget to skip empty words and add check agains memory overrun (no more that 31 char is possible to store in your word buffer)

**samus250** · 04-26-2008

... But, I've made other programs like this and it works. I thought it might be because I am evaluating directly what fgetc returns and assign its value to ch seperately.

Code:

(ch = fgetc(read_file)) != EOF

I put everything between parenthesis so I guess that what is being compared is what the function returns directly, not ch... or am I wrong?

And also, that means that I should change the array to an int array, or would it convert automatically from int to char? ... Or I could add a typecast

**vart** · 04-26-2008

Originally Posted by samus250

... But, I've made other programs like this and it works. I thought it might be because I am evaluating directly what fgetc returns and assign its value to ch seperately.

No...
Let's see what happens...

fgetc returns 0-255 value for char read from file or -1 to indicate EOF condition
you store this value in char var ch

there are 2 possibilities
1. if char is by default signed on your system:

- values from 0 to 127 are atored as is
- values from 128 to 255 stored as -128 to -1
- EOF stored as -1

when you go to compare the ch to EOF its value is casted to int and you compare values from
-128 to 127 with -1

so the condition is true for original value EOF (which is good) or 255 (which is bad - because you stop reading file too early)
If file contains 255 character - it is read only up to this char (not including)

2. second possibility - char is unsigned by default
- values from 0 to 255 are stored as is
- EOF is stored as 255

when you go to compare ch var with EOF you never get the comparison as true - and enter the infinite loop

---------
You say it is working for you, it means that for now you worked with compilers that has signed char by default and files without 255 char in it...

Right thing to do - declare ch as int, check if it is EOF, if not - possible values are 0-255 that can be freely stored in the char array member, so no need to declare your storage arrays as int - char is enough

**samus250** · 04-27-2008

Thanks a lot vart, I'll do that.

Since I am sure that I'll never have a txt file 100MB big, So I'll try to make one program that sorts completely of RAM or sorts using Adak's suggestion, depending on the arguments. I'm pretty sure I can easily make the one that sorts completely off RAM, and I'm pretty sure that I can easily code the first 4 steps of Adak's suggestion, but the last step I'll leave it to the sort command, I just tried it in Ubuntu and it works! (Words need to be in their own line though, so the files need to be pre-processed anyways). After that I'll try to not use the sort command for the challenge, or for any other OS that may not have it.

The only thing I need to figure out of how to do efficiently, is removing duplicates. But I think I can make it with a final pass of the file. Since the words will be in order, duplicates will be consecutive. I guess it's not a hard thing to do with bool switches and a couple of if's or something.

Thanks!

**iMalc** · 04-27-2008

Sure okay, if you can give yourself an upper limit of 100MB then sorting in memory should be okay.

**samus250** · 04-26-2008

Originally Posted by vart

word[x] = '\0';
now you have a full word read
allocate enough memory and copy word there
After that you can use the word buffer again to read futher from file

do not forget to skip empty words and add check agains memory overrun (no more that 31 char is possible to store in your word buffer)

Good idea. So I just use my word[] var over and over again, for a temporary word storage. Then I guess I could count the number of bytes by using the value in 'x', and just allocate with malloc enough memory for the word (including the NULL). My other question would be... from where will I get enough pointers to point to every single word address returned by malloc?

After that, I guess I could just sort the pointers, check for any repeated words, and then make the output file.

Thread: Sorting a file with a lot of words.

Thread Tools

Search Thread

Display

Hybrid View

Sorting a file with a lot of words.

Similar Threads

A development process

Newbie homework help

File transfer- the file sometimes not full transferred

Basic text file encoder

Encryption program