Thread: Sorting a file with a lot of words.

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Registered User
    Join Date
    Jan 2008
    Posts
    182

    Sorting a file with a lot of words.

    I was thinking on making a program that would take a text file, and extract all words from it into a new file, each word on its own line, in alphabetical order, and not repeated. So if you have a file that says:
    Code:
    The quick brown fox jumped over the lazy dog.
    The dog got really really mad.
    The program would output:
    Code:
    brown
    dog
    got
    fox
    jumped
    lazy
    mad
    over
    quick
    really
    the
    The problem is that I haven't started, because in the program planning stage, I went through a lot of walls.

    I think that maybe I can use fgetc to read the words char by char, words get separated by anything that is not a letter, check if the word has already been read, and then I sort the words using something like quicksort, and then I write the file.

    But lets say I have a text file with hundreds of thousands of words in it. How will I store so much words? If I use linked lists, how would I sort them? The big question is... How would you do it (some pseudo code or algorithm would be useful )?

    Thanks!!!

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Note: these comments should only be taken as placeholders until iMalc notices the thread.

    Anyway: for small datasets, insertion sort (using an array/dynamic "array" of pointers, perhaps) seems like the obvious choice. For larger sets, the array of pointers would be necessary (far easier to sort pointers than real data); I might build the table, sort the whole thing, and remove duplicates.

  3. #3
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Well, I guess it is far easier to sort pointers. But how do I keep reading the words without wasting memory (by declaring a 5 thousand pointer array like *words[5000]) and at the same time by making sure there is enough?

  4. #4
    Registered User
    Join Date
    Oct 2001
    Posts
    2,934
    Quote Originally Posted by samus250 View Post
    Well, I guess it is far easier to sort pointers. But how do I keep reading the words without wasting memory (by declaring a 5 thousand pointer array like *words[5000]) and at the same time by making sure there is enough?
    Well for the first version, you could simply use a dynamic array, calling realloc() to expand the array size (say doubling the array size each time) when necessary. Then call the standard qsort() to sort the words. Then once you have that code working, if you prefer, you could convert from an array to some other data structure.

    The only advantage I can think to using some other data structure is if words will need to be added after the initial list is built.

  5. #5
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Ok so I should load the FULL txt file to RAM (by reallocating the char array when necessary), then sort everything out with qsort(), and then store it to the disk? I think I'll also need a 2 dimension array right?

  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Don't think of "enough". There is NEVER "enough". Any structure or array size you use, can be easily busted, short of a slow disk file, for overflow. IMO that defeats the purpose of reading it in from disk.

    Think of "handling input in chunks", (I would use an array, but whatever you like). Then sort the chunk you have, and write it out to a file.

    After all the chunks have been read in, then mergesort all the sorted "chunk" files, into one file. It's very handy if you have a unique name for the chunk files, something like:

    dat 1 000001.txt

    Which tells you this is the first level of chunk files, and it's also the number 1 chunk file at that level.

    Now your mergesort program can use that info, to find it's input, and produce it's output appropriately.

    I wrote this type of program twice. Used mergesort the first time, and quicksort the second. The first one was a more elegant program, but I do like quicksort and would probably use it again for writing out the first level chunk files.

    Depending on your hardware/software, something between 5 and 10 thousand words in a chunk, should be good.
    Last edited by Adak; 04-25-2008 at 07:41 PM.

  7. #7
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    ok well, a very miniature scale model: I am scanning words. Stopped scanning at 3, and then sort that chunk. The chunk sorted contains this:
    Code:
    apples
    pears
    torrents
    then, keep reading, and the second 3 sorted words chunk contains:
    Code:
    bananas
    monkeys
    watermelons
    If I just merge them, they will be out of order, so I have to like, put bananas after apples and monkeys before pears and watermelons after torrents.

    How do I do that with two different chunks, so that I can merge them into one 6 word sorted file?

  8. #8
    Banned master5001's Avatar
    Join Date
    Aug 2001
    Location
    Visalia, CA, USA
    Posts
    3,685
    This reminds me of the introduction to binary trees in K&R... You could actually just use a binary tree for this application.

  9. #9
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    I have no idea of what a binary tree is...

  10. #10
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    ch = fgetc(read_file)) != EOF
    fgetc returns int because it can read values from 0 to 255 and EOF

    ch should be declared as int

    when declared as char - you cannot distinguish between 255 and EOF

    -------
    word[x] = '\0';
    now you have a full word read
    allocate enough memory and copy word there
    After that you can use the word buffer again to read futher from file

    do not forget to skip empty words and add check agains memory overrun (no more that 31 char is possible to store in your word buffer)
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  11. #11
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    ... But, I've made other programs like this and it works. I thought it might be because I am evaluating directly what fgetc returns and assign its value to ch seperately.
    Code:
    (ch = fgetc(read_file)) != EOF
    I put everything between parenthesis so I guess that what is being compared is what the function returns directly, not ch... or am I wrong?

    And also, that means that I should change the array to an int array, or would it convert automatically from int to char? ... Or I could add a typecast
    Last edited by samus250; 04-26-2008 at 03:30 PM.

  12. #12
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    Quote Originally Posted by samus250 View Post
    ... But, I've made other programs like this and it works. I thought it might be because I am evaluating directly what fgetc returns and assign its value to ch seperately.
    No...
    Let's see what happens...

    fgetc returns 0-255 value for char read from file or -1 to indicate EOF condition
    you store this value in char var ch

    there are 2 possibilities
    1. if char is by default signed on your system:

    - values from 0 to 127 are atored as is
    - values from 128 to 255 stored as -128 to -1
    - EOF stored as -1

    when you go to compare the ch to EOF its value is casted to int and you compare values from
    -128 to 127 with -1

    so the condition is true for original value EOF (which is good) or 255 (which is bad - because you stop reading file too early)
    If file contains 255 character - it is read only up to this char (not including)

    2. second possibility - char is unsigned by default
    - values from 0 to 255 are stored as is
    - EOF is stored as 255

    when you go to compare ch var with EOF you never get the comparison as true - and enter the infinite loop


    ---------
    You say it is working for you, it means that for now you worked with compilers that has signed char by default and files without 255 char in it...

    Right thing to do - declare ch as int, check if it is EOF, if not - possible values are 0-255 that can be freely stored in the char array member, so no need to declare your storage arrays as int - char is enough
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  13. #13
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Thanks a lot vart, I'll do that.

    Since I am sure that I'll never have a txt file 100MB big, So I'll try to make one program that sorts completely of RAM or sorts using Adak's suggestion, depending on the arguments. I'm pretty sure I can easily make the one that sorts completely off RAM, and I'm pretty sure that I can easily code the first 4 steps of Adak's suggestion, but the last step I'll leave it to the sort command, I just tried it in Ubuntu and it works! (Words need to be in their own line though, so the files need to be pre-processed anyways). After that I'll try to not use the sort command for the challenge, or for any other OS that may not have it.

    The only thing I need to figure out of how to do efficiently, is removing duplicates. But I think I can make it with a final pass of the file. Since the words will be in order, duplicates will be consecutive. I guess it's not a hard thing to do with bool switches and a couple of if's or something.

    Thanks!

  14. #14
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Sure okay, if you can give yourself an upper limit of 100MB then sorting in memory should be okay.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  15. #15
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Quote Originally Posted by vart View Post
    word[x] = '\0';
    now you have a full word read
    allocate enough memory and copy word there
    After that you can use the word buffer again to read futher from file

    do not forget to skip empty words and add check agains memory overrun (no more that 31 char is possible to store in your word buffer)
    Good idea. So I just use my word[] var over and over again, for a temporary word storage. Then I guess I could count the number of bytes by using the value in 'x', and just allocate with malloc enough memory for the word (including the NULL). My other question would be... from where will I get enough pointers to point to every single word address returned by malloc?

    After that, I guess I could just sort the pointers, check for any repeated words, and then make the output file.
    Last edited by samus250; 04-26-2008 at 03:37 PM. Reason: After that, I guess....

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. A development process
    By Noir in forum C Programming
    Replies: 37
    Last Post: 07-10-2011, 10:39 PM
  2. Newbie homework help
    By fossage in forum C Programming
    Replies: 3
    Last Post: 04-30-2009, 04:27 PM
  3. File transfer- the file sometimes not full transferred
    By shu_fei86 in forum C# Programming
    Replies: 13
    Last Post: 03-13-2009, 12:44 PM
  4. Basic text file encoder
    By Abda92 in forum C Programming
    Replies: 15
    Last Post: 05-22-2007, 01:19 PM
  5. Encryption program
    By zeiffelz in forum C Programming
    Replies: 1
    Last Post: 06-15-2005, 03:39 AM