Sorting a file with a lot of words.

This is a discussion on Sorting a file with a lot of words. within the C Programming forums, part of the General Programming Boards category; Originally Posted by vart word[x] = '

Thread: Sorting a file with a lot of words.

'; now you have a full word read allocate enough memory and copy word ...

  1. #16
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Quote Originally Posted by vart View Post
    word[x] = '\0';
    now you have a full word read
    allocate enough memory and copy word there
    After that you can use the word buffer again to read futher from file

    do not forget to skip empty words and add check agains memory overrun (no more that 31 char is possible to store in your word buffer)
    Good idea. So I just use my word[] var over and over again, for a temporary word storage. Then I guess I could count the number of bytes by using the value in 'x', and just allocate with malloc enough memory for the word (including the NULL). My other question would be... from where will I get enough pointers to point to every single word address returned by malloc?

    After that, I guess I could just sort the pointers, check for any repeated words, and then make the output file.
    Last edited by samus250; 04-26-2008 at 03:37 PM. Reason: After that, I guess....

  2. #17
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    Quote Originally Posted by samus250 View Post
    My other question would be... from where will I get enough pointers to point to every single word address returned by malloc?

    After that, I guess I could just sort the pointers, check for any repeated words, and then make the output file.
    You could make an "array" by keeping a char** where you've malloc'ed enough storage for (say) 1000 char *'s. You can then access these with array notation. When you run out, use realloc to give you 2000, then 4000, and so on.

  3. #18
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    int main(int argc, char *argv[])
    {
    	FILE *read_file;
    	char ch;
    	int x = 0, i; 
    	char word[32]; // word storage
    	char *address;
    	
    	read_file = fopen(argv[1],"r"); // open file where specified
    	if(read_file == NULL) {
    		printf("Error reading file\n");
    		return 1;
    	}
    
    	while( (ch = fgetc(read_file)) != EOF) {
    		if(isalpha(ch)) {
    			x = 0;
    			while(isalpha(ch)) {
    				word[x] = ch;
    				x++;
    				ch = fgetc(read_file);
    			}
    			word[x] = '\0';
    			address = (char *)malloc(x);
    
    			for(i=0 ; i<x ; i++) { // copy word to allocated area
    				*(address+i) = word[i];
    			}
    			
    			// print the word, just for now.
    			for(i=0 ; i<=x ; i++) {
    				if(*(address+i) == '\0')	
    					putchar('\n');
    				else
    					printf("&#37;c", *(address+i));
    			}
    
    
    		} else {
    			// A non alphabetical char was encountered.
    			// I suppose that I allocate more memory for the next word here
    			// not sure how to do it
    			// allocate_memory()
    			continue;
    		}
    	}
    	
    	// sort_words();
    	// write_to_file();
    	
    	return 0;
    }
    This code works, it just prints the words the file has, one by line, but it prints them from the allocated data. Now, from where am I going to get all the pointers so that I could sort them?

    And one other thing, take away the 'x=0;' assignment in line 20. Something funny happens, it outputs one word every line, but it outputs the word before it too. Example:
    Code:
    word1
    word1word2
    word1word2word3
    Edit:
    I found out why hehe. The word[] buffer will overflow anyways.. Since the x never initialize, then word will contain both words, the current one and the one before it.
    Last edited by samus250; 04-26-2008 at 04:05 PM.

  4. #19
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Quote Originally Posted by tabstop View Post
    You could make an "array" by keeping a char** where you've malloc'ed enough storage for (say) 1000 char *'s. You can then access these with array notation. When you run out, use realloc to give you 2000, then 4000, and so on.
    Ummmmmm I don't understand very well. So you are saying that I should first declare say 1000 pointers:
    Code:
    char *addresses[1000];
    int a; // the counter
    And every time I malloc I would:
    Code:
    addresses[a] = (char *)malloc(x);
    a++;
    And:
    Code:
    if(a >= 1000)
        reallocate for more space
    Is that what you mean? If it is, how do I reallocate an array? I've never done that before, I don't know how reallocating arrays work.

    Thanks.

  5. #20
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Or do you mean something more like:

    Code:
    char **array;
    char *adresses;
    
    addresses = (char *)malloc((sizeof(char *)) * 1000);
    array = &addresses; // and never modify 'array' since it is the start of all the pointers?
    ? I'm confused

  6. #21
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    Quote Originally Posted by samus250 View Post
    Or do you mean something more like:

    Code:
    char **array;
    char *adresses;
    
    addresses = (char *)malloc((sizeof(char *)) * 1000);
    array = &addresses; // and never modify 'array' since it is the start of all the pointers?
    ? I'm confused
    Except that you don't need the array variable, yes. You now have room for 1000 char * (except when malloc fails, of course) at addresses[0] through addresses[999]. If you go over that, use realloc to get more. (You should be able to find 349 -- that is, some large number -- posts about realloc here on the board if you search.)

  7. #22
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by samus250 View Post
    Or do you mean something more like:

    Code:
    char **array;
    char *adresses;
    
    addresses = (char *)malloc((sizeof(char *)) * 1000);
    array = &addresses; // and never modify 'array' since it is the start of all the pointers?
    ? I'm confused
    You can make this as complicated and or elegant as you want, but there is no need to malloc a dam thing here, to simply do the job.

    Use a static 2D array, and you don't give a crap about wasting some memory - you're not going to need to store the whole bloody text in RAM, anyway.

    (If you do want to store the entire text in RAM, then you WILL want an array of pointers, otherwise, you don't need them).

    Store a "chunk", sort it, write it out, and repeat. No malloc, no array of pointers (lovely as they are).

    You can do this rather simply, and I'd recommend it.

  8. #23
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    OK thanks, let me experiment with that complicated malloc function.

    Just one thing. In this line, does addresses need to be a pointer to pointer, or just a normal pointer?:
    Code:
    addresses = (char *)malloc((sizeof(char *)) * 1000);
    And after that how do I access all those pointers? Would it be something like:
    Code:
    // array notation
    *address;
    *(address+1);
    *(address+2);

  9. #24
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Quote Originally Posted by Adak View Post
    You can make this as complicated and or elegant as you want, but there is no need to malloc a dam thing here, to simply do the job.

    Use a static 2D array, and you don't give a crap about wasting some memory - you're not going to need to store the whole bloody text in RAM, anyway.

    (If you do want to store the entire text in RAM, then you WILL want an array of pointers, otherwise, you don't need them).

    Store a "chunk", sort it, write it out, and repeat. No malloc, no array of pointers (lovely as they are).

    You can do this rather simply, and I'd recommend it.
    You say that I should get like 1000 words, sort them, store the chunk, get the other thousands of words and store the sorted chunks and then use mergesort? So if its 20,000 words I'll write 20 different files and sort them all to one, then clean them all up when finished?

  10. #25
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by samus250 View Post
    You say that I should get like 1000 words, sort them, store the chunk, get the other thousands of words and store the sorted chunks and then use mergesort? So if its 20,000 words I'll write 20 different files and sort them all to one, then clean them all up when finished?
    Yeah, but make it 5,000 - 10,000 words at a time, to help limit the number of files, and use the resources of the memory you have.

    Sort them using Quicksort or whatever fast sorter you prefer.

    Write it to the disk as a file.

    Repeat above until the input file, is empty.

    THEN you'll use mergesort on the files you've written out, to simply and efficiently merge them all together, in sorted order.

    You can rather easily write out the logic for the last sort, yourself. The only part that is remotely difficult is just deciding on the filenames protocol you want to use, as I mentioned previously, so your program knows what filenames to look for, and when to begin sorting them together.

    You'll have
    Code:
    level 1 #1 \
               level 2, #1
    level 1 #2 /
    etc.

    Not simpleton, but relatively straightforward, and elegant as well, in a 4x4 kind of way.

    This is not much of a programming method to solving the problem, but another approach, even simpler.

    Some OS's (Like Windows 2000), have special micro-code and guaranteed resources available for sorting. It allows your PC to sort like a banshee from hell, if it's there! You will NEVER write a program remotely able to line sort like it.

    So, using that resource, just do this:

    Write all the words from the text file, into the one word per line format you want to get.

    Now let the OS go to work:
    Code:
    system("sort UnsortedFilename >sortedFilename"); //note the one space before the >, and no space afterward
    And just sit back in awe - it's a wonder to behold. It will use all it's available memory, and all it's hard disk available space, if needed.

    You'll have your sorted list of words, and you may find that the OS can even delete duplicates, but I'm not sure about the latter. It's certainly easy enough to delete duplicates after the name have all been sorted.

    Ez as pie!
    Last edited by Adak; 04-26-2008 at 10:25 PM.

  11. #26
    Banned master5001's Avatar
    Join Date
    Aug 2001
    Location
    Visalia, CA, USA
    Posts
    3,685
    If the average word is lets say 5 letters long, and the average whitespace is lets say one character wide, then theoretically, with 20,000 words you are only talking about like maybe 117k. I appreciate Adak's enthusiasm, but it would probably be advantageous to do something this small in memory. Since file IO is the slowest part of your sort algorithm now.

  12. #27
    CSharpener vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,424
    Quote Originally Posted by samus250 View Post
    ... But, I've made other programs like this and it works. I thought it might be because I am evaluating directly what fgetc returns and assign its value to ch seperately.
    No...
    Let's see what happens...

    fgetc returns 0-255 value for char read from file or -1 to indicate EOF condition
    you store this value in char var ch

    there are 2 possibilities
    1. if char is by default signed on your system:

    - values from 0 to 127 are atored as is
    - values from 128 to 255 stored as -128 to -1
    - EOF stored as -1

    when you go to compare the ch to EOF its value is casted to int and you compare values from
    -128 to 127 with -1

    so the condition is true for original value EOF (which is good) or 255 (which is bad - because you stop reading file too early)
    If file contains 255 character - it is read only up to this char (not including)

    2. second possibility - char is unsigned by default
    - values from 0 to 255 are stored as is
    - EOF is stored as 255

    when you go to compare ch var with EOF you never get the comparison as true - and enter the infinite loop


    ---------
    You say it is working for you, it means that for now you worked with compilers that has signed char by default and files without 255 char in it...

    Right thing to do - declare ch as int, check if it is EOF, if not - possible values are 0-255 that can be freely stored in the char array member, so no need to declare your storage arrays as int - char is enough
    The first 90% of a project takes 90% of the time,
    the last 10% takes the other 90% of the time.

  13. #28
    Registered User
    Join Date
    Jan 2008
    Posts
    182
    Thanks a lot vart, I'll do that.

    Since I am sure that I'll never have a txt file 100MB big, So I'll try to make one program that sorts completely of RAM or sorts using Adak's suggestion, depending on the arguments. I'm pretty sure I can easily make the one that sorts completely off RAM, and I'm pretty sure that I can easily code the first 4 steps of Adak's suggestion, but the last step I'll leave it to the sort command, I just tried it in Ubuntu and it works! (Words need to be in their own line though, so the files need to be pre-processed anyways). After that I'll try to not use the sort command for the challenge, or for any other OS that may not have it.

    The only thing I need to figure out of how to do efficiently, is removing duplicates. But I think I can make it with a final pass of the file. Since the words will be in order, duplicates will be consecutive. I guess it's not a hard thing to do with bool switches and a couple of if's or something.

    Thanks!

  14. #29
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,261
    Sure okay, if you can give yourself an upper limit of 100MB then sorting in memory should be okay.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Page 2 of 2 FirstFirst 12
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. A development process
    By Noir in forum C Programming
    Replies: 37
    Last Post: 07-10-2011, 10:39 PM
  2. Newbie homework help
    By fossage in forum C Programming
    Replies: 3
    Last Post: 04-30-2009, 04:27 PM
  3. File transfer- the file sometimes not full transferred
    By shu_fei86 in forum C# Programming
    Replies: 13
    Last Post: 03-13-2009, 12:44 PM
  4. Basic text file encoder
    By Abda92 in forum C Programming
    Replies: 15
    Last Post: 05-22-2007, 01:19 PM
  5. Encryption program
    By zeiffelz in forum C Programming
    Replies: 1
    Last Post: 06-15-2005, 03:39 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21