Word frequency and sorting

This is a discussion on Word frequency and sorting within the C Programming forums, part of the General Programming Boards category; "t" is non-standard extention, better leave it out, without it - file is opened still in text mode...

  1. #16
    CSharpener vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,484
    "t" is non-standard extention, better leave it out, without it - file is opened still in text mode
    The first 90% of a project takes 90% of the time,
    the last 10% takes the other 90% of the time.

  2. #17
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    In my compiler, there is a global setup variable. It will be set for either binary or text mode as the default file open mode.

    So the 't' removes any ambiguity, and overrides the global setup variable, if that should differ.

    You can always try just the "w" mode and see what mode is your default mode.

  3. #18
    Registered User
    Join Date
    Dec 2008
    Posts
    24
    I think, i deliver my question badly here sorry, here is a case study to clear the explanation:

    - I have a text file, which is in a different location (not in one folder) from my read-text program. With all that i know right now, if i want the text file to be read or written by the program, i must move the file to the same folder as my program (right?). Now the problem is, i don't want to move the program, i just want to make the program that would select the folder(directory) where the text file is, and do the "read" or "write" as usual.

    So the point is, how do i make the program detect my text file which is in another directory? Do i need the fopen function also and add something to it?

    -The second one, can anyone give me a little bit example of a isspace function? i've searched in the internet and unluckily didn't found any.

    - I'm having problem with the !feof function, what does the "!" means in front of the feof and could the feof function be helpful in my program?

    Thanks
    Last edited by zyxx_66; 12-15-2008 at 05:04 AM.

  4. #19
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Help for isspace():

    Code:
    isspace()   Character classification macro
    
     Syntax:
       int isspace(int c);
    
     Prototype in
     ctype.h
    
     Remarks:
    isspace is a macro that classifies ASCII coded integer values by table
    lookup.
    
    It is a predicate returning nonzero for true and 0 for false. It is defined
    only when isascii(c) is true or c is EOF.
    
    You can make this macro available as a function by undefining (#undef) it.
    
     Return Value:
    isspace returns nonzero if c is a space, tab, carriage return, new line,
    vertical tab, or formfeed (0x09 to 0x0D, 0x20).
    
     Portability:
    isspace is available on UNIX machines and is compatible with ANSI C.
    
    It is compatible with Kernighan and Ritchie.
    
     Example:
     #include <ctype.h>
     #include <stdio.h>
    
     int main(void)
     {
        char c = 'C';
        if (isspace(c))
           printf("%c is white space\n",c);
        else
           printf("%c isn't white space\n",c);
        return 0;
     }


    To work with a file in another directory from your program, you have to tell your program exactly what the directory and the filename, are:

    Code:
    while((ch = fgetc(inFileHandle)) != EOF)   {
       //your isspace code goes in here
    }
    
    Instead of just:
    
    if(inFileHandle = fopen("FileName", "rt") == NULL)  {
    //your file didn't open so print an error message and exit(1);
    //as shown in my previous post
    }
    
    You'll want:
    if(inFileHandle = fopen("yourDirectoryAndYourFilename", "rt") == NULL)  {
       //your file didn't open, so print an error message, and exit(1);
    }
    Where inFileHandle is replaced with the name of your own file handle, of course.

  5. #20
    Registered User
    Join Date
    Dec 2008
    Posts
    24
    Thanks Adak for the information, well actually i read some reference and found how it works before seeing your post, but after reading your post i got to know it better, thanks!

    Well, i've finished the first part, now i've made the program to count words, letters and newlines (it really perked me up)

    But i think next is the hardest part, the sorting of the words. So far in my program, i can only give the number of words occured, but can not show the words and how much it occured.

    Now for the startup, what should i know about this part? the first post gave me a tree structure, but i don't really know it yet(more readings for me haha).

    Is this part gonna be long? thanks for all the post so far and i'm still going to need all of your help.

  6. #21
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    This part will not be long from me!! I've only taken one class in programming, and we didn't learn about tree's.

    When I wanted to write a program to analyze common words in files (books, actually), here's what I did:

    After the word was gathered from the file, letter by letter, and counted, then I wrote the word out to my own file, with a newline at the end. So I wound up with something like this:

    wordone
    wordtwo
    wordthree
    wordfour
    etc.

    Now this file got very large, and I didn't have enough memory to sort it, without using a mergesort or quicksort, with a lot of temporary files that all had to be managed. Usually, there were 16 temporary files that had to be merged, for every big word file.

    I found the easiest way to do it was just to use the system command
    "sort filename >newfilename". Two things you need to watch out for:
    (this odd ^^^^^^^^^ spacing is not a typo. It is exact.)

    For Old Turbo C/C++ DOS compilers:

    1) If your current directory is not "in the path" then you may get an error. If that happens you should just do it from the root directory: "C:\". That is always "in the path".

    2) You may need to leave your program, and not shell from inside it, to do the sort. This:

    system("sort filename >newfilename");

    may give you the "Too big for memory" error. If so, then you'll have to open a separate console window to increase the memory.

    Windows has a *wonderful* sorter - with dedicated microcode and system resources available to it. It will run through a big sort job, like s***t through a goose, trust me.

  7. #22
    Registered User
    Join Date
    Dec 2008
    Posts
    24
    Hmm, so is this sorting outside the c program? Like using a Windows sorter to do the sorting job?

    (I understand when you edit your textfile to be 1 word per line(am i wrong?), but i don't understand how the files got very large(these files is only a matters of kb right?) and how the sorting mechanism you propose).

    Well, actually what i had in mind is that the program can do word counting and it's frequency of occurence, and sorted the frequency.

    So, again with my current code (i've used the isspace and the function if.... elseif), how do i exactly group the words and count it's frequencies? what's the basic function used?

    Thanks

  8. #23
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    OK. I was collecting data on word usage, in books, on-line blogs, etc., so there were a *........load* of words.

    The sorting was needed to keep the counting and cataloging, easy. Right you are, one word per line (otherwise the sort on that line is meaningless).

    The sort I'm speaking of, is a built in command right in the operating system. Unix will sort, so will DOS, Windows, all versions of Linux, they all have a sorting function.

    After the sorting, then your program just counts them up:

    a
    a
    a
    a
    a
    add
    add
    again
    against
    aid
    aid

    4 a's, 2 add's, 1 again, 1 against, 2 aid's, etc., and of course, you're counting how many words you have in total. Now you get your percentage, for every word, after you have your counting done. Just like a batting average in baseball: in my example "add" has a rating of 0.1818 (2 instances divided by 11 words, in total), or 18.18%

    There is no "basic function" to count words and note their frequency. Well, I'm sure the NSA has "something" for this, but C does not - you roll your own function.

    I don't know about WindowsXP or Vista, but when I looked this up in Windows2000, it stated that the OS had built in microcode and resources that would be readily available and guaranteed, for sorting. Obviously, it had something, because it far outclassed both the iterative and the recursive style of Quicksort, regardless of how I tried to trip it up by rearranging the data it was sorting.

  9. #24
    Registered User
    Join Date
    Dec 2008
    Posts
    24
    I see, so the sorting is done by the OS? I would really use that, but i guess in this assignment, we must make the sorting program ourself, do you know about the tree structure(the first post was talking about)?

    How do i input the word i got in the previous functions to be used in a structure? i am confused when reading literature about the structure.

    Or i don't need the previous functions with isspace that i got , and make a new functions that links to the tree structure?

  10. #25
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by zyxx_66 View Post
    I see, so the sorting is done by the OS? I would really use that, but i guess in this assignment, we must make the sorting program ourself, do you know about the tree structure(the first post was talking about)?

    How do i input the word i got in the previous functions to be used in a structure? i am confused when reading literature about the structure.

    Or i don't need the previous functions with isspace that i got , and make a new functions that links to the tree structure?
    No, I don't know squat about tree structures, except the kind you find in depth first searches, in chess and other puzzles. I don't call them "tree's", but they are a search tree, although a specialized one like a binary tree, etc.

    OK, structures are just a group of data, that logically fit together: Your name, age, and address might all go together on a student record for you, see?

    First you define a structure like you want it to be:
    Code:
    struct students  {   //this is our structure definition
       char name[30];
       int age;
       char addy1[30];
       char addy2[30];
    };
    
    struct students student[100];  //this is our structure declaration - 
                                                     // in an array of 100 structures.
    
    /* for a word, the struct could be just: */
    struct allwords  {
    
      char word[22];
      int freq;
    
    };
    
    struct allwords words[100];
    And here's where all the options come in. You can "hard code" the size of the word array, as I've shown, or you can make it more flexible, but slightly more complex, by using a pointer to an array of pointers, there.

    Not recommended for beginners, but definitely an improvement. Another option is to "hard code" the size of the array as I've done here, or to malloc () more memory, only as it's needed.

    This is the easiest way, but it's not the formula I racing car, way.

    Another way is to use corresponding arrays. One for words[], and one for frequency freq[]. No structure is needed here. The word in words[0], will have the frequency in freq[0], so they always correspond.

    When you need to sort them, you just sort the words, and then swap the frequency array, whenever you swap the word array. (Or do an index sort). Either way, Ez smeazy. Again, you have the choice of "hard coding" the size of these arrays, or of using malloc() (or calloc() ), to set up memory for them, as needed.

    As long as you don't have more things to be sorted, than you have memory to sort with, you'll be fine.

    Any idea on how many words you'll be trying to count and sort?

  11. #26
    Registered User
    Join Date
    Dec 2008
    Posts
    24
    Uhm, for the number of words i'll be counting and sorting depends on the text file i input, like i said, where i'll be inputting the word counting from the previous function(reading textfiles). Any idea on how to do that?

    And what do you mean by "hard code" like you did or use pointers ? I didn't get it actually. (sorry for me being a dimwit)

    For the malloc, it allocates memory for the program/function right? So what is the benefit in doing that? Thanks

    O, just for curiosity, how long am i from finishing this program with my current knowledge of c (poor i guess )?

  12. #27
    Registered User
    Join Date
    May 2008
    Location
    India
    Posts
    30
    Quote Originally Posted by zyxx_66 View Post
    Uhm, for the number of words i'll be counting and sorting depends on the text file i input, like i said, where i'll be inputting the word counting from the previous function(reading textfiles). Any idea on how to do that?

    And what do you mean by "hard code" like you did or use pointers ? I didn't get it actually. (sorry for me being a dimwit)

    For the malloc, it allocates memory for the program/function right? So what is the benefit in doing that? Thanks

    O, just for curiosity, how long am i from finishing this program with my current knowledge of c (poor i guess )?
    If you want to do the sorting and grouping through the program,i would suggest take a look on tree structure or Map. Both of them are mentioned already in the previous posts. I am not sure about how to implement MAP, but for tree implementation, you should have knowledge on structures.
    if use tree, you won't be needing any temporary file or any sorting. To be more precise go for a binary tree.

  13. #28
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by zyxx_66 View Post
    Uhm, for the number of words i'll be counting and sorting depends on the text file i input, like i said, where i'll be inputting the word counting from the previous function(reading textfiles). Any idea on how to do that?

    And what do you mean by "hard code" like you did or use pointers ? I didn't get it actually. (sorry for me being a dimwit)

    For the malloc, it allocates memory for the program/function right? So what is the benefit in doing that? Thanks

    O, just for curiosity, how long am i from finishing this program with my current knowledge of c (poor i guess )?
    I'm not sure how far you are from finishing this program, because I'm not sure what the program requirements are yet.

    I've asked "How many words are we talking about counting and sorting?", and you say "that depends on the text file I input".

    No kidding, huh? What a stunner!

    "Hard code" meant that the size of the array was fixed when the program started, and could not be changed. The benefit of malloc (or calloc), is that the size of the array can be changed at the time you run the program. The downside of that is that the complexity is increased slightly, as well.

    I'm going to recommend, that you consider using another language. Ruby or Python would be much easier. Both are higher level languages, and have parts to them (like Dictionaries) and memory management, that might make your job much easier. They're both free, have good tutorials, and have a large help forum available, as well. With either language, you'll be done with your program faster than you would in C.

    You can do what you want to do, in C, but you also have to deal with all the little details that are so important in a mid-level language like C. That's very difficult to do if you are a beginner, trying to write a robust program, that may have to deal with millions of words.

  14. #29
    Registered User
    Join Date
    Dec 2008
    Posts
    24
    Sorry for not giving the specifications of my problem earlier but here it goes:

    i) Read in a text from a file. As we read the file, break the file into words.
    ii) Count the number of occurrences of each word.
    iii) Rank the words in order of frequency.

    And it kindly gives some tips on the basic logic of the program :
    a) Define a large array of strings which will contain all our words. Blank them all.
    b) Define an array which is the same size to count the number of occurrences.
    c) Set a counter to the number of words found so far (0).
    d) Open a file and move along it looking for words.
    e) Every time we find a new word:
    i) Move along the array from 0 to the number of words found so far comparing it with the other words to check if its new.
    ii) If the word is new then enter it into our array of strings at the position indicated by our counter for the number of words. Set the appropriate element of our occurrence array to 1. Increment the number of words found so far.
    iii) If the word is not new then increment the appropriate element of our occurrence array by 1.
    f) Print the final total of words and number of occurrences.
    g) Use the unix command "sort" to sort the list into order

    Now where i am in this process, is i already made a program that can count words, but the output is only the total words (example : Total Words : 5) , it only shows the total number of words, but it cannot show the specific word that occured or the how many times the specific word occured.

    And , i appreciate you recommending Ruby or Python but i am obligated to using the C program only.

  15. #30
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by zyxx_66 View Post
    Sorry for not giving the specifications of my problem earlier but here it goes:

    i) Read in a text from a file. As we read the file, break the file into words.
    ii) Count the number of occurrences of each word.
    iii) Rank the words in order of frequency.

    And it kindly gives some tips on the basic logic of the program :
    a) Define a large array of strings which will contain all our words. Blank them all.
    b) Define an array which is the same size to count the number of occurrences.
    c) Set a counter to the number of words found so far (0).
    d) Open a file and move along it looking for words.
    e) Every time we find a new word:
    i) Move along the array from 0 to the number of words found so far comparing it with the other words to check if its new.
    ii) If the word is new then enter it into our array of strings at the position indicated by our counter for the number of words. Set the appropriate element of our occurrence array to 1. Increment the number of words found so far.
    iii) If the word is not new then increment the appropriate element of our occurrence array by 1.
    f) Print the final total of words and number of occurrences.
    g) Use the unix command "sort" to sort the list into order

    Now where i am in this process, is i already made a program that can count words, but the output is only the total words (example : Total Words : 5) , it only shows the total number of words, but it cannot show the specific word that occured or the how many times the specific word occured.

    And , i appreciate you recommending Ruby or Python but i am obligated to using the C program only.
    OK, this is small to medium in scale, and easy enough to do in C, for a beginner.

    Please post your most recent version of the program, and include in it the two array's you'll be using:

    one string:
    char words[1000][25], and the other int tally, freq, whatever [1000]. Try it with your compiler and make sure these sizes are 1) Not too big (the compiler should complain), and 2) Not too small. Sort of the Goldilocks size.

    And let us know what you're stuck on. If you're on Unix, you'll need to check your man page for info on using it's sort command. You're actually using Unix? Seems odd, but nothing wrong with that.

Page 2 of 3 FirstFirst 123 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Need help in sorting
    By franziss in forum C++ Programming
    Replies: 3
    Last Post: 09-18-2005, 12:00 AM
  2. Sorting a string
    By Roaring_Tiger in forum C Programming
    Replies: 12
    Last Post: 09-26-2004, 08:12 AM
  3. extra word printing
    By kashifk in forum C++ Programming
    Replies: 2
    Last Post: 10-25-2003, 04:03 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21