Thread: text categorization

  1. #1
    Registered User
    Join Date
    Sep 2003
    Posts
    15

    text categorization

    Hi,

    I am trying to start the following project:

    A system that can look through text documents and be able to classify the document into 1 of 5 categories, similar to a newspaper, which places different articles in different sections like Sports, Finance, etc. Has anyone ever done something like this before? Can anyone advise on how to implement something like this in C? I appreciate any ideas.

    Thanks.

  2. #2
    Registered User linuxdude's Avatar
    Join Date
    Mar 2003
    Location
    Louisiana
    Posts
    926
    I would make arrays with terms for that area Ex:
    Code:
    char sports[4][]={"Saints","VooDoo","Hornets","LSU"};
    Of course it would all be lowecase and then you would open the file tolower each word and compare it with all the arrays if match occurs categorize.

  3. #3
    Registered User
    Join Date
    Sep 2003
    Posts
    15
    thanks for the reply. what's the best to way to go through each word in the file and compare it? also, would you be familiar as to how i could dynamically add new keywords to each category? for example, for each document that is added to the category i would like to add new keywords for that category which i obtain from that file.

    thanks.

  4. #4
    Registered User linuxdude's Avatar
    Join Date
    Mar 2003
    Location
    Louisiana
    Posts
    926
    to get a single word from a file and compare would be something like this
    Code:
    #include <stdio.h>
    #include <stdlib.h> /*for exit*/
    #include <string.h>
    
    char *getword(FILE *ffile);
    
    int main(void){
         FILE *myfile;
         char example[4][]={"Saints","VooDoo","Hornets",LSU"};
         char array[BUFSIZ];
         int x=0;
    
         myfile=fopen("temp","rb");
         if(!myfile){
              fprintf(stderr,"Error");
              exit(0);
         }
         array=getword(myfile);   /*gets one word(you would have to put this in a loop*/
         while(x<4){
              if(!strcmp("EOF",array)){
                    printf("End of File\n");
                    exit(0);
              }
              else if(!strcmp(example[x],array)){
                   printf("There is the word!!!!!!\n");
                   x++;
                   }
              else{
                   x++;
              }
    }
    
    char *getword(FILE *ffile){
          char buffer[BUFSIZ];
          char word[BUFSIZ];
          char endoffile[]={"EOF"};
          int y=0,int z=0;
          if((fgets(buffer,sizeof buffer,ffile))!=NULL){
               while(buffer[y]!=' ' || (ispunct(buffer[y])!=0) || buffer[y]!='\t' || buffer[y]!='\n'){
                buffer[y]=word[y];
                y++;
                }
           return word;
          }
          else
               return endoffile;
    }
    I think that will work for a word. But once again I can't check

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. A bunch of Linker Errors...
    By Junior89 in forum Windows Programming
    Replies: 4
    Last Post: 01-06-2006, 02:59 PM
  2. struct question
    By caduardo21 in forum Windows Programming
    Replies: 5
    Last Post: 01-31-2005, 04:49 PM
  3. Unknown Memory Leak in Init() Function
    By CodeHacker in forum Windows Programming
    Replies: 3
    Last Post: 07-09-2004, 09:54 AM
  4. Ok, Structs, I need help I am not familiar with them
    By incognito in forum C++ Programming
    Replies: 7
    Last Post: 06-29-2002, 09:45 PM
  5. Outputting String arrays in windows
    By Xterria in forum Game Programming
    Replies: 11
    Last Post: 11-13-2001, 07:35 PM