Thread: Developing SpamFilter

  1. #1
    Registered User
    Join Date
    Sep 2007
    Posts
    4

    Developing SpamFilter

    I was developing SpamFIlter for my FinalProject, i'm still confused about:
    1. Email message extraction into collection of word, is there any algorithm for that?
    2. I was developing SpamFilter with "Linear Least Squares Fit" method, is there somebody had algorithm diagram or something like that

    Thx..

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    1. Yes - there are probably several, but the simples one is to scan the ext for "delimiters" (whitespaces, punctuations, etc) and anything inbetween two delimiters is a word.

    2. Did you even TRY to google for that?
    Second hit is:
    http://en.wikipedia.org/wiki/Linear_least_squares

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Registered User
    Join Date
    Sep 2007
    Posts
    4
    Ok thx..
    1. For extraction, how about common word like "they, would, will etc", i shouldn't include that word in the collection?
    2. I know that LLSF is using matrix, but how about the algorithm diagram, how about training algorithm and categorization diagram?

    Thx..

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Building a list of the 100-500 most common words would be a good start. This should be done on a statistical basis, perhaps based on analyzing thousands of non-spam and spam e-mails from various types of sources (mailing lists, business (intra- & inter-office) e-mail, personal mail to give a few different categories).

    As to the best algorithm for figuring out what's what - I have no idea.. I'm sure there are dozens of different ways to categorize if an e-mail is likely to belong to the spam or non-spam category.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by andhikacandra View Post
    1. Email message extraction into collection of word, is there any algorithm for that?
    I've written spam filtering software and the tokenization of email can be very difficult. You generally want to keep all email addresses intact as single tokens while breaking URLs into host and path components. You want to separate punctuation from other characters, and probably convert everything to lower case.

    Include all the header information, it is very useful.

    As for how exactly you DO the tokenization in C#, I have no idea.

    2. I was developing SpamFilter with "Linear Least Squares Fit" method, is there somebody had algorithm diagram or something like that
    How do you plan to apply this method to spam categorization? After a few moments of thinking I can't see how it could be used.

    The systems I have worked with have been Naive Bayes filters, and neural networks with clustered inputs, with the clustering determined by a class-conditional-distribution EM method. The neural network method was the significantly better overall. See the following paper: Learning Spam: Simple Techniques For Freely-Available Software

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Developing GUI objects in C - advice needed
    By officedog in forum C Programming
    Replies: 14
    Last Post: 10-31-2008, 03:30 PM
  2. developing hash tables in the C Programming language
    By w108dab in forum C Programming
    Replies: 1
    Last Post: 05-20-2008, 11:20 AM
  3. Developing stone age games
    By geek@02 in forum C++ Programming
    Replies: 5
    Last Post: 03-14-2004, 04:33 AM
  4. Replies: 20
    Last Post: 11-19-2003, 09:38 AM
  5. Dillema: VC++6 or VC# for game developing
    By Unregistered in forum C++ Programming
    Replies: 6
    Last Post: 05-03-2002, 06:51 AM