# Developing SpamFilter

• 09-25-2007
andhikacandra
Developing SpamFilter
I was developing SpamFIlter for my FinalProject, i'm still confused about:
1. Email message extraction into collection of word, is there any algorithm for that?
2. I was developing SpamFilter with "Linear Least Squares Fit" method, is there somebody had algorithm diagram or something like that

Thx..
• 09-25-2007
matsp
1. Yes - there are probably several, but the simples one is to scan the ext for "delimiters" (whitespaces, punctuations, etc) and anything inbetween two delimiters is a word.

2. Did you even TRY to google for that?
Second hit is:
http://en.wikipedia.org/wiki/Linear_least_squares

--
Mats
• 10-01-2007
andhikacandra
Ok thx..
1. For extraction, how about common word like "they, would, will etc", i shouldn't include that word in the collection?
2. I know that LLSF is using matrix, but how about the algorithm diagram, how about training algorithm and categorization diagram?

Thx..
• 10-02-2007
matsp
Building a list of the 100-500 most common words would be a good start. This should be done on a statistical basis, perhaps based on analyzing thousands of non-spam and spam e-mails from various types of sources (mailing lists, business (intra- & inter-office) e-mail, personal mail to give a few different categories).

As to the best algorithm for figuring out what's what - I have no idea.. I'm sure there are dozens of different ways to categorize if an e-mail is likely to belong to the spam or non-spam category.

--
Mats
• 10-02-2007
brewbuck
Quote:

Originally Posted by andhikacandra
1. Email message extraction into collection of word, is there any algorithm for that?

I've written spam filtering software and the tokenization of email can be very difficult. You generally want to keep all email addresses intact as single tokens while breaking URLs into host and path components. You want to separate punctuation from other characters, and probably convert everything to lower case.

Include all the header information, it is very useful.

As for how exactly you DO the tokenization in C#, I have no idea.

Quote:

2. I was developing SpamFilter with "Linear Least Squares Fit" method, is there somebody had algorithm diagram or something like that
How do you plan to apply this method to spam categorization? After a few moments of thinking I can't see how it could be used.

The systems I have worked with have been Naive Bayes filters, and neural networks with clustered inputs, with the clustering determined by a class-conditional-distribution EM method. The neural network method was the significantly better overall. See the following paper: Learning Spam: Simple Techniques For Freely-Available Software