C Board  

Go Back   C Board > General Programming Boards > C# Programming

Reply
 
LinkBack Thread Tools Display Modes
Old 09-25-2007, 07:38 AM   #1
Registered User
 
Join Date: Sep 2007
Posts: 4
Developing SpamFilter

I was developing SpamFIlter for my FinalProject, i'm still confused about:
1. Email message extraction into collection of word, is there any algorithm for that?
2. I was developing SpamFilter with "Linear Least Squares Fit" method, is there somebody had algorithm diagram or something like that

Thx..
andhikacandra is offline   Reply With Quote
Old 09-25-2007, 07:43 AM   #2
Kernel hacker
 
Join Date: Jul 2007
Location: Farncombe, Surrey, England
Posts: 15,686
1. Yes - there are probably several, but the simples one is to scan the ext for "delimiters" (whitespaces, punctuations, etc) and anything inbetween two delimiters is a word.

2. Did you even TRY to google for that?
Second hit is:
http://en.wikipedia.org/wiki/Linear_least_squares

--
Mats
__________________
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
matsp is offline   Reply With Quote
Old 10-01-2007, 07:11 PM   #3
Registered User
 
Join Date: Sep 2007
Posts: 4
Ok thx..
1. For extraction, how about common word like "they, would, will etc", i shouldn't include that word in the collection?
2. I know that LLSF is using matrix, but how about the algorithm diagram, how about training algorithm and categorization diagram?

Thx..
andhikacandra is offline   Reply With Quote
Old 10-02-2007, 02:10 AM   #4
Kernel hacker
 
Join Date: Jul 2007
Location: Farncombe, Surrey, England
Posts: 15,686
Building a list of the 100-500 most common words would be a good start. This should be done on a statistical basis, perhaps based on analyzing thousands of non-spam and spam e-mails from various types of sources (mailing lists, business (intra- & inter-office) e-mail, personal mail to give a few different categories).

As to the best algorithm for figuring out what's what - I have no idea.. I'm sure there are dozens of different ways to categorize if an e-mail is likely to belong to the spam or non-spam category.

--
Mats
__________________
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
matsp is offline   Reply With Quote
Old 10-02-2007, 10:35 AM   #5
Senior software engineer
 
brewbuck's Avatar
 
Join Date: Mar 2007
Location: Portland, OR
Posts: 5,753
Quote:
Originally Posted by andhikacandra View Post
1. Email message extraction into collection of word, is there any algorithm for that?
I've written spam filtering software and the tokenization of email can be very difficult. You generally want to keep all email addresses intact as single tokens while breaking URLs into host and path components. You want to separate punctuation from other characters, and probably convert everything to lower case.

Include all the header information, it is very useful.

As for how exactly you DO the tokenization in C#, I have no idea.

Quote:
2. I was developing SpamFilter with "Linear Least Squares Fit" method, is there somebody had algorithm diagram or something like that
How do you plan to apply this method to spam categorization? After a few moments of thinking I can't see how it could be used.

The systems I have worked with have been Naive Bayes filters, and neural networks with clustered inputs, with the clustering determined by a class-conditional-distribution EM method. The neural network method was the significantly better overall. See the following paper: Learning Spam: Simple Techniques For Freely-Available Software
brewbuck is online now   Reply With Quote
Reply

Thread Tools
Display Modes

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Developing GUI objects in C - advice needed officedog C Programming 14 10-31-2008 03:30 PM
developing hash tables in the C Programming language w108dab C Programming 1 05-20-2008 11:20 AM
Developing stone age games geek@02 C++ Programming 5 03-14-2004 04:33 AM
Developing, building, and testing. How do it the best? Learning from the world leader gicio Tech Board 20 11-19-2003 09:38 AM
Dillema: VC++6 or VC# for game developing Unregistered C++ Programming 6 05-03-2002 06:51 AM


All times are GMT -6. The time now is 12:43 PM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22