![]() |
| | #1 |
| Registered User Join Date: Sep 2007
Posts: 4
| Developing SpamFilter 1. Email message extraction into collection of word, is there any algorithm for that? 2. I was developing SpamFilter with "Linear Least Squares Fit" method, is there somebody had algorithm diagram or something like that Thx.. |
| andhikacandra is offline | |
| | #2 |
| Kernel hacker Join Date: Jul 2007 Location: Farncombe, Surrey, England
Posts: 15,686
| 1. Yes - there are probably several, but the simples one is to scan the ext for "delimiters" (whitespaces, punctuations, etc) and anything inbetween two delimiters is a word. 2. Did you even TRY to google for that? Second hit is: http://en.wikipedia.org/wiki/Linear_least_squares -- Mats
__________________ Compilers can produce warnings - make the compiler programmers happy: Use them! Please don't PM me for help - and no, I don't do help over instant messengers. |
| matsp is offline | |
| | #3 |
| Registered User Join Date: Sep 2007
Posts: 4
| Ok thx.. 1. For extraction, how about common word like "they, would, will etc", i shouldn't include that word in the collection? 2. I know that LLSF is using matrix, but how about the algorithm diagram, how about training algorithm and categorization diagram? Thx.. |
| andhikacandra is offline | |
| | #4 |
| Kernel hacker Join Date: Jul 2007 Location: Farncombe, Surrey, England
Posts: 15,686
| Building a list of the 100-500 most common words would be a good start. This should be done on a statistical basis, perhaps based on analyzing thousands of non-spam and spam e-mails from various types of sources (mailing lists, business (intra- & inter-office) e-mail, personal mail to give a few different categories). As to the best algorithm for figuring out what's what - I have no idea.. I'm sure there are dozens of different ways to categorize if an e-mail is likely to belong to the spam or non-spam category. -- Mats
__________________ Compilers can produce warnings - make the compiler programmers happy: Use them! Please don't PM me for help - and no, I don't do help over instant messengers. |
| matsp is offline | |
| | #5 | ||
| Senior software engineer Join Date: Mar 2007 Location: Portland, OR
Posts: 5,753
| Quote:
Include all the header information, it is very useful. As for how exactly you DO the tokenization in C#, I have no idea. Quote:
The systems I have worked with have been Naive Bayes filters, and neural networks with clustered inputs, with the clustering determined by a class-conditional-distribution EM method. The neural network method was the significantly better overall. See the following paper: Learning Spam: Simple Techniques For Freely-Available Software | ||
| brewbuck is online now | |
![]() |
| Thread Tools | |
| Display Modes | |
|
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Developing GUI objects in C - advice needed | officedog | C Programming | 14 | 10-31-2008 03:30 PM |
| developing hash tables in the C Programming language | w108dab | C Programming | 1 | 05-20-2008 11:20 AM |
| Developing stone age games | geek@02 | C++ Programming | 5 | 03-14-2004 04:33 AM |
| Developing, building, and testing. How do it the best? Learning from the world leader | gicio | Tech Board | 20 | 11-19-2003 09:38 AM |
| Dillema: VC++6 or VC# for game developing | Unregistered | C++ Programming | 6 | 05-03-2002 06:51 AM |