A parallel Decision Forest
Here is a link to all of the code that I used for my email spam/ham classification algorithm - github repo
Explanation of some files
- emaildata.py - contains the EmailData class, whose methods are used to extract data from each individual email.
- extractwords.py - contains the ExtractWords class, which is used for automatic feature generation.
- Reads a large number of emails and finds the frequency each word appears in spam or non-spam emails. Then those frequencies are subtracted from eachother. The words with the highest magnitudes then are used as features. (The numbers with large magnitudes should be those that are particularly spammy, or not) A new file -message.fts- is written, containing the features.