A parallel Decision Forest
Here is a link to all of the code that I used for my email spam/ham classification algorithm - github repo
Explanation of some files
- emaildata.py - contains the EmailData class, whose methods are used to extract data from each individual email.
- extractwords.py - contains the ExtractWords class, which is used for automatic feature generation.
- Reads a large number of emails and finds the frequency each word appears in spam or non-spam emails. Then those frequencies are subtracted from eachother. The words with the highest magnitudes then are used as features. (The numbers with large magnitudes should be those that are particularly spammy, or not) A new file -message.fts- is written, containing the features.
- id3.py - My implementation of the id3 Decision tree algorithm.
- This one is a bit cryptic and hard to read. But there are good explanations of the algorithm out on the web. This video is part of a series of videos talking about decision trees and has a nice walkthrough of what the id3 algorithm is.
- parallelpredict.py - Uses id3.py and serialpredict.py to learn and classify email. Splits the data into different sets for each core in the cluster to look at and sends it out. Each core learns on the data. When making a prediction, each core receives the sample it is classifying, and makes its prediction. The prediction with the highest votes wins. For example, if we have 25 trees where 15 think the sample is spam and 10 do not, then we classify the sample as spam.
- serialpredict.py - for use as a time comparison to parallelpredict.py. Learns a single decision tree and classifies samples consecutively.