Difference between revisions of "Decision Forests"

From APL
Jump to: navigation, search
(Created page with " == Code on Github == Here is a link to all of the code that I used for my email spam/ham classification algorithm - [https://github.com/brianjp93/email-classification/ github...")
 
(Code on Github)
Line 1: Line 1:
  
== Code on Github ==
+
== A parallel Decision Forest ==
 
Here is a link to all of the code that I used for my email spam/ham classification algorithm - [https://github.com/brianjp93/email-classification/ github repo]
 
Here is a link to all of the code that I used for my email spam/ham classification algorithm - [https://github.com/brianjp93/email-classification/ github repo]
 +
 +
Explanation of some files
 +
<ul>
 +
  <li>emaildata.py - contains the EmailData class, whose methods are used to extract data from each individual email.</li>
 +
  <li>extractwords.py - contains the ExtractWords class, which is used for automatic feature generation.</li>
 +
  <ul><li>Reads a large number of emails and finds the frequency each word appears in spam or non-spam emails.  Then those frequencies are subtracted from eachother.  The words with the highest magnitudes then are used as features.  (The numbers with large magnitudes should be those that are particularly spammy, or not)  A new file  -message.fts- is written, containing the features.</li></ul>
 +
</ul>

Revision as of 03:00, 19 March 2016

A parallel Decision Forest

Here is a link to all of the code that I used for my email spam/ham classification algorithm - github repo

Explanation of some files

  • emaildata.py - contains the EmailData class, whose methods are used to extract data from each individual email.
  • extractwords.py - contains the ExtractWords class, which is used for automatic feature generation.
    • Reads a large number of emails and finds the frequency each word appears in spam or non-spam emails. Then those frequencies are subtracted from eachother. The words with the highest magnitudes then are used as features. (The numbers with large magnitudes should be those that are particularly spammy, or not) A new file -message.fts- is written, containing the features.