I decided to learn Python and was recently admitted into Northwestern's Predictive Analytics Graduate program. So I figured, meh! Why not learn both at the same time? So my first foray into predictive analytics had to do with a supervised classification model called a decision tree. What is a decision tree? In essence it's a predictive algorithm that just so happens to be (when drawn out / visualized), well... a tree. I first encountered decision trees in the book published by O'Reilly called, Data Science for Business. I used most of Joe McCarthy's primer as the guide to my programming exercises and modified it a bit to better suit my nuances in programming style. It was the first predictive model they described and one of the more interesting ones in my opinion because of its relative simplicity. They cited using a data set of mushrooms samples, courtesy of UCI. The aim of the tree was to predict whether any additional samples based on its attributes, was either poisonous or edible. Which leads me to the question; how did they manage to find out whether the samples from the original data set was poisonous or edible? The sample data set can be found here: Mushrooms.
An accompanying set of attributes and their value definitions was provided with the data set. They are as follows:
So how's it work? The tree's efficacy is based on how big of an influence or how significant each attribute is in determining whether or not a mushroom is edible or poisonous. This is done by calculating the entropy of the target attribute and then proceeding to calculate the information gain for each non-target attribute (#2 - 23 in the list above). For this data set of mushrooms, the target attribute is the following: (#1 in the list above. Edible or Poisonous).
Entropy is defined as the measure of the general disorder or impurity in a set of data and is described in the following equation:
Calculating entropy was as simple as writing a function to calculate it based on whether the initial data set was poisonous or edible. This was done by first munging the sample data set by excluding instances without errant or missing data values (attributes values), then using the formula illustrated above on the 1st index of cleaned data.
The code above yields an entropy of 0.959, which means the data set is very disorderly and contains a nearly equal ratio of edible to poisonous.
The Decision Tree is programmed with the following code:
Yes, yes, I know. It's a lot of code to look through, so I'll save you the hassle tell you what it essentially does. But first let's take a look at the output and the same instance data used by the tree when it's been visualized with some trace printing for easier tracking.
Below is an example of the sample instances (training set) used to build the tree (make the tree learn heheh).
The tree first targets the attribute with the most information gain. In the first iteration of the tree, an index of 5 (odor) was selected to be the most informative. It then proceeds to segment the attribute's values. It segments them in 2 ways. First if the attribute value is has an entropy of 0, then the tree bases its decision on the purest occurrence of edible or poisonous. If the entropy of the attribute value is greater than 0 (an impure concentration of edible/poisonous), then the tree attempts to find the next most informative attribute. You'll notice this occurred on the 'n' attribute value of index 5 (odor). Recursion of the tree built another node on index 20 (spore-print-color) and continued this pattern until it exhausted all the available attribute values of the first tree node (index 5: odor).
Efficacy you ask? Well, the tree output above was built using a sample data set that numbered over 8,000. I then took an additional 20 instances and ran it through the tree using a function that classifies whether or not the sample instance is edible or poisonous using the following code:
I ran the classify function above on a preexisting tree in the following code with the results shown below. The classify function attempts to traverse the decision tree by following the path of the tree's attributes and the possible values of those attributes using the values of the instance being tested. It returns a decision once it has found an appropriate path.
The results output are as follows:
So what's it all mean? Well it means a couple of good things, and a couple of bad things.
Attention to detail? Nah, attention to the whole picture.