RION ANGELES
RION ANGELES
  • BLOG
  • ABOUT ME
  • CONTACT ME
  • BLOG
  • ABOUT ME
  • CONTACT ME

Decisions, Decisions, Decisions, Tree?

2/15/2015

0 Comments

 
Picture
I decided to learn Python and was recently admitted into Northwestern's Predictive Analytics Graduate program. So I figured, meh! Why not learn both at the same time? So my first foray into predictive analytics had to do with a supervised classification model called a decision tree. What is a decision tree? In essence it's a predictive algorithm that just so happens to be (when drawn out / visualized), well... a tree. I first encountered decision trees in the book published by O'Reilly called, Data Science for Business. I used most of Joe McCarthy's primer as the guide to my programming exercises and modified it a bit to better suit my nuances in programming style. It was the first predictive model they described and one of the more interesting ones in my opinion because of its relative simplicity. They cited using a data set of mushrooms samples, courtesy of UCI. The aim of the tree was to predict whether any additional samples based on its attributes, was either poisonous or edible. Which leads me to the question; how did they manage to find out whether the samples from the original data set was poisonous or edible? The sample data set can be found here: Mushrooms. 

An accompanying set of attributes and their value definitions was provided with the data set. They are as follows:
  1. classes: edible=e, poisonous=p
  2. cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
  3. cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
  4. cap-color: brown=n ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
  5. bruises: bruises=t, no=f
  6. odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
  7. gill-attachment: attached=a, descending=d, free=f, notched=n
  8. gill-spacing: close=c, crowded=w, distant=d
  9. gill-size: broad=b, narrow=n
  10. gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
  11. stalk-shape: enlarging=e, tapering=t
  12. stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
  13. stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
  14. stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
  15. stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
  16. stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
  17. veil-type: partial=p, universal=u
  18. veil-color: brown=n, orange=o, white=w, yellow=y
  19. ring-number: none=n, one=o, two=t
  20. ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
  21. spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
  22. population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
  23. habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

 So how's it work? The tree's efficacy is based on how big of an influence or how significant each attribute is in determining whether or not a mushroom is edible or poisonous. This is done by calculating the entropy of the target attribute and then proceeding to calculate the information gain for each non-target attribute (#2 - 23 in the list above). For this data set of mushrooms, the target attribute is the following: (#1 in the list above. Edible or Poisonous). 

Entropy 
is defined as the measure of the general disorder or impurity in a set of data and is described in the following equation:
Picture
Picture
Calculating entropy was as simple as writing a function to calculate it based on whether the initial data set was poisonous or edible. This was done by first munging the sample data set by excluding instances without errant or missing data values (attributes values), then using the formula illustrated above on the 1st index of cleaned data.
def entropy(instances, attribute_names):
'''Returns entropy of a data set of mushrooms'''
class_value_count = attribute_value_counts(instances, 'classes', attribute_names)
number_of_instances = len(instances)
#calculate edible first
e = float(class_value_count['e'])
p = float(class_value_count['p'])
side_a = (e / number_of_instances) * math.log((e / number_of_instances), 2)
#then calculate poisonous next
side_b = (p / number_of_instances) * math.log((p / number_of_instances), 2)
entropy_calculation = - side_a - side_b
return entropy_calculation
The code above yields an entropy of 0.959, which means the data set is very disorderly and contains a nearly equal ratio of edible to poisonous. 

The Decision Tree is programmed with the following code:
def create_decision_tree(instances, candidate_attribute_indexes = None, class_index = 0, default_class = None, trace = 0):
'''Returns a new decision tree trained on a list of instances.
The tree is constructed by recursively selecting and splitting instances based on
the highest information_gain of the candidate_attribute_indexes.
The target attrbute is found in position class_index.
The default_class is the majority value for the current node's parent in the tree.
A positive (int) trace value will generate trace information with increasing levels of indentation.
class_index is the target attribute to classify by, in this case it is poisonous or edible
candidate_attribute_indexes is the attributes the tree will create against
Derived from the simplified ID3 algorithm presented in Building Decision Trees in Python by Christopher Roach,
http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3'''
attribute_list = load_attributes('agaricus-lepiota.attributes')
#if no candidate_attribute_indexes are provided, assume that we will use all but the target_attribute_index
if candidate_attribute_indexes is None:
#candidate_attribute_indexes = range(len(instances[0]))
candidate_attribute_indexes = load_attributes('agaricus-lepiota.attributes')
#delete classes attribute at index 0
del candidate_attribute_indexes[0]
#add occurrences of edible or poisonous into a dict Counter
attribute_value_and_counts = collections.Counter([instance[class_index] for instance in instances])
#print 'length -', len(attribute_value_and_counts), ' - ', attribute_value_and_counts
#if the dataset is empty or the candidate attributes list is empty, return the default value
if not instances or not candidate_attribute_indexes:
if trace:
print '{}Using default class {}'.format('< ' * trace, default_class)
return default_class
'''(Should only be hit during recursion iteration > 0)if all the instances
have the same target_attribute (edible or poisonous) it returns the target
attribute value to the tree that initially called it'''
elif len(attribute_value_and_counts) == 1:
#return the attribute value
target_attribute = attribute_value_and_counts.most_common(1)[0][0]
if trace:
print '{}All {} instances have label {}'.format('< ' * trace, len(instances), target_attribute)
return target_attribute
#all the instances DO NOT have the same target_attribute (edible or poisonous)
else:
default_class = majority_value(instances, class_index)
# Choose the next best attribute index to best classify the instances
best_index = choose_best_attribute_index(instances, candidate_attribute_indexes)
if trace:
print '{}Creating tree node for attribute index {}'.format('> ' * trace, best_index)
# Create a new decision tree node with the best attribute index and an empty dictionary object (for now)
tree = {best_index:{}}
# Create a new decision tree sub-node (branch) for each of the values in the best attribute field
partitions = split_instances(instances, best_index)
# Remove that attribute from the set of candidates for further splits
#remove the next best index
print best_index
#find the attribute to remove from the master list of attributes
attribute_to_remove = attribute_list[best_index]
#find the index of the attribute to remove
index_to_remove = candidate_attribute_indexes.index(attribute_to_remove, )
#finally, remove the attribute at that candidate index level
del candidate_attribute_indexes[index_to_remove]
remaining_candidate_attribute_indexes = candidate_attribute_indexes
for attribute_value in partitions:
if trace:
print '{}Creating subtree for value {} of attribute index {} ({}, {}, {}, {})'.format(
'> ' * trace,
attribute_value,
best_index,
len(partitions[attribute_value]),
len(remaining_candidate_attribute_indexes),
class_index,
default_class)
'''Create a subtree for each value of the the best attribute and use the instances where the attribute value
is the key'''
subtree = create_decision_tree(partitions[attribute_value], remaining_candidate_attribute_indexes,
class_index,
default_class,
trace + 1 if trace else 0)
'''Add the new subtree to the empty dictionary object in the new tree/node we just created.
This is capable of returning either a brand new tree or edible/poisonous based on the attribute_value used'''
tree[best_index][attribute_value] = subtree
return tree

Yes, yes, I know. It's a lot of code to look through, so I'll save you the hassle tell you what it essentially does. But first let's take a look at the output and the same instance data used by the tree when it's been visualized with some trace printing for easier tracking.
> Creating tree node for attribute index 5
5
> Creating subtree for value a of attribute index 5 (400, 21, 0, e)
< < All 400 instances have label e
> Creating subtree for value c of attribute index 5 (192, 21, 0, e)
< < All 192 instances have label p
> Creating subtree for value f of attribute index 5 (1584, 21, 0, e)
< < All 1584 instances have label p
> Creating subtree for value m of attribute index 5 (28, 21, 0, e)
< < All 28 instances have label p
> Creating subtree for value l of attribute index 5 (400, 21, 0, e)
< < All 400 instances have label e
> Creating subtree for value n of attribute index 5 (2764, 21, 0, e)
> > Creating tree node for attribute index 20
20
> > Creating subtree for value k of attribute index 20 (1296, 20, 0, e)
< < < All 1296 instances have label e
> > Creating subtree for value r of attribute index 20 (72, 20, 0, e)
< < < All 72 instances have label p
> > Creating subtree for value w of attribute index 20 (100, 20, 0, e)
> > > Creating tree node for attribute index 21
21
> > > Creating subtree for value y of attribute index 21 (24, 19, 0, e)
< < < < All 24 instances have label e
> > > Creating subtree for value c of attribute index 21 (16, 19, 0, e)
< < < < All 16 instances have label p
> > > Creating subtree for value v of attribute index 21 (60, 19, 0, e)
< < < < All 60 instances have label e
> > Creating subtree for value n of attribute index 20 (1296, 19, 0, e)
< < < All 1296 instances have label e
> Creating subtree for value p of attribute index 5 (256, 19, 0, e)
< < All 256 instances have label p
{5: {'a': 'e',
'c': 'p',
'f': 'p',
'l': 'e',
'm': 'p',
'n': {20: {'k': 'e',
'n': 'e',
'r': 'p',
'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},
'p': 'p'}}
Below is an example of the sample instances (training set) used to build the tree (make the tree learn heheh).


p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
e,b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
e,b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
e,b,s,y,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,s,m

The tree first targets the attribute with the most information gain. In the first iteration of the tree, an index of 5 (odor) was selected to be the most informative. It then proceeds to segment the attribute's values. It segments them in 2 ways. First if the attribute value is has an entropy of 0, then the tree bases its decision on the purest occurrence of edible or poisonous. If the entropy of the attribute value is greater than 0 (an impure concentration of edible/poisonous), then the tree attempts to find the next most informative attribute. You'll notice this occurred on the 'n' attribute value of index 5 (odor). Recursion of the tree built another node on index 20 (spore-print-color) and continued this pattern until it exhausted all the available attribute values of the first tree node (index 5: odor). 

Efficacy you ask? Well, the tree output above was built using a sample data set that numbered over 8,000. I then took an additional 20 instances  and ran it through the tree using a function that classifies whether or not the sample instance is edible or poisonous using the following code:
def classify(tree, instance, default_class=None):
'''Returns a classification label for instance by traversing the tree, given a decision tree
default_class is the default classification, in this case, whether edible or poisonous'''
#if tree is empty
if not tree:
return default_class
'''if tree is NOT a dict, then it means that the final value of the target attribute has been found
ie. poisonous / edible '''
if not isinstance(tree, dict):
return tree
#return the first key of the dict
attribute_index = tree.keys()[0]
print 'attribute index = ', attribute_index
#return the first value of the dict/tree
attribute_values = tree.values()[0]
#save the attribute value of the instance of the highest entropy attribute
instance_attribute_value = instance[attribute_index]
print 'instance attribute value = ', instance_attribute_value
#if that attribute value was not found in the value set for specific attribute
if instance_attribute_value not in attribute_values:
return default_class
#if it's not a pure attribute, then it sends the remaining tree into the classify function until it finds a match
return classify(attribute_values[instance_attribute_value], instance, default_class)

I ran the classify function above on a preexisting tree in the following code with the results shown below. The classify function attempts to traverse the decision tree by following the path of the tree's attributes and the possible values of those attributes using the values of the instance being tested. It returns a decision once it has found an appropriate path.
for instance in testing_instances:
predicted_label = classify(tree, instance)
actual_label = instance[0]
print 'predicted: {}; actual: {} \n'.format(predicted_label, actual_label)
The results output are as follows:
attribute index =  5
instance attribute value = m
predicted: p; actual: p
attribute index = 5
instance attribute value = m
predicted: p; actual: p
attribute index = 5
instance attribute value = m
predicted: p; actual: p
attribute index = 5
instance attribute value = n
attribute index = 20
instance attribute value = w
attribute index = 21
instance attribute value = y
predicted: e; actual: e
attribute index = 5
instance attribute value = n
attribute index = 20
instance attribute value = w
attribute index = 21
instance attribute value = v
predicted: e; actual: e
attribute index = 5
instance attribute value = m
predicted: p; actual: p
attribute index = 5
instance attribute value = n
attribute index = 20
instance attribute value = w
attribute index = 21
instance attribute value = v
predicted: e; actual: e
So what's it all mean? Well it means a couple of good things, and a couple of bad things.
  • GOOD - Easy to use and interpret - The tree's ability to utilize and automate information gain is extremely useful because it can accurately determine which attribute is the best indicator to predict whether a mushroom is edible or poisonous.
  • BAD - Limited scope - You may have noticed that the decision tree looks rather small even after being built from a sample of over 8000. This is because the tree is only limited to size and predictive power based on the samples that was used to build the tree in the first place. This places heavy importance on the size of the sample set used.
  • BAD - Overfiting - Yes, the humble decision tree is very susceptible to overfitting the model. This is due in part to the limited scope mentioned above. The tree is "trained" on a sample set that may not be representative of the true real-world sample. That being said, the moment the tree is run against a sample that is outside the parameters of the sample set it was trained on, the tree will fail to accurately predict the outcome of the sample. However, if the sample falls within the ranges of the training set, the tree will perform quite well.
Next up? We use the Decision Tree's entire family, also know as a Random Forest, against a different data set. 
0 Comments



Leave a Reply.

    Rion Angeles

    Attention to detail? Nah, attention to the whole picture.

    View my profile on LinkedIn

    Archives

    April 2017
    January 2016
    December 2015
    April 2015
    March 2015
    February 2015
    November 2014

    RSS Feed

    Categories

    All
    Business Intelligence
    Clustering
    Data Science
    Etsy
    Machine Learning
    Manufacturing
    Marketing
    Optimization
    Predictive Analytics
    Unsupervised

Proudly powered by Weebly