RION ANGELES
RION ANGELES
  • BLOG
  • ABOUT ME
  • CONTACT ME
  • BLOG
  • ABOUT ME
  • CONTACT ME

"Titanic" Data Mining in Tableau

1/6/2016

0 Comments

 
Picture

We all know how the Titanic sank. As you can observe from the realistic representation to the left, it was obviously the doing of a massive sea monster. No? Well regardless of how the vessel met its end, a data set provided by Kaggle I recently stumbled upon provided some interesting patterns and trends that I thought would be fun to visualize in Tableau.

Let's jump right into it. First off, we take look at the attributes of the dataset. Below is a description pulled from Kaggle.

Code Editor

    
Now that we have a good idea of the attributes available to us, let's take a quick glance at the raw data and look for any strange or missing values within these columns. 

A Quick Glance

Picture
Opening the CSV file in Tableau gives us some interesting insight into the data set. Namely, we've got a lot of Nulls. Just by looking at some of the attributes, I have a sneaking suspicion that some may yield less information gain than others. Later on, if/when I attempt to create a predictive model I may exclude Cabin because of the high cardinality and the large number of Null fields. Also, the Embarked attribute could possibly be excluded because I doubt this has a influence on survivability considering they were all on the same boat. Where they originated from seems independent of survivability. Age on the other hand can go either way. I can either choose to exclude it or perform some form of imputation. However, by adopting imputation we introduce an inherent bias to the dataset. We'll way our options later. Let's put up some pretty charts and graphs and see what we can learn.

Age and Survival

Alright let's talk about some of these graphs. First up, the Age graph. You'll notice some interesting differences in regards to age and survival rate. A large majority of the those that perished are between the age buckets of 15 to 30. In addition, we notice a huge number of individuals without an age (Null value for age). It would be safe to say that performing an imputation on these records would drastically change the distribution of age. We'll get to our options for imputation later. For now, we should try to determine if the attribute of Age played a significant factor towards survival.  We can run an ad-hoc chi-squared AB Test to prove/disprove statistical significance. Considering the amount of age buckets we have, we can select just 2 age buckets to prove statistical significance. We'll choose buckets that show a great deal of disparity. How about....Age bucket 20 (20 - 25) and bucket 0 (0 - 5). 
The AB Test shown to the right is hosted by Evan Miller and proves to a statistically significant degree that Age did in fact have a significant influence on one's survival rate. 
Picture
What about our other attributes? Did those have a huge influence on survival rates? It turns out they did! To maximize your chance of survival, the data suggests that a passenger should be:
  • 1st/2nd class
  • female
  • have a parent of child accompanying you
  • have a spouse or sibling accompanying you
  • pay up to a certain amount in ticket fares
  • depart from Cherbourg as the departure port

I wouldn't have expected the departure port to have had an influence over survival rates. I suspect it may be because the class of the passengers may be dependent on the port of departure. 

Next Steps: Machine Learning!

All in all, Tableau has proven to be an excellent data mining tool. It's ability to slice, dice, and visualize data sets with ease is incredible. The learning curve wasn't too bad, and I was able to answer most of my questions regarding exploration of the data set fairly quickly. Where do we go from here? Well this data set would be a prime candidate for a Classification Machine Learning algorithm; perhaps an ensemble method like a Random Forest. It would be interesting to see just how influential the attributes are relative to each other in determining survival rates. This would be done by performing calculations of information gain. 
0 Comments



Leave a Reply.

    Rion Angeles

    Attention to detail? Nah, attention to the whole picture.

    View my profile on LinkedIn

    Archives

    April 2017
    January 2016
    December 2015
    April 2015
    March 2015
    February 2015
    November 2014

    RSS Feed

    Categories

    All
    Business Intelligence
    Clustering
    Data Science
    Etsy
    Machine Learning
    Manufacturing
    Marketing
    Optimization
    Predictive Analytics
    Unsupervised

Proudly powered by Weebly