Now that we have a good idea of the attributes available to us, let's take a quick glance at the raw data and look for any strange or missing values within these columns.
A Quick Glance
Opening the CSV file in Tableau gives us some interesting insight into the data set. Namely, we've got a lot of Nulls. Just by looking at some of the attributes, I have a sneaking suspicion that some may yield less information gain than others. Later on, if/when I attempt to create a predictive model I may exclude Cabin because of the high cardinality and the large number of Null fields. Also, the Embarked attribute could possibly be excluded because I doubt this has a influence on survivability considering they were all on the same boat. Where they originated from seems independent of survivability. Age on the other hand can go either way. I can either choose to exclude it or perform some form of imputation. However, by adopting imputation we introduce an inherent bias to the dataset. We'll way our options later. Let's put up some pretty charts and graphs and see what we can learn.
Age and Survival
Alright let's talk about some of these graphs. First up, the Age graph. You'll notice some interesting differences in regards to age and survival rate. A large majority of the those that perished are between the age buckets of 15 to 30. In addition, we notice a huge number of individuals without an age (Null value for age). It would be safe to say that performing an imputation on these records would drastically change the distribution of age. We'll get to our options for imputation later. For now, we should try to determine if the attribute of Age played a significant factor towards survival. We can run an ad-hoc chi-squared AB Test to prove/disprove statistical significance. Considering the amount of age buckets we have, we can select just 2 age buckets to prove statistical significance. We'll choose buckets that show a great deal of disparity. How about....Age bucket 20 (20 - 25) and bucket 0 (0 - 5).
What about our other attributes? Did those have a huge influence on survival rates? It turns out they did! To maximize your chance of survival, the data suggests that a passenger should be:
I wouldn't have expected the departure port to have had an influence over survival rates. I suspect it may be because the class of the passengers may be dependent on the port of departure.
Next Steps: Machine Learning!
All in all, Tableau has proven to be an excellent data mining tool. It's ability to slice, dice, and visualize data sets with ease is incredible. The learning curve wasn't too bad, and I was able to answer most of my questions regarding exploration of the data set fairly quickly. Where do we go from here? Well this data set would be a prime candidate for a Classification Machine Learning algorithm; perhaps an ensemble method like a Random Forest. It would be interesting to see just how influential the attributes are relative to each other in determining survival rates. This would be done by performing calculations of information gain.
Attention to detail? Nah, attention to the whole picture.