RION ANGELES
RION ANGELES
  • BLOG
  • ABOUT ME
  • CONTACT ME
  • BLOG
  • ABOUT ME
  • CONTACT ME

A Very OSEMN Data Science / Data Analytic Workflow

3/15/2015

0 Comments

 
While preparing for my matriculation at Northwestern, I decided to get my hands dirty in some more data analytics. Upon delving deeper into the myriad of techniques, I soon stumbled upon a not-yet industry standard of Data Science/Analytic workflow.
Picture
Such workflow, much OSEMN, very structure.
My first encounter with the OSEMN (pronounced, "Awe-some") workflow was during a data science research stint. I was bothered by the lack of a standardized and widely accepted work-flow pattern to organize the process of solving problems with Data Science/Analytics. Dataists, through a post by Hilary Mason describes the first iteration of the now coined, OSEMN workflow. 

What is it? Well it's a fairly straightforward set of steps that a Data Scientist/Analyst would supposedly perform to solve the ubiquitous problems in their lives. 

Note: Typing out Scientist/Analyst is proving to be quite tedious. From here on out, I'm penning my own term. This term will be known as, "Scanalyst." Ostensibly because the duties of Data Scientists seem to largely overlap with Data Analysts, though in truth, it's out of laziness. If you wanna read more on the difference between Data Scientists vs Data Analysts, check over here.

OBTAIN - Well what's a scanalyst to do without data. The first step is to retrieve usable data. Typically, this will already be pre-determined. Data can be pulled asynchronously, such as a Python script that periodically pulls data from an online resource, or synchronously in the case of a simple SQL query targeted against a database.
SCRUB - What good is the data if we don't clean it first! Datasets can be a nasty, inconsistent beast, especially if there's direct human involvement (think free form text). Scrubbing data is almost mandatory for datasets that are not aggregated automatically. Data scrubbing duties can range anywhere from, formatting values correctly, removing or replacing missing data, formatting the data in an acceptable format, rounding numeric values, etc. A useful tool for smaller data sets would be Excel. CSV data sets can be manipulated using the Python CSV package.

EXPLORE - I don't think it'd be very useful to run tests on data if you don't have a clue as to what you're looking for. That being said, the Explore step of OSEMN serves as a step for building one's intuition and general familiarity with the dataset. Just look at the data, get a sense for what kind of values appear, and don't appear. Does this data fit what your mental models, suspicions. initial predictions? Try modeling it in a distribution or scatter-plot to see the general shape of the dataset. 

MODEL - Woot! Now the fun really starts. Pick a model, any reasonable model that you think will perform the best and aims to answer the questions you came up with earlier. The aim here is to predict. Perhaps an old-school linear regression will get the job done if you're attempting to forecast or identify trends; or maybe a more complicated Support Vector Machine or Random forest if you're attempting to classify something.

INTERPRET - Ok, so can we all agree that this last step is not the most intuitive. I'm well aware that the "N" in OSEMN should really be an "I" But then where would be the fun in saying a mnemonic technique? The Interpret step is the place where all questions are answered (hopefully). The OSEMN model stresses that the predictive power of a model is determined by it's ability to generalize. So with a freshly modeled dataset, can it answer the burning questions, or prove the hypothesis that induced the model in the first place? This would be a good time to brush up on those powerpoint skills to ensure that the correct inference or conclusions are delivered coherently to stakeholders.
0 Comments



Leave a Reply.

    Rion Angeles

    Attention to detail? Nah, attention to the whole picture.

    View my profile on LinkedIn

    Archives

    April 2017
    January 2016
    December 2015
    April 2015
    March 2015
    February 2015
    November 2014

    RSS Feed

    Categories

    All
    Business Intelligence
    Clustering
    Data Science
    Etsy
    Machine Learning
    Manufacturing
    Marketing
    Optimization
    Predictive Analytics
    Unsupervised

Proudly powered by Weebly