While preparing for my matriculation at Northwestern, I decided to get my hands dirty in some more data analytics. Upon delving deeper into the myriad of techniques, I soon stumbled upon a not-yet industry standard of Data Science/Analytic workflow.
SCRUB - What good is the data if we don't clean it first! Datasets can be a nasty, inconsistent beast, especially if there's direct human involvement (think free form text). Scrubbing data is almost mandatory for datasets that are not aggregated automatically. Data scrubbing duties can range anywhere from, formatting values correctly, removing or replacing missing data, formatting the data in an acceptable format, rounding numeric values, etc. A useful tool for smaller data sets would be Excel. CSV data sets can be manipulated using the Python CSV package.
EXPLORE - I don't think it'd be very useful to run tests on data if you don't have a clue as to what you're looking for. That being said, the Explore step of OSEMN serves as a step for building one's intuition and general familiarity with the dataset. Just look at the data, get a sense for what kind of values appear, and don't appear. Does this data fit what your mental models, suspicions. initial predictions? Try modeling it in a distribution or scatter-plot to see the general shape of the dataset.
MODEL - Woot! Now the fun really starts. Pick a model, any reasonable model that you think will perform the best and aims to answer the questions you came up with earlier. The aim here is to predict. Perhaps an old-school linear regression will get the job done if you're attempting to forecast or identify trends; or maybe a more complicated Support Vector Machine or Random forest if you're attempting to classify something.
INTERPRET - Ok, so can we all agree that this last step is not the most intuitive. I'm well aware that the "N" in OSEMN should really be an "I" But then where would be the fun in saying a mnemonic technique? The Interpret step is the place where all questions are answered (hopefully). The OSEMN model stresses that the predictive power of a model is determined by it's ability to generalize. So with a freshly modeled dataset, can it answer the burning questions, or prove the hypothesis that induced the model in the first place? This would be a good time to brush up on those powerpoint skills to ensure that the correct inference or conclusions are delivered coherently to stakeholders.
Attention to detail? Nah, attention to the whole picture.