Adapted from Blitzstein and Pfster the Data Science Process offers a defined approach the data, to create a clear and actionable path.
The process has five stages, which are:
1 Identify the question
2 Prepare the data
3 Analyse the data
4 Visualize or Model the insights
5 Present the insights
Identify the question may be relaxed to – frame the problem, where the change is necessitated. For example, a scientist may have an enormous set of clinical data for which she has a precise question. The data, however, may not provide any opportunities for testing her hypothesis.
Stage 2 is typically the most time-expensive stage. It is here that Stage 1 must be most vulnerable to alteration. When working with a new dataset an analyst must take some reflection and investigation time. It is important that as many assumptions as possible are tested at this stage.
Simple preparations must include data-verification tasks. For example, analysts must validate:
- that the primary key field has only unique entries
- that the primary key is, in fact the identifier used throughout the data chain
- that all data sits within its expected range (e.g. scale data from 1-3 should not have any vale <1 or >3
- that transformations (e.g. means) are valid
This is simply a starting point, from different patients with duplicate id numbers; to exam results linked by student name (which was mutable), rather than their primary key; to means created from scale data; to data entry errors. The data analyst must be sensitive to where chaos may be at work in their data-system.
Knowing how clean your data is, is vital. Once the points of error and chaos are identified a call must be made as to whether they arise at an acceptable rate. In the case of student exam results not linking – 100% success must be demanded. For mis-assigned scales it may be possible just to drop the rule-breakers from analyses. Or even to calculate a likely error rate in the data set and apply that to later visualizations or statistics.
Either way- it is good practice to create an analysts’ report, that is with assumptions listed; records of cleaning undertaken; calculations of error rates. This report is typically separate from that of stage 5.
Stages 3 and 4
I would argue that sometimes a visual representation should precede analyses. Either way, one can feed the other. Many statistical analyses are aided by a companion visual. Simply visualisations often precede statistics- such as a scattergram or histogram plot, or the data may be modelled- or tested against a model (such as a bell curve), or a general linear model.
Present the insights for the customer, or for their intended audience. It is rare that all the data relevant to the data analyses report would be the same as that provided to the client or their audience. The visuals would almost certainly be different or modified further.
Be ready to be surprised
A data analyst must be willing to be surprised and to iterate through these stages more than once, if required.
When you take on a Technobunnies team you typically employ 2 or more freelancers, with no one person employing a full cycle as described above. It is the requirement to communicate, to be able to tag someone in that creates opportunities. When forced to communicate assumptions and actions taken, or to determine and communicate analyses needed, such collaborations create a climate for challenging preconceptions and highlighting mistakes and assumptions.
Also, this puts the minimum number of minds to touch the data to 3. It is this triangle (or more) of brains that provides stability to the insights obtained.
Don’t worry if you are not based in Cape Town or even South Africa. Technobunnies are experienced in joining and building global virtual teams.