Ah the data
Data validation is a key safety check that should be carried out throughout the lifecycle of a data project.
Data sets are sensitive objects, that is – a client may not have asked you to sign an NDA, but they typically want you to behave as if you have. Therefore when a data analyst wants to showcase their skills they may turn to fake-data. I don’t really like to use fake-data because, well it is simply an exercise and the fun in data analytics is turning data into information, the world doesn’t need more fake-information.
Verify the data is clean
Interestingly the comments on Kaggle where people have pulled this data to perform machine learning go along the lines of: “the length of service field can be dropped because it can be calculated from termination date – employment date”. Indeed it is redundant information, but what if there is a data entry error? In the real world of real data there will be errors. Double data such as this is a wonderful way to both identify potential errors and to estimate the proportion of “cleanliness” of your data. In the real world of data entry any one, two, or three of those values may be wrong (with decreasing likelihood that it will be 2 and so on).
There are some immediate checks I would perform from a row perspective (i.e. comparing data within each row). There are a few ways to play with the data. It came as .csv so I pulled it into Excel. Similarly we may use PowerBI to query and visualise the data and I will do both to demonstrate.
Data Validation 1: Is termination data always greater than employment date?
I ran this as a simple IF statement in Excel
This statement is saying if TERMINATIONDATE is greater than ORIGINALHIREDATE then do nothing, if not enter the word oops.
Any errors would be seen as Oops in column ValidatedDates
So I took a quick browse through that column and our fake data-entry person is fantastic, so far no errors. We can even begin to do our job as a data analyst and report. The simplest technique is to start a dashboard or reporting page in the same excel workbook. Using another simple IF statement, this time a CountIF I can give a figure to how clean my data is.
My CountIF is looking to see how many times the word oops appeared (0), that is how many times there was a data error above. Such warning systems can be very useful to raise flags when collating live data.
It is typical in the workplace, that at least some data verification happens at the user-entry stage. That is, our fake data-entry person would input the data through a form that would not allow the data to be saved if termination date was less than employment date. This would not prevent all errors from creeping into the data, but it catches some.
Data Validation 2: Does (employment date – termination date) = length of service?
As I was playing with the data I noticed something that I had not expected, I noticed that the Employee Key was duplicated. I can see no logical reason for this, the only pieces of data unique to each occasion was the RECORDDATE_KEY i.e. this fake data had a fake entry on different dates, and the LengthOfService was calculated at the time of RECORDDATE meaning older records give out of date “lengths of service”. To progress further we need to identify the most useful entry. I chose the oldest entry to be my unique record.
I did this in Excel again, having checked that the data was sorted by (column A) EMPLOYEEID and by RECORDDATE_KEY, I ran another IF from a row context:
This gives me the most recent entry for that employee. (I actually made it a little more complicated to account for any instance where there were no duplicate entries for column A EmployeeID)
So, fun as it is to play in Excel, there are many ways to de-hair those felines. I went into PowerBI to run a query on the data. To answer the question: “Does calculated length of service = field given length of service?” I produced a simple IF statement in PowerBI. Note the syntax has changed slightly. One of the great features of PowerBI is that the editor identifies syntax errors as you write them.
if[length_of_service]=[LengthOfService_fromCalc] then ""
This statement is only going to pick up those rows identified as the oldest records. I will enter a 1 into the column in the instances where the two values for length of service are not the same.
One of the down-sides to writing your if statements in PowerBI is they are typically implemented as you add calculated columns and become part of a long list that is your query. I created LengthOFService_FromCalc in a couple of steps:
- from [terminationdate_key]-[orighiredate_key]
- then I duplicated that column and set the data type to duration
- then set the number type to whole number- given me numbers of years in service, rounded up or down.
This is much quicker than similar fiddling would take in Excel, but your steps can easily be hidden in the Applied Steps in your query list with names such as “Add Custom”, making error checking and re-waking your brain to previous bright ideas so much harder. To be able to go find your work you must name each of your applied steps in a PowerBI query with as much diligence as if you were coding and writing a function name.
Also the job of reporting gets left until after I have loaded my data, through the query I am setting up. I therefore need to keep a running list of data verification reports I need to do:
|Mental note 1)- add VerifyDataLengthOfService to dashboard and verify count = 0|
Whereas if I were building my dash in the same workbook I would create a CountIF or set up a graph/pivot graph on my validation columns immediately (and see the result as I make it, which keeps my brain constantly aware of how clean this data is).
I will stay in PowerBI to continue reporting on this data, but I can not come down on one side or the other. I like Excel for some jobs, I like using the Data Model in Excel or PowerBI for others. I often combine the two.
Most data analyses starts here, doing quick on-the-fly checks like the two demonstrated above to either verify the data or to quantify the error rates within your data set. Keep enjoying your tools data lovers- they give us our power!