Data cleaning for analysis

When cleaning data for analysis, you want to focus on accuracy without ever forgetting practicality. Although we cannot do anything that allows our final results to become inacurate

A quick aside on predicting variables (option 2). Although this could be very useful, it comes with more risk. First, you may make an incorrect prediction, which could cause several issues. Secondly, you risk losing track of which variables were “predicted,” which can reduce the accuracy of future predictions (creating self-enforcing bias). It is easy to assume that “we will just track which values were predicted and which were real”; however, that often leads to operational difficulties and introduces a new potential for error.

The final step is to make the data actionable. Looking back at zip codes, we see there are over 40,000. Having that many options for a single variable could quickly overwhelm systems. Instead, simplify high-variance data in a new column. Never overwrite data to simplify it. There are many examples of how to do this, for example, breaking it down by state. If 50 is still too many, you could use the Department of Interiors' regional definitions. You may also want to remove exact values and utilize relative values. If an organization has physical locations, they may want to utilize a “distance to nearest store” value instead. This value should also be grouped, for example, <1 mile, 1-5 miles, >5 miles. However, even that system may run into issues. In a rural setting, people may be used to driving to the next town over to get things; 5 miles may not be much at all, whereas, in a city, that same distance could be too far.

These same ideas can be used to break down other large-quantity variables like age, income, time-based variables, etc. All variables that can be broken down into simple abject groups are one step in the right direction. Making the variable groups based on what is most useful for the specific instances where they are needed is a far better solution.

In the end, without clean and actionable data, nothing you build with it will achieve its maximum potential and may even be worse than no data at all. The steps above are a few ways to ensure the accuracy of stored information but making sure proper QA steps are taken by the entire organization is far more impactful than any “quick trick.”