Data Security and HIPAA
When working with data, it should be clean and reliable. “Calculators don’t make mistakes; we just tell them to do the wrong thing.”
If you are collecting data that will be used to make business decisions or predictive models, you must trust that data to be accurate, or it could have disastrous consequences.
NOTE: This article will focus on high-level data-cleaning ideas. Technical steps like removing whitespace, checking for misspellings, case uniformity, and similar value-specific issues are not covered here.
One thing to remember is it is much easier to prove data invalid than to prove it valid. An age of 130 is an error; knowing if 34 is correct can be challenging.
In that spirit, here are some simple ways to catch bad data.
Define data types - By defining data types in your systems, you can quickly know if data is invalid. Age must be an “int” (whole number), which can quickly flag many errors, such as letters or decimals.
Create a value bound - Define ranges or accepted inputs. There are nearly 60,000 ways to create a five-digit number, but only 41,704 five-digit zip codes exist in the US.
Flag common mistakes - Identify common ways the data may be incorrect and create flags to inform the person or system inputting the information that something is wrong.
Create unique IDs - Identify duplicate entries by creating unique ID numbers.
It can also be relatively easy to identify missing data; however, what we do with that is a little more complex. There are two main solutions to handling missing data. Either 1. Accept that sometimes data may be missing and ensure your systems can handle missing data or 2. Use algorithms to predict what the missing data would be. In general, most operations will want to work with the first option.
When looking at option 1, accepting missing data, ensure each system is able to handle any missing values. For example, a system that requires explicit consent from a user would have to treat any user with that variable blank as a negative (consent not given.) On the other hand, a variable like ethnicity may not be necessary for the system to operate.
A quick aside on predicting variables (option 2). Although this could be very useful, it comes with more risk. First, you may make an incorrect prediction, which could cause several issues. Secondly, you risk losing track of which variables were “predicted,” which can reduce the accuracy of future predictions (creating self-enforcing bias). It is easy to assume that “we will just track which values were predicted and which were real”; however, that often leads to operational difficulties and introduces a new potential for error.
The final step is to make the data actionable. Looking back at zip codes, we see there are over 40,000. Having that many options for a single variable could quickly overwhelm systems. Instead, simplify high-variance data in a new column. Never overwrite data to simplify it. There are many examples of how to do this, for example, breaking it down by state. If 50 is still too many, you could use the Department of Interiors' regional definitions. You may also want to remove exact values and utilize relative values. If an organization has physical locations, they may want to utilize a “distance to nearest store” value instead. This value should also be grouped, for example, <1 mile, 1-5 miles, >5 miles. However, even that system may run into issues. In a rural setting, people may be used to driving to the next town over to get things; 5 miles may not be much at all, whereas, in a city, that same distance could be too far.
These same ideas can be used to break down other large-quantity variables like age, income, time-based variables, etc. All variables that can be broken down into simple abject groups are one step in the right direction. Making the variable groups based on what is most useful for the specific instances where they are needed is a far better solution.
In the end, without clean and actionable data, nothing you build with it will achieve its maximum potential and may even be worse than no data at all. The steps above are a few ways to ensure the accuracy of stored information but making sure proper QA steps are taken by the entire organization is far more impactful than any “quick trick.”