I know what you are thinking, when is data too clean? And who wants dirty data?
If you are someone who works with data on a regular basis, you know that all too often data is a mess. Data may be missing, have errors, or is too outdated to be useful. Generally, we strive to clean our data and maintain that cleanliness, but often you need imperfect data for testing.
In this case, our consultants were evaluating open source data profiling tools and needed sample data to run reports against. We located a sample data file that fit our use case but upon inspection the data was too clean. In order to evaluate our selected profiling tools, we wanted to demonstrate how each would report on dirty data. In our use case, 'dirty data' was defined as unexpected null values in an assortment of fields and duplicate values for unique identifiers.
Rather than manually editing our data file to add nulls and duplicates, I decided to write a script to randomize the process using python and pandas. For this use case, I am using python 3.9 and our data file is a csv.
Let's walk through the script!
Set Up:
First, a few variables need to be defined.
- FILE_NAME: (string) name of the file to read from
- FIELDS_TO_ADD_NULL: (list) a list of field names which we will add nulls to
- FIELDS_TO_CREATE_DUPS: (list) a list of field names which we will create duplicate values in
- MAX_NUMBER_OF_NULL_ROWS: (int) the maximum number of rows we want to add null values to
- MAX_NUMBER_OF_DUP_ROW: (int) the maximum number of rows we want to create duplicate values in
Note: The field names and maximum number of rows could also be randomized but I chose to control these definitions.
import random
import pandas as pd
FILE_NAME = 'BankCustomers.csv'
FIELDS_TO_ADD_NULLS = ['credit_limit','customer_age','income_category']
FIELDS_TO_CREATE_DUPS = ['clientnum']
MAX_NUM_OF_NULL_ROWS = 20
MAX_NUM_OF_DUP_ROWS = 5
The provided data file is read into a pandas DataFrame and one more variable is defined as a range() object representing all indexes in the DataFrame.
df = pd.read_csv(FILE_NAME, dtype=object)
RANGE_OF_INDEXES = range(len(df))
Function for Randomization:
An integral piece of this script is the use of the sample() function from Python's random module. Sample() randomly selects elements from a provided list (or other python sequence), without repeating items, and returns them in a list of their own. For this script, the sample() function will be used to create a list of random row indexes that will be utilized to make updates to the DataFrame.
Sample() requires two arguments:
- a python sequence (in our case a list)
- an integer which determines the length of the returned list
For both the creation of nulls and duplicates, the first argument passed to sample() will be the RANGE_OF_INDEXES variable created above. The second argument will be a random integer between 2 and the respective max number of rows variable (MAX_NUM_OF_NULL_ROWS or MAX_NUM_OF_DUP_ROWS). This will both randomize the selection of the row indexes and the total number of rows to edit.
random.sample(RANGE_OF_INDEXES, random.randint(2, MAX_NUM_OF_NULL_ROWS))
random.sample(RANGE_OF_INDEXES, random.randint(2, MAX_NUM_OF_DUP_ROWS))
We will see this function in action in the next two sections.
Add NULL values:
Now it's time to add the nulls to our data! A dictionary is created by looping through each field name in our FIELDS_TO_ADD_NULLS list, with the key as the field name and the value as a list generated by calling the sample() function as described above.
dict_of_null_indexes = {}
for field_name in FIELDS_TO_ADD_NULLS:
dict_of_null_indexes[field_name]=random.sample(RANGE_OF_INDEXES, random.randint(2, MAX_NUM_OF_NULL_ROWS))
The resulting dictionary looks like this:
{
'credit_limit': [8977, 3334, 7094, 2138, 7133, 8824, 8108, 10047],
'customer_age': [7836, 357],
'income_category': [8942, 7832, 9131, 3913, 766, 3713]
}
For each key in the dictionary, the script loops through the list of row indexes and updates the associated value in the DataFrame to 'None'.
for key in dict_of_null_indexes.keys():
for num in dict_of_null_indexes[key]:
df.at[num, key] = None
Add duplicate values:
Finally, the duplicates will be added. Again, a dictionary is created but this time it is a nested dictionary with additional key:value pairs. Each field name in FIELDS_TO_CREATE_DUPS becomes a key and its value is a dictionary with keys row_list and value_for_duplicates.
The row_list value is a list generated by calling the sample() function. This is the list of row indexes that will be updated to all contain the same value, producing our duplicates.
The value_for_duplicates value is the data value to be duplicated across the row indexes in row_list. To guarantee a valid value, the value is taken from the first row index in row_list.
dict_of_dup_indexes = {}
for field_name in FIELDS_TO_CREATE_DUPS:
dict_of_dup_indexes[field_name] = {}
dict_of_dup_indexes[field_name]['row_list'] = random.sample(RANGE_OF_INDEXES, random.randint(2, MAX_NUM_OF_DUP_ROWS))
dict_of_dup_indexes[field_name]['value_for_duplicates'] = df.at[dict_of_dup_indexes[field_name]['row_list'][0],field_name]
The resulting dictionary looks like this:
{
'clientnum': {
'row_list': [6371, 8765, 7525, 8855],
'value_for_duplicates': 788440083
}
}
For each key in the nested dictionary, the script loops through each row index in row_list (excluding the first) updating the DataFrame value at that location to the value in value_for_duplicates.
For example, using the dictionary above clientnum will be updated to the value 788440083 at row index 8765, 7525 and 8855. The value at index 6371 is already set to the correct value since that is where we pulled the value_for_duplicate from.
for key in dict_of_dup_indexes.keys():
for num in dict_of_dup_indexes[key]['row_list'][1:]:
df.at[num, key] = dict_of_dup_indexes[key]['value_for_duplicates']
Write new file:
All updates are now complete! All that is left is to write out a new file with the updated DataFrame.
df.to_csv(f'new_{FILE_NAME}', index = False)
Now we have a new sample data file that includes assorted null values and duplicates in a column that should be a unique identifier.
As I stated before, no one wants dirty data but it can be necessary to test code functionality or a new tool. I hope you had fun reviewing a way to muck up your data!