How to clean up dirty data
Dirty data
Typically, dirty data is when you have extra spaces in cells, duplicates or names that have been spelled differently across data entries. Cleaning up your data helps make sure that everything is categorised correctly so you can make better sense of it.
Removing extra spaces
Extra spaces at the beginning or end of any cell can be particularly problematic when working with data: you can’t see them, but the computer can. And if it’s trying to match two pieces of data, for example customer names to sales invoice data, it will not match them if the region name has a space after it in one of the entries.
Tip. Use the TRIM function in Excel to remove any spaces at the beginning or end of a cell. For example, =TRIM(A2). You can copy this formula down an entire column to apply it to all the cells in that column.
Find and Replace
One of the best tricks to clean up data is the Find and Replace function in Excel. The feature allows you to find a specific word or character and replace it with a new one. This is particularly useful when different spelling variations appear for the same word such as “United Kingdom” and “UK” or “Limited” and “Ltd”.
Tip 1. You can find this function under Edit/Replace. Alternatively, you can use the keyboard shortcut Control+F and click “Replace”. Type the word you are looking to replace in the top entry field and the word you want to replace it with in the bottom entry field and click “Replace All”.
Tip 2. The Replace function is also an easy way to delete extra spaces in cells. Just add a space in the top entry field and nothing in the bottom entry field and click “Replace all”. All extra spaces in your selected range will be deleted.
Deleting duplicates
If you want a dataset with unique rows, you will need to look for duplicates.
Step 1. Sort your column alphabetically. Select a column, go to Data/Sort, then select the column you want to organise and select “A to Z”.
Step 2. Once everything is ordered alphabetically by the column you want to check, you can use the IF formula to highlight duplicates. The way you do it is by comparing one cell to the one above. The IF formula works like this:
Excel formula:
=IF(logical_text, [value_if_true],[value_if_false])
For example. You want to ensure that column C doesn’t contain duplicates. Start by comparing cell C3 to cell C2 to see if they are the same, this is the logical test. If they are the same we want the cell we are using to check for duplicates to write Duplicate (value_if_true) and if they are not the same (value_if_false), the cell should just be blank. So, your formula will look like this:
=IF(C3=C2,”Duplicate”,””)
Step 3. Use the paste special/values for the check column with the formula, then you sort the column alphabetically, so all the duplicates congregate before you highlight the rows and delete them.