SPREADSHEETS - 03.06.2020

How to clean up dirty data

One of the biggest hurdles to overcome when trying to combine data is having to clean up data that has, for example, extra spaces in cells or numerous duplicates, i.e. it’s “dirty”. How can Excel help you speed up the process?

Dirty data

Typically, dirty data is when you have extra spaces in cells, duplicates or names that have been spelled differently across data entries. Cleaning up your data helps make sure that everything is categorised correctly so you can make better sense of it.

Removing extra spaces

Extra spaces at the beginning or end of any cell can be particularly problematic when working with data: you can’t see them, but the computer can. And if it’s trying to match two pieces of data, for example customer names to sales invoice data, it will not match them if the region name has a space after it in one of the entries.

Tip. Use the TRIM function in Excel to remove any spaces at the beginning or end of a cell. For example, =TRIM(A2). You can copy this formula down an entire column to apply it to all the cells in that column.

Find and Replace

One of the best tricks to clean up data is the Find and Replace function in Excel. The feature allows you to find a specific word or character and replace it with a new one. This is particularly useful when different spelling variations appear for the same word such as “United Kingdom” and “UK” or “Limited” and “Ltd”.

Tip 1. You can find this function under Edit/Replace. Alternatively, you can use the keyboard shortcut Control+F and click “Replace”. Type the word you are looking to replace in the top entry field and the word you want to replace it with in the bottom entry field and click “Replace All”.

Tip 2. The Replace function is also an easy way to delete extra spaces in cells. Just add a space in the top entry field and nothing in the bottom entry field and click “Replace all”. All extra spaces in your selected range will be deleted.

Deleting duplicates

If you want a dataset with unique rows, you will need to look for duplicates.

Step 1. Sort your column alphabetically. Select a column, go to Data/Sort, then select the column you want to organise and select “A to Z”.

Step 2. Once everything is ordered alphabetically by the column you want to check, you can use the IF formula to highlight duplicates. The way you do it is by comparing one cell to the one above. The IF formula works like this:

Excel formula:

=IF(logical_text, [value_if_true],[value_if_false])

For example. You want to ensure that column C doesn’t contain duplicates. Start by comparing cell C3 to cell C2 to see if they are the same, this is the logical test. If they are the same we want the cell we are using to check for duplicates to write Duplicate (value_if_true) and if they are not the same (value_if_false), the cell should just be blank. So, your formula will look like this:

=IF(C3=C2,”Duplicate”,””)

Step 3. Use the paste special/values for the check column with the formula, then you sort the column alphabetically, so all the duplicates congregate before you highlight the rows and delete them.

Use Excel’s TRIM function to remove extra spaces at the beginning and end of cells. Excel’s Find and Replace function allows you to replace misspelled words and you can use the IF function to help find and delete duplicate entries.

© Indicator - FL Memo Ltd

Tel.: (01233) 653500 • Fax: (01233) 647100

subscriptions@indicator-flm.co.ukwww.indicator-flm.co.uk

Calgarth House, 39-41 Bank Street, Ashford, Kent TN23 1DQ

VAT GB 726 598 394 • Registered in England • Company Registration No. 3599719