When you acquire data for analysis, it is highly unlikely that it will be in a form ready to analyze. Experienced data analysts and statisticians know that data preparation or data cleaning is the most time-consuming part of data analysis but incredibly important, otherwise any conclusions drawn from your analysis will be suspect.

We’ll explore how to clean data by viewing videos from two sources.

First, almost everyone has access to Excel. In a quite useful instructional presentation, Sumit Bansal, Founder of TrumpExcel.com, an online platform to train people in Excel spreadsheets, discusses 10 ways to clean data in Excel using examples.

 As you know, analysis with a statistical package requires each variable to be in a separate column. Sumit shows how to do this using Excel. [See #7]

Here are Sumit’s 10 Ways for Cleaning Data:

1. Get rid of extra spaces

2. Select and treat all blank cells

3. Convert numbers stored as text into numbers

4. Remove duplicates

5. Highlight errors

6. Change text to Lower/Upper/Proper case

7. Use text to columns to parse data in Excel

8. Spell check

9. Delete all formatting

10. Use find and replace

You can see Sumit’s video here.

Now that our data has been cleaned in Excel, what about what’s needed before conducting statistical analysis?

Meredith Rocchi, Assistant Professor in the Department of Communication at the University of Ottawa, gives an online lecture on the basics of data cleaning focusing on more of the statistical preparations needed for data analysis. While she refers to SPSS, the methods used can be implemented in any good statistical software.

Data cleaning involves a series of specific steps. These data cleaning procedures are:

  • Check for impossible or out-of-range values
  • Check for response sets
    • A response set is the tendency to exhibit a particular pattern of response independent of the question being asked in a survey or on a questionnaire
  • Deal with missing data. [NOTE: An earlier ISSSP blog dealt with that topic. We encourage you to follow the suggestions there.]
  • Manage outliers
  • Check for assumptions re: the statistical analysis to be conducted
  • Check for reliability
    • Do items on a question measure the same thing?

Ms. Rocchi discusses how to implement each of the procedures. Again, while she shows how to use SPSS in implementing the approaches, any good statistical software program could be used.

You can view Meredith’s video here.