Sooner or later you will encounter missing data in your analyses. Should you ignore it? Should you include it? What is the best way to deal with missing data?
In a series of three excellent videos, Rirvik Kharkar, former software engineer and data analyst and currently a master’s student in statistics at UCLA, discusses ways to include missing data in analysis.
In his first video, Ritvik discusses the three major mechanisms that lead to missing data: the hows and whys data goes missing. This is important because knowing the reason your data is missing can suggest the way to deal with it in your analysis.
- Missing Completely at Random (MCAR)
- Each data point has the same chance of being missing independently of anything at all
- Missing at Random (MAR)
- Each data point has a certain rate of being missing depending on some other variable in the data
- Missing Not at Random (MNAR)
- Rate of missing data for a variable is related to the values of that same variable
The way to determine whether you have MCAR or MAR is to plot your data vs various variables in your data set. If the rate of missing data is about the same independent of the variables you plot against, you have the MCAR mechanism at work. If it varies depending on which variable you are plotting against, then you have the MAR mechanism in play.
The MNAR case is much harder to determine. You can find various ways to assess this mechanism by searching for Missing Data Not at Random on the internet.
Now let’s use our knowledge of why data is missing to determine how to handle it in our analysis.
In his second video, Ritvik discusses three single imputation methods where single imputation refers to each missing data value is replaced with a single data value.
- Row Deletion: Delete every row that has a missing value
- Pro: Simple to do
- Con: Only use for the Missing Completely at Random (MCAR) mechanism. Otherwise this method will introduce bias into your dataset
- Mean/Median Imputation: Fill in missing data with the mean or median of all the data values you do have
- Pro: Simple to do
- Con: Artificially reduces the variability in your dataset if there are many missing values
- Hot Deck Methods: Any method where you fill in the missing values with values that are similar to the missing values. For example, use the average of the values you do have for cases that are similar
- Pro: Takes more information into account when determining the value to impute
- Con: More computationally expensive
In Ritvik’s third video he covers more powerful multiple imputation methods where multiple refers to determining multiple values for the missing data value and taking the average of those to replace the missing value. There are many different ways to carry out the required analysis. Ritvik illustrates one. You can find others by searching for Multiple Imputation on the internet.
- Pro: More unbiased
- Con: Complex
One approach is to one is use a single imputation method to generate several complete data sets. Then aggregate the results and determine the spread.
You can view Ritvik’s Missing Data videos here:
Over 7 years of highly instructional videos
Ritvik’s YouTube channel includes over seven years of highly instructional videos in math, statistics, and data analytics. You can access Ritvik’s channel here.