Replacing Missing Data: How does it work?

Problems with incomplete answers or missing data

In a sampling process we, usually, get missing values for questions the interviewees preferred not to answer or they did not know the answer, or even, mistakenly skipped over them. For researchers who invested time and money resources in collecting these data, a missing value means, sometime, that the questionnaire is useless. However, there are various advanced methods to deal with missing data. These methods can be divided between Model-based methods, and Data-Based methods. The first group of methods allows running statistical models in the presence of missing values, based on the Full-Information-Maximum-Likelihood technique. It receives strong support by professionals as it does not require physical completion of missing data. Instead, the method takes the missing data into account when calculating the variance/covariance matrix. However, due to the complex estimation algorithm, models using this technique may sometimes not converge and coefficients are not estimated, especially when the sample size is small. The second group provides techniques to physically replace missing data and will be explained further below. A key for using both methods is a prior understanding in the missing pattern. That is, we need to know whether missing values are distributed evenly across all observed items (variables) and records (observations).  For example, higher and lower income survey respondents tend not to report their exact income. Thus, the percentage of missing values among participants who belong to lower or higher income percentiles is higher in compare to the rest. Regardless of the method we use to deal with that missing data, a potential biased estimation is in risk. In terms of the missing value patterns, it means that these missing data are not completely at random, or not even at random. We can estimate the random versus non random pattern by suitable goodness of fit moments (e.g. Little's Missing Completely at Random test). Treating missing data as if they are uniformly distributed across the dataset would, probably, yield lower variance and wrong estimates. Instead, back to missing pattern for income, the two ends should be divided from the rest, and completion of missing data should be run separately. In other words, the completely missing at random assumption indicates that there is no systematic association between the missing trend and the missing or non-missing data, such that the probability to find a missing value is uniform across all information items in the data. When data are missing not completely at random, but only at random, it is a situation in which there might be a correlation between the missing values to the structure of the data, yet the missing data are distributed randomly. Finally, missing not at random is the case we described above for the income report. Only when a prior test for missing completely at random shows that the distribution of missing data is at least missing at random if not missing completely at random, we can replace missing data on the full dataset. And that holds for both methods.

Other forms that deal with missing data without physical completion

It is still common, yet not as before, to drop cases with missing values. This can be done in a Listwist Deletion – drop the case that is missing at least one item, or Pairwise Deletion – drop the cases that is missing items relevant to the statistical model in use. Both methods assume missing at random, but if this assumption is wrong, biased coefficients are estimated. This problem is less severe for the Pairwise Deletion, yet the sample size is reported differently for each model. If models are based on full sample mean and variance, the Pairwise Deletion is biased from the full sample. It is difficult to calculate the correct standard error in such cases. On the other hand, these methods are simple and require lower statistical skills.

Methods for physically replacing missing data

What are then the better methods for physically replacing missing data? There are two methods – both are based on a regression model. The first replaces missing values based on the prediction of a regression mean, while the second still uses the regression model but replacement is done stochastically around the regression mean. The second method produces fully randomized values, while the first may still bias the data toward higher convergence with the mean. The effect on the variance of the missing data replaced by the second method can be ignored. Studies show that coefficients estimated by the second technique showed no bias in compare to full data. To physically complete missing data we use multiple imputations. In multiple imputations missing data are replaced using simulations such that a full dataset is built for statistical tests. The key rule for multiple imputations is that the higher the rate of missing data, the more repeats are necessary to maintain the random sampling idea. For modeling use, a final dataset is built of those repeats by randomly selecting the replaced missing values out of all repeats.

For the list of physically replacing missing value we can add the "replace with the mean" option, for example, in factor analysis models. That method simply decreases the variance which affects the estimates dramatically with respect to the level of missing. This method is considered the worst among all methods which deal with missing values and has no support in any academic research.

Dr. Gabriel Liberman – Data-Graph Statistical Consulting