# Chapter 7

## Simulation on imputation

A simulation study was done to explore benefits and limitations of imputation for predictive regression models. We consider estimates of regression coefficients in a simple linear regression model (Y~X1+X2), and also consider estimates of predictive performance.

## Simulation design

The simple linear regression model Y~X1+X2 was discussed in section 7.3 of the book. The impact of 4 missing data mechanisms (MCAR, MAR on x, MAR on y, MNAR) was illustrated in Fig 7.1.

We first generate X1 and X2 as uncorrelated predictors. In the second series of simulations, we impose a correlation of 0.707 (covariance 50%) between X1 and X2. We consider 4 mechanisms to generate 50% missing values in X1: MCAR, MAR on x, MAR on y, MNAR, as in Fig 7.1. Below we illustrate the correlation of X1 with X2 and of X1 MAR on X2, respectively. Original data are plotted with ‘-‘; complete data are indicated with ‘o’.

## Evaluation

We calculate the estimated regression coefficients and their estimated standard errors with different approaches to dealing with missing values. We use the mean squared error (MSE) as a summary measure for the quality of estimation of regression coefficient values. The MSE is calculated as mean(estimated b – true β)2. The MSE combines bias (systematic difference between estimated and true value of the regression coefficient) and precision (random variability of estimates). For comparison on the same scale as the coefficients, we take the square root (‘sqrt(MSE)’). Smaller values of sqrt(MSE) indicate better estimation of the regression coefficients. Furthermore, we calculate the adjusted R2 statistics to indicate the estimated predictive performance of the model with X1 and X2. Simulations were done 1000 times for data sets with 500 simulated subjects.

## Missing values and imputation procedures

We first consider the hypothetical situation of complete data from the simulations (‘original data’). These results provide a gold standard reference. Next, we consider a CC analysis, where only patients with complete values on X1 (and X2 and Y) are analysed. As single imputation procedures, we consider conditional mean imputation using X2 to impute X1 values. In addition, we consider a stochastic regression imputation, where single imputations for X1 are based on random draws from an imputation model using X2 and Y. Finally, we apply the mice MI procedure with its default settings to generate 5 imputed data sets. The mice algorithm assumes linearity in the associations between variables, which is the case in our simulation design. The imputed data sets are analyzed with standard least squares methods, and results are pooled using the standard formula for MI results.