Missing Data Options

This page describes the options available in Standard R advanced analyses that provide options for Missing data. Not all options are available for all advanced analyses (e.g., some are applicable only to cluster analysis, some only to regression).

Error if missing data

An error is returned if any of the data used in the analysis contains missing values.

Exclude cases with missing data

The analysis is conducted using cases with no missing data. For example, if there are three variables, x, y, and z, and the total sample size is 10, but 5 cases have no data for z, only the 5 cases with the complete data are used in the analysis. This is also known as casewise deletion and the complete-case method. It is the default approach in Q.

Assign partial data to clusters

The initial analysis is performed based on Exclude cases with missing data, and then any cases that have some, but not only, missing data are assigned to the most similar clusters based on the data that is available.

Use partial data

The analysis is conducted using all the data for each case. For example, in Segments - K-Means Cluster Analysis, if there are nine variables in the analysis, and a case only has data for six, then the case is assigned to the most similar cluster based on the data for the six variables.

Use partial data (pairwise correlations)

The analysis is conducted using the correlations, rather than raw data, and the correlations are computed based on all the available data. For example, if there are three variables, x, y, and z, and the total sample size is 10, but 5 cases have no data for z, then the correlation between x and y is computed for all 10 cases, and the correlation for x and z and y and z are computed using the 5 cases.

Where this approach is being used in regression, the correlation matrix is analyzed using the sweep operator.^[1]

Dummy variable adjustment

This method assumes that the missing data is structurally missing and that a predictor could be impossible for some cases and hence coded as missing. E.g. if a non-married person is asked to rate the quality of their marriage. A model with a structure to allow this is used whereby the missing predictor is removed and an intercept adjustment is performed. This is implemented by adding a dummy variable for each predictor that has at least one missing value. The dummy indicator variables take the value zero if the original predictor has a non-missing value and the value one if the original predictor is missing. The original missing value is then recoded to a new value. In particular, the missing values of numeric predictors are recoded to be the mean of the predictor (excluding the missing data) and the missing values of factors are recoded to be the reference level of the factor. ^[2]

Imputation (replace missing values with estimates)

By default, data is imputed using the default settings from the mice R package, which employs Multivariate Imputation by Chained Equations (predictive mean matching) ^[3]. Care should be taken to ensure that variables have the correct variable type, as this has a big impact on this algorithm. Where a technical error is experienced using mice, the imputation is performed using hot-decking, via the hot.deck package in R.^[4]

When applied with regression, missing values in the outcome variable are excluded from the analysis after the imputation has been performed.^[5]

Note that although imputation can reduce the bias of parameter estimates, it can create misleading statistical inference (e.g., as the simulated sample size is assumed to be the actual sample size in calculations).

Multiple imputation

This is the same as imputation (described above), except that:

The imputation is repeated multiple times (by default, 10).
Parameter estimates are based on the average result across the different data sets.
Standard errors are computed using using Rubin's (1987) method^[6] and the degrees of freedom using the 'small sample' approach.^[7]

Other than parameter estimates, standard errors, p-statistics, and p-values, diagnostics are based on based on the results from only the first of the models (e.g., measures of influence, tests of normality, residuals, are all from the first model).

Articles in this section

Contents

Error if missing data

Exclude cases with missing data

Assign partial data to clusters

Use partial data

Use partial data (pairwise correlations)

Dummy variable adjustment

Imputation (replace missing values with estimates)

Multiple imputation

References

Articles in this section

Error if missing data

Exclude cases with missing data

Assign partial data to clusters

Use partial data

Use partial data (pairwise correlations)

Dummy variable adjustment

Imputation (replace missing values with estimates)

Multiple imputation

References

Related articles