Multicollinearity has two distinct but related meanings when creating predictive models (e.g., Regression, Driver Analysis):
- That the independent variables (i.e., predictors or attributes) are highly correlated, making it difficult to disentangle their relative effects. Driver analysis methods such as Shapley and Kruskal are explicitly designed to attempt to address this problem.
- That one or more of the independent variables are essentially redundant (linearly dependent), containing no information that is in other variables. Where this occurs, most predictive models will give an error message or produce strange results (for examples, see Regression Troubleshooting in Q and How to Troubleshoot Regression Problems in Displayr). This is sometimes referred to as perfect multicollinearity. Generally, any error messages that indicate the existence of multicollinearity are referring to perfect multicollinearity.
Causes of perfect multicollinearity
Perfect multicollinearity is generally caused by errors. That is, it is generally a result of a mistake of some kind made by the researcher. For example:
- Including the same variable in an analysis twice. This can sometimes occur when a variable has been copied and renamed. The most straightforward way of detecting this is to compute correlations between the independent variables.
- Including a derived variable in the analysis along with its inputs. For example, including age, gender, and family structure, and also a life-stage variable that is created from age, gender, and family structure.
- Using constant sum data as independent variables (e.g., a question that asks people to assign 100 points across different attributes). A solution in this case is to leave out one of the variables from the analysis. An alternative is to recode the data in some way (e..g, creating a Pick Any question, where only 'high' values are counted).
- Transforming the independent variable in some way which causes them to become linearly dependent (e.g., transforming them so they have an average-per-case of 0 or some other number).
- Small sample sizes, in which the data ends up having one of the characteristics described above by chance (this is common with time series data).
Causes of perfect multicollinearity in experiments
In addition to the causes described above, when conducting experiments, such as conjoint analysis and MaxDiff, the following may lead to multicollinearity:
- An error in the creation of an experimental design. The most straightforward way to test this is usually to examine the original experimental design itself (e.g., to check that the correlations between the columns are as expected).
- An error occurred in the administration of the experiment (e.g., some choice tasks were not shown to respondents).
- Some attribute levels in the experiment were so (un)appealing that choices that contained these options were (never) always selected.
- A clerical error has been made when setting up the Experiment, where 'clerical' denotes that the error that has been made is one where the data is not set up as intended. The easiest way to check this is to select all the cells in the Coefficient column and press and see if anything is listed in the Invalid task report. Any invalid tasks can be checked by reading through a respondent's data as it appears in the Data tab in Q or the Data Editor in Displayr for the Experiment question. It can be useful to read through the data even when nothing is shown in the Invalid task report.
- The experimental design has prohibitions, but these prohibitions have not been taken into account when setting up the Experiment. For example, where a particular level of one attribute only ever occurs in conjunction with a particular level of another attribute, it is not possible to estimate Coefficients for both of these attributes.
- Respondents were shown a None of these option, but this option has been set as Missing Data and no attribute levels have been included (see Experiments Specifications).
Further reading: Key Driver Analysis Software