SACG Blog: Trainer Tip: Multicollinearity, by Steve Poulin, Ph.D., Trainer & Consultant, IBM/SPSS

Multicollinearity (sometimes called Collinearity)

This describes the condition of high correlations among two or more independent variables used in a multiple regression technique. Based on my experience, it is one of the most common threats to accurately ranking the effects of the independent variables used in a regression analysis. This condition affects any technique based on regression principles, including linear regression, binary and multinomial logistic regression, the Cox regression survival analysis technique, and an ARIMA time-series analysis that uses transfer functions to measure the effect of two or more external series.

The most direct test of multicollinearity is available in the Linear Regression procedure (Analyze/Linear Regression) within the IBM SPSS Statistics software and the Regression node in version 14 of the IBM SPSS Modeler software. Within IBM SPSS Statistics, clicking on the Statistics button in the Linear Regression dialog box opens the following subdialog box:

In version 14 of IBM SPSS Modeler, collinearity diagnostics are requested from a very similar dialog box that is invoked from the Regression node:

The collinearity diagnotics option will produce two new columns in the Coefficients table and a Collinearity Diagnostics table. All of these diagnostics will tell a similar story, and the most commonly used diagnostic is the Tolerance statistic that appears in the Coefficients table for each independent variable. The Tolerance statistic measures how much variance in each independent variable is NOT explained by the other independent variables. Tolerance values below .3 (30%) are likely to indicate a problem with multicollinearity, which means that the B and Beta coefficients produced for those variables may be incorrect.

There other methods available in IBM SPSS Statistics and Modeler for detecting multicollinearity. Bivariate correlation tests can be run for all of the independent variables. However, as the name implies, these tests can only test for high correlations among two variables at a time, while multicollinearity refers to the correlations between each independent variable and all of the other independent variables. Nevertheless, if multicollearity is occurring because of high correlations among a few variables, this method will be sufficient. Correlation coefficients above .8 or below -.8 on a scale between -1 and 1 usually indicate multicollinearity at a level that will distort regression coefficients.

Factor analysis is a better test of multicollinearity because it can detect high correlations among any number of variables. Another advantage is that factor analysis can produce factor scores that can be used in lieu of the original independent variables. If orthogonal (uncorrelated) factor scores care created, this method will completely remove multicollinearity (with tolerance values of 1!). However, the coefficients associated with the factor scores used as independent variables in a regression can be difficult to interpret.

Linear Regression in IBM SPSS Statistics and the Regression node in Modeler 14 are the only statistical procedures that offer collinearity tests. However, any set of independent variables can be tested in the Linear Regression procedure, regardless of the regression-based procedures that will be used. Since the collinearity test only applies to the independent variables, any dependent variable can be designated as the dependent variable (even the subject’s ID numbers!). However, to simplify the output, you should deselect the Model fit and Regression coefficients, which are selected by default

The simplest way to fix a multicollinearity problem is to simple pick one of a set of variables that are highly correlated, especially if high correlations suggest redundancy. Another simple solution is to use the mean of highly correlated variables. Regardless of the solution used, tests for multicollinearity should be run before the analyst begins his or her interpretation of the regression coefficients.

About Steve: Steve has been an education consultant for SPSS since November of 1997. For most of that time he worked as an external consultant, and in March of 2010 he began working for IBM SPSS full-time. Before he began working full-time for IBM SPSS, he worked as a researcher for the Center for Mental Health Policy and Services Research at the University of Pennsylvania. Steve received a PhD in Social Policy, Planning, and Policy Analysis from Columbia University. He loves to travel, and is an avid backpacker with his son.

1 comment:

Jon PeckNovember 8, 2011 at 6:54 AM
There are regression-related techniques that can help with multicollinearity. Ridge regression, lasso, elastic net, and partial least squares are a few. The first three are available in the Categories option of SPSS Statistics, and PLS is available in Statistics as an extension command (see the SPSS Community site at www.ibm.com/developerworks/spssdevcentral for information about PLS and extension commands.)

But ultimately, high multicollinearity means weak data: limited ability to distinguish the effects of the correlated data. Ultimately that means relying on a good theory or collecting better data. As somebody famous - I forget who - said, you can't make bricks without straw.

Having just gone through a construction project involving lots of adobe bricks, though, I admit that I didn't see any straw.

Statistics & Analytics Consultants Group Blog

Friday, October 21, 2011

Trainer Tip: Multicollinearity, by Steve Poulin, Ph.D., Trainer & Consultant, IBM/SPSS

1 comment: