Statistics and Time Series.

Multicollinearity explained briefly. Is it bad at all?

Multicollinearity is a common pitfall in the practice of multiple regression analysis. Whenever analysts conduct quantitative research, they must be aware of at least the seven core assumptions of linear regression. In checking for the assumptions, many of us just stop at number seven. Many other people –the wise ones- want to run the extra mile for the extra credit by checking in their multiple regression models the non-violation of multicollinearity. Running the extra mile may pay off in many instances in life, but when it comes to multicollinearity, it might be better just to leave the data as it is. The reason for saying this is that the mere presence of multicollinearity does not make a flawed model, nor multicollinearity applies for nonlinear models. Multicollinearity represents an analytical barrier only when researchers aim at estimating the relative importance of independent variables over the dependent variable.

By Catherine De Las Salas.

By Catherine De Las Salas.

What does multicollinearity mean?

But before going any further, let us try to address what multicollinearity means. Multicollinearity refers to the existing correlation between two or more independent variables. The fact that two or more independent variables correlate implies that before they affect change in the dependent variable, they reinforce among themselves. I believe that a good way to represent this influence is by relating the engine of a car with a turbocharger. Given that a car has a very simple method of propulsion, it is possible to reduce its system into a model.


Let us say that Y represents the speed of a car. Also, let us say that such rate depends upon the engine’s horsepower (X₁), solely. Now, let us imagine that the car has gotten a new turbocharger. After installing the turbo, the driver noticed that the average speed of the car went from 60 miles to 100 miles in 30 seconds. Indeed, the car is faster when running with the turbocharger. Consequently, the speed of the car depends upon one more factor, the turbocharger (X₂). In this case, the turbocharger reinforces the horsepower levels of the engine making the car faster though both elements seem to work independently. Thus, the speed of the car is a function of the engine and the turbocharger at the same time. Since the turbocharger injects extra air into the cylinders, the engine and the turbocharger are highly correlated. The more the engine burns combustible, the more the turbocharger pushes extra air into the cylinders. And then, the more the engine burns combustible, the more speed the car achieves. That effect is very similar to that of multicollinearity.

Multicollinearity does not obstruct the work of the researcher (kind of).

So, the engine example works not only for describing the reinforcing trend that multicollinearity creates among variables but also for explaining when multicollinearity becomes a problem. Even in the presence of multicollinearity, the car’s speed model works without a problem given that it accomplish what the model is supposed to do. The specification of the car’s speed model is good enough to provide analysts with a solid prediction of what would happen in the case that the turbocharger goes on or off. Roughly speaking, and following the example, the presence of a turbocharger in a car’s engine should increase the vehicle’s speed by 66 percent. Therefore, multicollinearity does not obstruct the work of the researcher as the investigator forecasts and predicts the possible outcomes from variations in the engine. For us to know that is easy given that car’s engine pieces can be removed and installed as we please. However, in social science research, neither can researchers remove nor install “turbochargers”. Then, multicollinearity is something with we have to learn to live.

Multicollinearity makes it difficult to disentangle effects of regressors.

Otherwise, what no researcher can estimate, in the presence of severe multicollinearity, is the distinct contribution of either the engine or the turbocharger over the outcome –speed. In other words, multicollinearity amalgams so well both forces to the extent that it makes it difficult to disentangle them by regressing with Ordinary Least Square method. Thus, multicollinearity makes it hard to estimate predictors with precision and accuracy. Whomever likes to dig into the possibilities for avoiding multicollinearity for a better estimations of betas, may want to learn how game theory could help in diluting multicollinearity.

In conclusion, multicollinearity represents a barrier only when researchers aim at estimating the relative importance of independent variables over the dependent variable. However, multicollinearity per se does not violate any of the core assumptions of the linear regression model. Multicollinearity is just an issue of degree. In other words, in most of the statistical models, there will always be the presence of multicollinearity. Therefore, it is something we have to learn to live with.

Finally, if multicollinearity is a matter of degree, do researchers need better tools for examining and testing for multicollinearity?

1 reply »

Leave a Reply