One of the most recurrent questions in quantitative research refers to on how to assess and rank the relevance of variables included in multiple regression models. This type of uncertainty arises most of the time when researches prioritize data mining over well-thought theories. Recently, a contact of mine in social media formulated the following question: “Does anyone know of a way of demonstrating the importance of a variable in a regression model, apart from using the standard regression coefficient? Has anyone had any experience in using Johnson’s epsilon or alternatives to solve this issue? Any help would be greatly appreciated, thank you in advance for your help”. In this post, I would like to share the answer I offered to him by stressing the fact that what he wanted was to justify the inclusion of a given variable further than its weighted effect on the dependent variable. In the context of science and research, I pointed out to the need of modeling appropriately over the mere “p-hacking” or questionable practices of “regression fishing”.

[wp-paywall]

What I think his concern was all about and pertained to is modeling in general. If I am right, then, the way researchers should tackle such a challenge is by establishing the relative relevance of a regressor further than the coefficients’ absolute values, which requires a combination of intuition and just a bit of data mining. Thus, I advised my LinkedIn contact by suggesting how he would have almost to gauge the appropriateness of the variables by comparing them against themselves, and analyze them on their own. The easiest way to proceed was scrutinizing the variables independently and then jointly. Therefore, assessing the soundness of each variable is the first procedure I suggested him to go through.

In other words, for each of the variables I recommended to check the following:

First, data availability and the degree of measurement error;

Second, make sure every variable is consistent with your thinking –your theory;

Third, check the core assumptions;

Fourth, try to include rival models that explain the same your model is explaining.

Now, for whatever reason all variables seemed appropriate to the researcher. He did check out the standards for including variables, and everything looked good. In addition, he believed that his model was sound and cogent regarding the theory he surveyed at the moment. So, I suggested raising the bar for decanting the model by analyzing the variables in the context of the model. Here is where the second step begins by starting a so-called post-mortem analysis.

Post-mortem analysis meant that after running as much regression as he could we would start a variable scrutiny for either specification errors or measurement errors or both. Given that specification errors were present in the model, I suggested a test of nested hypothesis, which is the same as saying that the model omitted relevant variables (misspecification error), or added an irrelevant variable (overfitting error). In this case the modeling error was the latter.

The bottom line, in this case, was that regardless of the test my client decided to run, the critical issue will always be to track and analyze the nuances in the error term of the competing models.

Categories: Statistics and Time Series.