In a previous post, I covered how heteroscedasticity “happened” to me. The anecdote I mentioned mostly pertains to time series data. Given the purpose of the research that I was developing back then, change over time played a key factor in the variables I analyzed. The fact that the rate of change manifested over time made my post limited to heteroscedasticity in time series analysis. However, we all know heteroscedasticity is also present in cross-sectional data. So, I decided to write something about it. Not only because did not I include cross-sectional data, but also because I believe I finally understood what heteroscedasticity was about when I identified it in cross-sectional data. In this post, I will try to depict, literally, heteroscedasticity so that we can share some opinions about it here.
As I mentioned before, my research project at the moment was not very sophisticated. I had said that I aimed at identifying the effects of the Great Recession in the Massachusetts economy. So, one of the obvious comparisons was to match U.S. states regarding employment levels. I use employment levels as an example given that employment by itself creates many econometric troubles, being heteroscedasticity one of them.
The place to start looking for data was U.S. Labor Bureau of Statistics, which is a nice place to find high quality economic and employment data. I downloaded all the fifty states and their jobs level statistics. Here in this post, I am going to restrict the number of states to the first seventeen in alphabetical order in the data set below. At first glance, the reader should notice that variance in the alphabetical array looks close to random. Perhaps, if the researcher has no other information -as I often do- about the states listed in the data set, she may conclude that there could be an association between the alphabetical order of States and their level of employment.
I could take any other variable (check these data sources on U.S. housing market) and set it alongside employment level and regress on it for me to explain the effect of the Great Recession on employment levels or vice versa. I could find also any coefficients for the number of patents per employment level and states, or whatever I could imagine. However, my estimated coefficients will always be biased because of heteroscedasticity. Well, I am going to pick a given variable randomly. Today, I happen to think that there is a strong correlation between Household’s Pounds of meat eaten per month and level of employment. Please do not take wrong, I believe that just for today. I have to caution the reader; I may change my mind after I am done with the example. So, please allow me to assume such a relation does exist.
Thus, if you look the table below you will find interesting the fact that employment levels are strongly correlated to the number of Household’s pound of meat eaten per month.
Okay, it is clear that when we array the data set by alphabetical order the correlation between employment level and Household’s Pounds of meat eaten per month is not as clear as I would like it to be. Then, let me re-array the data set below by employment level from lowest to the highest value. When I sort out the data by employment level, the correlation becomes self-evident. The reader can see now that employment drives data on Household’s Pounds of meat eaten per month up. Thus, the higher the number of employment level, the greater the number of Household’s Pounds of meat consumed per month. For those of us who appreciate protein –with all due respect for vegans and vegetarians- it makes sense that when people have access to employment, they also have access to better food and protein, right?
In this case, given that I have a small data set I can re-array the columns and visually identify the correlation. If you look at the table above, you will see how both growth together. It is possible to see the trend clearly, even without a graph.
But, let us now be a bit more rigorous. When I regressed Employment levels on Household’s Pounds of meat eaten per month, I got the following results:
After running the regression (Ordinary Least Squares), I found that there is a small effect of employment on consumption of meat indeed; nonetheless, it is statistically significant. Indeed, the regression R-squared is very high (.99) to the extent that it becomes suspicious. And, to be honest, there are in fact reasons for the R-squared to be suspicious. All I have done was tricking the reader with a fake data on meat consumption. The real data behind meat consumption used in the regression is the corresponding state population. The actual effect in the variance of employment level stems from the fact that states do vary in population size. In other words, it is clear that the scale of the states affects the variance of the level of employment. So, if I do not remove size effect from the data, heteroscedasticity will taint every single regression I could make when comparing different states, cities, households, firms, companies, schools, universities, towns, regions, son on and so forth. All this example means that if the researcher does not test for heteroscedasticity as well as the other six core assumptions, the coefficients will always be biased.
For some smart people, this thing is self-explanatory. For others like me, it takes a bit of time before we can grasp the real concept of the variance of the error term. Heteroscedasticity-related mistakes occur most of the time because social scientists look directly onto the relation among variables. Regardless of the research topic, we tend to forget to factor in how population affects the subject of our analysis. So, we tend to believe that it is enough to find the coefficient of the relation between, for instance, milk intake in children and household income without considering size effect. A social scientist surveying such a relation would regress the number of litters of milk drunk by the household on income by family.
Sometimes it is good to learn about issues such as heteroscedasticity by empirically identifying them. Here is how I detected heteroscedasticity was present in time series analysis. I started working on a research project intended to measure how Massachusetts economy had recovery from the Great Recession of 2009. Neither was it a sophisticated research, nor the scope went further than to describe the way Mass’ economy had reallocated resources after the crisis. I knew at the time that descriptive statistics would suffice my research objectives. So, I picked a bunch of metrics that I thought would depict mostly downward slopes lines. I remember having chosen Gross Domestic Product by industry. So, I started plotting data in charts and graphs. Then I turned onto municipalities. I gathered some data on employment levels cross-sectional and time series. Once I was done with the exploratory phase of the research, I started to see strange patterns in the graphs. Everything went up and up, even after a recession. Apparently, it did not make sense at all, and I had to research the reason behind upward slopes in the time of economic distress.
It turned out heteroscedasticity was the phenomenon bumping up the lines. I said, it is nice to meet you Miss, but who the heck are you? Not knowing heteroscedasticity is almost the same thing as ignoring lurking or confounding variables in your regression model. However, the difference stems from the fact that heteroscedasticity does aggregate lurking variables and hides them within the model’s error term. In descriptive statistics of time series analysis, heteroscedasticity manifests as a portion of the area underneath the line, which makes time series lines to have a false rate of change. It looks like the lines had been inflated artificially. Obviously, this is clear when the measure tracks currency. We all know that currency grows over the time as its value depreciate. Therefore, we all adjust by inflation, right? Although adjusting for inflation was an easy task, the lines kept on showing upward trends. Something was going under definitely -I thought at the moment.
On the other side, measures like employment levels also were trending upwards. Even though employment is an economic measure, I am not idiot enough for confounding and associating it with inflation. Perhaps, there might be a theory in which employment could depreciate over time as currency; but, I know it performs differently to price inflation. So, after doing my research, I found that it was the growth of population that bolstered employment growth after the crisis. Does that count as real job growth? No, it does not. Then, how should I measure such a distorted effect? Once again, heteroscedasticity held the answer.
What is technically heteroscedasticity?
Heteroscedasticity is a data defect that thesis advisors use for to make you work harder. No, seriously. What is heteroscedasticity? Technically, heteroscedasticity is the correlation between the error term and one of the independent variables. In other words, it is an effect caused by the nature of the data most of the times. It is a phenomenon that data collected over time suffer from, and which means that the error term of the model has variance different than zero. In time series analysis, econometricians call such a thing Non-stationary Process, hence one of the main assumptions in linear regression analysis is to aim at analyzing data that is Stationary Stochastic Process.
What makes heteroscedasticity a problem?
Heteroscedasticity taints estimated coefficients in regression analysis. The collection technique can generate heteroscedasticity, outliers can trigger heteroscedasticity, incorrect data transformation can create heteroscedasticity, and skewness in the distribution of the data can produce heteroscedasticity.
Ever since the first test I use for heteroscedasticity in time series analysis is the graphical method. Yes, it is an informal method, but it gives researchers an idea of what transformation to do in the data. Finally, if you want to hear about how to estimate heteroscedasticity with a formal procedure here is my advice.
Although I use mostly either White test or Park test when testing for heteroscedasticity, if you must use Breusch-Pagan for whatever reason, here is what you need to do. The goal in Breusch-Pagan test is to estimate the ½ of Explained Sum of Squares (ESS), which follows approximately a Chi-Square distribution. You will have to build an additional regression model based on the model you suspect the heteroscedasticity is present in. The first thing is to obtain the residuals from your model through OLS. Then, estimate a rough statistic of its variance by adding up and squaring the residuals to ultimately dividing by the number of your observations. Once you have the approximate variance of the residuals, proceed to create a new variable by dividing each residual squared by the estimated variance above. Let us call such a new variable p. Now, regress p on the independent variables of your original model. Obtain the Explained Sum of Squares and divide it by 2. Then compare your 1/2ESS statistic with those in the Chi Square Table.
Let me know if you need help with getting rid of heteroscedasticity!