In a previous post, I covered how heteroscedasticity “happened” to me. The anecdote I mentioned mostly pertains to time series data. Given the purpose of the research that I was developing back then, change over time played a key factor in the variables I analyzed. The fact that the rate of change manifested over time made my post limited to heteroscedasticity in time series analysis. However, we all know heteroscedasticity is also present in cross-sectional data. So, I decided to write something about it. Not only because did not I include cross-sectional data, but also because I believe I finally understood what heteroscedasticity was about when I identified it in cross-sectional data. In this post, I will try to depict, literally, heteroscedasticity so that we can share some opinions about it here.
As I mentioned before, my research project at the moment was not very sophisticated. I had said that I aimed at identifying the effects of the Great Recession in the Massachusetts economy. So, one of the obvious comparisons was to match U.S. states regarding employment levels. I use employment levels as an example given that employment by itself creates many econometric troubles, being heteroscedasticity one of them.
The place to start looking for data was U.S. Labor Bureau of Statistics, which is a nice place to find high quality economic and employment data. I downloaded all the fifty states and their jobs level statistics. Here in this post, I am going to restrict the number of states to the first seventeen in alphabetical order in the data set below. At first glance, the reader should notice that variance in the alphabetical array looks close to random. Perhaps, if the researcher has no other information -as I often do- about the states listed in the data set, she may conclude that there could be an association between the alphabetical order of States and their level of employment.
I could take any other variable (check these data sources on U.S. housing market) and set it alongside employment level and regress on it for me to explain the effect of the Great Recession on employment levels or vice versa. I could find also any coefficients for the number of patents per employment level and states, or whatever I could imagine. However, my estimated coefficients will always be biased because of heteroscedasticity. Well, I am going to pick a given variable randomly. Today, I happen to think that there is a strong correlation between Household’s Pounds of meat eaten per month and level of employment. Please do not take wrong, I believe that just for today. I have to caution the reader; I may change my mind after I am done with the example. So, please allow me to assume such a relation does exist.
Thus, if you look the table below you will find interesting the fact that employment levels are strongly correlated to the number of Household’s pound of meat eaten per month.
Okay, it is clear that when we array the data set by alphabetical order the correlation between employment level and Household’s Pounds of meat eaten per month is not as clear as I would like it to be. Then, let me re-array the data set below by employment level from lowest to the highest value. When I sort out the data by employment level, the correlation becomes self-evident. The reader can see now that employment drives data on Household’s Pounds of meat eaten per month up. Thus, the higher the number of employment level, the greater the number of Household’s Pounds of meat consumed per month. For those of us who appreciate protein –with all due respect for vegans and vegetarians- it makes sense that when people have access to employment, they also have access to better food and protein, right?
In this case, given that I have a small data set I can re-array the columns and visually identify the correlation. If you look at the table above, you will see how both growth together. It is possible to see the trend clearly, even without a graph.
But, let us now be a bit more rigorous. When I regressed Employment levels on Household’s Pounds of meat eaten per month, I got the following results:
After running the regression (Ordinary Least Squares), I found that there is a small effect of employment on consumption of meat indeed; nonetheless, it is statistically significant. Indeed, the regression R-squared is very high (.99) to the extent that it becomes suspicious. And, to be honest, there are in fact reasons for the R-squared to be suspicious. All I have done was tricking the reader with a fake data on meat consumption. The real data behind meat consumption used in the regression is the corresponding state population. The actual effect in the variance of employment level stems from the fact that states do vary in population size. In other words, it is clear that the scale of the states affects the variance of the level of employment. So, if I do not remove size effect from the data, heteroscedasticity will taint every single regression I could make when comparing different states, cities, households, firms, companies, schools, universities, towns, regions, son on and so forth. All this example means that if the researcher does not test for heteroscedasticity as well as the other six core assumptions, the coefficients will always be biased.
For some smart people, this thing is self-explanatory. For others like me, it takes a bit of time before we can grasp the real concept of the variance of the error term. Heteroscedasticity-related mistakes occur most of the time because social scientists look directly onto the relation among variables. Regardless of the research topic, we tend to forget to factor in how population affects the subject of our analysis. So, we tend to believe that it is enough to find the coefficient of the relation between, for instance, milk intake in children and household income without considering size effect. A social scientist surveying such a relation would regress the number of litters of milk drunk by the household on income by family.