Steps in traditional statistics proceedings for analysis of data:
1. Formulation of Hypothesis.
2. Description of Mathematical model.
3. Collecting and organizing data.
4. Estimation of the coefficients.
5. Hypothesis testing and confidence interval.
6. Forecasting and prediction.
7. Control and optimization.
1. Hypothesis: write down a statement that “in theory” you think happens in real life. For instance,
“Heavier labor regulation may be associated with lower labor force participation”.
2. Mathematical model: although it is not strictly necessary, it always helps to make clear whether the relationship you established, namely between “regulation” and “labor force participation” is positive or negative. In other words, do you believe that “labor regulation” has a positive or negative impact in “labor force participation”? One way to confirm your believes is by plotting a chart and see whether the trend is upward sloping or downward sloping.
3. Collecting and organizing the data: collecting data is expensive. In our case “heavier labor regulation may be associated with lower labor force participation” can be analyzed with data already collected by the World Bank and organized by Juan Botero et al (2004). In the case you do not have data you will need to design a questionnaire and get out to ask those question to at least 100 randomly chosen individuals. However, say you want to know about the relation between “the more you learn, he more you earn”, what would you ask to several random people? Well, you would ask at least two questions: what is your annual/monthly income? And, what level of education do you have, PhD, Masters, Undergraduate, High School? You will record every single answer perhaps into a Microsoft Excel spreadsheet. Do not forget to label the columns and what they mean. Those two columns which result from your survey are your variables (e.g. X and Y). Going back to our case “heavier labor regulation may be associated with lower labor force participation” the Excel Spreadsheet looks like the picture below. In the spreadsheet you can see each of the observations, which are actually data drawn from countries. What you read in column AO as “index_labor7a” is nothing else than a score researches like you gave to whatever they considered to be “labor regulation”. The adjacent column AP, which reads “rat_mal2024” is no more than an average of unemployment rate amongst male of ages ranging from 20 to 24. That is what researches in our example consider to be a proxy for “labor force participation”.
Once you have set up your software, you will run the regression by selecting “Regression” after clicking the “Data Analysis” button, which usually can be found in the upper right corner in the “Data” tab as shown in the picture below.
Then, you will have to define your Y’s and X’s. These are your variables, which come from the empirical observations (e.g. the survey). In our case, as we defined above, our Y is the AP column in the picture below. That is, “rat_mal2024”, or “male labor force participation”. Complementary, our X is “index_labor7a”, which is as we stated a score of labor regulation. Do not forget to specify to Excel whether your columns do have or do not have labels and the output range. It is up to you to have Excel plotting the residuals and other relevant statistics. For now, just check on confidence level box.
Excel will generate the “Summary Output” table. This table contains the coefficients we are trying to estimate. From this point onwards you will have to be somewhat familiar with statistics in order to interpret the results.
5. Hypothesis testing and confidence interval: in this step you will have to deny and reject whatever contrary argument faces your initial thoughts on the relation between earnings and learnings. In other words, you will have to reject the possibility that such a relation does not exists.
6. Forecasting and prediction: this step is a bit slippery, but you can still say something about the next person to whom you would ask the survey questions. In this step you will be able to “guess” the answer other people would give to your questionnaire with certain level of confidence.