Econometrics is one of the most important tools in an economist’s arsenal. Econometric techniques allow economists to test hypotheses in economic theory in the real world and see if they are correct or not. Econometrics is the application of data analysis to economics, the word “Econ” refers to economics and “metrics” refers to quantitative measurement. So, econometrics is basically just the examination of economic data.
The econometric model we will be looking at in this post is the general linear model. The general linear model is used to assess the statistical significance which one or more variables have on another variable. For example, if you wanted to see how employment levels affected GDP you could regress GDP against employment levels and see if there was a significant statistical relationship between the two variables. Now, that’s easier said than done so to show you exactly how this works we’re going to look at a sample regression from the world of music.
The excel file being used for this post can be found here (Music Economics Music sales) and is free for all to use for educational purposes in economics classrooms throughout the world. It should also be noted that this data set does not relate to real life musicians, it was constructed for the sole purposes of being a teaching aid. In this example we are going to try and see what affects the music sales of musicians. To do this we are going to run a regular linear regression and explain all the steps as we go along.
Constructing Our Model:
To run a regression you must first construct a model. Now your model is going to need to contain at least two variables, a dependent and an independent variable. A dependent variable is the output which you are trying to measure (in our case music sales), the independent variable is something which you think could contribute to the dependent variable (in our case it could be whether or not the musician is currently touring).
If we were just going to look at two variables (a dependent and an independent variable) then our model would be called a univariate regression because it only looks at one factor which may influence the dependent variable. However, if we wanted to look at one or more variables which may affect our dependent variable, it would be called a multivariate regression because it looks at multiple variables. For our example, we will be performing a multivariate regression. Our model will have a total of seven variables, one dependent variable and six independent variables.
The dependent variable is music sales and the independent variables will be the artist’s gender, their age, if they make pop music, if they are signed to a major record label, the number of albums they have, and whether they are currently touring or not. Now to show this model in a paper economists generally write out their model in a production function.
A production function is basically an equation which shows the effect a series of inputs have on an output, it is generally noted as: Y = b0 + b1X
Where Y is the output or dependent variable (music sales) and b1X is the first of our independent variables (gender). b0 is the intercept or constant variable. This means it is the expected value of our dependant variable if our independent variable was equal to zero. So, before we write out our own production function there is one more term we need to include; the error term. The error term is included in our model basically as an acknowledgment that there are things in the real world which affect our dependent variable which our model doesn’t include for.
For example, in 2009 the single Killing in the Name of by Rage against the Machine went to the number one spot in the UK due to a freak campaign movement to make the song the Christmas number one in protest against the dominance of X Factor winners taking the number one spot Christmas week. Obviously this is a freak event which did cause the sales of Rage against the Machine’s music to increase, but it is a variable which is very hard to predict so our model can’t include for it happening so these types of events are captured instead in the error term of the model. So, essentially the error term just measures stuff that we didn’t include in the independent variables in our model but did affect the dependent variable in real life.
Expressing our Production Function as a Linear Equation:
Essentially a production function just tries to show what variables affect another variable. So a production function can be written like this:
Music Sales = Female + Age + Pop Music + Major Record Label + No. of Albums + Currently Touring + Error
But because econometrics is the analysis of economics using statistics and maths, we need to write this production function as a linear equation. This is something which tends to see students run for the hills but it’s really not that complicated. The linear equation for our example can be seen below:
The use of mathematical symbols here tends to turn students off, but it is actually the same basic production function as up above. It’s just written as a linear equation. In linear equations Y is used to denote whatever the dependent variable is, β’s are used to denote the independent variables, and ε or µ is used to denote the error term.
Here, Y is music sales (our dependent variable), i is an identification for which musician the data is regarding, and ε is the error term. β0 is the intercept or constant variable; this is just the expected mean value of our dependent variable if our independent variable was set to 0. Β1Female is whether the musician was female or not, β2Age is the age of the musician, β3Pop is whether the musician makes pop music or not, β4MRL is whether or not the musician is signed to a major record label or not, β5Albums is the number of albums the musician has released to date, and β6Tour is whether the musician is currently touring or not.
Now that we have our linear equation written out we need data which measures the variables in our equation, and then we need to import this data into a data analysis software programme and run a regression. For this example we will be using the Stata software.
In order to import the data into Stata, we must first format it into Excel in a manner that Stata will understand. Below is how I laid out the data for this example.
As can be seen above, the top row consists of only variable names. The rows which follow after are the data which belong to their respective variable. Obviously for data which can be measured numerically (Age and Music Sales) the data is just coded as whatever the value of that variable is for that observation. For example, if musician 1 is twenty four then the value of their age is just coded as 24.
But for variables which are categorical we need to assign them a numerical value because computers can’t recognize letters. So for example, whether a musician is on tour or not does not have a numerical value, so instead we assign one to it. Usually for yes or no questions like this we code 1 as yes and 0 as no. So, when you see a 1 in the tour variable column it means that that musician is currently touring.
Assigning numerical values to a categorical variable is called creating a dummy variable. If you wanted to create a dummy variable for a variable which had more than two characteristics then you use the (n-1) rule. The rule basically just means that whatever number of possible characteristics there is for the variable you take one away from that and assign the numerical values to the variables accordingly. You use this rule to avoid falling into the “dummy variable trap”, where you assign more categories then is needed to code a variable.
For example, if this sample of musicians were all from Ireland and you wanted to know from which one of the four provinces they came from you would take 1 away from 4 which would leave you with 3 values and assign them to the musicians accordingly. So, Munster would be 1, Leinster 2, and Connaught 3; then you would know the remaining value is for people who are from Ulster because it is the only choice left. This is why the (n-1) rule is used, because if you know someone was from one of four places and they said they weren’t from the first three places, logically they must be from the last remaining place.
Importing Data into Stata and Running a Regression:
Once we save our data into an excel document, we are ready to start Stata and run our regression. To do this we open a do file in Stata, once we open this we start writing our code to run a regression. The code used for this example can be seen below.
The first line is used to set the command directory; this is effectively telling Stata where our data is in our computer. In my case “F:\Music Economics 2018\Econometrics” was the address for where my data file was located in my computer. This address will change for everyone based on where you have the file in the computer, and it can be copied and pasted from the address bar in your PC’s file explorer.
The second line of code creates a log of your Stata session; this is basically a page of text with your Stata activity saved which you can look into at any time as a text file. This is just easier than running the code again the next time you want to check your previous Stata activity. The third line of code tells Stata which file it is that you want to import into it, that the first row of that file are variable headings and not actual data, and that the data you wish to use is contained in sheet 1 of the excel file.
The last line then is used to run your regression. It is easy enough to perform it just consists of giving the command to Stata to regress which is “regress” then you must give the exact name of your dependent variable, followed by the list of independent variables you wish to regress it against. Once you run this you just have to interpret the results. Below our the results for this example’s regression.
When interpreting a regression there are two things we need to check, is the model significant and if so, are the independent variables significant. This will be discussed further below.
Is the Model Significant?
To see if the model is significant, we must check the Prob > F score, which for us is 0.0000. An ideal score here would be as close to 0 as possible and seen as ours is 0.0000 this means the model we have run is statistically significant. The second thing we check is the R-squared value. This is the level of explanatory power that the model has, so it indicates how significant the model is. For example, our R-squared is 0.5498, which indicates that the model explains just over 54% of the variation in our dependent variable of music sales.
Are the Independent Variables Significant?
Finally, we check the P>|t| values, or critical t values as they are called. This tells us the level of significance that each of our independent variables have. Similarly, to the Prob > F score, the closer to 0 the better with this score. These scores are significant if they’re low because interpreting them works off of what’s known as a confidence interval. The confidence interval we are working off of is the 95% confidence interval. This means that for values to be considered significant they cannot exceed 0.05. So, any score of 0.05 or lower is considered significant and any score above 0.05 is not.
We have two independent variables in our model which are significant at the 95% confidence interval, they are whether the musician is currently touring and whether they are signed to a major record label. We know these variables are significant at the 95% confidence interval because their P>|t| values are both 0.000 which is lower than 0.05. All the other independent variables in our model have P>|t| values over 0.05, meaning they are not significant at the 95% confidence interval.
These means that we have now proved that out of all the variables we included in our model, just touring and signing to a major record label were factors which impacted a musician’s music sales. Now that we have our results and have interpreted them, we can go about the final phase of our econometric example, hypothesis testing.
Hypothesis testing is kind of the whole point of doing econometric tests. A hypothesis is basically a kind of ideal or an assumed train of thought regarding a particular topic. Prior to running econometric tests we can only hypothesize about the effect which one thing has on another, but after we run econometric tests we will be able to prove if our hypotheses are correct or not. This is called hypothesis testing.
So let’s use our example of music sales. When testing a hypothesis for whether one variable effects another variable, we have two hypotheses. The first is called the null hypothesis and the other is called the alternative hypothesis. The null hypothesis is that the independent variables we are looking at have no effect on our dependent variable at all. This is because the word “null” literally means no value. The alternative hypothesis is that the variables we are looking at do have an impact on our dependent variable. For the purposes of exams and textbooks they are noted like this:
Null hypothesis: H0: β = 0
Alternative Hypothesis: H1: β ≠ 0
Before we run an econometric test, we assume that the null hypothesis is correct, and then if our econometric results prove that the independent variables have an effect on our dependent variable; we can reject this null hypothesis. Meaning that we have found that one or more of our independent have an effect on our dependent variable.
So, for our example of music sales, first we would assume the null hypothesis. Then we run our results and we find that touring and being signed to a major record label actually do have an effect on the sales of musician’s music, so we can now reject the null hypothesis. This final hypothesis test concludes our example.
By Daragh O’Leary