An online International Summer School Program on “Data, Monitoring and Evaluation” is a two-month immersive online hands-on certificate training course organised by IMPRI Impact and Policy Research Institute, New Delhi. The day 3 of the program on June 17th, started with a session on “Hands-on Data Learning: Interpretation of Model using EVIEWS” by Prof Nilanjan Banik. Prof. Banik started with a recapitulation of the previous sessions, where distribution and density functions and their purpose were deliberated.
Further, the need for policy intervention and the statistical significance of Government policies were discussed. The concept of multiple regression is explained by him. Among the several dependent and independent variables, we take one dependent variable, which is the function of many independent variables. The aim is to find what factors are influencing the dependent variable.
Linear Regression and Data
The Linear Regression technique is explained with respect to economic analysis, using examples of quantity demanded and price. Here, Quantity demand is set as the dependent variable (Y), and the price of the good is the independent variable (X). Other factors such as taste, income, and lifestyle can also be added.
Next, the methodology of a linear regression function is stated. First, the slope of the function (β) must be linear. Second, Independent Variables (X) are the ones which influence the Dependent variable. These Independent Variables can be both linear and nonlinear. Prof. Banik used the example of a cost curve to demonstrate the application of linear and non-linear independent variables.
Errors found in model
Next, He defines Error as a variable that we have not controlled for in the model, i.e., the variables that aren’t explicitly considered in the model. The key assumption here is that the independent variable (X) and Error (U) are uncorrelated. Hence, they do not have endogeneity. If X and U are related across a time series or cross-section, then we have a problem of “autocorrelation”.
The problem of heteroscedasticity is mentioned, which assumes that the variances of errors are uniform across the sample. If the variance of the Error is related to X, then we have a problem of heteroscedasticity. Next, if the independent variables, i.e., variables in X, are related to one another, then we have a problem of multicollinearity.
The concept of Stratified Sampling is introduced with an example from an econometrics textbook, showing the different levels of income involved in sampling.
Collection of data for multiple linear regression
He then details the process of collecting data for multiple linear regression. He uses the example of two prominent localities of Delhi, Greater Kailash and Delhi, from which he may collect data from 5 people, each with income levels of 80 dollars/rupees, to understand how their consumption varies. Thus, once the sample is created, multiple linear regression is carried out. He explains how this technique can be used to predict the price of a house, compared to other techniques, which may involve only considering the historical prices.
This technique instead considers all sorts of information, say, the opening of a new metro line and how old the house is. Next, he mentions the importance of using the F-test to measure the statistical significance of the variables and to choose between variables to be considered in the model. He used “Microsoft Excel” in the session itself to apply the concepts he was explaining.
Further, he explains the concept of P-value. A p-value is the probability of a null to be true. The lower the p-value, the more statistically significant the variable is. As a rule of thumb, he states that if a p-value is below 0.10, then the variable can be accepted.
Next, he uses an application called “EViews“, an application used for statistical and econometric analysis. He shows the procedure to use it live. He once again demonstrates the process of calculating p-values (statistical significance of variables) in EViews. Furthermore, this application is used to find the problem of correlation using the Durbin-Watson test. If the Durbin-Watson test is in the vicinity of 1.8 to 2.2, then there is no problem with autocorrelation. But, if it is higher than 2.5 or lower than 1.8, then we have a problem with positive and negative autocorrelation, respectively.
The process of checking the problem of multicollinearity is demonstrated through EViews. Prof. Banik mentions how if the variation inflation factor is more than 10, then you have a problem of multicollinearity. If it is below 10, then there is no problem with multicollinearity.
Similarly, to check the problem of heteroscedasticity, the process is demonstrated by Prof. Banik through EViews. For heteroscedasticity, the user must look into the residual diagnostic, which tells us if the variance of the Error is related to X. As a rule of thumb if the value of the residual diagnostic is more than is more than 1.0, then the problem of heteroscedasticity is there. If it is below 1.0, then the data is homoscedastic.
In the dataset used during the session as an example, it is shown that there is only a problem of autocorrelation and not heteroscedasticity or multicollinearity.
In a new dataset, which shows how many hours of work a female individual has done in a particular year as the dependent variable, he shows how the problem of multicollinearity can be observed with respect to independent variables such as hourly wage, education, age, etc.
In another dataset about the abortion rate in the USA as the dependent variable, with various independent variables like education, state funding, income, religion and price, he shows us the problem of heteroscedasticity. We also observe how various factors like price & income play a greater role than funding, religion & income in deciding the abortion rate.
In the end, he summarised all he covered in the detailed presentation and shared the data he used in the presentation. He opened the floor to questions from participants. Questions regarding the concepts of the session were asked and patiently answered by Prof. Banik.
The session concluded with a vote of thanks at the end.
Acknowledgement: Garvit Gupta is a research intern at IMPRI.
Read more session reports on web policy learning events conducted by IMPRI: