Stokastik

Machine Learning, AI and Programming

Linear Regression Analysis with Examples using R

In this post we are going to solve linear regression problems using R and analyze the solutions. We will be looking at multiple linear regression examples, in which the output variable is modeled as a function of more than one input variables, i.e. for the i-th example, the output y_i is assumed to be of the following form :

y_i={\beta}_0+{\beta}_1x^{(1)}_i+{\beta}_2x^{(2)}_i+...+{\beta}_dx^{(d)}_i+e_i

In the above equation, 'y' is the output variable which is assumed to be linearly dependent on each of the 'd' input variables [x^{(1)},x^{(2)},...,x^{(d)}] and {\beta}_j are the proportionality constants (termed as co-efficients) and e_i is the error in prediction of the i-th example due to any unwanted noise in the data. Thus our model can be written as :

\hat{y}_i={\beta}_0+{\beta}_1x^{(1)}_i+{\beta}_2x^{(2)}_i+...+{\beta}_dx^{(d)}_i

where e_i=y_i-\hat{y}_i

i.e. the errors are the difference between the true and the predicted responses from the model.

The regression is called linear because an unit change in any one of the input variables x^{(j)} (all other inputs remaining constant), changes \hat{y} by a constant factor of {\beta}_j.

\frac{\partial{\hat{y}_i}}{\partial{x^{(j)}_i}}={\beta}_j

Assuming that there are N examples, we can represent the regression model by the following matrix notation :

\hat{Y}=XB^T

where \hat{Y} is Nx1 matrix of the predicted values from the model for the N examples, B is a 1x(d+1) matrix of the coefficients [\beta_0, \beta_1, ..., \beta_d] and X is a Nx(d+1) matrix, where each row represents an example, the first column are all 1's and each column from column 2 onwards represents the value of the input variables [x^{(1)},x^{(2)},...,x^{(d)}].

The goal is to fit this model to the data we have. We need to find the value of the unknown parameters \beta_j such that the errors e_i are as close to zero as possible. Mathematically we want to minimize the sum of squares of the errors, i.e.

L=\sum_{i=1}^Ne^2_i=\sum_{i=1}^N(y_i-\hat{y}_i)^2

One obvious way to minimize the above loss function is to do a stochastic gradient descent on the unknown parameters as we have seen in an earlier blog post :

\beta^{t+1}_j=\beta^t_j-\eta*\frac{\partial{L}}{\partial{\beta_j}} for each \beta_j, where \eta is the learning rate.

But the loss function is convex and hence will have a global minima, which we can find out directly by computing the partial derivatives w.r.t \beta_j and setting them to 0 individually.

B^T=(X^TX)^{-1}X^TY

We will take a simple example from the "BostonHousingData" available from the "mlbench" package in R. Instead of computing the coefficients explicitly using the above formula or SGD (which we have already done in an earlier post), we will use the "lm" method in R to create the model. You can obtain the description for the dataset from the UCI repository.

We want to build a regression model for the median house prices given in the column "medv" of the data based on the remaining attributes. Below is the R code to do exactly that :

library(mlbench)
data(BostonHousing)
fit <- lm(medv ~ ., data=BostonHousing)
summary(fit)

Based on the above code, we obtain the following output :

Call:
lm(formula = medv ~ ., data = BostonHousing)
Residuals:
Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 
Coefficients:
Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas1        2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
b            9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,	Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

The residuals are the error terms defined above, i.e. e_i=y_i-\hat{y}_i. The median value of the residuals from the 506 examples is -0.518. The residuals for each example can be obtained using the function :

residuals(fit)

Next we come to the table of co-efficients. In the table, the first column i.e. "Estimate" gives us the estimated values for the co-efficients [\beta_0, \beta_1, ..., \beta_d] (\beta_0 refers to the "Intercept" term in the above table) and the second column gives the Std. Error associated with each co-efficient.

The model (and subsequently the coefficients) we are trying to fit to the given data depends on the data available and the data we are using. If we had considered different subsamples of the above data, we would have got different values for the coefficients. Hence the Estimate and Std. Error terms. The estimates are given by the values of the vector :

(X^TX)^{-1}X^TY

BostonHousing$chas <- as.numeric(as.character(BostonHousing$chas))
mydata <- cbind(rep(1, nrow(BostonHousing)), as.matrix(BostonHousing[,1:13]))
labels <- as.matrix(BostonHousing$medv)
a <- solve(t(mydata)%*%mydata)
b <- t(mydata)%*%labels
c <- a%*%b

whereas the Std. errors are given by the following equation :

se(\beta_j)=\sqrt{(X^TX)^{-1}_{jj}}\sigma_e

where \sigma_e is the standard deviation of the residuals, i.e.

\sigma^2_e=\frac{\sum_{i=1}^Ne^2_i}{N-d-1}

where N-(d+1) is the degree of freedom of the model. The Std. Errors gives the stability of the co-efficients and will see below that how they can be used to define confidence intervals. The below R code gives the same values for the Std.Error values in the output table from summary results of the model.

res <- residuals(fit)
std.dev <- sqrt(sum(res*res)/(nrow(mydata)-ncol(mydata)))
std.err <- sqrt(diag(solve(t(mydata)%*%mydata)))*std.dev

The confidence interval for the co-efficients are given by the formula :

[\beta_j-a*se(\beta_j), \beta_j+a*se(\beta_j)]

where 'a' is a multiplicative factor depending upon the level of confidence we want to achieve. The factor 'a' is dependent on the number of degrees of freedom N-(d+1). For 95% confidence interval and 506-14=492 degrees of freedom, 'a' is approximately equal to 1.96. So for the feature 'crim' in our data, the 95% confidence interval is given to be

[-1.080114e-01-1.96*0.032864994, -1.080114e-01+1.96*0.032864994]=[-0.1724, -0.0436]

i.e. if we repeatedly subsample from the given data and compute the co-efficients each time, then 95 out of 100 times the value for the co-efficient 'crim' would lie between -0.1724 and -0.0436. The confidence intervals for all parameters can be obtained by using the following method, the level defines the desired confidence level required.

confint(fit, level = 0.95)

The next two columns, i.e. "t-value" and the "Pr(>|t|)" are useful for testing the significance of that variable in the model, in the presence of other variables. The "t-value" is the ratio of the estimate and the Std.Error for the co-efficient, i.e.

t=\frac{\text{Estimate}}{\text{Std.Error}}

Thus for the variable 'crim', the t-value is given as, (-0.1080/0.03286)=-3.287, which is the same value in the 3rd column. Higher the absolute value of the t-value, greater the significance of the variable. "Pr(>|t|)" is better known as the p-value in statistics.

For the variable 'crim', the p-value gives the probability of observing a t-value greater than 3.287, if in reality the co-efficient for the variable 'crim' is 0, i.e. had the variable 'crim' been insignificant for the model, what is the probability that any random subsampling of the above data gives a t-value of at-least 3.287 by chance ? Thus, lower this probability, the less likely that this t-value is due to chance alone and hence more significant is the variable to the model. In practice, if the p-value for a variable is less than 0.005, then we consider that variable significant to the model. The number of stars at the end denote the significance of the variable. Greater number of stars denote greater significance.

In order to estimate the goodness of fit of the model to the data, we use the following metric :

R^2=1-\frac{\sum_{i=1}^N(Y_i-\hat{Y}_i)^2}{\sum_{i=1}^N(Y_i-\bar{Y})^2}

where Y are the true values of the house prices, \hat{Y}_i are the predicted value of the house prices by the model, \bar{Y} is the mean of Y (true prices). The R-Squared value gives the fraction of the variance in the data that can be explained by the model. Thus greater the value of R-Squared implies that the model captures most of the variation in the data and hence is a good fit.

The value of R-Squared for our model is 0.7406, which is the value given by the "Multiple R-squared" in the summary of the model.

y.bar <- mean(labels)
preds <- fitted(fit)
1-sum(res*res)/sum((labels-y.bar)*(labels-y.bar))

Thus about 74% of the variance in the actual data is captured by the model (which is not bad).

As you add more variables to your model, the R-Squared value of the new bigger model will always be greater than that of the smaller subset. This is because, since all the variables in the original model is also present, their contribution to explain the dependent variable will be present in the super-set as well, therefore, whatever new variable we add can only add (if not significantly) to the variation that was already explained. It is here, the adjusted R-Squared value comes to help. Adj R-Squared penalizes total value for the number of terms (read predictors) in your model.

\text{Adjusted-}R^2=1-\frac{\frac{\sum_{i=1}^N(Y_i-\hat{Y}_i)^2}{N-d-1}}{\frac{\sum_{i=1}^N(Y_i-\bar{Y})^2}{N-1}}

where 'd' is the number of variables used in the model, and +1 for the intercept term. Thus the Adj. R-Squared value is given to be 0.7338 or 73.38% which is a bit less than the normal R-Squared.

The F-Test allows us to test the significance of the overall regression model, in order to answer the question that is there a relationship between the labels Y and the variables X ? The F-Statistic is defined as :

F-\text{Statistic}=\frac{\frac{\sum_{i=1}^N(\hat{Y}_i-\bar{Y})^2}{d-1}}{\frac{\sum_{i=1}^N(Y_i-\hat{Y}_i)^2}{N-d-1}}

(sum((preds-y.bar)*(preds-y.bar))*(nrow(mydata)-ncol(mydata)))/(sum(res*res)*(ncol(mydata)-1))

gives the F-Statistics as 108.0767. To test whether this F-Statistic value is significant or not to reject the NULL hypothesis (i.e. the outputs Y is not dependent on the input variables X), we compute the p-value for F-Statistic. The p-value we obtain is less than 2.2e-16, implying that if the outputs Y were truly independent of the variables X, then the probability of obtaining at-least the observed F-Statistic would be 2.2e-16, which is insignificant, thus the the model is significant in explaining the output labels.

Apart from the R-Squared and the F-Statistic measures for the goodness of fit of the model to the actual data, there are better measures like the AIC and the BIC information criterion (we have encountered them earlier in the context of choosing the optimal number of clusters in clustering).

\text{AIC}=-2ln(L)+2k

\text{BIC}=-2ln(L)+kln(N)

where L is the maximum value of the likelihood function for the model and k is the number of model parameters. Lower the values of the above, better is the model fit. To compute the maximum likelihood, we need to compute the likelihood function for the model. The residuals are assumed to be following a normal distribution with a mean and variance, thus :

L_i=P(e_i)=\frac{1}{\sqrt{2{\pi}}\sigma}e^{-\frac{(e_i-\mu)^2}{2{\sigma}^2}}

The overall probability for N observed errors is the product of each probability. Thus the log likelihood of the observed errors is given to be :

log(L)=-N*log(\sqrt{2{\pi}}\sigma)-\frac{\sum_{i=1}^N(e_i-\mu)^2}{2{\sigma}^2}

Plot of the histogram for the residuals. Approximately a normal distribution.

Maximizing the above likelihood, gives back the same sum of residuals criteria that we want to minimize, hence we will use the residuals computed from the fitted model and compute the mean and variance and substitute them in the above formula to obtain the maximum likelihood value, i.e. substituting \mu with the mean of residuals and substituting \sigma with the standard deviation of residuals, we get :

506*log(sqrt(2*pi)*sd(res))+sum((res-mean(res))*(res-mean(res)))/(2*var(res))

where N=506. Thus the maximum likelihood value is -1499. For computing AIC, we have got the "ln(L)" term, for 'k' i.e. the number of model parameters, we have 14 coefficients + 1 parameter for the variance of the residuals in total of k=15 parameters. Thus :

\text{AIC}=-2*(-1499)+2*15=3028

\text{BIC}=-2*(-1499)+15*log(506)=3091

For our model with all the variables, the AIC and the BIC values obtained are 3028 and 3091 respectively, which are the same as returned by the in-built functions.

AIC(fit)
BIC(fit)

In order to validate that the model with the set of variables used is in fact a good model, we need to test it on unseen data. We can do a K-Fold cross validation on the training data itself to understand how the model with all the variables behave on a hold out testing data. In K-fold CV, we divide the available data into K equal parts, and build the model with the K-1 parts of the data and test the model on the remaining hold out part. Thus for a 3-fold CV, we divide the data into training and testing in 2:1 ratio and repeat the experiment 3 times.

Below is an R code written to do a K-fold cross validation, as well as report the metrics for AIC, BIC, Adjusted R-Squared and F-Statistic for the trained model and the sum squared errors for the testing data :

k.fold.cv <- function(mydata, k=3) {
max.size <- ceiling(nrow(mydata)/k)
cv.data <- lapply(1:k, function(i) mydata[((i-1)*max.size+1):min(c(nrow(mydata), (i*max.size))),])
do.call(rbind, lapply(1:k, function(i) {
model.data <- do.call(rbind, cv.data[-i])
test.data <- cv.data[[i]]
fit <- lm(medv ~ ., data=model.data)
preds.train <- fitted(fit)
preds.test <- predict(object = fit, newdata = test.data)
res.train <- model.data$medv-preds.train
y.bar.train <- mean(model.data$medv)
rsq.train <- 1-(sum(res.train*res.train)*(nrow(model.data)-1))/(sum((model.data$medv-y.bar.train)*(model.data$medv-y.bar.train))*(nrow(model.data)-ncol(model.data)))
fstat.train <- (sum((preds.train-y.bar.train)*(preds.train-y.bar.train))*(nrow(model.data)-ncol(model.data)))/(sum(res.train*res.train)*(ncol(model.data)-1))
sum.sq.err <- sum(preds.test*preds.test)/(nrow(test.data)-ncol(test.data))
data.frame("AIC"=AIC(fit), "BIC"=BIC(fit), "RSQ"=rsq.train, "FSTAT"=fstat.train, "SUM-SQUARED-ERROR"=sum.sq.err)
}))                                
}

On running the BostonHousing data with a 5-fold CV, we get the following output :

   AIC  BIC   RSQ FSTAT SUM.SQUARED.ERROR
1 2470 2530 0.738  88.4               636
2 2428 2488 0.715  78.6               689
3 2413 2473 0.689  69.6              1048
4 2183 2243 0.846 171.5               470
5 2462 2522 0.729  85.2               481

As you see that when we leave out the 4th portion of the data for testing, the model with the remaining data (portion 1,2,3 and 5) is the best among all the sub-models and also it fits well to the testing data too. In order to compare multiple regression models with different subset of the variables, we can do a K-fold CV on the each of the models and compare their metrics.

For example in one model, we include all the variables and in another model we exclude the  variables "age" and "indus" :

BostonHousingWithoutAge <- BostonHousing[-which(colnames(BostonHousing) %in% c("age", "indus"))]
k.fold.cv(BostonHousingWithoutAge, 5)

The output is as follows :

   AIC  BIC   RSQ FSTAT SUM.SQUARED.ERROR
1 2467 2519 0.739 104.7               625
2 2426 2478 0.714  92.6               691
3 2411 2463 0.689  82.2              1029
4 2199 2251 0.839 192.4               461
5 2459 2511 0.730 100.9               467

Without the "age" and "indus" variables, the sum squared error is better in 4 out of 5 folds, the AIC is also better in 4 out of 5 folds, RSQ is better in 3 out of 5 folds, FSTAT is better in all 5 folds. Hence it seems that the model without the "age" and "indus" variables is better compared to including all the variables in the final model for the purpose of generalization to unseen datasets.

When we were starting to build the model, we had made the assumption that the residuals are just a noise in the actual data and do not have any correlation or pattern with any of the variables such as the predicted house prices or the explanatory variables like "crim", "zn", etc. If the residuals do exhibit patterns or correlations with the predicted house prices and/or the explanatory variables then probably our linear model assumption is not correct and a non-linear model is a better fit for the data.

In order to verify our assumption that the residuals are just random noise (with a normal distribution), we plotted the residuals against the predicted house prices and some of the explanatory variables :

Residual Plot of the fitted house prices vs. the residuals.

From the above residual plot of the predicted house prices vs the residuals, it seems that the residuals are scattered randomly around the mean, the correlation co-efficient between the fitted values and the residuals is close to 0, implying that there is no visible relation between the fitted values and the residuals. One important assumption of the multiple linear regression is "homoscedasticity" which implies constant variance of the residuals. In order for the linear model assumption to hold true, the variance around each fitted value should be constant.

The correlation coefficient of the residuals with all the other explanatory variables are nearly 0. The following is the plot of the variable "crim" with the residuals :

Residual plot of the crime rate vs. the residuals.

For a time series regression, there could be sequential pattern in the residuals, i.e. the residuals at time t and t+k, t+2k, t+3k and so on. might be correlated for some k > 0. These correlations can be detected using the autocorrelation function. There are techniques to remove these autocorrelations in the residuals which we will discover in a later post.

You can read more about analysis of linear regression from the following references :

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , ,