Stokastik

Machine Learning, AI and Programming

Logistic Regression Analysis with Examples using R

In the last post we had seen how to perform a linear regression on a dataset with R. We had also seen how to interpret the outcome of the linear regression model and also analyze the solution using the R-Squared test for goodness of fit of the model, the t-test for significance of each variable in the model, F-statistic for significance of the overall model, Confidence intervals for the variable stability and so on. Lastly we analyzed the residuals for any correlation or pattern with the fitted values and the independent variables. In this post we are going to look at another widely used variant called the logistic regression.

In linear regression, the true outcomes Y were continuous and were assumed to be following a normal distribution, but in logistic regression, the true outcomes are generally binary (or categorical with multiple classes) and is assumed to be following the binomial/multinomial distribution. There is no R-Squared measure to asses the goodness of fit for the model and also the constant variance of the residuals ("homoscedasticity") assumption does not hold in case of logistic regression. We will use the Chi-Square tests along with the p-value to asses the goodness of fit of the model.

In linear regression, given the input variables X, we try to directly predict the output \widehat{Y} using the linear regression formula, whereas in  logistic regression, we model the odds of the outcome, assuming that the output is binary [0,1], the model tries to predict the odds of the output being 1, i.e.

\frac{P(\widehat{Y}=1|X)}{P(\widehat{Y}=0|X)}=\frac{P(\widehat{Y}=1|X)}{1-P(\widehat{Y}=1|X)}

since the outcome is binary, thus P(\widehat{Y}=0|X)=1-P(\widehat{Y}=1|X)

The value of the odds is always positive (ranges from 0 to +Infinity) but the expression of the linear combination of the input variables, i.e.

\beta_0+{\beta_1}x^{(1)}+{\beta_2}x^{(2)}+...+{\beta_d}x^{(d)}

can be both positive as well as negative, hence we transform the odds using the logarithmic function so that the log odds can now lie between (-Infinity, +Infinity), i.e. we model the following :

\text{log}(\frac{P(\widehat{y}=1|x)}{1-P(\widehat{y}=1|x)})=\beta_0+{\beta_1}x^{(1)}+{\beta_2}x^{(2)}+...+{\beta_d}x^{(d)}

Thus, we have the expression for P(\widehat{y}=1|x) :

P(\widehat{y}=1|x)=\frac{e^{\beta_0+{\beta_1}x^{(1)}+{\beta_2}x^{(2)}+...+{\beta_d}x^{(d)}}}{1+e^{\beta_0+{\beta_1}x^{(1)}+{\beta_2}x^{(2)}+...+{\beta_d}x^{(d)}}}=\frac{1}{1+e^{-(\beta_0+{\beta_1}x^{(1)}+{\beta_2}x^{(2)}+...+{\beta_d}x^{(d)})}}

Denoting h(\beta, x)=\beta_0+{\beta_1}x^{(1)}+{\beta_2}x^{(2)}+...+{\beta_d}x^{(d)} for the remainder of the post, we have :

P(\widehat{y}=1|x)=\frac{1}{1+e^{-h(\beta, x)}}

For an unit increase in x^{(j)}, the log odds increase by an unit of \beta_j or the odds increase by e^{\beta_j}. This means that for a model in which x^{(j)} is 0 and a model for which x^{(j)} is 1, the odds of the outcome \hat{y}=1 is better by an amount e^{\beta_j} for the second model.

The probability of observing an instance is given as :

P(\widehat{y}=1|x)^{y}(1-P(\widehat{y}=1|x))^{1-y}

Note that when y=1, we have the probability P(\widehat{y}=1|x) else when y=0, we have the probability 1-P(\widehat{y}=1|x). The overall probability of observing the N examples is given by :

L=\prod_{i=1}^NP(\widehat{y_i}=1|x_i)^{y_i}(1-P(\widehat{y_i}=1|x_i))^{1-y_i}

or the log likelihood :

\text{log} L=\sum_{i=1}^N[y_i*\text{log}(P(\widehat{y_i}=1|x_i))+(1-y_i)*\text{log}(1-P(\widehat{y_i}=1|x_i))]

where P(\widehat{y_i}=1|x_i)=\frac{1}{1+e^{-h(\beta, x_i)}}

We need to maximize the likelihood in order to compute the optimum values for the unknown co-efficients \beta_j. Taking the derivatives of the log likelihood w.r.t. \beta_j, we get :

\frac{{\partial}\text{log} L}{{\partial}\beta_j}=\sum_{i=1}^N(y_i-P(\widehat{y_i}=1|x_i))*x^{(j)}_i

We can solve for the unknowns using the stochastic gradient ascent technique, where :

\beta^{t+1}_j=\beta^t_j+\eta*[\frac{{\partial}\text{log} L}{{\partial}\beta_j}]=\beta^t_j+\eta*\sum_{i=1}^N(y_i-P(\widehat{y_i}=1|x_i))*x^{(j)}_i

 

Let's move on to solve a binary classification problem with logistic regression. The dataset considered is the "Titanic" dataset from the Kaggle website.

titanic.data <- read.csv("titanic.csv")
titanic.data <- titanic.data[!is.na(titanic.data$survived),]

titanic.data$age[is.na(titanic.data$age)] <- mean(titanic.data$age[!is.na(titanic.data$age)])
titanic.data$fare[is.na(titanic.data$fare)] <- mean(titanic.data$fare[!is.na(titanic.data$fare)])

model <- glm(survived ~ pclass+sex+age+sibsp+parch+fare, family=binomial(link='logit'), data = titanic.data)
summary(model)

First we read the data file (which is in csv format), next we ignore all the rows in the dataset for which the labels are missing. There are some variables for which there are missing values (such as "age", "fare" etc.). For all these variables, we substitute the missing values with the mean of the remaining values. Next we fit the logistic regression model using the "glm" method in R.

Note that, we are not using all the variables in the data to model the survival of a passenger. We have only considered "pclass" (upper, middle or lower class of passenger), "sex" (male or female), "age", "sibsp" (number of siblings or spouses on board) and "parch" (number of parents or children on board for the concerned passenger) and "fare". Because we feel only these variables could be important to model the survival outcome.

The output of the model fit is as follows :

Call:
glm(formula = survived ~ pclass + sex + age + sibsp + parch + 
    fare, family = binomial(link = "logit"), data = titanic.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6191  -0.6551  -0.4526   0.6564   2.5102  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.476009   0.423929  10.558  < 2e-16 ***
pclass      -0.989913   0.112509  -8.799  < 2e-16 ***
sexmale     -2.590107   0.156273 -16.574  < 2e-16 ***
age         -0.036392   0.006263  -5.810 6.23e-09 ***
sibsp       -0.328919   0.090202  -3.646 0.000266 ***
parch       -0.036982   0.089354  -0.414 0.678965    
fare         0.002551   0.001850   1.379 0.167862    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1741.0  on 1308  degrees of freedom
Residual deviance: 1209.6  on 1302  degrees of freedom
AIC: 1223.6

Number of Fisher Scoring iterations: 5

On the top are the deviance residuals. The deviance residuals are given by the formula :

d_i=\sqrt(-2*\text{log}L_i)\text{ if }y_1=1\text{ else }-\sqrt(-2*\text{log}L_i)

where \text{log}L_i is the log likelihood of each data point given as :

\text{log}L_i=y_i*\text{log}(P(\widehat{y_i}=1|x_i))+(1-y_i)*\text{log}(1-P(\widehat{y_i}=1|x_i))

The values of P(\widehat{y_i}=1|x_i) are obtained from the fitted model, i.e.

preds <- fitted(model)

Thus, we have the deviance residuals as :

log.likelihoods <- titanic.data$survived*log(preds)+(1-titanic.data$survived)*log(1-preds)
deviance.res <- ifelse(titanic.data$survived == 1, 1, -1)*sqrt(-2*log.likelihoods)

If you compute the quantile function of "deviance.res" you will obtain the same output as given in the summary.

Next is the table of co-efficients for the model variables. The estimates represents the change in the log odds per unit increase in value of the variable. Thus for a middle class passenger (pclass=2), the log odds of survival decreases by 0.9899 when compared to a first class passenger (pclass=1). Similarly for a male (sexmale=1) passenger the log odds of survival decreases by 2.59 compared to a female passenger (sexmale=0) and so on. The "age" factor is less prominent, but a negative value indicates that children are more likely to survive compared to older passengers.

The standard errors are computed by taking the square root of the diagonal entries of the following covariance matrix, i.e.

(X^TVX)^{-1}

where X is the document-feature matrix of the variables under consideration, and V is a diagonal matrix whose entries corresponds to :

P(\widehat{y_i}=1|x_i)*(1-P(\widehat{y_i}=1|x_i)) for each i

mydata <- as.matrix(cbind(rep(1, nrow(titanic.data)), 
                          as.matrix(titanic.data[,which(colnames(titanic.data) %in% c("pclass", "sex", "age", "sibsp", "parch", "fare"))])))

v <- diag(preds*(1-preds), length(preds), length(preds))

std.err <- sqrt(diag(solve(t(mydata)%*%v%*%mydata)))

The z-test along with the p-values in the last two columns of the co-efficients table represents the significance of each variable in the model. For the first 5 variables, the p-value is less than 0.05, indicating these variables are significant to the model. The last two variables have high p-values indicating they are not significant. Thus "parch" and "fare" are not significant for the model.

In order to compute the goodness of fit for the model, in linear regression, we had the R-Squared metric which computed the fraction of the variance in the data that could be explained by the model. In logistic regression we do not have R-Squared metric, instead we use the Chi-Square metric to compute the goodness of fit.

{\chi}^2=(-2*\text{log}L_0)-(-2*\text{log}L_{model})

where \text{log}L_{model} is the log likelihood of the above model, which is also known as the residual deviance and \text{log}L_0 is the log likelihood of the null model, in which only the intercept term is non-zero, rest all of the co-efficients are 0. Our null hypothesis is that the independent variables are not significant to the model and thus \beta_j=0 for j=1 to d.

In order to generate the null model, we use the glm function but without any of the variables to model the survival chances.

null.model <- glm(survived ~ 1, family=binomial(link='logit'), data = titanic.data)
summary(null.model)

The output is only the intercept term :

Call:
glm(formula = survived ~ 1, family = binomial(link = "logit"), 
    data = titanic.data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-0.981  -0.981  -0.981   1.387   1.387  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.48119    0.05689  -8.459   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1741  on 1308  degrees of freedom
Residual deviance: 1741  on 1308  degrees of freedom
AIC: 1743

Number of Fisher Scoring iterations: 4

Note that for the null model, the null deviance and the residual deviance are the same. To compute the null deviance, we have used :

null.preds <- fitted(null.model)
null.log.likelihood <- titanic.data$survived*log(null.preds)+(1-titanic.data$survived)*log(1-null.preds)
null.deviance <- sum(-2*null.log.likelihood)

and the residual deviance :

preds <- fitted(model)
log.likelihood <- titanic.data$survived*log(preds)+(1-titanic.data$survived)*log(1-preds)
residual.deviance <- sum(-2*log.likelihood)

Thus the Chi-Squared value comes out to be 1741.024-1209.604=531.42. The p-value of the computed Chi-Squared value with 1302 degrees of freedom comes out to be almost 0, hence the null hypothesis is rejected in favor of the model. Greater the Chi-Squared value, better is the model fit on the data.

As discussed above that logistic regression is primarily a binary classifier, hence the performance of the model should be evaluated from the classification accuracy and discriminative power perspective. Since the model outputs the predicted probabilities P(\widehat{y_i}=1|x_i) for each fitted datapoint, but we need class labels instead, hence commonly we would use the transformation rule that if  0.5" /> then predicted class label is 1 else it is 0.

pred.labels <- ifelse(preds > 0.5, 1, 0)
table("actual"=titanic.data$survived, "predicted"=pred.labels)

The output of the confusion matrix is :

      predicted
actual   0   1
     0 694 115
     1 161 339

i.e. the precision in predicting the passengers who survived (i.e. predicted class label is 1) is given to be 339/(339+115)=0.7467 (i.e. the fraction of the predicted survivors who are actual survivors) and the recall for the survivors (fraction of the actual survivors who were predicted to survive) is given to be 339/(339+161)=0.678.

Apart from the precision and recall as the discriminative power of the logistic regression classifier, data scientists generally use another important metric known as the Area Under Curve (AUC). The curve is known as the ROC curve. Its the plot of the True Positive Rate vs. the False Positive Rate.

AUC is the preferred method in case of unbalanced class problems. If you randomly pick an example from class 0 and another random example from class 1 and use the model to predict the labels for each of them, then AUC represents the percentage of times the pair of examples are correctly classified into the correct classes.

library(ROCR)
pr <- prediction(preds, titanic.data$survived)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

ROC Curve for the Titanic dataset.

Above is the plot of the ROC curve for the Titanic dataset. The AUC value obtained is 0.842. Thus 84.2% of times if you randomly pick the examples from the two classes, they would be classified correctly by the given model.

You can refer to the below resources for more in depth understanding of logistic regression :

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , ,