# Machine Learning Interview Questions and Answers (Part III)

• #### Which loss function is better for neural network training, logistic loss or the squared error loss and why ?

The loss function depends mostly on the type of problem we are solving and the activation function. In case of regression where the values from the output units are normally distributed, the squared error is the preferred loss function whereas in a classification problem the output units follows the multinomial distribution, the cross entropy loss function is the preferred choice.

Assuming that we are working on a multi-class classification problem with softmax activation function, lets see what are the bad properties of the squared error loss that makes it unsuitable.

Assuming that we have 3 output nodes, with the actual outputs from each node being [0, 1, 0], i.e. the instance belongs to class 2 and the predicted outputs being [0.1, 0.6, 0.3] (softmax normalizes the output probabilities), then the squared error loss is given to be :

$(0-0.1)^2+(1-0.6)^2+(0-0.3)^2=0.26$

Now for another instance belonging to the same class, the predicted outputs are [0.2, 0.6, 0.2], and thus the squared error is :

$(0-0.2)^2+(1-0.6)^2+(0-0.2)^2=0.24$

Although the posterior probability of the actual class is same in both the instances, but the errors are different, because the posterior probabilities of the other classes are different. Ideally this should not be the case because we are predicting the actual class with the highest probability which is same in both instances, thus error should only reflect "how far" we are from predicting the actual class correctly.

Whereas with cross entropy, the loss from both the instances is -log2(0.6)=0.74. Since for only one output node the actual output is 1 rest all are 0. Thus cross entropy captures what we really want for the loss function to reflect.

It can be shown that during backpropagation of errors, if the gradient of the cross entropy loss w.r.t. a particular weight $w_{ij}$ is G, then the gradient of the squared error w.r.t. the same weight is :

$G*\text{out}_j*(1-\text{out}_j)$

where $\text{out}_j$ is the posterior probability of the node 'j' (the weight $w_{ij}$ is from node 'i' of one layer to node 'j' of the next layer). Now if $\text{out}_j$ is close to 0 or close to 1, then the term $\text{out}_j*(1-\text{out}_j)$ would be very small, and thus there would be no significant updates to these weights (vanishing gradient problem) and the squared loss would be stuck inside a local minima.

• #### When would you prefer L1 regularization over L2 regularization and why ?

The purpose of regularization is to prevent overfitting of a model to the training data and reduce generalization error i.e. error on unlabelled test data. As we know already that two common approaches of regularizing a model is to add either the absolute sum of the parameters of the model (L1 regularization) or the sum of squares of the parameters (L2 regularization) to the objective function L to minimize.

L1 regularization : $L(y, \widehat{y}|\theta_1,\theta_2, ...,\theta_m)+{\lambda}\sum_{i=1}^m||\theta_i||$

L2 regularization : $L(y, \widehat{y}|\theta_1,\theta_2, ...,\theta_m)+{\lambda}\sum_{i=1}^m||\theta_i||^2$

where ${\lambda}>0$.

For example, in linear regression L would be the mean squared error loss function :

$L(y, \widehat{y}|\theta_1,\theta_2, ...,\theta_m)=\frac{1}{N}\sum_{k=1}^N(y_k-(\theta_1x^k_1+\theta_2x^k_2+...+\theta_mx^k_m))^2$

The idea of adding those terms at the end of the objective function (to minimize) is to make the parameters $\theta_i$ small or zero. Due to outliers in the training data, the objective function without the regularization term would try to model these outliers and as a result the parameters associated with features of outliers, would have very high values compared to other parameters and as a result the model would lose generalization capability. Overfitting. Straight line represents model with regularization whereas the curved one represents the model without regularization.

According to this paper by Andrew Ng, L1 regularization makes the input feature space sparse, i.e. makes most of the parameters set to 0. L1 regularization has the inherent capability of doing feature selection i.e. only for important features, $\theta_i{\ne}0$. Also the number of training examples needed increases logarithmically with the number of irrelevant features, i.e. even if the number of outlier features increases, the number of additional training instances required to correctly model the remaining relevant features only increases "slightly".

Whereas in L2 regularization, the parameters are reduced in magnitude but not set to 0 and thus explicit feature selection algorithms are required to select only the relevant features for further modeling. The number of training examples needed increases linearly with the number of irrelevant features, i.e. compared to L1 regularization, algorithms using L2 regularization would require more number of training instances if there are many irrelevant features but very few relevant features.

Thus if a classification problem has too many features but most of them are irrelevant, then L1 regularization is a better choice compared to L2 regularization. But if there are many important features, then L1 regularization would make the weights for these features 0 but L2 would only reduce their weights and not make them 0, so L2 regularization would perform better in this case.

• #### What are some of the steps would you take to improve the performance of a classification model based on prediction results (class probabilities, precision, recall, ROC curve etc.) on a test data set ?

It is rarely that the first cut of your machine learning model performs as expected. From my own experience, even after doing several feature engineering, carefully tuning the feature selection algorithms, selecting the best parameters for the model through K-fold cross validation, using combination of classifiers and so on, the classification error rate on unseen test data were 5-10% more than estimated.

The problem I was working on had more than 500 classes. So its not really a surprise.

Some key steps that one can take to analyze why the performance was low on unseen data are as follows :

• Different class distribution in training and testing data. It could be possible that, the classes which were modeled incorrectly in the training data, have higher frequencies in the testing data. Since we do not have control over the distribution of the testing data as they are coming in a stream, only solution is probably to improve the classifier on poorly performing classes during training.
• Lookout for class imbalance. If it is the case in point no. 1, then it is most probably due to class imbalance, i.e. there are some classes with too many samples in training data and there are some classes with too few samples. One solution is to oversample the minority classes instances (duplicate each of them by say 5 fold or so) and re-model, or generate synthetic samples for the minority classes instances or use some other metric to report performance of the classifier that takes into account the class imbalance scenario.
• Instead of reporting overall classification accuracy use stratified sampling strategy. Randomly sample (with replacement) equal number of instances from each class (sample size less than equal to the minimum class size), compute the classification accuracy. Now repeat this experiment several times and then take the average of the classification accuracy. This can be performed both for cross validation as well as for test data. Report numbers with the confidence intervals (for some desired level of confidence 95% or 99%). For example, if the mean accuracy is 0.85 (85%) and standard deviation is 0.05 with 1000 different experiments, then the 95% confidence interval is :

$[0.85-1.96\frac{0.05}{\sqrt{1000}},0.85+1.96\frac{0.05}{\sqrt{1000}}]=[0.847, 0.853]$,

i.e. the accuracy on an unseen data would lie between this interval with 95% confidence.

• Correlation between class confidence probabilities and correctness. If it is the case that there are many incorrect predictions with high confidence scores then probably there is some error in building the model or the true class labels assigned are not correct. The labelings for the data must be reviewed.
• Confusion matrix of classes. With so many number of classes, it is highly likely that two or more classes would be very similar to one another based on their input features and many times one class is predicted into another and vice-versa. One way to identify such classes is to look at the class confusion matrix of the predictions, i.e. the entry $C_{ij}$ of the confusion matrix denotes that how many instances of class 'i' is being predicted as an instance of class 'j'. If classes 'i' and 'j' are very similar then the normalized probabilities for $C_{ij}$ and $C_{ji}$ should be high.
• Confusion matrix. Class A and Class C seems to be similar to one another.

• One way to handle this is to use multi-level classifiers. The approach is to combine the instances of the similar classes (A & C) and model them as a single class by the top level classifier and use a highly non-linear classifier (deep neural networks etc.) to model the similar classes at the granular level. In this way if the top-level classifier predicts the combined class as the output, then we can apply the non-linear classifier at the next level to decide which base class to pick.
• Model 0 combines class A and class C instances. Model 1 is a non-linear classifier that classifies between class A and class C

• Different sets of features in training and testing data. Your model might be outdated as the data used to model may not hold true any more as features have been added or deleted. Feature importance might have changed. To check whether features have changed or feature importance changed, one could use K-Means clustering. Create the initial clusters with only training data where each cluster contains only instances from the same class. Then for each test instance, add the instance to the corresponding cluster and analyze the squared distances of each point from the cluster centroids. If the training and testing instances for the same class differ in the feature space, then the average squared distances for that cluster instances would increase, as well as the standard deviation of the squared distances.
• #### If you have to create an ensemble of different classifiers (e.g. logistic regression & naive bayes), how would you combine the results from different classifiers into a single prediction on the same data set ?

It appears to be quite intuitive that if we combine the power of different classifiers on the same dataset, then the performance of the combined models "should" be better as compared to using these classifiers alone. But this method works only if the predictions from the individual classifiers are uncorrelated (independent) and the performance of the individual classifiers is at-least better than random guess (for a binary classifier, accuracy > 50%, for N classes, 1/N).

If the predictions are correlated, i.e. if one classifier correctly predicts some instance then the majority of other classifiers are also predicting the same instance correctly and similarly if one classifier incorrectly predicts some instance then the majority of other classifiers are also predicting the same instance incorrectly, then we see that combining these classifiers would not give us any benefit because an instance which is predicted incorrectly by one classifier, is most likely to have been predicted incorrectly by the remaining classifiers too and so the combination will not correct the misclassification.

There are multiple techniques to combine the powers of classifiers, one class of technique uses voting such as majority voting, weighted voting, averaging or ranked averaging that directly computes the combined output using some voting function without any learning. In the second class, we use a meta-learner, i.e. another classifier that learns from the predictions of individual base classifiers. The meta-learner is usually a simple classifier such as a decision tree. The final output is predicted by the meta-learner.

Assuming that we are working with a binary classification problem :

• In majority voting, if we have 3 classifiers, then we take the correct output to be the one that is predicted by at-least 2 out of 3 classifiers, i.e. if the predictions are [0,1,0] then the final output is 0.

Assuming that each classifier predicts an instance correctly with a probability of 'p'. With 3 classifiers, the probability of correct prediction is = probability that two classifiers predicts correctly + probability that all 3 classifiers predicts correctly.

$g(p)=3*p^2(1-p)+p^3$, since 2 out of 3 classifiers can predict correctly in 3 different ways.

For $g(p)>p$, we must have $2p^2-3p+1<0$ or $p>0.5$

• In majority voting, we give equal weightage to all classifiers, but some classifiers might be stronger than the other. Based on how accurate each classifier is on the training data, we can assign weights to individual classifiers and then take their weighted voting on test data. For example, if the classifier predictions are [0,1,0] and their respective weights are [0.2, 0.4, 0.4], then weighted prediction would be 0.2*0+0.4*1+0.4*0=0.4. If the final score is less than 0.5 then we consider the output to be 0 else 1. In this case it is 0. Averaging smoothes out overfitting in base classifiers

• In averaging, instead of working with predicted classes, we work with predicted probabilities. We simply take the average of the predicted class probabilities and then let AUC metric decide the final performance. The averaging can be a weighted averaging.
• Some algorithms might produce probabilities on the lower side even for correct predictions whereas some algorithms like NB produce probabilities on the higher side even for incorrect predictions although the relative ordering of the scores among multiple test instances might be the same for different algorithms, i.e. for three test instances if one classifier predicts the probabilities [0.91, 0.99, 0.95], then the second classifier might predict the probabilities [0.05, 0.15, 0.12], but the ordering of the probabilities are same. Simple averaging cannot be used in this scenario. Instead of using the raw probabilities, the test instances are first assigned their ranks as probabilities, i.e. in the above example the scores for the three instances would be [0,2,1] for both classifiers. These are then normalized to [0, 0.67 0.33] and then averaging is used. Stacked Meta-Learner

• In the meta-learner case, we call these methods stacking or blending. In stacking, if the classifiers predict the classes [0,1,0] and the actual output is 1, then a logistic regression classifier is trained with the input features [0, 1, 0] and the output 1. During training, we divide the training set into 2 parts, the base classifiers are trained on the 1st part and tested on the second part. Based on the predictions from the 2nd part, the meta-learner is trained. To add more data points to the meta-learner, the same process is repeated but the roles of the 1st part and the 2nd part of the data are reversed. During testing, first the base classifiers are used to generate the initial predictions, then with these predictions logistic regression classifier predicts the final output. This approach is similar to weighted voting, only that the weights are learnt by optimizing the base classifiers output. Instead of using only the predictions of the base classifiers as the features, one can also use the original features used to train the base classifiers in addition to the predictions.

Resources on combining classifiers

In a later post we will look at different meta-learner approaches and performances through example codes.

Categories: MACHINE LEARNING