Machine Learning, AI and Programming

Common and not so common Machine Learning Questions and Answers (Part I)

  • What is the role of activation function in Neural Networks ?

The role of the activation function in a neural network is to produce a non-linear decision boundary via non-linear combinations of the weighted inputs. A neural network classifier is essentially a logistic regression classifier without the hidden layers. The non-linearity to a neural network is added by the hidden layers using a sigmoid or similar activation functions.

Sigmoid Activation Curve

The role of the hidden layers is to apply an activation function to the weighted inputs of the neural network and transform the inputs from linear to a non-linear space.

z=\sum_{i=1}^nw_ix\;+\;b (linear input)

\text{out}=\frac{1}{1\;+\;e^{-z}} (non-linear output)

where x is the input and w_i are the weights.

In logistic regression, since there are no hidden layers, the inputs are not transformed, but still a sigmoid function is applied to the output layer in order to generate probabilistic interpretation only, which does not mean its a non-linear model. See this answer by Sebastian Raschka for an intuitive understanding.

  • Why do we need the bias term in weights of Neural Network ?

The bias value allows you to shift the activation function to the left or right, which may be critical for successful learning. Without the bias term, the mid point (0.5) of the sigmoid curve occurs only when the weighted input is 0, whatever be the value of the weights be.

z=\sum_{i=1}^nw_ix\;+\;b=0 if x=0 and b=0

\text{out}=\frac{1}{1\;+\;e^{-z}}=0.5 if z=0

Sigmoid Curve with different values of bias term

The weights control the slope of the sigmoid curve only. Different training data might require the curve to pass through 0.5 mark for non-zero values of the weighted input, hence the bias term can be helpful in such cases. Check out this answer in stack-overflow for a better understanding.

  • What is the purpose of having hidden layers in Neural Network ?

The hidden layers in neural network serves multiple purposes. In the first answer, we saw how hidden layers are used to make neural networks a non-linear classifier unlike a logistic regression classifier. Usually the number of nodes in a hidden layer are much smaller than the number of input layer nodes.

Multiple Hidden Layers Neural Network

In text classification, where input data is sparse and are very high dimensional, the hidden layer transforms the high dimensional feature matrix into low dimensional dense matrix, thus enabling dimensionality reduction. In deep neural networks such as CNN with image data, the hidden layers serves to extract different features from images such as object boundaries, shapes, sizes etc. which further helps with image classification.

CNN with fully connected hidden layer for images

In earlier posts on word vectors, we saw how the hidden layer representations along with the weights can be used to represent word vectors or document vectors as and when required. See this post for an elaborate answer.

  • How to choose the number of hidden layers and number of nodes in hidden layers in Neural Network ?

Usually for most classification tasks, there is only one hidden layer. With more than one hidden layer, we need to do pruning/dropout of hidden units or add regularization terms to the objective. The more the number of hidden layers or more the number of nodes over-fits the neural network and generally the performance degrades after adding an optimum number of nodes to a hidden layer. A rule of thumb is to use a number of nodes between 1 and the number of nodes in the input layer. Other suggestions include using the mean of the input units and the output units. But mostly the number of hidden units or nodes is selected using a cross validation approach. Refer this post for more elaborate answer.

  • How to do weights initialization in Neural Networks ?

Incorrect weights initialization for the node connections could make the neural network never converge to a solution. Poorly chosen initial weights could be 0 weights or equal weights for all connections. If all the weights are 0 or have the same value, then the outputs are same from all the output/hidden nodes and hence during back-propagation, all weights are again updated to the same value, thus the weights never converge. One possible solution is to initialize with small random numbers sampled from either a uniform or a Gaussian distribution with a zero mean.

w_i=\text{Gaussian}(0, \sigma)

Note that too small or too large weights would have small gradient updates which would slow down the convergence.

With randomly assigned weights, the variance of the weighted input to a hidden or an output node increase proportional to the number of input nodes. We can normalize the variance to 1 by scaling its weight vector by the square root of the number of input nodes :

w_i=\frac{\text{Gaussian}(0, \sigma)}{\sqrt{N}}

where N is the number of input nodes w.r.t. the current layer

or as suggested in this paper, we can use the following weight initialization scheme.

w_i=\frac{\text{Gaussian}(0, \sigma)}{\sqrt{\frac{N_{in}+N_{out}}{2}}}

where N_{in} and N_{out} are the number of nodes in the input and output layers w.r.t. the current layer. This is popularly known as the Xavier Initialization.

One other popular technique known as Batch Normalization is commonly used to break the symmetry of the weights. This method initializes the neural network by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training. Since normalized weights representation for the inputs is a differentiable function, hence the normalization process is part of the network itself and the parameters of the normalization are updated using mini-batch SGD similar to the weights. Please refer this nice piece of tutorial.

  • What are some methods of weights Regularization in Neural Networks ?

The most commonly used techniques for regularization in machine learning are the L1 and L2 regularization. Regularization prevents overfitting. In L1 and L2 regularization, in order to prevent the unknown parameters (weights) to model the outliers or noise and take on large values as a result, we add a penalizing term of the weights (L1 norm and L2 norm of the weights respectively) to the loss function, thus preventing weights to take on large values in case of outliers.

L(y, \widehat{y})+\lambda\sum_{i=1}^n|\Theta_i| (L1 regularization)

L(y, \widehat{y})+\lambda\sum_{i=1}^n|\Theta_i|^2 (L2 regularization)

where L(y, \widehat{y}) is the loss function for the neural network (y-actual outputs, \widehat{y}-predicted outputs, \Theta_i-weights).

Similar to pruning in decision trees, where certain branches of the tree are pruned off in order to prevent overfitting on the training data, neural networks applies something similar to that for fully connected layers. This method is known as Dropout, as it probabilistically samples a sub-network from the entire network in each iteration and applies the weight updates only to the sub-network. But unlike pruning in decision trees, dropouts are not used on testing data whereas the pruned decision trees are used with testing data.

Dropout applied to fully connected hidden layer

  • What are the properties of the activation functions in Neural Networks ? Or in other words, what kind of functions would make a good choice for activation function ?

Activation is modeled after the human brain neurons, where a neuron is triggered only when the electric signals coming from other neurons crosses certain threshold. Similarly the purpose of activation in the hidden nodes is to fire only when the input to this node is above certain threshold. Since the weighted input to a hidden nodes


is a continuous value depending on which inputs x_i are active at that instant, it is difficult to choose a binary 0 or 1 (trigger or do-not-trigger) state based on a threshold and thus activation functions are chosen to output real values indicating the strength of the output from this node. Some useful properties are thus :

  1. The activation function should be a non-decreasing function of the input.
  2. The activation function should be a non-linear function of the input. Non-linearity helps generate non-linear decision boundaries in classification problems. If the activation function is linear in the inputs, then however deep the network architecture be, the same can be modeled with a single input and output layer.

    Sigmoid and Tanh activation function

  3. The activation function should be continuously differentiable. This is because we need to be able to compute the derivative of the activation at the gradient descent optimization step.
  4. Sigmoid activation function saturates at the either tail of 0 or 1. Hence the gradient near these points would be very small. Since during backpropagation, the gradient of the weights are multiplied from one layer to another, thus the gradient will be almost 0 by the time the update reaches the input layer and consequently there will be no further updates to these weights. Thus large value of input to any node during the initial phase of learning is dangerous. Similarly initializing the weights to very large values will also create the same problem. To overcome the effect of vanishing gradient, instead of Sigmoid activation, ReLU activation functions are used which do not saturate.

    ReLU activation function

  5. Sigmoid function is not zero-centered, since \sigma(x) > 0 for all x. Given the loss function L(y, \widehat{y}), the gradient w.r.t. w_i is computed as :

\frac{\partial{L(y, \widehat{y})}}{\partial{w_i}}=\frac{\partial{L(y, \widehat{y})}}{\partial{\widehat{y}}}.\frac{\partial{\widehat{y}}}{\partial{w_i}}

Given that \widehat{y}=\sigma=\frac{1}{1\;+\;e^{-z}}


where z=\sum_i{w_i}{x_i}+b


\frac{\partial{z}}{\partial{w_i}}=x_i, and

\frac{\partial{\widehat{y}}}{\partial{z}} = \widehat{y}*(1-\widehat{y}) > 0

if x_i > 0, then

\frac{\partial{L(y, \widehat{y})}}{\partial{w_i}} > 0 if \frac{\partial{L(y, \widehat{y})}}{\partial{\widehat{y}}} > 0 else \frac{\partial{L(y, \widehat{y})}}{\partial{w_i}} < 0

Since \frac{\partial{L(y, \widehat{y})}}{\partial{\widehat{y}}} is the same for all weights in the network (depends only on the predicted and actual outputs), thus for weights in the network, the gradient is either all positive or all negative if all the inputs are positive. This could lead to zig-zagging of the weight updates from one iteration to another.  This could have been avoided if \widehat{y} could have taken +ve as well as -ve values. Find more detailed activation function related discussions here.


Tags: , , ,