Stokastik

Machine Learning, AI and Programming

Common and not so common Machine Learning Questions and Answers (Part I)

  • What is the role of activation function in Neural Networks ?

The role of the activation function in a neural network is to produce a non-linear decision boundary via non-linear combinations of the weighted inputs. A neural network classifier is essentially a logistic regression classifier without the hidden layers. The non-linearity to a neural network is added by the hidden layers using a sigmoid or similar activation functions.

Sigmoid Activation Curve

The role of the hidden layers is to apply an activation function to the weighted inputs of the neural network and transform the inputs from linear to a non-linear space.

z=\sum_{i=1}^nw_ix\;+\;b (linear input)

\text{out}=\frac{1}{1\;+\;e^{-z}} (non-linear output)

where x is the input and w_i are the weights.

In logistic regression, since there are no hidden layers, the inputs are not transformed, but still a sigmoid function is applied to the output layer in order to generate probabilistic interpretation only, which does not mean its a non-linear model. See this answer by Sebastian Raschka for an intuitive understanding.

  • Why do we need the bias term in weights of Neural Network ?

The bias value allows you to shift the activation function to the left or right, which may be critical for successful learning. Without the bias term, the mid point (0.5) of the sigmoid curve occurs only when the weighted input is 0, whatever be the value of the weights be.

z=\sum_{i=1}^nw_ix\;+\;b=0 if x=0 and b=0

\text{out}=\frac{1}{1\;+\;e^{-z}}=0.5 if z=0

Sigmoid Curve with different values of bias term

The weights control the slope of the sigmoid curve only. Different training data might require the curve to pass through 0.5 mark for non-zero values of the weighted input, hence the bias term can be helpful in such cases. Check out this answer in stack-overflow for a better understanding.

  • What is the purpose of having hidden layers in Neural Network ?

The hidden layers in neural network serves multiple purposes. In the first answer, we saw how hidden layers are used to make neural networks a non-linear classifier unlike a logistic regression classifier. Usually the number of nodes in a hidden layer are smaller than the number of input layer nodes but depending on the complexity of the problem and the amount of training data available, the number of hidden units can also be greater than the number of input units.

Multiple Hidden Layers Neural Network

In text classification, where input data is sparse and are very high dimensional, the hidden layer transforms the high dimensional feature matrix into low dimensional dense matrix, thus enabling dimensionality reduction. In deep neural networks such as CNN with image data, the hidden layers serves to extract different features from images such as object boundaries, shapes, sizes etc. which further helps with image classification.

Thus hidden layers serves to extract important features and that too we can do without the class labels by a purely unsupervised manner. Refer to Autoencoders and Restricted Boltzmann Machines.

CNN with fully connected hidden layer for images

In earlier posts on word vectors, we saw how the hidden layer representations along with the weights can be used to represent word vectors or document vectors as and when required. See this post for an elaborate answer.

  • How to choose the number of hidden layers and number of nodes in hidden layers in Neural Network ?

The number of hidden units or hidden layers depends on the complexity of the problem. Usually for most text classification tasks, one hidden layer is enough, whereas for image or speech recognition we need more than one hidden layers.

Higher number of hidden layers or higher number of hidden units causes over-fitting. To overcome overfitting issues with high number of hidden layers or hidden units, usually dropout scheme is used. In dropout, hidden units are randomly dropped during each training cycle to prevent co-adaptation.

A rule of thumb is to use a number of nodes between 1 and the number of nodes in the input layer. Other suggestions include using the mean of the input units and the output units. But mostly the number of hidden units or nodes is selected using a cross validation approach. Refer this post for more elaborate answer.

  • How to do weights initialization in Neural Networks ?

Incorrect weights initialization for the node connections could make the neural network never converge to a solution. Poorly chosen initial weights could be all 0 weights. If all the weights are 0, then the outputs are same from all the output/hidden nodes and hence during back-propagation, all weights are again updated to 0, thus the weights never learn. One possible solution is to initialize with small random numbers sampled from either a uniform or a Gaussian distribution with a zero mean.

w_i=\text{Gaussian}(0, \sigma)

Note that with sigmoid activation function, too small or too large weights would have small gradient updates which would slow down the convergence.

With randomly assigned weights, the variance of the weighted input to a hidden or an output node increase proportional to the number of input nodes. We can normalize the variance to 1 by scaling its weight vector by the square root of the number of input nodes :

w_i=\frac{\text{Gaussian}(0, \sigma)}{\sqrt{N}}

where N is the number of input nodes w.r.t. the current layer

or as suggested in this paper, we can use the following weight initialization scheme.

w_i=\frac{\text{Gaussian}(0, \sigma)}{\sqrt{\frac{N_{in}+N_{out}}{2}}}

where N_{in} and N_{out} are the number of nodes in the input and output layers w.r.t. the current layer. This is popularly known as the Xavier Initialization.

  • What are the properties of the activation functions in Neural Networks ? Or in other words, what kind of functions would make a good choice for activation function ?

Activation is modeled after the human brain neurons, where a neuron is triggered only when the electric signals coming from other neurons crosses certain threshold. Similarly the purpose of activation in the hidden nodes is to fire only when the input to this node is above certain threshold. Since the weighted input to a hidden nodes

\sum_i{w_i}{x_i}+b

is a continuous value depending on which inputs x_i are active at that instant, it is difficult to choose a binary 0 or 1 (trigger or do-not-trigger) state based on a threshold and thus activation functions are chosen to output real values indicating the strength of the output from this node. Some useful properties are thus :

  1. The activation function should be a non-linear function of the input. Non-linearity helps generate non-linear decision boundaries in classification problems. If the activation function is linear in the inputs, then however deep the network architecture be, the same can be modeled with a single input and output layer.
  2. Sigmoid and Tanh activation function

  3. The activation function should be continuously differentiable. This is because we need to be able to compute the derivative of the activation during the backpropagation.
  4. Sigmoid activation function saturates at the either tail of 0 or 1. Hence the gradient near these points would be very small. Since during backpropagation, the gradient of the weights are multiplied from one layer to another, thus the gradient will be almost 0 by the time the update reaches the input layer and consequently there will be no further updates to these weights. Thus large value of input to any node during the initial phase of learning is dangerous. Similarly initializing the weights to very large values will also create the same problem. To overcome the effect of vanishing gradient, instead of Sigmoid activation, ReLU activation functions are used which do not saturate.

    ReLU activation function

  5. Sigmoid function is not zero-centered, since \sigma(x) > 0 for all x. Given the loss function L(y, \widehat{y}), the gradient w.r.t. w_i is computed as :

\frac{\partial{L(y, \widehat{y})}}{\partial{w_i}}=\frac{\partial{L(y, \widehat{y})}}{\partial{\widehat{y}}}.\frac{\partial{\widehat{y}}}{\partial{w_i}}

Given that \widehat{y}=\sigma=\frac{1}{1\;+\;e^{-z}}

\frac{\partial{\widehat{y}}}{\partial{w_i}}=\frac{\partial{\widehat{y}}}{\partial{z}}.\frac{\partial{z}}{\partial{w_i}}

where z=\sum_i{w_i}{x_i}+b

Thus,

\frac{\partial{z}}{\partial{w_i}}=x_i, and

\frac{\partial{\widehat{y}}}{\partial{z}} = \widehat{y}*(1-\widehat{y}) > 0

if x_i > 0, then

\frac{\partial{L(y, \widehat{y})}}{\partial{w_i}} > 0 if \frac{\partial{L(y, \widehat{y})}}{\partial{\widehat{y}}} > 0 else \frac{\partial{L(y, \widehat{y})}}{\partial{w_i}} < 0

Since \frac{\partial{L(y, \widehat{y})}}{\partial{\widehat{y}}} is the same for all weights in the network (depends only on the predicted and actual outputs), thus for weights in the network, the gradient is either all positive or all negative if all the inputs are positive. This could lead to zig-zagging of the weight updates from one iteration to another.  This could have been avoided if \widehat{y} could have taken +ve as well as -ve values. Find more detailed activation function related discussions here.

Categories: MACHINE LEARNING

Tags: , , , , , ,