Stokastik

Machine Learning, AI and Programming

Optimization Methods for Deep Learning

In this post I am going to give brief overview of few of the common optimization techniques used in training a neural network from simple classification problems to deep learning. As we know, the critical part of a classification algorithm is to optimize the loss (objective) function in order to learn the correct parameters of the model. The type of the objective function (convex, non-convex, constrained, unconstrained etc.) along with the method used to optimize this function decides how accurately and efficiently we compute the parameters. In the context of neural networks, the loss function could be a squared error loss or logistic loss (cross entropy).

Following is the cross entropy loss function for a classification problem with m output nodes ('m' class problem):

L(y, \widehat{y};w)=-\sum_Q\sum_{j=1}^m{y_j}*log(\widehat{y}_j), the outer sum is over all training instances Q.

where y_j are the actual outputs (0 or 1. Out of the 'm' output nodes only one node will have actual output of 1 the rest will have actual output of 0, similarly \widehat{y}_j are the corresponding predicted probabilities from the 'm' output nodes, w_{ij} are the weights for the neural network, connecting the i-th node in the previous layer to the j-th node in the next layer. We need to minimize the above loss function in order to compute the optimal weights w^*_{ij} that minimizes this function.

Deep Neural Network

The predicted probabilities are the probabilities for predicting each of the 'm' class labels, given by the softmax (in multi-class problems) i.e.

\widehat{y}_j=\frac{e^{h_j}}{{\sum_{k=1}^m}e^{h_k}}, where h_j={\sum_i}w_{ij}s_i

where s_i are the inputs to the output layer.

In order to solve the optimization problem, the most common class of technique deployed is the line search method. In the line search method, in order to minimize a function f(x), we iteratively find the actual solution x*, such that starting with some random solution x_0, we find a direction p_0 and the amount of step \alpha_0 to take along the direction p_0, that will take us closer to the actual solution, i.e.

x_1=x_0+{{\alpha}_0}p_0,

The step size \alpha_0 is chosen is such a way s.t. f(x_0+{\alpha}_0p_0) is minimized.

Continue to repeat the above procedure by choosing a new direction to optimize in each iteration and then computing the step size \alpha_k for the chosen direction. The step size \alpha_k can be fixed for all iterations or can be computed dynamically in each iteration.

x_k=x_{k-1}+{\alpha}_kp_k

Repeat the above procedure until convergence is reached, i.e.

|f(x_k)-f(x_{k-1})|\;{\le}\;\epsilon for some error bound {\epsilon}\;{\ge}\;0.

The algorithms we are going to discuss in this post are all variants of the line search algorithm. In each of these variant, the method of choosing the search direction and the step sizes differ only. The algorithms we are going to discuss are Gradient Descent and its variants (SGD, Mini-Batch SGD, Momentum, AdaGrad, Adam, etc.), the Conjugate Gradient algorithm and the L-BFGS optimization algorithm.

Gradient Descent

Given a function f(x), its gradient i.e. f'(x) gives the direction of greatest increase of the function, thus, the negative of the gradient of the function gives the direction in which the function f(x) decreases the most. The gradient descent algorithm uses this information to update the weight parameters of the neural network. The step size or the learning rate in gradient descent is usually kept at a constant value \eta.

w^{(t)}=w^{(t-1)}-{\eta}L'(w^{(t-1)}), where w is the vector of weights

L'(w^{(t-1)})=\frac{{\partial}L}{{\partial}w} is the vector of gradients of the loss w.r.t. each weight

where w^{(t)} is the optimal value of the weight w in the current iteration and w^{(t-1)} is the optimal value of the weight in the previous iteration. The direction of the greatest decrease in the loss function L w.r.t. the weights w is given by -L'(w^{(t-1)}).

Note that if the shape of the loss L is a concave parabola near the current solution w^{(t-1)} as shown in the below diagram by the point E, then as the weight increases (move right towards the weights axis), the loss decreases till we hit the minima. Thus at the current solution w^{(t-1)}, the gradient of the loss L'(w^{(t-1)}) is negative, and thus the weight increases in the next iteration, w^{(t)}\;>\;w^{(t-1)} in order to lower the loss.

Gradient of Loss

For the weights in the last layer, the value of the gradient is obtained to be (as derived in these notes) :

L'(w^{(t-1)}_{ij})=\sum_Q(\widehat{y}_j-y_j)*s_i

where w^{(t-1)}_{ij} is the weight connecting the i-th node in last hidden layer to j-th node in output layer and s_i is the output from the i-th node in hidden layer.

Thus the update equation for the weights w becomes :

w^{(t)}_{ij}=w^{(t-1)}_{ij}-{\eta}*\sum_Q(\widehat{y}_j-y_j)*s_i

where \sum_Q denotes the summation over all training instances.

The weights in the backward layers are updated based on the backpropagation algorithm (which we will not go in details in this post). Note that to update each weight, we need to compute the gradient for all the examples Q first, then one of the weights gets updated. This might be slow.

In the Stochastic Gradient Descent (SGD) variant, instead of computing the gradients with all the examples, it randomly samples an example and does the gradient update.

w^{(t)}_{ij}=w^{(t-1)}_{ij}-{\eta}*(\widehat{y}_j-y_j)*s_i

I have earlier written a post dedicated only to the Stochastic Gradient Descent algorithm, because SGD is probably the most commonly used optimization technique due to its simplicity and computational speed. SGD can easily be computed in online manner based only on single examples and thus is really helpful for real-time machine learning.

With only a single example, there could be a lot of fluctuation in the weight update and some weights might converge to local minima and some may not converge at all. Batch gradient descent is very good for convex functions or functions with very less number of local minima. Whereas for a function with lost of local minima, searching in zig-zag manner in SGD might actually help us reach the global minima faster.

The first contour plot below represents the learning with normal gradient descent whereas the one below that, represents the learning with stochastic gradient descent.

Contour plot for Batch Gradient Descent

Contour plot for SGD

In yet another variant, instead of considering only a singe example in SGD, we consider a mini-batch of examples randomly sampled from the instances :

w^{(t)}_{ij}=w^{(t-1)}_{ij}-{\eta}*\sum_{\left\{q\right\}{\epsilon}Q}[(\widehat{y}_j-y_j)*s_i]

where \sum_{\left\{q\right\}{\epsilon}Q} indicates that the summation is over a subset of examples randomly sampled from the set Q. With mini-batch stochastic gradient descent, the convergence is much more stable and fast. Moreover mini-batch SGD can utilize different efficient matrix and vector libraries for faster computation as compared to SGD. There is a nice comparison of mini-batch SGD vs. SGD.

Major drawback with the above SGD approach is that the learning rate (or the step size) \eta is fixed. Choosing a high value of the learning rate, the algorithm may not converge at all, whereas choosing a small value, it will take a long time for the algorithm to converge, if at all it converges.

Too high learning rate

SGD with Momentum :

We have explained momentum technique to speed up SGD in a separate post. To repeat that, the idea of utilizing momentum to speed up SGD is similar to a ball rolling down a hill. As the ball rolls down the hill, it gains further momentum to roll down in the same direction at a greater speed.

v_t={\alpha}v_{t-1}-{\eta}L'(w^{(t-1)})

w^{(t)}=w^{(t-1)}+v_t

Momentum increases convergence rate.

In the equation for SGD, we add another parameter, which is the update from the previous iteration, scaled by 0\;<\;\alpha\;<\;1, the momentum parameter. Usually the momentum learning rate \alpha is kept somewhere around 0.9.

If in the previous iteration, the weight has increased i.e. v_{t-1}\;>\;0, which means that we are going 'downhill' (lowering the loss), and in the current iteration, the gradient is negative, i.e. \frac{{\partial}L}{{\partial}w}\;<\;0, then the increase in the weight is higher as compared to that without the momentum (effective learning rate has increased), i.e. we are moving 'downhill' with a greater speed.

The momentum based approach can be improved further using the Nesterov Accelerated Gradient (NAG) Descent technique. This technique is similar to the momentum but differs in how the update step is computed :

v_t={\alpha}v_{t-1}-{\eta}L'(w^{(t-1)}+{\alpha}v_{t-1})

w^{(t)}_{ij}=w^{(t-1)}_{ij}+v_t

Nesterov Update

Intuitively, what NAG does is similar to momentum (blue arrows), that it utilizes the previous update to make the current update slower or faster by also looking at the gradient at the point which we would reach from the current point if we take the same step as in the previous iteration. If the gradient at the possible new point is still negative, then we take a larger stride in the same direction (implying we are good to move in that direction) depending on "how far we are good to go" based on the magnitude of the gradient L'(w^{(t-1)}+{\alpha}v_{t-1}), else if it is positive, either we need to slow down the learning rate or take a step back from the current position and "re-consider" our steps (implying possible "danger").

The brown arrow represents the initial step taken by the NAG and the red arrow, represents the correction in step i.e. L'(w^{(t-1)}+{\alpha}v_{t-1}), whereas the green arrows represents the "good to go" step.

Adaptive Gradient (AdaGrad)

Consider the input layer of the neural network, where there is a weight for each input node for the network. Given a classification problem, where the feature matrix is sparse (common for text classification) i.e. all instances do not contain all the input features, some features might be frequent across instances whereas some features might be too infrequent.

During the weight updates with backpropagation, the weights for the frequent features will be updated frequently whereas the weights for the infrequent features will be updated much less frequently. Due to this, some important but infrequent features may not converge towards their correct weights. To overcome this, the learning rate is dynamically increased for the infrequent weights, whereas it is decreased for the frequent ones.

In the equation for the gradient descent :

w^{(t)}_{ij}=w^{(t-1)}_{ij}-\frac{\eta}{\sqrt{G^{(t-1)}_{ij}+\epsilon}}L'(w^{(t-1)}_{ij})

where G^{(t-1)}_{ij} is the sum of squares of the gradients L'(w^{(u)}_{ij}) for u, from 1 to (t-1), i.e.

G^{(t-1)}_{ij}=\sum_{u=1}^{t-1}(L'(w^{(u)}_{ij}))^2

and \epsilon is a constant between 0 and 1.

The original paper on AdaGrad.

G^{(t-1)}_{ij} will be lower for infrequent weights and thus the effective learning rate would be higher for these weights and vice versa. It somewhat eliminates the drawback of having a constant learning rate. Problem with Adagrad is that, for frequent but important weights, G^{(t-1)}_{ij} would be high and thus the steps taken by the current iteration would be very small and with each iteration, the step size would shrink and these weights may not converge at all.

To overcome this issue, we use a variant called the AdaDelta. Instead of computing the sum of the squares of all the past gradients, AdaDelta computes decayed root mean square values for the previous gradients. Instead of having to start with some constant learning rate \eta, AdaDelta replaces it with the root mean square values of the previous step sizes. The AdaDelta algorithm looks like :

v_t=-\frac{\text{RMS}[v_{t-1}]}{\text{RMS}[u_{t-1}]}L'(w^{(t-1)}_{ij})

w^{(t)}_{ij}=w^{(t-1)}_{ij}+v_t

where

\text{RMS}[u_{t-1}]=\sqrt{E^{(t-1)}_{ij}+\epsilon},  where E^{(t-1)}_{ij}={\gamma}E^{(t-2)}_{ij}+(1-\gamma)(L'(w^{(t-1)}_{ij}))^2

\text{RMS}[v_{t-1}]=\sqrt{F^{(t-1)}_{ij}+\epsilon},  where F^{(t-1)}_{ij}={\gamma}F^{(t-2)}_{ij}+(1-\gamma)v^2_{t-1}

There are a few other variants of the gradient descent approach that solves the problem of SGD with adaptive learning rates (specially Adam). There is a nice documentation of all these methods in the following paper : "An overview of gradient descent optimization algorithms"

Conjugate Gradient Algorithm

Instead of keeping a constant learning rate in the line search gradient descent algorithm, we can optimize the learning rate in each iteration to minimize the loss, we can show that the path followed by the gradient descent algorithm are orthogonal to the previous one.

w^{(t)}=w^{(t-1)}-{\eta_{t-1}}L'(w^{(t-1)})

\frac{{\partial}L}{{\partial}{\eta_{t-1}}}=0, then using the chain rule, implies  \frac{{\partial}L}{{\partial}w^{(t)}}\frac{{\partial}w^{(t)}}{{\partial}{\eta_{t-1}}}=0, or L'(w^{(t)})L'(w^{(t-1)})=0

i.e. the gradients for two consecutive updates are orthogonal. This means that the path followed towards the minima would  have lots of zig-zag and would not be optimal as we would be traversing the same direction multiple times.

Gradient Descent (Left) vs. Conjugate Gradient (Right)

In conjugate gradient algorithm, each direction taken in the line search algorithm is A-conjugate to all the previous directions and thus prevents searching any one direction multiple times if we do not find it useful to search in that direction. Thus it converges a lot faster compared to gradient descent algorithm. Two vectors u and v are said to be A-conjugate w.r.t. to each other if for a SPD (Symmetric Positive Definite) matrix A,we have

{u^T}Av=0

In conjugate gradient, the neural network weight updates are given as :

w^{(t)}=w^{(t-1)}-{\alpha_{t-1}}p_{t-1}

where the line search directions are given by p_t and the step sizes are given by \alpha_t (compare this with the line search direction and step size in gradient descent). Additionally each direction p_t is conjugate to p_{t-1} w.r.t. to a symmetric positive definite matrix.

Wikipedia article about Non-linear conjugate gradient.

We start with some initial weight guess for the weights w^{(0)}. Then compute the gradient r_0=L'(w^{(0)}). The initial direction is set to the gradient itself, i.e. p_0=r_0.

The optimum value of the step size is computed by approximately minimizing the loss  L(w^{(0)}-{\alpha_0}p_0) w.r.t. \alpha_0, i.e.

\frac{{\partial}L(w^{(0)}-{\alpha_0}p_0)}{{\partial}{\alpha_0}}=0

The updated weight is then computed as : w^{(1)}=w^{(0)}-{\alpha_0}p_0

The direction for the next update is chosen by setting :

p_t=r_t+{\beta_t}p_{t-1}, where r_t=L'(w^{(t)}).

Intuitive this means that instead of choosing the new direction p_t to be the gradient r_t as in gradient descent, in conjugate gradient we choose a direction somewhere between the gradient and the previous search direction.

The value of {\beta_t} can is computed using the below formula :

\beta_t=\frac{r^T_t(r_t-r_{t-1})}{r^T_{t-1}r_{t-1}}

Limited Memory BFGS

Both gradient descent and conjugate gradient are first order methods, i.e. they compute the new search directions based upon the first order gradient of the loss function w.r.t. to the weights. Newton's method class of optimization computes the new search directions using the second order gradients (also called Hessian) of the loss function. For example, if a function f, is twice differentiable, then using the Taylor series expansion :

For multi-dimensional input vector X,

f(X+{\delta}X)=f(X)+({\delta}X)^Tf'(X)+\frac{1}{2}({\delta}X)^Tf''(X)({\delta}X)

i.e. we are approximating the objective function around a point X, using just the first order and second order derivatives of the function f.

f'(X) is the gradient vector g=[\frac{{\partial}f}{{\partial}x_1},\frac{{\partial}f}{{\partial}x_2},...,\frac{{\partial}f}{{\partial}x_n}]

f''(X) is the Hessian matrix H, where

H_{ij}=\frac{{\partial}f}{{\partial}x_i{\partial}x_j}

In order to minimize the objective function we need to compute the derivative of the objective w.r.t. the change {\delta}X and set it to 0, i.e.

\frac{{\partial}f}{{\partial}({\delta}X)}=0 or f'(X)+f''(X)({\delta}X)=0

Solving for the optimal step needed to minimize the objective f, we get :

{\delta}X=-(f''(X))^{-1}f'(X)=-H^{-1}g

In the case of neural network, the weight updates translates into :

w^{(t)}=w^{(t-1)}-{\alpha_{t-1}}H^{-1}_{t-1}g_{t-1}

where {\alpha_{t-1}} is the step size for line search as earlier and H^{-1}_{t-1}g_{t-1} is the new direction for the line search. H^{-1}_{t-1} is the inverse of the hessian matrix from the previous iteration and g_{t-1} is the gradient vector from the previous iteration. The step sizes {\alpha_{t-1}} can be computed by backtracking line search or using the Wolfe Conditions.

Computing the hessian matrix for each iteration can be very expensive operation, given that there could potentially be billions of weights to update in a very deep neural network architecture and the hessian matrix would be of the order O(N^2). In quasi-Newton approach, instead of computing the exact hessian in each iteration, we approximate the hessians from the previous hessian.

Without getting into the mathematical details, to compute the inverse hessian H^{-1}_t from H^{-1}_{t-1}, we need to solve the below constrained optimization problem :

\text{min}_{H^{-1}}||H^{-1}-H^{-1}_{t-1}||, subject to H^{-1}y_{t-1}=s_{t-1} and H^{-1} is symmetric.

where y_t=g_t-g_{t-1} and s_t=w_t-w_{t-1}

Solving for the optimum value of H^{-1}, we obtain the solution for H^{-1}_t,

H^{-1}_t=(I-{\rho}_{t-1}y_{t-1}s^T_{t-1})H^{-1}_{t-1}(I-{\rho}_{t-1}s_{t-1}y^T_{t-1})+{\rho}_{t-1}s_{t-1}s^T_{t-1}

where {\rho}_t=({y^T_t}s_t)^{-1}

The above method is known as the BFGS algorithm. We can choose the initial inverse hessian H^{-1}_0 to be the identity matrix I. Given that the initial inverse hessian H^{-1}_0 is provided to us, we can easily compute the hessian at the t-th time step, by using the above recursive relation, but we need to store all the values of y_t and s_t from time steps 1 to t in order to reconstruct H^{-1}_t from H^{-1}_0.

Storing the vectors y_t and s_t from time step 1 to t can take up considerable amount of memory. To overcome this we store only the last 'm' updates for y_t and s_t, i.e. for time steps 't-m-1' to 't'. This modification is known as the L-BFGS or Limited Memory BFGS.

A very nice comparison between SGD, CG and L-BFGS could be found in the the following paper. They discuss which method to use under what conditions.

Categories: MACHINE LEARNING

Tags: , , , , , ,