# Kernels and Support Vector Machines

Given a supervised classification problem with the set of N training examples $X_i$ along with the class labels $y_i$, i.e. $\left\{(X_1, y_1), (X_2, y_2), ... (X_N, y_N)\right\}$, we need to build a model to predict the class label for an unseen example. Some of the algorithms we have already encountered and some we will encounter in later posts such as the Logistic Regression, Naive Bayes, Adaboost, Gradient Boosting, KNN, Support Vector Machines, Neural Networks etc. In this post we will discuss about the Support Vector Machine (SVM) methodology.

Given a binary classification problem, $y_i\;\epsilon\;\left\{-1,1\right\}$, intuitively think of the SVM method as a geometric model of constructing an M dimensional hyperplane (M being the dimension of each training example $X_i$) that separates the examples belonging to one class from another, obviously assuming that the training examples could be separated with an M-dimensional hyperplane. In scenarios where the training examples cannot be separated in the original M-dimensional hyperspace, the training examples are transformed into a higher dimensional hyperspace through Kernel transformation, where the examples becomes separable.

Separating Hyperplane

Above is diagrammatic representation of the SVM . For purpose of simplicity, we are showing a 2-D space. The solid dots represent one class of examples whereas the hollow dots represents the other class of examples. The solid line represents the separating hyperplane. The confidence of classification for an example is proportional to the distance of the example from the separating hyperplane, i.e. examples which are far away from the separating hyperplane are classified more confidently compared to examples near the hyperplane.

The goal is to find the optimal hyperplane given the training examples, such that the margins from the closest training examples on either side to the hyperplane is maximized. In the above example, the dashed lines on either side of the separating hyperplane represents the margins on either side. The training examples closest to the hyperplane on either side are known as the Support Vectors. We will see later that only the support vectors matter for prediction unlike in K-Nearest Neighbors which requires all the training examples for predicting unseen examples.

The equation of the hyperplane :

$f(X)=w^TX+b=0$

where $w$ is the M-dimensional non-zero vector perpendicular to the hyperplane. Thus the model for a binary classification problem is defined to be :

$y_i=h(X_i;w,b)=sgn(w^TX_i+b)$

where $sgn(x)$ denotes the sign of $x$, i.e.

$y_i=+1$ if $w^TX_i+b>0$ and $y_i=-1$ if $w^TX_i+b<0$

Given this definition, the classification confidence or the margin is defined to be :

$y_i(w^TX_i+b) > 0$

The above quantity is always greater than 0 for a correct prediction, moreover the corresponding exact value of the quantity $y_i(w^TX_i+b)$ gives an estimate of confidence in the prediction. Choose the hyperplane in a way such that the equation of the margins on either side becomes :

$w^TX+b={\pm}1$ or $|w^TX+b|=1$

Using the formula to find the distance between two planes (the hyperplane and the margin), we get that the distance of any training example $X_i$ lying on one of the margins i.e. support vectors from the hyperplane :

$\frac{|w^TX_i+b|}{||w||}=\frac{1}{||w||}$

The goal is to maximize this margin. Since the margins are given by the equation $w^TX+b={\pm}1$, thus the classification confidence for any instance is now lower bounded by 1, i.e.

$y_i(w^TX_i+b) \;\ge\; 1$.

Since the objective function $\frac{1}{||w||}$ is non-convex, we notice that maximizing $\frac{1}{||w||}$ amounts to minimizing $\frac{1}{2}||w||^2$ (w is a non-zero vector, hence is defined at all points). Thus we now have a convex optimization problem (Quadratic Programming) :

minimize $\frac{1}{2}||w||^2$

subject to $1-y_i(w^TX_i+b) \;\le\; 0$ for i=1 to N

The above convex optimization problem has a global solution if the N training examples are linearly separable by the optimal hyperplane. The Lagrangian of the QP is defined to be :

$L(w, b, \alpha)=\frac{1}{2}||w||^2-\sum_{i=1}^N\alpha_i[y_i(w^TX_i+b)-1]$ for $\alpha_i\;\ge\;0$

To minimize the Lagrangian w.r.t. w and b, we need to take the derivative of the Lagrangian for each variable and set them to zero, i.e. $\frac{{\partial}L}{{\partial}w}=0$ and $\frac{{\partial}L}{{\partial}b}=0$.

Setting $\frac{{\partial}L}{{\partial}w}=0$, we get :

$w-\sum_{i=1}^N\alpha_iy_iX_i=0$

which implies that, $w=\sum_{i=1}^N\alpha_iy_iX_i$.

Similarly setting $\frac{{\partial}L}{{\partial}b}=0$, we get :

$\sum_{i=1}^N\alpha_iy_i=0$

Substituting back the values into the Lagrangian, we get :

$L(\alpha)=\sum_{i=1}^N\alpha_i - \frac{1}{2}\sum_{i,j=1}^Ny_iy_j\alpha_i\alpha_jX_i^TX_j$

W.r.t. the Lagrange parameters, we can formulate the dual optimization problem as :

maximize $\sum_{i=1}^N\alpha_i - \frac{1}{2}\sum_{i,j=1}^Ny_iy_j\alpha_i\alpha_j$

subject to : $\alpha_i\;\ge\;0$ and $\sum_{i=1}^N\alpha_iy_i=0$

where the term $$ in the dual objective function represents the inner product of the vectors $X_i$ and $X_j$.

To satisfy the KKT conditions, we also need that:

$\alpha^*_i[y_i(w^{*T}X_i+b)-1]=0$

for optimal solutions $\alpha^*_i$ and $w^*$ for all i=1 to N. If $\alpha_i > 0$ for any training example, then we need to have $y_i(w^{*T}X_i+b)-1=0$ for  these examples. This is only possible if these examples lies on the margin i.e. the support vectors. Or in other words only for support vectors, we have $\alpha_i > 0$. For any other examples, we have $y_i(w^{*T}X_i+b)-1 > 0$ or $y_i(w^{*T}X_i+b)-1 < 0$ and hence for these examples $\alpha_i=0$.

In order to predict the class label for any unseen example X, we compute $w^{*T}X+b$. From the above, we have seen that to satisfy the KKT conditions, $w=\sum_{i=1}^N\alpha_iy_iX_i$ for training examples $\left\{(X_i, y_i)\right\}$. Thus

$w^{*T}X + b=\sum_{i=1}^N\alpha^*_iy_iX_i^TX + b=\sum_{i=1}^N\alpha^*_iy_i + b$

Since $\alpha_i > 0$ only for $X_i$ being the support vectors and 0 otherwise, hence we compute the inner product of the test example with only the support vectors and then take the sign of it to determine the class label. This is unlike in KNN where we compute the inner product of the test example with all other training examples. Thus SVM prediction is much faster.

We have not yet solved for the dual optimization problem. There is a technique called the Sequential Minimal Optimization (SMO) which is used to solve the dual optimization problem and find out the optimal values for $\alpha_i$, and $b$. We will discuss about the algorithm in a later post. In the meantime we will explore SVM for the cases where the training examples are not linearly separable in the original M-dimensional input space. The following two methods are popularly used :

1. Kernel transformation to higher dimensional feature space.
2. $L_1$ Regularization.

In Kernel transformation, the training examples are transformed from the original M-dim input space $X_i$ to a higher dimensional feature space $\phi(X_i)$. The assumption is that examples which are not linearly separable in the original input space, becomes linearly separable when transformed to a higher dimension. In the transformed feature space, the SVM model is defined to be :

$h(X)=sgn(\sum_{i=1}^N\alpha^*_iy_i<\phi(X_i), \phi(X)> + b)=sgn(\sum_{i=1}^N\alpha^*_iy_iK(X_i, X) + b)$

where $K(X_i, X_j)=\phi(X_i)^T\phi(X_j)$ is the Kernel function.

Notice that we are working with the inner product of the transformed features. It is much easier to compute the kernel functions $K(X_i, X_j)$ compared to computing the individual transformations $\phi(X_i)$ and $\phi(X_j)$ and then computing their inner product.

For example consider the Kernel function :

$K(X, Z)=(X^TZ)^2$

Assuming that $X$ and $Z$ are M-dim vectors, finding $K(X, Z)$ requires O(M) operations. Notice that if the transformation function is defined as :

$\phi(X)=[x_1x_1\;x_1x_2\;...\;x_2x_3\;...\;x_Mx_M]^T$

Then the inner product :

$\phi(X)^T\phi(Z)=\sum_{i=1}^M\sum_{j=1}^Mx_iz_ix_jz_j=(X^TZ)^2$

which is the same as computing the kernel $K(X, Z)$ directly. Computing $\phi(X)^T\phi(Z)$ using the above method requires $O(M^2)$ operations. The commonly used kernel functions in SVM are the polynomial kernel and radial kernels. Whereas polynomial kernels takes the form :

$K(X, Z)=(X^TZ + c)^d$ where 'd' is the degree of the polynomial,

radial basis kernels are represented as :

$K(X, Z)=exp(-\gamma||X-Z||^2)$

Among the two, the radial basis kernel is the preferred choice in most difficult classification problems whereas for applications with hard bounds on running time, the polynomial kernel is generally chosen and speed is given priority over accuracy. Although with a lower degree polynomial kernel, the time complexity is less, but since the radial basis function represents a infinite degree polynomial (Taylor expansion of $e^x$), most difficult to separate classes in the input space are quite well separated by a hyperplane in the transformed feature space given by the radial basis kernels.

Kernel Transformation

But how to know whether a chosen function is a valid kernel function or not. If $K(X, Z)$ is a valid kernel function then for any two examples $X_i$ and $X_j$,we have

$K(X_i, X_j)=\phi(X_i)^T\phi(X_j)=\phi(X_j)^T\phi(X_i)=K(X_j, X_i)$

i.e. if we create the Kernel Matrix K, where $K_{ij}=K(X_i, X_j)$ then it must be symmetric, $K=K^T$. Also for any vector $z$, we have :

$z^TKz\;\ge\;0$

i.e. the symmetric matrix K must also be positive semidefinite. Thus the necessary and sufficient condition for the function $K(X, Z)$ to be a valid kernel function is that the matrix $K_{ij}=K(X_i, X_j)$ must be symmetric positive semidefinite. This is known as Mercer's theorem. Thus we can experiment with different kernel functions as long as they satisfy the above conditions.

Whereas kernel transformation of the examples from the original input space to a higher dimensional feature space can help us to find a separating hyperplane, but many times in order to model examples not separable by hyperplanes in lower dimensions, we tend to model noise with high dimensional hyperplanes. To effectively handle this situation, we use the $L_1$ regularization technique.

The regularized Quadratic programming problem is formulated as :

minimize $\frac{1}{2}||w||^2 + C\sum_{i=1}^N\zeta_i$ for some constant C > 0

subject to : $y_i(w^TX_i+b)\;\ge\;1-\zeta_i$ for i=1 to N

and $\zeta_i\;\ge\;0$

Soft Margins Hyperplane

The margins are now soft margins as it allows examples (possible noises and outliers) to have margins less than 1. For each example whose margin is less than 1, i.e. $\zeta_i > 0$, the objective loss function is penalized by an additional term of $C\zeta_i$.

Categories: MACHINE LEARNING