# Convex Optimization in Machine Learning

In an earlier post, we introduced one of the most widely used optimization technique, the gradient descent and its scalable variant, the Stochastic Gradient Descent. Although the SGD is an efficient and scalable technique to optimize a function, but the drawbacks with both gradient descent and SGD is that they are susceptible to find local optimum. The gradient descent technique is not suited to find local or global optimum with constraints in place.

In general, an objective function to be optimized along with all the constraints, need not have a global optimum and for such problems, approximation algorithms like Hill Climbing, Simulated Annealing, Genetic algorithms etc. works well in finding a local optimum. But finding a global optimum with these techniques is very time consuming. Here we will discuss a special class of constrained optimization problems called convex optimization for which the local optimum is a global optimum. In fact many of the widely used machine learning techniques fall into the class of convex optimization problems or can be transformed into one. The beauty of this class of problems is that they could be very efficiently solved for a global optimum using techniques such as Linear Programming, Quadratic Programming, Quadratically Constrained Quadratic Programming etc.

Mathematically a convex optimization problem is formulated as :

minimize f(x)

subject to : $g_i(x)\;{\le}\;0$ for i=1, 2, ..., m

and $h_j(x)=0$ for j=1, 2, ..., p

where f(x) is a convex function, $g_i(x)$ are convex functions and $h_j(x)$ are affine functions and 'x' is the optimization variable.

A function $f(x) : R^n{\rightarrow}R$ is convex if its domain denoted by D(f) is a convex set and if for all x, y $\epsilon$ D(f) and $\theta\;{\epsilon}\;R$, with $0\;{\le}\;\theta\;{\le}\;1$, we have :

$f({\theta}x + (1-\theta)y) \;{\le}\; {\theta}f(x) + (1-\theta)f(y)$

Intuitively, if we pick any two points on the graph of the convex function f(x) and draw a straight line between them, then all points lying between these two points will lie below this straight line. The below diagram illustrates this criteria.

Convex Function

A set C is convex, if for any x, y $\epsilon$ C and $\theta\;{\epsilon}\;R$, with $0\;{\le}\;\theta\;{\le}\;1$, we have :

${\theta}x + (1-\theta)y \;\epsilon\; C$

i.e. if we connect any two elements x, y within a convex set C by a straight line, then all the points of the straight line lies within the convex set C. Below are two diagrams, one of them is a convex set while the other one is not.

Apart from the general definition of a convex function, we have the first order condition of convexity which states that, if the function f(x) is differentiable (i.e. ${\nabla}_xf(x)=\frac{\partial}{{\partial}x}f(x)$ exists at all points x in the domain of f), then f(x) is convex if and only if D(f) is a convex set and for all x, y $\epsilon$ D(f),

$f(y)\;{\ge}\;f(x)+{\nabla}_xf(x)^T(y-x)$

Intuitively, if we approximate a function f by it's first order approximation at the point 'x' (Taylor series expansion) i.e. the tangent line at the point 'x', then every point on this tangent line lies below the corresponding point on the graph.

First Order Approximation

Similar to the first order approximation, the second order condition for convexity states that if the function f(x) is twice differentiable (i.e. the Hessian ${\nabla}^2_xf(x)=\frac{\partial^2}{{\partial}x}f(x)$ exists at all points x in the domain of f), then f(x) is convex if and only if D(f) is a convex set and for all x $\epsilon$ D(f), the Hessian ${\nabla}^2_xf(x)$ is positive semi-definite.

A matrix M is positive semi-definite, if for any $x\;{\epsilon}\;R^n$, $x^TMx\;{\ge}\;0$. In one dimensional case, this is equivalent to saying that $f''(x)\;\ge\;0$.

An affine function is $f(x) : R^n{\rightarrow}R$, such that

$f(x)=b^Tx + c$ for some $b\;{\epsilon}\;R^n$ and $c\;{\epsilon}\;R$

The Hessian ${\nabla}^2_xf(x)=0$, implies that an affine function f(x) is both convex and concave. Let's prove that in a convex optimization problem there are no local minimas but only global minima(s).

A point 'x' is locally optimal if it feasible (i.e. satisfies the constraints of the optimization problem) and there exists some R > 0, such that for all feasible points 'z', with $||x-z||_2\;\le\;R$, we have $f(x)\;\le\;f(z)$ or in other words we can draw a n-dimensional 'sphere' of radius R around the point 'x' such that all parts of the function f lying within this sphere lies above f(x). Whereas in the case of global optima the above is true for all R > 0.

Given a convex optimization problem on the domain D(f), assume that there exists a point 'x' which is a local optimum but it is not a global optimum, i.e. there exists some $y\;{\epsilon}\;D(f)$ such that f(x) > f(y). Since the point 'x' is a local optimum, we can choose a small enough R > 0, such that for all points 'z' within this radius R, i.e. $||x-z||_2\;\le\;R$, it is not possible to have f(z) < f(x). But now suppose that we choose 'z' as :

$z={\theta}y + (1-\theta)x$, where $\theta=\frac{R}{2||x-y||_2}$

Then,

$||x-z||_2=||x-\frac{R}{2||x-y||_2}y + (1-\frac{R}{2||x-y||_2})x||_2=||\frac{R}{2||x-y||_2}(x-y)||_2=\frac{R}{2}\;\le\;R$

By the convexity of f, we have :

$f(z)=f({\theta}y + (1-\theta)x)\;\le\;{\theta}f(y)+(1-\theta)f(x)\;\le\;f(x)$

Since x, y $\epsilon$ D(f) (a convex set), hence $z={\theta}y + (1-\theta)x$ $\epsilon$ D(f). Therefore for any small enough R > 0, there exists some 'z' such that $||x-z||_2\;\le\;R$ and f(z) < f(x), hence our assumption that 'x' is a local minima is false.

There are a few special cases of convex problems : Linear Programming (LP), Quadratic Programming (QP), Quadratic Constrained Quadratic Programming (QCQP). We shall explore LP and QP with a few examples with R.

A convex optimization problem is a Linear Program, if both the objective function f(x) and the inequality constraint function $g_i(x)$ are affine functions. A Linear Program has the following formulation :

minimize $c^Tx+d$

subject to : $Gx\;\le\;h$, element-wise inequality

and $Ax=b$

where $x\;\epsilon\;R^n$ is the optimization variable, $c\;\epsilon\;R^n$, $d\;\epsilon\;R$, $G\;\epsilon\;R^{mxn}$, $h\;\epsilon\;R^m$, $A\;\epsilon\;R^{pxn}$ and $b\;\epsilon\;R^p$.

Let's try to solve an LP example :

You are going to design a biased die, such that the expected value of a throw is maximized. Let's say that you designed the die in such a way that the probability distribution of the six faces are $\left\{P_1, P_2, P_3, P_4, P_5, P_6\right\}$. But there are certain constraints that you must also follow :

1. Sum of the probabilities should equal 1, i.e. $\sum_{i=1}^6P_i=1$
2. Sum of any two probabilities $(P_i, P_j)$ should be less than equal to 0.5, i.e. $P_i+P_j\;\le\;0.5$
3. Probability of each face of the die should at-least equal to 0.1, i.e. $P_i\;\ge\;0.1$

The expected value for a throw that needs to be maximized :

$P_1 + 2*P_2 + 3*P_3 + 4*P_4 + 5*P_5 + 6*P_6$

We have the following optimization problem :

maximize $V^Tx$

subject to : $G^Tx\;\le\;h$

and $Ax=b$

where x = $\begin{bmatrix} P_1\\P_2\\P_3\\P_4\\P_5\\P_6\end{bmatrix}$, V = $\begin{bmatrix} 1\\2\\3\\4\\5\\6\end{bmatrix}$, A = $\begin{bmatrix} 1\\1\\1\\1\\1\\1\end{bmatrix}$, b = $\begin{bmatrix} 1\\1\\1\\1\\1\\1\end{bmatrix}$,

h = ${\begin{bmatrix} 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.5 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1\end{bmatrix}}^T$

G = $\begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & -1 & 0 & 0 & 0 & 0 & 0\\1 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & -1 & 0 & 0 & 0 & 0\\0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & -1 & 0 & 0 & 0\\0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & -1 & 0 & 0\\0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & -1 & 0\\0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & -1\end{bmatrix}$

library(lpSolve)

f.obj <- c(1, 2, 3, 4, 5, 6)
f.con <- matrix(c(1, 1, 1, 1, 1, 1), nrow=1)

for (i in 1:5) {
for (j in (i+1):6) {
vec <- rep(0, 6)

vec[i] = 1
vec[j] = 1

f.con <- rbind(f.con, vec)
}
}

f.con <- rbind(f.con, diag(6))

f.dir <- c("=", rep("<=", 15), rep(">", 6))
f.rhs <- c(1, rep(0.5, 15), rep(0.1, 6))

out <- lp("max", f.obj, f.con, f.dir, f.rhs)

We use the "lpSolve" package in R to solve the above LP optimization problem. On solving we get that the maximum expected value in a throw as 4.2 and the optimum values are obtained for:

x = $\begin{bmatrix} P_1\\P_2\\P_3\\P_4\\P_5\\P_6\end{bmatrix}=\begin{bmatrix} 0.10\\0.10\\0.15\\0.15\\0.15\\0.35\end{bmatrix}$

Let's look at another LP problem, but this one is bit interesting.

You need to buy some filing cabinets. You know that Cabinet X costs $10 per unit, requires six square feet of floor space, and holds eight cubic feet of files. Cabinet Y costs$20 per unit, requires eight square feet of floor space, and holds twelve cubic feet of files. You have been given $165 for this purchase, though you don't have to spend that much. The office has room for no more than 72 square feet of cabinets. How many of which model should you buy, in order to maximize storage volume? Suppose you buy $N_x$ quantity of cabinet X and $N_y$ quantity of cabinet y, then : Cost = $10*N_x+20*N_y\;\le\;165$ Area = $6*N_x+8*N_y\;\le\;72$ and Volume = $8*N_x+12*N_y$ Then the LP optimization problem becomes : maximize $v^Tx$ subject to : $g^Tx\;\le\;h$ where x = $\begin{bmatrix} N_x\\N_y\end{bmatrix}$, v = $\begin{bmatrix} 8\\12\end{bmatrix}$, g = $\begin{bmatrix} 10 & 6\\20 & 8\end{bmatrix}$ and h = $\begin{bmatrix} 165\\72\end{bmatrix}$ Using similar code as the earlier problem : library(lpSolve) f.obj <- c(8,12) f.con <- matrix(c(10, 20, 6, 8), nrow = 2, byrow = T) f.dir <- c("<=", "<=") f.rhs <- c(165, 72) out <- lp("max", f.obj, f.con, f.dir, f.rhs) On solving we get that the maximum value of the volume as 105 and the optimum values are obtained for: x = $\begin{bmatrix} N_x\\N_y\end{bmatrix}=\begin{bmatrix} 3.00\\6.75\end{bmatrix}$ But wait, how can the number of models for cabinet Y, be fractional. We have missed a very important constraint for the optimization variable, i.e. they must be lying only in the Integer space $x\;\epsilon\;Z^n$. When the domain of the optimization variable lies in Integers, we have a special case of optimization problem called Integer Programming. We will explore more of Integer Programming algorithms in later posts. Meanwhile, to solve the above Integer Programming problem, "lpSolve" allows to pass a vector to the "lp" function containing the indices of the optimization components which must be integers. In our case both $N_x$ and $N_y$ must be integers. out <- lp("max", f.obj, f.con, f.dir, f.rhs, int.vec = c(1,2)) On solving the above we get the maximum value of the volume as 104 (less optimal than the generic LP problem) and the solution is : x = $\begin{bmatrix} N_x\\N_y\end{bmatrix}=\begin{bmatrix} 4\\6\end{bmatrix}$ A convex optimization problem is a Quadratic Program, if the inequality constraint function $g_i(x)$ are still affine functions but the objective function f(x) is a convex quadratic function. Mathematical formulation of a QP is : minimize $\frac{1}{2}x^TPx+c^Tx+d$ subject to : $Gx\;\le\;h$, element-wise inequality and $Ax=b$ where $x\;\epsilon\;R^n$ is the optimization variable, $c\;\epsilon\;R^n$, $d\;\epsilon\;R$, $G\;\epsilon\;R^{mxn}$, $h\;\epsilon\;R^m$, $A\;\epsilon\;R^{pxn}$, $b\;\epsilon\;R^p$ and $P\;\epsilon\;S_+^n$ is a symmetric positive semidefinite matrix. Portfolio Optimization problem is a classic example of a QP. We will use the 225-Asset dataset from the OR Library. Suppose there are N assets $\left\{A_i\right\}$. The rate of return for asset $\left\{A_i\right\}$ is a random variable with expected value of $R_i$. The problem is to find what fraction $x_i$ to invest in asset i, so as to minimize the overall risk. The overall risk to be minimized is computed as : $\frac{1}{2}x^TCx$ where C is the covariance matrix of the asset returns and x is the vector of the fractions to invest in each asset. This is the mean-variance model of portfolio optimization. There is also the constraint that the expected rate of return over all the assets should be at-least equals to 'r', i.e. $\sum_{i=1}^Nx_iR_i\;\ge\;r$ Apart from the constraints, that the sum of the fractions $x_i$ should equal to 1, i.e. $\sum_{i=1}^Nx_i=1$ and each $x_i$ should be between 0 to 1, i.e. $0\;\le\;x_i\;\le\;1$. From the given data, we have the mean and the standard deviation of return for each asset and also the correlations between each asset. To obtain the covariance matrix from the correlation matrix, we will use : $Cov(X, Y)=Corr(X, Y)*SD(X)*SD(Y)$ where SD(X) is the standard deviation of the variable X. To implement QP, we use the "quadprog" package in R. First we read the input files and then construct the covariance matrix : library(quadprog) f1 <- read.csv("returns.csv") f2 <- read.csv("Corr.csv") expected.returns <- f1$Mean.Return
std.dev.returns <- f1$Std..Dev..Return cov.assets <- matrix(0, nrow=length(unique(f2$Asset.i)), ncol=length(unique(f2$Asset.i))) for (i in 1:nrow(f2)) { x <- f2$Asset.i[i]
y <- f2$Asset.j[i] corr <- f2$Correlation[i]

sdev1 <- std.dev.returns[x]
sdev2 <- std.dev.returns[y]

q <- corr*sdev1*sdev2

cov.assets[x, y] <- q
cov.assets[y, x] <- q
}

Then we define the function that will create the parameters for using the solve.QP function from the "quadprog" package.

portfolio.opti <- function(cov.assets, expected.returns, desired.rate.of.return=0.002) {
Dmat <- cov.assets
dvec <- rep(0, nrow(Dmat))

bounds <- diag(nrow(Dmat))

Amat <- cbind(rep(1, nrow(Dmat)), bounds, -bounds, expected.returns)

bvec <- c(1, rep(0, nrow(Dmat)), rep(-1, nrow(Dmat)), desired.rate.of.return)

r <- solve.QP(Dmat, dvec, Amat, bvec, meq = 1)
sol <- ifelse(r$solution < 1e-5, 0, r$solution)

list("Risk"=r\$value, "Solution"=sol)
}

The above function returns the minimum risk for a minimum desired rate of return from the assets. Calling the above function with the covariance matrix of the assets, the expected rate of return for each asset and the minimum overall desired rate of return from the assets (=0.002), we get the minimum risk equals to 0.0001949121. The minimum risk increases with a higher desired rate of return.

In later posts, we will discuss about some generic algorithms for solving Convex Optimization problems and also look at specific algorithms for LP and QP. Meanwhile you can go through some of the below references to understand the concepts in much more detail.

Categories: MACHINE LEARNING, PROBLEM SOLVING