# Maximum Likelihood Estimation

Observations from a probability distribution, depends on the parameters of that model. For example, given an unbiased coin with equal probability of landing heads as well as tails, what is probability of observing the sequence "HHTH". Our knowledge from probability theory says that since the toss of a coin follows the binomial distribution, the probability of the observation should be 0.54 = 0.0625, but what if the coin was biased and the probability of landing heads was 0.7 (and 0.3 for tails) ? The answer would be 0.103.

Thus the probability of the observation is a function of the probability of heads (or tails). Lets denote that probability as P(O; p) where 'p' is the probability of heads,

P(O=HHTH; p=0.5) = 0.0625 and P(O=HHTH; p=0.7) = 0.103.

But in more common scenarios, given the observation, we need to estimate the probabilities. The aim of maximum likelihood estimation is to find the parameter value(s) that makes the observed data most likely, or in other words given the observation 'O', what should be the value of 'p' so that  P(O; p) is maximum.

Continuing with our coin tossing example, let's suppose we do not know the value of 'p' and we have to estimate it given the observation 'HHTH'. Since the probability of heads is 'p', hence the probability of tails is (1-p). From binomial distribution, the likelihood of the observed sequence as a function of 'p' is :

$L(p; O=HHTH) = p^3 * (1-p)$

Our goal is to maximize this quantity.

Probability of the observation as a function of the unknown parameter 'p'

There are many possible ways the above function can be maximized. If the function is too complex then we normally use numerical algorithms to estimate the values of the parameters. But for simple functions like the one above, we can find a solution by setting the first derivative of the likelihood w.r.t the unknown parameters. Sometimes the function may be having local minima and maxima and hence the solution obtained may not be the globally correct answer.

Complex curve demonstrating local maxima and minima.

Since we are dealing with probabilities (real numbers between 0 and 1), raising them to exponents and taking products will result in integer underflow. We can overcome that by taking Log Likelihood of the function, i.e.

$Log L(p; O=HHTH) = 3 * log (p) + log (1-p)$

Taking the derivative of the above quantity w.r.t. 'p' and setting it to zero, gives us : $3(1-p) = p$,  solving for p gives us, p=3/4=0.75. The answer seems pretty intuitive and one might wonder why did we go through the pain of computing derivatives and roots. True. But the point was to illustrate that the approach is generic.

Lets look at a more complex example with Normal distribution.

Given a sequence of identical and  independent observations X1, X2, ..., Xn from a normal distribution with unknown parameters $\mu$ and $\sigma$, estimate the parameters so as to fit the distribution to the observation.

The Gaussian distribution is defined as

$N(\mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

Hence the probability of the sequence of observations is the product of the probabilities of each observation (since the observations are I.I.D.), thus :

$P(O | \mu, \sigma) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(X_i-\mu)^2}{2\sigma^2}}$

Taking the log likelihood of the above, we get :

$Log L(\mu, \sigma | O) = \sum_{i=1}^{n} log (\frac{1}{\sqrt{2\pi}}) - log (\sigma)-\frac{(X_i-\mu)^2}{2\sigma^2}$

To maximize the above function, we will use the same technique as earlier i.e. take the derivatives w.r.t. the unknown parameters and find their roots. On taking the derivative w.r.t. $\mu$ we get :

$\sum_{i=1}^{n} (X_i-\mu) = 0$,

solving for $\mu$ we get,

$\mu = \frac{\sum_{i=1}^{n} X_i}{n}$,

i.e. the mean of the observations. Similarly on taking derivative w.r.t. $\sigma$, we get the following :

$\sum_{i=1}^{n} -\frac{1}{\sigma} +\frac{(X_i-\mu)^2}{\sigma^3} = 0$,

solving for $\sigma$, we get

$\sigma = \sqrt{\frac{\sum_{i=1}^{n} (X_i-\mu)^2}{n}}$,

which is the standard deviation of the observations.

Categories: MACHINE LEARNING