In the last post, we introduced a technique called the Maximum Likelihood Estimation (MLE) to estimate unknown parameters of a probability distribution given a set of observations. Although it is a very useful technique, but it assumes that all information about the observation is available to us. Consider the example of a two coin toss :
"Given two coins A and B, with probability of heads being 'p' and 'q' respectively. Each time one of the coin is chosen at random and tossed. The following sequence is observed : HA HA TB TA TB HB HB TA HB HA TA TA HA HB HA TB, estimate the values of 'p' and 'q' from the given data"
With direct application of MLE for this problem, one can conclude that
Well that was easy because all the information required to compute the unknown parameters was available. But what if the labels on the coins were hidden and you did not know which coin was used for which toss. Given that the coins are equally likely to be chosen, how will you estimate the unknown parameters ‘p’ and ‘q’ ?
We will try to solve the problem iteratively. In each iteration, we have 2 steps, the ‘E’ step and the ‘M’ step.
“E” Step (Expectation) :
- Start with some initial guess for the values p and q.
- Instead of saying that a coin toss has come from a particular coin, we will say that it can come from coin A with probability ‘x’ and from coin B with probability ‘1-x’.
- Compute the expected number of heads and tails from each coin.
“M” Step (Maximization) :
- Compute the log likelihood of the expected number of heads and tails from each coin in step 3 from “E” step, similar to our MLE calculations above.
- Maximize the log likelihood w.r.t. the unknown parameters and re-estimate the new values of p and q at which the log likelihood gets maximized for each coin.
- Repeat the ‘E’ step with the new p and q values until it converges.
Let us take an example, where 5 experiments were done and in each experiment 10 tosses. (using two coins).
1 H T T T H H T H T H 2 H H H H T H H H H H 3 H T H H H H H T H H 4 H T H T T T H H T T 5 T H H H T H H H T H
We start with some initial guess for the unknown parameters p=0.6 and q=0.5. Lets' take the first experiment. Call this observation 'S', we want to estimate what is the likelihood that the observation 'S' is coming from coin A, i.e. P(A|S). Recall from Bayes' Theorem that :
P(A) is the probability of choosing coin A, it is 0.5 (so is P(B)) as we know that each coin has equal probability of being picked. P(S|A) is the probability of the observation given that it has come from coin A, i.e. using binomial distribution, we infer that it is :
, and similarly
P(S) is the probability of the observation. Since the observation can come from either coin A or coin B or both, hence :
Thus we now get
Substituting the values of our initial guesses for p=0.6 and q=0.5 , we get P(A|S) = 0.45, hence P(B|S) = 1-P(A|S) = 0.55. Thus given the observation 1, the probability of it coming from coin A is 0.45 and from coin B 0.55. Thus the expected number of heads coming from coin A = 5*0.45 and tails = 5*0.45, similarly the expected number of heads coming from coin B = 5*0.55 and tails = 0.5*0.55. Repeating the same Expectation (E) steps for the other four experiments, we get the total number of expected heads from coin A = 21.3 and tails = 8.6, similarly for coin B, the total number of expected heads = 11.7 and tails = 8.4
The new estimates for the unknown parameters p and q are thus
The previous step was the 'M' step or the Maximization step. We repeat the above EM steps until the values of 'p' and 'q' converges. In this example, the values of 'p' and 'q' converges to the final values p=0.8 and q=0.52 in about 10 steps.
The above is a very simple example of an application of the EM Algorithm. It serves to show that given a parameter estimation problem with missing data, EM Algorithm can iteratively solve the problem by generating likely guesses for the missing data and then maximizing the likelihood of the observations by using these guesses. Apart from the simple coin toss example, EM has been successfully used in training HMM for hidden states, EM is also used for Clustering applications and in Semi-Supervised Learning, which we will explore in later posts.
One final thought that why does the EM converges or why should it ? The following paper nicely derives the EM for a much more generic scenario and also provides an explanation for why does the method converges.
Theory has been inspired from the following paper :
Categories: MACHINE LEARNING