Machine Learning, AI and Programming

Monte Carlo Sampling Techniques

In the last post, we saw how to sample random values from a target probability distribution (both with discrete as well as continuous distributions) using techniques like inverse CDF method, the transformation method and so on. All of the earlier discussed methods falls under the category of Monte Carlo techniques. In this post, we will be discussing some of the other advanced Monte Carlo techniques and their importance in the area of Machine Learning. We would be covering the following techniques : Rejection Sampling, Importance Sampling and Markov Chain Monte Carlo (Metropolis-Hastings algorithm and Gibbs sampling algorithm).

Many machine learning algorithms requires computing the expectation of a function over an exponential number of different configurations of the input variables, for example :

\int_X\;f(X)P(X)\;dX or  \sum_Xf(X)P(X) for discrete distributions

where P(X) is the probability distribution of the input variable X. The expectation is over the entire domain of X, which can be pretty much expensive for problems involving input pixels in an image, bag of words in text documents etc. Can we sample N values from the distribution P(X) such that the expectation as computed with the sampled values of X, closely matches with the true expectation ?

Suppose, we sample X_i from P(X) for i, 1 to N, then we can write the expectation as follows


Similarly in Bayesian Inference, if we need to compute the posterior probabilities as :

P(X|Y)=\frac{P(Y|X)P(X)}{\int_XP(Y|X)P(X)\;dX} or P(X|Y)=\frac{P(Y|X)P(X)}{\sum_XP(Y|X)P(X)}

which requires computing the quantities \int_XP(Y|X)P(X)\;dX or \sum_XP(Y|X)P(X). Again we can sample N values of X, i.e. X_i and use these values to approximate the above quantities as :


Similarly in problems like marginalization, optimization using maximum likelihood estimates etc., we can use sampled values from the input distribution to approximate quantities which are hard (expensive) to compute exactly.

Normally when we sample using built in uniform sampling libraries from any distribution, the sampled values loses the information about the original distribution and hence the approximate values of the computed quantities are way off. With Monte Carlo, no sacrifice in the model or in the solution is made.

Rejection Sampling

In rejection sampling, we sample from a target probability density function p(x), given that we can sample from a probability density function q(x) using some inbuilt library. The target density p(x) is not known but The idea is that, if M*q(x) forms an envelope over p(x) for some M > 1, i.e.

\frac{p(x)}{q(x)} < M for all x

Then if we sample some x_i from q(x), and if y_i=u*M*q(x_i) lies below the region under p(x) for some u ~ U(0,1), then accept x_i else reject x_i, i.e.

If U(0,1)*M*q(x_i) < p(x_i) accept x_i else reject

Rejection Sampling

Normally considering q(x) as the uniform distribution or the Gaussian distribution should work. The probability that the sample x_i is accepted is proportional to \frac{1}{M}, hence choosing M to be very high will reduce the probability of the x_i getting accepted.

Given and Target Distributions

In the above diagram, the distribution shown in green is our target distribution i.e. p(x) from where we need to sample, and the red plot shows the distribution that is available to sample from, i.e. M*q(x).

p(x)=N(x; 1, 2)+N(x; 10, 3) and q(x)=N(x; 5, 5)

In the above diagram, the envelope (shown in red) is 5*q(x), i.e. M=5, in our algorithm.

Below is the R code that does the sampling given the number of samples to generate.

rejection.sample <- function(n) {
  i <- 0
  out <- c()
  while (i < n) {
    xi <- rnorm(1, 5, 5)
    u <- runif(1)
    if (u < (p(xi)/(5*dnorm(xi, 5, 5)))) {
      out <- c(out, xi)
      i <- i + 1

We generated 100K samples with the above rejection sampling module and plot the histogram of the generated samples :

Histogram of generated samples.

As you see that the envelope of the histogram from the sampled data resembles the probability density p(x), (shown in green above) with the peaks at around 1 and 10 respectively in both.

Since it could be difficult to choose the density q(x) and the constant M, such that M*q(x) forms an envelope over the target density p(x), and also choosing a value of M not very high in order to reduce the probability of rejection, there is an adaption version of rejection sampling.

In adaptive rejection sampling we update the envelope M*q(x) depending on whether we accept or reject x_i in each iteration. We start with some normal distribution q(x) as above and choose an arbitrary large value of M.

Adaptive Rejection Sampling

Then randomly sample a few x_i's from q(x). For each x_i, check if it is accepted or rejected as above. If it is rejected, then construct tangents at log\;p(x_i). Now the new envelope function is the intersection of all the tangents on log\;p(x). From here on, for any x_i, if it is rejected then we again update the envelope by finding the intersections of the tangents all over again and so on. As you see that this method has lots of drawbacks, as it works only if log\;p(x) is a concave function and also the method needs to update the envelope function every-time an x_i is rejected by computing the intersections all over again.

Importance Sampling

Importance Sampling is Monte Carlo sampling technique used primarily for computation of expectation of a function f(x), i.e.


Normally, using a sampling technique as earlier, we would sample x_i from the target distribution p(x) and use them to compute the approximate expected value as :


But given that in the region where f(x) is defined, the density p(x) is close to zero (tail of a distribution etc.), then we may not obtain any sample x_i at all from the region of our interest and thus the approximate expectation is far from close to the actual expectation. In such scenarios, we sample x_i from some other known density q(x) which is significant compared to p(x) in the region where f(x) is defined. But in order to identify that we are sampling from a region where our target density p(x) is low, we assign low weights to these sampled values during expectation calculations. Conversely, regions where q(x) is lower than p(x) (where f(x) is defined), the weights for samples x_i{\sim}q(x) should be high.

After we sample x_i{\sim}q(x), we assign the following weights to the samples :


which obeys our above requirements. And the new expectation calculation is done as follows


The reason this works is because :

{\int}f(x)p(x)dx={\int}f(x)\frac{p(x)}{q(x)}q(x)dx\;{\approx}\;\frac{1}{N}\sum_{i=1}^Nw_if(x_i) where x_i{\sim}q(x)

Although the method is primarily used for computing the expectation of f(x), but we can always use this method to obtain samples from p(x). In order to obtain samples, we use the Sampling Importance Sampling (SIR) algorithm. In SIR, the weights are normalized first :


Then each x_i obtained above, is sampled with replacement with the probabilities w_i. Below is the R code to do SIR. We have used the functions p(x) same as in rejection sampling and used q(x) to be N(10, 5).

sir <- function(n) {
  x <- rnorm(n, 10, 5)
  weights <- sapply(x, function(xi) p(xi)/dnorm(xi, 10, 5))
  probs <- weights/sum(weights)
  sample.custom(n, probs, x)

where the "sample.custom" is the sampling function for discrete distribution defined in the earlier blog post. Following is the histogram of the obtained samples :

Sampling Rejection Sampling with q(x)=N(10,5)

which looks similar to the original distribution p(x) as expected. It is important to choose the known function q(x) carefully. If instead of choosing q(x) to be N(10, 5), we choose q(x) to be N(-5, 5), then we obtain the below histogram.

SIR with q(x)=N(-5, 5)

Markov Chain Monte Carlo (MCMC)

Importance sampling as seen above might work well in lower dimensions for computing the expectation of f(x), but in higher dimensions, the regions of interest for f(x), might have even smaller densities and finding a proposal density q(x) without knowing the values of p(x) in the regions of interest is very difficult. This is where MCMC techniques comes into play. In standard sampling techniques, the samples x_i are independent of each other, whereas MCMC generates dependent samples.

Due to the Markov property, each sample x_i is only dependent on the previous sample x_{i-1}. Thus in general, in order to sample from p(x) we sample from the conditional probabilities of q(x) :


The idea here is to construct a Markov chain of samples x_i as shown above. The assumption here is that after "enough" samples has been generated, the Markov chain reaches a stationary state and all samples generated beyond this point, reflects the target distribution p(x). The time taken to reach this state is called the "burn-in" period.

You can think of the Markov chain as a directed graph. In a Markov process, a random variable X can take 's' different states [x_1, x_2, ..., x_s]. For example, if X is the variable describing the financial market condition, then the possible states could be "Bull", "Bear", "Stagnant" etc. At any particular instance of time, each state can either remain in the same state or transition to a different state. The state of the variable X at time t, is denoted as X^{(t)}.

Markov Process Financial Market

The probability of transitioning to the state X^{(t)} at time t, is only dependent on the previous state at (t-1), i.e. X^{(t-1)} :

K(X^{(t)}|X^{(t-1)},X^{(t-2)}, ...,X^{(1)})=K(X^{(t)}|X^{(t-1)})

The probabilities K(X^{(t)}|X^{(t-1)}) are known as the transition probabilities. The chain is homogeneous if the transition probabilities are independent of time, i.e.


The probabilities can then be represented as a transition matrix K,

K_{i,j}=K(x_j|x_i), and \sum_jK_{i,j}=1

Let p(X^{(1)}) denote the starting probability distribution of all the states (it's a vector of probabilities). Then at time t, the probability distribution of the states is given as :


is the matrix multiplication of the vector p(X^{(t-1)}) and the matrix K.

Given that if the graph of the Markov chain is connected (irreducible), the chain is not trapped in cycles (aperiodic) and can reach any state from any other state in finite number of steps (ergodic), then the state probability distributions p(X^{(t)}) converges to an equilibrium, i.e. at equilibrium, p(X^{(t)})=p(X^{(t+1)}) for all t.

For example, with a 3-state Markov process, let the starting probability distribution of being in each state is p(X^{(1)})=(0.5, 0.2, 0.3) and the transition probabilities :

K=\begin{pmatrix} 0 & 1 & 1 \\ 0 & 0.1 & 0.9 \\ 0.6 & 0.4 & 0 \end{pmatrix}

Markov Chain with 3 states

Then, p(X^{(2)})=p(X^{(1)}).K=(0.18, 0.64, 0.18)

At equilibrium, the probabilities converges to p(x)=(0.22, 0.41, 0.37) which is the stationary state we are interested in. A sufficient, but not necessary, condition to ensure that a particular p(x) is the desired invariant distribution is the following reversibility (detailed balance) condition :


or taking the sum over all possible states at time (t-1), we get :


We need to design our Markov chain, such that it satisfies the above balancing condition and also converge to the stationary distribution p(X) quickly. We will look into two algorithms for MCMC sampling, first is the Metropolis-Hastings algorithm and second one is the Gibbs sampler.

Metropolis-Hastings Algorithm :

  1. Initialize randomly x_0
  2. For i from 1 to N, repeat steps 3 to 6
  3. Sample u\;{\sim}\;U(0,1)
  4. Sample x^*\;{\sim}\;q(x|x_{i-1})
  5. Assign acceptance probability A=min(1, \frac{p(x^*).q(x_{i-1}|x^*)}{p(x_{i-1}).q(x^*|x_{i-1})})
  6. If u < A, then x_i=x^* else x_i=x_{i-1}

In our previous example with the target distribution p(x)=N(x; 1, 2) + N(x; 10, 3), we initialize x_0=0, we consider the following the proposal distribution :

q(x|x_{i-1})=N(x; x_{i-1}, 5)

where N(x; x_{i-1}, 5) is the normal distribution with mean at the previously sampled value x_{i-1}.

Below is the R code, for sampling from the above p(x) and q(x).

metropolis.hastings <- function(n, p) {
  old.xi <- 0
  out <- c()
  i <- 1
  while (i <= n) {
    u <- runif(1)
    new.xi <- rnorm(1, old.xi, 5)
    accept <- min(c(1, (p(new.xi)*dnorm(old.xi, new.xi, 5))/(p(old.xi)*dnorm(new.xi, old.xi, 5))))
    if (u < accept) {
      out <- c(out, new.xi)
      old.xi <- new.xi
    else out <- c(out, old.xi)
    i <- i+1

On sampling using the M-H algorithm, we obtained the following histogram of the samples :

Histogram of samples generated using Metropolis-Hastings algo

with peaks at 1 and 10 as expected.

In the above example we used a normal distribution as the proposal distribution q(x). For the normal distribution q(x_{i-1}|x^*)=q(x^*|x_{i-1}), thus the acceptance probability reduces to :

A=min(1, \frac{p(x^*)}{p(x_{i-1})})

This version is known as the Metropolis algorithm.

Choosing the proper proposal distribution is very critical. If the standard deviation of the chosen normal distribution for q(x) is too low, then it might model only one of the modes of p(x) (sum of two normal distribution) well and if the deviation is too high then there could be high number of rejections i.e. A would be very low for most samples.

Sampling with proposal density q(x) with standard deviation of 0.5

Above is the histogram of the sampled values but with a proposal density q(x|x_{i-1})=N(x; x_{i-1}, 0.5). As you see that only one mode (N(x; 1, 2)) is modeled properly.

Gibbs Sampler :

Gibbs sampling is a special case of the Metropolis-Hastings algorithm. It is specifically designed for sampling from probability distributions in very high dimensions e.g. image pixels, bag of words in texts etc. In our earlier post on Feature reduction using Restricted Boltzmann machines, the values at the visible and the hidden nodes in the contrastive divergence algorithm step are sampled using the Gibbs sampling technique. Below is the algorithm for Gibbs sampling :

  1. Sample X^{(0)}\;{\sim}\;q(x_1, x_2, ..., x_s)
  2. For i from 1 to N, repeat below steps until convergence
  3. x^{(i)}_1\;{\sim}\;p(X|x^{(i-1)}_1, x^{(i-1)}_2, ..., x^{(i-1)}_s)
  4. x^{(i)}_2\;{\sim}\;p(X|x^{(i)}_1, x^{(i-1)}_2, ..., x^{(i-1)}_s)
  5. x^{(i)}_3\;{\sim}\;p(X|x^{(i)}_1, x^{(i)}_2, ..., x^{(i-1)}_s)
  6. ...
  7. x^{(i)}_s\;{\sim}\;p(X|x^{(i)}_1, x^{(i)}_2, ..., x^{(i)}_{s-1}, x^{(i-1)}_s)

where X=[x_1, x_2, ..., x_s] is an 's' dimensional random variable. Note that inside each iteration, the update for x^{(i)}_2 considers the latest update for x^{(i)}_1 as a conditional, x^{(i)}_3 considers both the latest updates of x^{(i)}_2 and x^{(i)}_1 as conditionals.


Tags: , , , ,