Stokastik

Machine Learning, AI and Programming

Sampling from Probability Distributions

Often we are required to sample random values from a specified distribution. The specified probability distribution could be either discrete or continuous. With discrete sampling, the generated samples can take only discrete states, for example, a coin toss experiment can either be a heads or a tails, a dice can only come up with result from the set {1, 2, 3, 4, 5, 6} and so on. Whereas with continuous sampling, the generated samples can take any real values within some range. For example in a normal distribution, the sampled values can possibly take any value form (-\infty, +\infty).

In both discrete and continuous data sampling, the sampled results should closely match the probability distribution from which it was sampled. For example for a coin with P(Heads)=0.7, if we sample 100 results, the number of heads should come close to 70. As you have guessed correctly, that if we generate more and more samples, the samples would have higher chances of following the probability distribution from which they are sampled.

Let us start with discrete sampling. You are given a biased coin where P(Heads)=0.7 i.e. P(Tails)=0.3, sample 100 outcomes from this distribution. How we can sample results such that the results follow the given distribution closely (if not exactly). Note that if we generate a random number with an uniform distribution between 0 and 1, then the probability of the number being less than equals to 0.7 is 0.7, thus if we generate a random number uniformly between 0 and 1, and it is less than 0.7, then we sample "Heads" else we sample "Tails". Given that most programming languages comes with an inbuilt function to generate random numbers with uniform distribution, this is an easy task. But given a probability distribution with many more states,

\left\{X_1, X_2, ..., X_M\right\}

with the corresponding distribution :

P(X_1), P(X_2), ..., P(X_M)

How to generate samples from this distribution ? Remember that the probability density function of an uniform distribution looks like below.

Uniform Distribution between a and b

If we divide the region between a and b into M rectangles, then the sum of the areas of the rectangles will equals to 1. The area of the rectangles from a to any point between a and b defines the cumulative density function of the uniform distribution. If a=0 and b=1, then we can divide the region between 0 and 1 with M rectangles such that the right end coordinate of each rectangle represents the probabilities P(X_1), P(X_2), ..., P(X_M) respectively with P(X_M)=b, since

P(X_1)+P(X_2)+...+P(X_M)=1.

Thus any random number from U(0, 1) lying inside the rectangle [0, P(X_1)) has the probability P(X_1) and thus belong in the state X_1, similarly any random number between [P(X_1), P(X_1)+P(X_2)) has the probability P(X_2) and thus belong to the state X_2 and so on.

More generally, to generate N random samples from the distribution P(X_1), P(X_2), ..., P(X_M), first generate N random numbers with uniform distribution U(0, 1). Then for each of the N random numbers, check in which rectangle the point lies, for e.g. if the random number lies between the values P(X_1)+P(X_2)+...+P(X_j) and  P(X_1)+P(X_2)+P(X_2)+...+P(X_{j+1}), then sample the state X_j and so on. Below is an R code to generate random samples given a probability distribution.

sample.custom <- function(n, p, v) {
  u <- runif(n)
  x <- rep(0, n)
  
  sums <- c(p[1])
  for (i in 2:length(p)) sums[i]<-sums[i-1]+p[i]
  
  for (i in 1:length(p)) {
    if (i == 1) y <- which(u <= sums[i])
    else y <- which(u > sums[i-1] & u <= sums[i])
    
    x[y] <- v[i]
  }
  x
}

where "p" is the vector of probability distribution [P(X_1), P(X_2), ..., P(X_M)] and "v" is the vector of sample states [X_1, X_2, ..., X_M] and "n" is the sample size to generate.

x <- sample.custom(100000, c(0.45, 0.25, 0.10, 0.15, 0.04, 0.01), c(1, 2, 3, 4, 5, 6))
p <- table(x)
p/sum(p)

With the distribution [0.45, 0.25, 0.10, 0.15, 0.04, 0.01] and the states [1, 2, 3, 4, 5, 6] , we generate 100000 samples and compare the distribution of the generated sample with the desired distribution.

x
      1       2       3       4       5       6 
0.45095 0.24889 0.09935 0.15062 0.04056 0.00963
x
      1       2       3       4       5       6 
0.44795 0.24954 0.10187 0.14998 0.04091 0.00975
x
      1       2       3       4       5       6 
0.45016 0.25187 0.09980 0.14919 0.03924 0.00974

We ran the experiment 3 times above and looking at the results, we see that the achieved distributions are quite close to our desired distributions.

Given that the cumulative density function of the desired probability distribution P is denoted by F(x), for the random variable X, i.e. F(x)=P(X\;{\le}\;x), then a random sample from the distribution P, can be generated from F^{-1}(u), where u is a randomly generated real number from the uniform distribution U(0, 1) and F^{-1} is the inverse function of the cumulative distribution.

For example, generate samples from the sigmoid cumulative distribution function, i.e.

F(x)=\frac{1}{1+e^{-x}}

Let y=\frac{1}{1+e^{-x}}, then x=-log(\frac{1}{y}-1)

Thus, F^{-1}(u)=-log(\frac{1}{u}-1)

Generate a random uniform real number u between 0 and 1, and compute F^{-1}(u) as given by the above equation.

The above sigmoid cumulative density function corresponds to the following PDF (\frac{d}{dx}F(x)):

\frac{e^x}{(1+e^x)^2}

which takes the following shape :

Original probability density function

On generating the random samples from the following function :

F^{-1}(u)=-log(\frac{1}{u}-1)

q <- sapply(runif(100000), function(u) -log((1/u)-1))

we obtain the following histogram of the points :

Probability histogram of generated samples

which matches closely with the original probability distribution function.

Case 1 : Even if we are able to analytically find the cumulative density function F(x), it may not always be possible to find it's inverse F^{-1}. In such cases, we can use the Newton's method of root finding to find the solution for "x" :

f(x)=F(x)-u=0, where u ~ U(0,1)

x_{t+1}=x_t-\frac{f(x_t)}{f'(x_t)}, where f'(x_t) is the derivative of f(x) at x=x_t

Iterate above till the value of x_t converges.

Case 2 : The CDF F(x) is hard to find out analytically, for e.g. in Normal distribution, the CDF is given by a variant of error function. Approximate the CDF using the trapezoidal rule of integration (simplest) or other numerical integration methods.

Case 3 : Transformation Method : Given random samples generated from a cumulative density function Q, we can generate random samples from our target distribution P by using a transformation G, if X ~ Q, G(X) ~ P. For example, if Q is the skew logistic distribution,

CDF(Q)=(\frac{1}{1+e^{-x}})^{\lambda}

then G(X)=log(1+e^{-X}) follows the exponential distribution.

P(G(X)\;\le\;x)=P(log(1+e^{-X})\;\le\;x)=P(X\;\ge\;-log(e^x-1))=1-P(X\;\le\;-log(e^x-1))

since, we know that X has the skew logistic CDF, i.e.

P(X\;\le\;-log(e^x-1))=(\frac{1}{1+e^{log(e^x-1)}})^{\lambda}=e^{-{\lambda}x}

Thus :

P(G(X)\;\le\;x)=1-e^{-{\lambda}x}

G(X) follows the exponential(\lambda) distribution with the above CDF.

Generalizing the above transformation method of sampling :

Given that the random variable X follows the probability distribution Q(X), then if Y=G(X), then :

P(Y)=Q(G^{-1}(Y))|\frac{dG^{-1}(Y)}{dY}|

In the above example, for the CDF(Q),the density function :

Q(X)=\frac{{\lambda}e^{{\lambda}x}}{(e^x+1)^{\lambda+1}}

G(X)=log(1+e^{-X}), hence G^{-1}(Y)=-log(e^Y-1)

Q(G^{-1}(Y))=\frac{e^Y-1}{e^{(\lambda+1)Y}}

\frac{dG^{-1}(Y)}{dY}=-\frac{e^Y}{e^Y-1}

P(Y)=Q(G^{-1}(Y))|\frac{dG^{-1}(Y)}{dY}|=e^{-{\lambda}Y}

which is the same exponential distribution obtained as above.

The following tutorial is the primary source of this article. It also contains explanation for Rejection Sampling, which we will discuss later when we discuss Monte Carlo sampling techniques in details.

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , ,