Stokastik

Machine Learning, AI and Programming

Heard in the Cafeteria - ML Interview Questions - Part 1

Probability Distributions

  1. Given a discrete probability distribution, how would you sample M points from that distribution ? Write code in Python.
    • Given the probability distribution of the six faces of a dice, simulate random throws.
    • What would be the time complexity of the approach when the die is fair ?
    • What would be the time complexity of the approach when the die is loaded ?
  2. How would you sample M points from a continuous distribution instead of discrete ?
  3. How would you obtain samples from a univariate normal distribution with mean \mu and standard deviation \sigma from a uniform distribution [0, 1] ? Write code in Python.
    • How would you sample normal distribution from standard Bernoulli generator ?
  4. How would you obtain samples from a multivariate normal distribution with mean \mu and standard deviation \Sigma from a uniform distribution [0, 1] ? Write code in Python.
  5. How would you measure whether samples drawn from two probability distributions are from same distribution or not ?
  6. Given a function that produces uniform random numbers between 0 and 7. Write a function that generates uniform random numbers between 1 and 5.
  7. Explain binomial distribution with an example. Draw plot to illustrate binomial distribution.
  8. Explain Central Limit Theorem with an example. Why Central Limit Theorem is useful ?
  9. Why is the normal distribution such an important distribution ?
  10. Which machine learning algorithms assumes that the data follows normal distribution ?
  11. Let X be a random variable with \mu=10 and standard deviation \sigma=4. A sample of size 100 is taken from the population. Find the probability that the sum of these observations is less than 900 ?
  12. From past experience, it is known that the number of tickets purchased by a student standing in line at the ticket window for the football match of UCLA against USC follows a distribution that has mean \mu=2.4 and standard deviation \sigma=2.0. Suppose that few hours before the start of one of these matches there are 100 eager students standing in line to purchase tickets. If only 250 tickets remain, what is the probability that all 100 students will be able to purchase the tickets they desire?
  13. Suppose that you have a sample of 100 values from a population with mean \mu=500 and with standard deviation \sigma=80.0
    • What is the probability that the sample mean will be in the interval (490, 510) ?
    • Give an interval that covers the middle 95% of the distribution of the sample mean.
  14. Suppose you toss a fair coin 400 times. What is the probability that you get at least 220 heads ? (Use CLT)
  15. Given that your ML model predicted on 1000 validation data with 92% accuracy, how would you estimate 95% confidence interval for your model ?
  16. Tossing a coin ten times resulted in 8 heads and 2 tails. How would you analyze whether a coin is fair ? What is the p-value ?
    • You have 10 coins. You toss each coin 10 times (100 tosses in total) and observe results. Would you modify your approach to the the way you test the fairness of coins?
  17. Explain beta distribution with an example.
  18. Given samples from two normal distributions A ~ N(5, 2) and B ~ N(7, 3), what is the probability that A > B ?
  19. A product on Amazon is being sold by 2 sellers. The ratings and the number of reviews for the 2 sellers are as follows:
    • Seller 1 : 97% positive with 50 reviews, Seller 2 : 98% positive with 25 reviews
    • Which seller would you buy from and why ?
    • Seller 1 : 97% positive with 500 reviews, Seller 2 : 98% positive with 250 reviews
    • Which seller would you buy from and why ?
    • What if there are more than 2 sellers ?
    • Similar question - You a model running in production developed by an external vendor that has 94.6% accuracy tested on 100 test data and you have developed an in-house model for your company that performs with 94.8% accuracy on 80 test data (this test data set is different). Should you replace the current model with the new one ?
  20. What is the role of conjugate priors ? Why is it useful ?

 

Answers and Hints:

Q.1

import random

def bsearch_slot(cdf, u):
    left, right = 0, len(cdf)-1
    
    while left < right:
        mid = int((left + right)/2)
        
        if cdf[mid] < u:
            left = mid+1
        else:
            right = mid
            
    return left

def simulate(n, probs):
    sampled = [0]*n
    
    cdf = [0]*len(probs)
    for i in range(len(probs)):
        cdf[i] = probs[i] + cdf[i-1] if i > 0 else probs[i]
        
    for i in range(n):
        u = random.uniform(0, 1)
        sampled[i] = bsearch_slot(cdf, u)
        
    return sampled

The above algorithm is for the general case where we have to sample 'n' values from a distribution given by 'probs'. When all probs[i] are equal (fair dice), then we just need to compute int(u*len(probs)) where u is the uniform random number between 0 and 1.

def simulate_fair(n, m):
    sampled = [0]*n
        
    for i in range(n):
        u = random.uniform(0, 1)
        sampled[i] = int(u*m)
        
    return sampled

Time complexity for the general case is O(n*log(m)+m) where m is the number of possible outcomes (e.g. 6 for a dice) and for the case of fair dice it is O(n).

References and interesting links:

Q.2

In the case of continuous distributions, we can use Inverse Transform Sampling technique to obtain samples from a given continuous distribution. Below is the code for sampling from an exponential distribution.

import random
from matplotlib.pyplot import hist

def inverse_cdf(u, k):
    return -k*np.log(1-u)

def simulate(n):
    sampled = [0]*n
        
    for i in range(n):
        u = random.uniform(0, 1)
        sampled[i] = inverse_cdf(u, 0.5)
        
    return sampled

a = simulate(100000)
hist(a, bins=100, density=True)

 

samples from exponential distribution

The steps for sampling are:

  • Generate a random uniform number u between 0 and 1.
  • Let the CDF of the distribution be F(X). Then the sampled value would be F-1(u).

This strategy will not work for normal distribution because we cannot find a closed form solution for the CDF of normal distribution, but we can approximate the CDF and then find the inverse function. We can approximate the CDF of normal distribution with the error function:

F(x) = CDF(x~N(0,1)) = \frac{1}{2}*(1+\text{erf}(\frac{x}{\sqrt{2}}))
F-1(u) = \sqrt{2}*\text{erf}^{-1}(2*u-1)

import random
from scipy.special import erfinv

def inverse_cdf(u):
    return erfinv(2*u-1)*np.sqrt(2)

def simulate(n, mu, sigma):
    sampled = [0]*n
        
    for i in range(n):
        u = random.uniform(0, 1)
        sampled[i] = inverse_cdf(u)*sigma+mu
        
    return sampled

a = simulate(100000, 5.0, 2.0)
hist(a, bins=100, density=True)

 

Samples from normal distribution

References and interesting links:

Q.3

import random
from matplotlib.pyplot import hist

def simulate(n, mu=0.0, sigma=1.0):
    sampled = [0]*n
    
    k = 100
    
    rand_uniforms = np.random.uniform(0, 1, (n, k))
    mean_uniforms = np.mean(rand_uniforms, axis=1)
    sampled = (mean_uniforms-0.5)*np.sqrt(k)*np.sqrt(12)
        
    return sampled

a = simulate(100000, 5.0, 2.0)
hist(a, bins=100, density=True)

The above algorithm makes use of the Central Limit Theorem to generate samples from a normal distribution with mean 5.0 and standard deviation 2.0. For each sample, do the following:

  • Sample 100 uniform random numbers between 0 and 1.
  • Compute the mean of the 100 uniform random numbers.
  • The mean of uniform random numbers [0,1] is 0.5 and standard deviation is sqrt(1/12)=0.289.
  • The mean of the 100 uniform random numbers follow a normal distribution with mean = 0.5 and sigma = sqrt(1/12*100).
  • Transform the normal distribution of mean values with mean = 0.5 and sigma = sqrt(1/12*100) into a normal distribution with mean = 5.0 and sigma = 2.0.

Sampling normally distributed data from Bernoulli generator:

import random
from matplotlib.pyplot import hist

def simulate(n, mu=0.0, sigma=1.0):
    sampled = [0]*n
        
    for i in range(n):
        u_list = [0]*100
        for j in range(100):
            u_list[j] = np.random.binomial(1, 0.5)
        x = sum(u_list)/100.0
        sampled[i] = sigma*(x-0.5)*np.sqrt(100)*0.5+mu
        
    return sampled

a = simulate(100000, 5.0, 2.0)
hist(a, bins=100, density=True)

Using Box-Muller Transform:

import random
from matplotlib.pyplot import hist

def simulate_box_muller(n, mu=0.0, sigma=1.0):
    sampled = [0]*n
        
    for i in range(n):
        u1 = random.uniform(1e-5, 1)
        u2 = random.uniform(1e-5, 1)
        z0 = np.sqrt(-2*np.log(u1))*np.cos(2*3.14159*u2)
        z1 = np.sqrt(-2*np.log(u1))*np.sin(2*3.14159*u2)
        
        sampled[i] = sigma*z0+mu
        
    return sampled

a = simulate_box_muller(100000, 5.0, 2.0)
hist(a, bins=100, density=True)

References and interesting links:

Q.4

Generate multivariate gaussian from uniform distribution

Python code to generate 2-D gaussian samples:

import random
from matplotlib.pyplot import hist
from sklearn.datasets import make_spd_matrix
import matplotlib.pyplot as plt

def simulate(n, m, mu, cov_mat):
    sampled = np.zeros((m, n))
    eig_vals, eig_vecs = np.linalg.eig(cov_mat)
    k = 100
    
    for i in range(m):
        rand_uniforms = np.random.uniform(0, 1, (n, k))
        mean_uniforms = np.mean(rand_uniforms, axis=1)
        sampled[i] = (mean_uniforms-0.5)*np.sqrt(k)*np.sqrt(12)
    
    sampled = np.dot(np.dot(eig_vecs, np.diag(np.sqrt(eig_vals))), sampled)
    return (sampled.T+mu).T

m = 2
cov_mat = make_spd_matrix(m)
mu = np.random.uniform(0, 10, m)
a = simulate(100000, m, mu, cov_mat)
print mu
plt.scatter(a[0], a[1])
plt.show()

Scatter plot of 2D gaussian

Q.5

Z-test - Compare sampled distribution mean with true distribution mean. Following code compares samples generated by Box Muller transform with normal distribution N(5.0, 2.0):

from scipy import stats
n, mu, sigma = 100000, 5.0, 2.0
a = simulate_box_muller(n, mu, sigma)
x = sum(a)/float(n)
z = (x-mu)*np.sqrt(n)/sigma
p_value = stats.norm.pdf(abs(z))*2
print z, p_value

High absolute value of Z or very low value of p, indicates that the sampled distribution has different mean than the true distribution mean i.e. 5.0. Can also be used to compare two different samples of Box Muller Transform, to check if the means of the two samples are equal:

from scipy import stats
n1, mu1, sigma1 = 100000, 5.0, 2.0
n2, mu2, sigma2 = 50000, 5.5, 2.5

a = simulate_box_muller(n1, mu1, sigma1)
b = simulate_box_muller(n2, mu2, sigma2)

x1 = sum(a)/float(n1)
x2 = sum(b)/float(n2)

z = (x1-x2)/np.sqrt(sigma1*sigma1/n1 + sigma2*sigma2/n2)

p_value = stats.norm.pdf(abs(z))*2
print z, p_value

But Z-test assumes that the samples and the population are normally distributed. Also two normal distributions with same mean but different standard deviations will have have p-value indicating that they are same.

Kolmogorov-Smirnov Test - Non-parametric test. Does not assume distribution.

Compute the Empirical CDF of the two samples (or one sample and one true distribution) and take the value of the maximum difference at any point between the CDF values.

def ecdf(x, bins):
    n, m = len(x), len(bins)
    y = sorted(x)
    cdfs = [0.0]*m
    i, j = 0, 0
    while j < n and i < m:
        if y[j] <= bins[i]:
            cdfs[i] += 1.0
            j += 1
        else:
            i += 1
            if i < m:
                cdfs[i] += cdfs[i-1]
    
    while i < m-1:
        i += 1
        cdfs[i] += cdfs[i-1]
        
    return np.array(cdfs)/n

def get_bins(a, b):
    m = min(len(a), len(b))
    x = max(min(a), min(b))
    y = min(max(a), max(b))
    u = (y-x)/float(m)
    bins = [0]*m
    for i in range(m):
        bins[i] = x+i*u
    
    return bins

n1, mu1, sigma1 = 100000, 5.0, 2.0
n2, mu2, sigma2 = 100000, 5.5, 2.0

a = simulate_box_muller(n1, mu1, sigma1)
b = simulate_box_muller(n2, mu2, sigma2)

bins = get_bins(a, b)

a1 = ecdf(a, bins)
b1 = ecdf(b, bins)

d = np.max(abs(a1-b1))

Using built-in library from the stats package:

n1, mu1, sigma1 = 100000, 5.0, 2.0
n2, mu2, sigma2 = 100000, 5.5, 2.0

a = simulate_box_muller(n1, mu1, sigma1)
b = simulate_box_muller(n2, mu2, sigma2) 

print stats.ks_2samp(a,b)

Reference and interesting links:

Q.6

Let U(0,7) be the function that produces uniform random numbers between 0 and 7, then a+U(0,7)*(b-a)/7 will generate uniform random numbers between 'a' and 'b'.

Q.7

Let 'p' be the probability of getting heads of a coin toss. Probability of getting 'x' heads in N tosses

P(x) = NCx*px(1-p)N-x

Q.8

Central Limit Theorem - Given any distribution having mean \mu and standard deviation \sigma, if we sample 'n' points and take their mean value, then the means of the samples will follow a normal distribution N(\mu, \frac{\sigma}{\sqrt{n}}). Useful for conversion of any probability distribution into a normal distribution.

Q.9

Normal distribution is useful for several reasons:

Q.10

One example is Gaussian Mixture Models.

Q.11

According to the Central Limit Theorem, the sum of 'n' samples from a distribution with mean \mu and std. dev. \sigma is normally distributed N(n\mu,\sqrt{n}\sigma). Let T be the sum of n=100 random samples. T ~ N(n\mu,\sqrt{n}\sigma), thus the standard normal:

z=\frac{T-n\mu}{\sqrt{n}\sigma}

Thus P(T < 900) = P(z < \frac{900-100*10}{\sqrt{100}*4.0}) = P(z < -2.5) = CDF(-2.5) = 0.00625

import numpy as np
from scipy.special import erf

def cdf_std_normal(x):
    return 0.5*(1 + erf(x/np.sqrt(2)))

Q.12

P(T < 250) = P(z < \frac{250-100*2.4}{\sqrt{100}*2.0}) = P(z < 0.5) = CDF(0.5) = 0.69

Q.13

Using CLT:

P(490 < X < 510) = P(\frac{(490-500)*10}{80} < z < \frac{(510-500)*10}{80}) = P(-1.25 < z < 1.25) = P(z < 1.25) - P(z < -1.25) = 0.79

For 95% confidence interval -1.96 < z < 1.96, thus \frac{(x_{lo}-500)*10}{80} = -1.96 and \frac{(x_{hi}-500)*10}{80} = 1.96

Q.14

Population is Bernoulli distribution with \mu=0.5 and \sigma=0.5.

We use the summation formula for CLT.

Thus P(T >= 220) = P(z >= \frac{220-400*0.5}{\sqrt{400}*0.5}) = P(z >= 2.0) = 1-P(z < 2.0) = 0.023.

Q.15

Assuming a Bernoulli distribution where 1 = prediction is correct and 0 = prediction is incorrect, probability correct p=0.92 and incorrect q = 0.08.

Thus the population mean is \mu=p=0.92 and \sigma=\sqrt{pq}=0.27.

For 95% confidence interval, we need -1.96 < z < 1.96, thus:

\frac{(x_{lo}-0.92)*\sqrt{1000}}{0.27} = -1.96

\frac{(x_{hi}-0.92)*\sqrt{1000}}{0.27} = 1.96

Solving for x_lo=90.3% and x_hi=93.7%

Q.16

P(Number of heads >= 8) = P(Number of tails >= 8) = \frac{{{10}\choose{8}} + {{10}\choose{9}} + {{10}\choose{10}}}{2^{10}}=\frac{56}{1024}=0.055.

p-value is the probability that with a fair coin we get number of heads or tails as extreme as 8 out of 10 which is 0.055*2=0.11.

Can we also use Z-test to find p-value ?

z = \frac{T-n\mu}{\sqrt{n}\sigma}=\frac{8-5}{0.5*\sqrt{10}}=1.897

The p-value can be computed using the following code:

stats.norm.pdf(abs(z))*2

p-value = 0.13.  But as we notice that the sample size is too less for Z-test.

Q.17

Beta distribution is a probability distribution of probabilities. For any random experiment like tossing of coin, although we know that the prior probability of an un-biased coin coming up heads is 0.5, but random experiments with very few trials (<10 or <100) will reveal that the fraction of the number of heads is not exactly 0.5 but lesser than or greater than 0.5. With increasing number of trials, the deviation from 0.5 is less.

Beta distribution measures the distribution of the different values of probability of heads coming up during the experiments.

For e.g. In case of coin toss, Beta(1, 1) represents the prior distribution without doing any experiments. Thus this is uniform random as without doing any experiments or without knowing anything about the coin, the coin has a probability of heads that is in U(0, 1) (Think that there are several biased+unbiased coins in a bag and you select one of them randomly).

With N tosses, if the number of heads is n and number of tails is N-n, then the distribution of probability of heads is Beta(1+n, 1+N-n) and mean probability of \frac{n}{N}.

As N increases, the distribution starts to look more like a normal distribution and the variance about the mean starts to decrease. This is intuitive because, if your experiment is tossing a coin 100 times which has probability of heads around 0.7, then if you repeat the experiment many times, each time the number of heads will be somewhere between say 60 and 80.

But instead if the experiment is tossing the coin 10000 times then if you repeat the experiment many times, each time the number of heads will be somewhere between say 68 and 72.

Experiments : 10 tosses, 100 tosses, 1000 tosses and 10000 tosses. Each repeated a million times to sample probability.

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
out1 = np.random.beta(1+7, 1+3, 1000000)
out2 = np.random.beta(1+70, 1+30, 1000000)
out3 = np.random.beta(1+700, 1+300, 1000000)
out4 = np.random.beta(1+7000, 1+3000, 1000000)
plt.hist(out1, bins=500, density=True, histtype=u'step', range=(0.5, 0.9))
plt.hist(out2, bins=500, density=True, histtype=u'step', range=(0.5, 0.9))
plt.hist(out3, bins=500, density=True, histtype=u'step', range=(0.5, 0.9))
plt.hist(out4, bins=500, density=True, histtype=u'step', range=(0.5, 0.9))
plt.show()

Beta distribution with coin tosses.

References:

Q.18

With A ~ N(5, 2) and B ~ N(7, 3), the distribution A-B will be N(5-7, 2+3)=N(-2, 5).

Now given X ~ N(-2, 5), we need to find P(X > 0):

import scipy.stats
1.0 - scipy.stats.norm(-2, 5).cdf(0)

which is around 0.3446

Without using scipy library, but using sampling:

import numpy as np
a = np.random.normal(-2, 5, 1000000)
np.sum(a > 0)/float(len(a))

which is around 0.3442.

Q.19

Seller ratings can be viewed as Beta distribution, because our uncertainty in the given positive ratings updates itself with increasing or decreasing number of ratings. Similar to coin toss example, we can view a positive rating as Heads and negative rating as Tails. Thus the probability of n positive ratings out of N total ratings for a seller is:

Positive Ratings ~ Beta(1+n, 1+N-n)

In the 1st case : Seller 1 : 97% positive with 50 reviews, Seller 2 : 98% positive with 25 reviews.

Seller1 ~ Beta(1 + 0.97*50, 1 + 0.03*50) and Seller2 ~ Beta(1 + 0.98*25, 1 + 0.02*25)

Now to decide which seller to choose, compute the probability that P(Seller1 > Seller2). If this probability is greater than 0.5 then choose Seller 1 else choose Seller 2.

Using sampling with python:

n = 1000000
a = np.random.beta(1+0.97*50, 1+0.03*50, n)
b = np.random.beta(1+0.98*25, 1+0.02*25, n)

np.sum(a > b)/float(n)

P(Seller1 > Seller2) = 0.52. Thus choose Seller 1 in this case.

In the 2nd case : Seller 1 : 97% positive with 500 reviews, Seller 2 : 98% positive with 250 reviews.

Seller1 ~ Beta(1 + 0.97*500, 1 + 0.03*500) and Seller2 ~ Beta(1 + 0.98*250, 1 + 0.02*250)

n = 1000000
a = np.random.beta(1+0.97*500, 1+0.03*500, n)
b = np.random.beta(1+0.98*250, 1+0.02*250, n)

np.sum(a > b)/float(n)

P(Seller1 > Seller2) = 0.24. Thus choose Seller 2 in this case.

For each of the L sellers (L > 2), compute the fraction of experiments in which seller L_i has the highest positive rating among all. Choose the seller with the highest percentage.

References:

Q.20

Suppose we want to sample from a probability distribution given as X ~ H({\theta}). But we do not know what \theta is. In such a case we assume that \theta itself is a random variable with a prior probability distribution {\theta} ~ G({\mu}), where \mu is a constant factor for the PDF function G.

How do we sample from X ?

  1. We sample {\theta} ~ G({\mu})
  2. Draw samples from X ~ H({\theta}) using \theta from step 1. Let's call the samples X'.
  3. Using X' can we update \theta ?
  4. Bayesian probability : P(\theta|X')=\frac{P(X'|\theta)*P(\theta)}{P(X')}
  5. Draw new samples {\theta} ~ P(\theta|X') and repeat from step 2.

But computing

P(\theta|X')=\frac{P(X'|\theta)*P(\theta)}{P(X')}

is a challenge, because although we can compute the below quantities:

P(X'|\theta)=H(x';{\theta}) and P(\theta)=G({\mu})

But to compute:

P(X')={\int}P(X'|\theta)*P(\theta)d{\theta}

we need to integrate over all possible values of \theta. One approximation is to sample many values of \theta and do the summation:

P(X')~{\sum}_kP(X'|{\theta}_k)*P({\theta}_k)

But the approximation can be way off.

But if G and H are from the same family of distribution, then we do not need to explicitly compute P(X'|\theta) and P(X') but we just need to update the value of current \theta by doing some addition or subtraction such as:

P(\theta|X')=H(\theta+\gamma)

Prior distributions G which belong to same family of distribution as the likelihood H(x;{\theta}) are called conjugate priors.

References:

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , , , ,