Machine Learning, AI and Programming

Common and not so common Machine Learning Questions and Answers (Part II)

In word2vec training the objective is to have semantically and syntactically similar words close to each other in terms of the cosine distance between their word vectors. In the skip-gram architecture, the probability of a word 'c' being predicted as a context word at the output node, given the target word 'w' and the input and output weights matrices U and V, is given by the softmax function :

P(c|w, U, V)=\frac{e^{U^T_wV_c}}{{\sum_q}e^{U^T_wV_q}}

In the denominator, the summation is over all possible words. Since the number of words in the corpus can be in billions, it would be computationally infeasible to compute the denominator for each forward pass of the algorithm. Since the objective is to obtain a higher probability for a true context word as compared to a word not in the context of the word 'w', thus instead of using all the words in the corpus to compute the probability, randomly sample a subset of words which are assumed to be "negative" (non-) context words w.r.t. the word 'w' and then compute the summation only over these sampled words along with the true context word.

To make it more clear, let K be the set of negative samples for the word 'w'. Randomly sample 'k' such negative samples from K, call this set K'. Then for a true context word 'a', the updated probability is given to be :

P(a|w, U, V)=\frac{H_a}{H_a + G}, where H_a=e^{U^T_wV_a} and G=\sum_{q{\epsilon}K'}e^{U^T_wV_q}

Similarly, for a non-context word 'b', the updated probability is given to be :

P(b|w, U, V)=\frac{H_b}{H_a + G}, because H_b is already included inside G.

Since the backpropagation algorithm ensures that V_a is closer to U_w as compared to V_b, thus

U^T_wV_a\;>\;U^T_wV_b, which implies  H_a > H_b

P(a|w, U, V)-P(b|w, U, V)=\frac{H_a-H_b}{H_a + G} > 0, since H_a > H_b

Thus we are still able to maintain the condition that the probability of predicting a true context word is higher than the probability of predicting a false context word. Although the magnitude of the probabilities and the value of the above difference might change on including all the words in the denominator but the relative ordering between a context and non-context word would remain same. Found a nice discussion on this on StackOverflow.

Derivation of negative sampling strategy.

  • Why do we not apply non-linear activation function to hidden layers during training of word vectors ?

In the training of word vectors, we want to have the cosine similarity between similar words higher compared to dissimilar words. The "similarity" of words are considered in the context of semantic and syntactic similarity. Words such as "Car" and "Bus" (vehicles) or "France" and "Germany" (country names) etc. are semantically similar words whereas words such as "Box" and "Boxes" (plural) or "Man" and "Woman" (gender) etc. are syntactically similar words.

In the word2vec training algorithm, the input word weights are represented by the matrix U, i.e. U_{car} represents the input word vector for the word "car" and the output word weights are represented by the matrix V, i.e. V_{bus} represents the output word vector for the word "bus". The objective is to maximize the value of the following dot product :

U^T_{car}V_{bus} or U^T_{bus}V_{car}

as compared to any other random word such as U^T_{car}V_{book}, because the word "bus" is much more likely to occur in the context of the word "car" as compared to the word "book". Since the dot products are not probabilities, we convert these dot products into probabilities using the softmax  function (as shown in the previous answer). The objective function to maximize :

P(bus|car, U, V)=\frac{e^{U^T_{car}V_{bus}}}{{\sum_q}e^{U^T_{car}V_q}}

directly corresponds to maximizing the dot product above. Thus the objective we are optimizing in the neural network is in alignment with our desired objective. Note that the target word and the context words are interchangeable, since the dot product is a distance metric, i.e. the cosine distance from the word "car" to "bus" is equal to the cosine distance from the word "bus" to "car.

Now if we had used a non-linear activation such as the sigmoid, the output from the hidden layer would be the vector :


where \sigma(U_{car}) is the vector which is formed by doing a sigmoid transformation on each element of the original input vector U_{car}. Thus the new dot product will be given as :


Note, we are not optimizing the cosine similarity between the input word vectors and the output word vectors. Thus we do not use the non-linearity in the hidden layers when training word vectors. Also note that in the skip-gram architecture, given a word as input, the neural network predicts the context words. Based on our assumption that given a target word, its true context words are closer to the target word and vice-versa, on proper training, ideally this should produce a linear separating boundary between the true context words and the non-context words.

Linear Boundary separating context words (blue) from non-context words (red) for target word (green).

But a context word (blue dot) being closer to a non-context word (red dot) does not necessarily mean that the separating boundary is not linear. Thus the data and the objective function ensures that linear hidden units works well for this problem, but the reverse is not true, i.e. the linear hidden units do not ensure that the objective function converges.

  • When and why are discriminative classifiers better than generative classifiers and vice-versa ?

According to this paper, discriminative classifiers performs better than generative classifiers when the amount of training data is huge. Both the classifiers have an asymptotic error bound, i.e. the error on the training data does not reduce further even on increasing the training data beyond a certain point. But the asymptotic error for the discriminative classifier is lower than the generative classifier, which means that when the data is huge, discriminative classifier almost all the time gives better performance.

The paper says that the generative classifier reaches this critical point much faster as compared to a discriminative classifier, i.e. when both the classifiers have not reached their critical point, the generative classifier usually performs better than the discriminative variant, but when the generative classifier have reached its critical point, then discriminative classifiers performs better beyond that. A more concise explanation.

Error with generative (solid line) vs. discriminative classifier (dashed line)

Note that in all 3 plots, after a certain point the dashed line (representing discriminative classification error) goes below the solid line (representing generative classification error).

  • Why normalization of inputs is required in training neural networks ?

According to this tutorial, covariate shift between training and testing data occurs when although the conditional probability distribution of the class labels 'y' given the input data X remains same between the training and the testing data but the probability distribution of the input data itself is different.

P_{\text{train}}(y|X)=P_{\text{test}}(y|X), but


Although one might wonder that why do we care about different input distributions in training and testing if the desired output i.e. the class conditional probabilities are same. This is because the discriminative models we learn are generally parametric models, for example we learn the inter-layer weights as parameters in neural network, coefficients in the logistic regression etc. These parameters are learnt based on maximizing the likelihood of the input data, and if the distribution of the input data changes, the learnt parameters of the model will not hold true.

During neural network training, for example during one forward pass of the algorithm, the inputs to each hidden layer and the output layer follows a particular probability distribution. During backpropagation, the weights gets updated and as a result in the next forward pass, the probability distribution of the inputs to each layer changes compared to the previous pass. This is known as the "internal covariate shift". Due to the change in input distribution, some inputs can get too small whereas some inputs can become very large. As we know that in the sigmoid activation function \sigma(x), if the value of x is too large (positive) or too small (negative), then the slope of the curve approaches 0, and thus the gradient becomes very small. During SGD training, if the gradients become too small, the weights for these nodes will not get updated fast enough and it will take vey long time for the network to converge.

To overcome this, a method known as Batch Normalization is used. In Batch normalization, the linear outputs from each hidden layer is grouped into batches and each batch is normalized to zero mean and unit variance before applying the non-linear activation function and passing as input to the next layer. If the inputs to a hidden layer is denoted as x, then after combining linearly with the weights at the hidden layer, we obtain the linear output :


Now \widehat{x} is grouped into n batches [\widehat{x}^{(1)}, \widehat{x}^{(2)}, ..., \widehat{x}^{(n)}], each of size m

For each batch, compute its mean and variance, i.e. :

{\mu}^{(i)}=\frac{\sum_{j=1}^m\widehat{x}^{(i)}_j}{m},   {{\sigma}^2}^{(i)}=\frac{\sum_{j=1}^m(\widehat{x}^{(i)}_j-{\mu}^{(i)})^2}{m}

Then normalize the values as :


These values are then transformed :


where \gamma^{(i)} and \beta^{(i)} are learnable parameters for each batch and are updated using batch SGD during backpropagation.

Now the non-linear activation are applied to \widehat{y}^{(i)}_j and passed as input to the next layer.

Batch normalization can help accelerate the learning process by allowing higher learning rates to converge. There are some pretty good resources to learn about batch normalization.

[Resource 1, Resource 2, Resource 3]


Tags: , , , ,