# Understanding Word Vectors and Word2Vec

Quite recently I have been exploring the Word2Vec tool, for representing words in text documents as vectors. I got the initial ideas about word2vec utility from Google's code archive webpage. The idea behind coming up with this kind of utility caught my interest and later I went on to read the following papers by Mikolov et. al. to better understand the algorithm and its implementation.

As I understand that, representing words in text documents as continuous vectors has been an active area of research for a decade now and there has been many papers published with many different approaches proposed. The core idea behind representing words as vectors is that, to a computer algorithm, individual words in a text document does not convey anything meaningful. For a human, when he sees a group of words in a sentence, he can infer a definite meaning out of the words belonging to that sentence, or when he sees two different words in similar contexts he knows that the words have similar meanings and so on.

To make a computer program understand the "meaning" or "context" of a word, we need to quantify the words using some encoding scheme such that we can directly use this encoding schemes to compute similarity between words. Here we are assuming that the encoding scheme is a continuous vector. If two words $w_1$ and $w_2$ are represented as vectors $v_1$ and $v_2$, then we can simply compute the cosine similarity between these two vectors to understand how similar the words $w_1$ and $w_2$ are.

Traditional methods used to compute the vectors for the words were Singular Value Decomposition, Latent Semantic Analysis, LDA Topic Modeling and so on. Recently researchers have used Neural Network based architectures to compute word vectors. The Neural Network architecture can be naturally adapted to computing word vectors.

Artificial Neural Network with Single Hidden Layer

Assuming you have NN architecture with an input layer, a hidden layer and an output layer, then the hidden layer represents a projection of the inputs. With M input nodes (each input representing a word) and N hidden nodes, there would be weights $w_{ij}$ connecting each input node 'i' to each hidden node 'j' or in other words, each input instance is projected on to a N dimensional vector space using the connection weights. When the NN state reaches an equilibrium, these hidden units acts as the vector representations for the input words.

The methods of generating vectors using SVD or LSA or LDA does not always maintain linear regularities and also is expensive space and time complexity wise. By linear regularities, I mean, the word vectors should have the following additive property : The vector representing [V("King")-V("Man")+V("Queen)] should be very close to the vector V("Woman"), where V("King") represents the vector for the word "King" and so on.

1. "King":"Man" :: "Queen":?
2. "France":"Paris" :: "Belgium":?
3. "Avengers":"Marvel" :: "Justice_League":?

These are some of the analogy questions that you must have encountered in various aptitude tests. The generated word vectors can help us find an answer to these questions by simply computing the vectors closest to the following resultant vectors :

1. V("King")-V("Man")+V("Woman")
2. V("France")-V("Paris")+V("Belgium")
3. V("Avengers")-V("Marvel")+V("Justice_League")

In the rest of the post, we will understand some of the modules of a python implementation of word2vec using the "gensim" library and look into some of the applications where word2vec would be useful.

If you look at the above NN architecture, we already have the inputs (i.e. the individual words and phrases across documents) but we do not yet know the value of the output nodes in the output layer. Note that the word2vec construction is an unsupervised problem, i.e. the word vectors are constructed without any class labels for the documents from where these words are extracted. In that way one need not depend on the true class labels which can be expensive to obtain. Better we only use words in a sentence to train the model, i.e. for a sequence of words from a sentence in the text, how well can we predict a target word given its surrounding words or how well can we predict the surrounding words given an input word.

For example, for the sentence "watch movies rather than read books", the extracted words would be

[("watch"), ("movies"), ("rather"), ("than"), ("read"), ("books")]

In the word2vec implementation, two different architectures has been proposed. One is the "Continuous Bag Of Words (CBOW)" model and the other the "Skip Gram" model. In the CBOW, given the surrounding words (context) as input, goal is to predict the target word, whereas in the Skip Gram, given a input word, goal is to predict the surrounding words. We specify a context size, i.e. how many (maximum) surrounding words to consider for each input/target word. For e.g. for a context size of 2, we consider the 2 words on the left and 2 words on the right as the surrounding words, i.e. in the above case, the context words for "rather" with context size of 2 would be :

("watch", "movies", "than", "read")

Similarly the context words for "read" would be :

("rather", "than", "books")

Thus for CBOW architecture, if the input is [("watch", "movies", "than", "read")], then desired output would be [("rather")] for this particular instance, and in the Skip Gram architecture, if the input is [("rather")] then the desired output would be [("watch", "movies", "than", "read")].

Thus the training instances for CBOW would look like :

• Input : [("movies", "rather")], Output : [("watch")]
• Input : [("watch", "rather", "than)], Output : [("movies")]
• Input : [("watch", "movies", "than", "read")], Output : [("rather")]

and so on.

Similarly for the training instances for Skip Gram, the inputs and the outputs are just reversed. Instead of using the words as strings, better would be to use the index of the word from a lookup table (HashMap) of all the words.

Coming back to defining the architecture for each. Assuming that we have M training instances and we want to represent each word as a N dimensional vector, then the input layer is represented as a one-hot vector, i.e. only one of the input node is 1 rest all are 0. Unlike traditional NN, where the outputs from the hidden layer is transformed using the sigmoid function, the outputs in word vector architectures are passed as simple weighted combination of the inputs.

Normal NN :

Output from hidden node 'j' = $h_j = \sigma(u_j)=\frac{1}{1 + \text{exp}(-u_j)}$

Word Vector NN :

Output from hidden node 'j' = $h_j = u_j$

where $u_j=\sum_{i=1}^Vw_{ij}x_i$

is the input to the j-th hidden node, $w_{ij}$ is the weight of the connection from the i-th input node to the j-th hidden node and $x_i$ is the value of the i-th input node.

In the case of word vector, only one of $x_i=1$ remaining all are 0.

The output layer has V output nodes, one for each unique word. The final output from each node is the softmax.

Value at output node 'k' = $O_k = \frac{\text{exp}(u'_k)}{\sum_{q=1}^{V}\text{exp}(u'_q)}$

where $u'_k=\sum_{j=1}^Nw'_{jk}h_j$

is the input to the k-th output node, $w'_{jk}$ is the weight of the connection from j-th hidden node to the k-th output node and $h_j$ is the output of the j-th hidden node.

The log-loss at each output node 'k' is given by the cross-entropy function, i.e.

$-y_k*log(O_k)-(1-y_k)*log(1-O_k)$

where $y_k=1$ if the actual output is the word at the k-th index else $y_k=0$. The total loss for all output nodes (for a single training instance) is given as :

$E=\sum_{k=1}^V[-y_k*log(O_k)-(1-y_k)*log(1-O_k)]$

The weights $w'_{jk}$ of the connections between hidden layer and the output layer are updated using stochastic gradient descent technique, which is given as :

$w'^{(t+1)}_{jk}=w'^{(t)}_{jk}-\eta*[\frac{\partial{E}}{\partial{w'_{jk}}}]$

where $\eta$ is the learning rate. On solving for the partial derivative in the above equation, we obtain the final form of the SGD update equation for the output layer weights as :

$w'^{(t+1)}_{jk}=w'^{(t)}_{jk}-\eta*(O_k-y_k)*h_j$

Note that the term $(O_k-y_k)$ denotes the error in prediction at the k-th node. This is also known as Back-propagation of errors. Similarly using backpropagation, one can obtain the update equations for the input weights $w_{ij}$. Since only one of the inputs is active at one instant, thus in each training iteration, only for one value of 'i' we need to update the input weights $w_{ij}$, which is given as :

$w^{(t+1)}_{ij}=w^{(t)}_{ij}-\eta*\sum_{k=1}^V(O_k-y_k)*w'^{(t+1)}_{jk}$

The above update equation is applied only for the node 'i', for which the input $x_i=1$.

Note that the update equations for $w'_{jk}$ is applied for all the words in the vocabulary (k=1 to V), whereas the update equations for $w_{ij}$ is applied for only one value of 'i'. Since the number of words V in the vocabulary can be in billions, the first set of update equations is very expensive given that we need to run the updates over multiple iterations of SGD until convergence. In order to overcome this inefficiency, different techniques has been proposed which we will see later.

Continuous Bag Of Words Architecture.

In the CBOW architecture, where we have multiple context words as input, we assume that we have multiple input layers but all the input layers share the same set of weights $w_{ij}$ connecting the input nodes to the hidden nodes. The output from the hidden node 'j' is the average of the weighted inputs for all the context words. Assuming we have C number of context words, the output from hidden layer 'j' is given as :

$h_j=\frac{1}{C}[\sum_{q=1}^Cw_{i_qj}x_{i_q}]$

where the inputs $x_{i_q}$ is the value of $x_i$ in the q-th input layer. Note that in each input layer only one of the inputs has a value of 1 rest all are 0. The update equation for the output weights $w'_{jk}$ remains same as above. The update equations for the input weights $w_{ij}$ is modified to be :

$w^{(t+1)}_{i_qj}=w^{(t)}_{i_qj}-\eta*\frac{1}{C}*\sum_{k=1}^V(O_k-y_k)*w'^{(t+1)}_{jk}$

The above updates are made only for which $x_{i_q}=1$ as defined above.

Skip Gram Architecture

For the Skip Gram architecture, there is only one input word for a training instance, but there are multiple output context words for each input. In this case we assume that we have multiple output layers and all output layers share the same set of output weights $w'_{jk}$. The SGD update equations for the output weights are modified as follows :

$w'^{(t+1)}_{jk}=w'^{(t)}_{jk}-\eta*h_j*\sum_{k_q}(O_{k_q}-y_{k_q})$

where $k_q$ denotes the k-th output node in the q-th output layer. Since there could be maximum of C output layers, the number of different values for $(O_{k_q}-y_{k_q})$ would be maximum of C. Basically the term $\sum_{k_q}(O_{k_q}-y_{k_q})$ represents the sum of all the errors from C different context words prediction. We are not taking the average error because we want to penalize the system for each context word mis-prediction. The update equation for the input weights remains unchanged.

$w^{(t+1)}_{ij}=w^{(t)}_{ij}-\eta*\sum_{k=1}^V\sum_{k_q}(O_{k_q}-y_{k_q})*w'^{(t+1)}_{jk}$

The above equation is valid only for one input where $x_i=1$.

With different experimentations done by the authors of the above two papers, it was found that the Skip Gram architecture performs better than the CBOW architecture on standard test sets by evaluating the word vectors on analogical question and answers demonstrated earlier. One reason for this could be due to the averaging of the input context vectors in CBOW that dissimilar words in a context gets averaged out.

The intuition behind using word and context pairs as training instances for computing word vectors, is that, words which co-occur together many times is more likely to predict each other at the output layer when one of them is given as input at the input layer. For example, given "computer" as an input word to the Skip Gram model, it is more likely to predict context words such as "desktop", "hardware", "programming" compared to words such as "dancing" or "football" and so on, because many sentences (or documents) would have the words "computer" and "programming" occur together and so the number of training instances having "computer" as the input and "programming" as one of the context words would be higher compared to "dancing" as one of the context words.

Coming to the efficiency of computation of the word vectors, as mentioned earlier that we need to update the output weights $w'_{jk}$ at the output layer for all words V in the vocabulary, i.e. k runs from 1 to V, where V can be in billions. This becomes the most time consuming computation. In the implementations done by the authors of word2vec, they experimented with two different techniques. The first one is the Hierarchical Softmax and the second one is Negative Sampling.

In the word2vec computation, the critical part becomes normalizing the output probabilities for each output node using the softmax function.

$O_k = \frac{\text{exp}(u'_k)}{\sum_{q=1}^{V}\text{exp}(u'_q)}$

Note that in the denominator, we need to compute the value of $\text{exp}(u'_q)$ for all words V in the vocabulary, the complexity of which comes out to be O(V). In Hierarchical Softmax, this complexity comes down to O(log(V)). In HS, the output layer is arranged in the form of a binary tree, with the leaves being the output nodes, thus there are V leaf nodes in the tree. For a binary tree with V leaf nodes, there would be V-1 internal nodes.

Hierarchical Softmax

There are no output weights with HS, but instead we need to update weights for each internal node. Each internal node has a weight associated with it. The probability of a particular output node (leaf node in this case) is given by the product of the internal node probabilities on the path from the root to this leaf node. There is no need to normalize this probabilities as the sum of all the leaf node probabilities sum to 1.

Let n(k, j) for an internal node denote that this internal node is the j-th node from the root to the word at index k in the vocabulary, along the path in the tree, then the output probabilities are given as :

$O_k=\prod_{j=1}^{L(k)-1}F(k, j)$, where

$F(k, j)=\sigma(v(k, j))$ if n(k, j+1) is the left child of n(k, j)

else $F(k, j)=\sigma(-v(k, j))$

where $v(k, j)=\sum_{i=1}^Nw'_{i, n(k, j)}h_i$ and $L(k)$ denotes the length of the path from the root of the tree to the word at index k, $w'_{i, n(k, j)}$ denotes the weight of the connection from the hidden node 'i' to the internal node n(k, j) in the tree. It can be shown that :

$\sum_{k=1}^VO_k=1$

hence these probabilities are already normalized. In the above diagram, the path from the root to the word at index 2 is displayed, note the values of n(k, j) at each node in the path. The binary tree can be constructed in any particular way. For the above tree, the output probability of the word at index 2 is given to be :

$O_2=\sigma(v(2, 1))*\sigma(v(2, 2))*\sigma(-v(2, 3))$

Instead of using a balanced binary tree for Hierarchical Softmax, the authors have used Huffman Trees. These trees leverages the frequency of each output word from the training instances to construct the tree in a way such that words which are more frequent are placed near the root whereas words which are less frequent are placed farther from the root. In this way it helps to update the weights of the internal nodes faster.

In the Negative Sampling approach, which is a more intuitive approach, instead of using all the words in the vocabulary to update the output weights, few words are sampled from the vocabulary and these words are used instead. The words that are sampled (apart from the actual output word) are generally chosen in a way such that these words are less likely to be predicted given the input word, so that the word vector of the input word is most affected by only the actual output words and least affected by the output vectors of the remaining sampled words.

In our word2vec example use cases, we are primarily concerned with maximizing the similarity between the input word and the output context words (similarity of their vectors), thus the other words which are not our focus does not contribute much but only increase time complexity.

The negative samples are selected from the words in the documents using their unigram distribution. Words which are very frequent in the text documents are more likely to be negative samples. In the approach suggested, compute the frequencies of the words, raise them to their 3/4th power, normalize them to probabilities and then sample the words from this distribution.

So far we understood the intuition behind word vectors and looked at the implementation details of some of the modules for word2vec algorithm. Now lets get into practicality. We will be using gensim, a python implementation of word2vec. To run the experiments, we will be using the 20 newsgroups dataset. Although the dataset contains labelled documents, but in our case we would be ignoring the class labels for training word2vec.

For a full list of word2vec parameters, kindly refer this blog post on Kaggle. In our experiments, we would be using the following configuration :

• Architecture : Skip Gram (default)
• Training algorithm : Negative Sampling (default)
• Word vector dimensionality (N) = 300
• Context/Window size = 10
• Number of threads = 4
• Minimum word count = 10 (means that words with frequency less than 10 are ignored).
from glob import glob
import re
import nltk
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class ExtractWords(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
dirs = glob(self.dirname + "/*")
for dir in dirs:
files = glob(dir + "/*")
for file in files:
with open(file) as f:
content = [re.sub("\\s+", " ",
re.sub("\\b[a-zA-Z]{1,2}\\b", " ",
re.sub("[^a-zA-Z]", " ", str))).strip() for str in content]
for txt in content:
yield nltk.word_tokenize(txt)
words = ExtractWords("20news-bydate")
model = gensim.models.Word2Vec(words, workers=4, size=300, min_count=40, sample=1e-3, window=10)
model.save('mymodel')

We are pre-processing the 20 newsgroup data by removing all non-alphabetic strings, remove all words with less than 3 characters in length and remove extra spaces. The words are extracted per line from the text file, which means that the maximum context size for a word in a training instance is limited by the number of words in the line from where this word is taken. We are using the nltk.word_tokenize() function to split the lines into words. In the default negative sampling approach, the maximum number of negative samples selected per training instance is set to 5.

The training process took me around 533 seconds on my Mac OS. I also saved the model for future reference. One can play around with the generated word vectors.

print model.most_similar('computer')

The above code will print the 10 most similar words to "computer".

[('keyboard', 0.8442419767379761), ('system', 0.785137414932251), ('workstation', 0.78306645154953), ('serial', 0.7806442975997925), ('digital', 0.7782835960388184), ('internal', 0.7772847414016724), ('application', 0.7747548818588257), ('controller', 0.7701689004898071), ('remote', 0.7686190009117126), ('BIOS', 0.7676270604133606)]

The word "keyboard" is deemed to be most similar to "computer". Looking at the output one can easily conclude that the most similar words are indeed related to "computer". Let's look at another example, this time my input word is "baseball", for which we get the following most similar words.

[('history', 0.7818132638931274), ('starting', 0.7711160778999329), ('bunch', 0.7419493794441223), ('hockey', 0.7399239540100098), ('minor', 0.7383657693862915), ('hearing', 0.733702540397644), ('looked', 0.7205613255500793), ('seeing', 0.719449520111084), ('riding', 0.7175126671791077), ('bikes', 0.7145760655403137)]

Apart from "hockey" none of the other words come close to the context of "baseball".

print model.doesnt_match("computer program religion hardware".split())

In the above, we are trying to find the most dissimilar word among the 4 words "computer", "hardware", "religion" and "program". As expected the output from the above is "religion".

print model.most_similar(positive=['father','mother'],negative=['woman'])

The above code is for answering the following analogical question. "mother":"woman" :: "father" : ?

I had expected "man" or "men" to come in the top 10 most similar words (although family members such as "brother", "daughter" and "son" come in top 3 positions).

[('brother', 0.825019896030426), ('daughter', 0.7538723945617676), ('son', 0.7393611669540405), ('followers', 0.7366220951080322), ('desk', 0.7152739763259888), ('prostitute', 0.7123449444770813), ('wife', 0.7114832401275635), ('Savior', 0.7107428312301636), ('Satan', 0.7011507153511047), ('friend', 0.6970185041427612)]

Note that we have selected the examples very carefully, since the newsgroup documents cover only 20 or so different topics, so obviously it would be foolish to assume that generic queries like "France":"Paris" :: "Belgium":? would work well since these words would probably not occur even once in the training instances. Also the data used to train word2vec is not sufficient with 20 newsgroup data hence some of the answers obtained above are not quite as per our expectation. To obtain better answers to the above questions, clearly we need to train word2vec with more data.

Coming to the most important question, where and in what kind of applications can we put word vectors into good use. The most intuitive application that can leverage word vectors is in "Search Engines". In order for a search engine to return a result, it computes a similarity between the query and the documents to search for. The similarity is often computed based on the cosine distance between the vectors of query and a document. Similar to LSA, word2vec can help us with obtaining search results which are "semantically similar" to the query.

For example, if the query is "python programming tutorial", then for a document with the title "Learn to code in python", the document may not have  all the words in the query, but if we add the word vectors for V("python")+V("programming")+V("tutorial") and then compute the cosine similarity with the resultant vector for V("learn"+V("code")+V("python") (stop-words excluded), we might obtain high similarity between these two concepts, thus the document becomes an important result for the query.

In order to demo this with a code :

from scipy import spatial
def addWordVectors(sentence, model) :
words = nltk.word_tokenize(sentence)
word_vectors = [model[word] for word in words]
resultant_vector = [sum(x) for x in zip(*word_vectors)]
return resultant_vector
v1 = addWordVectors("computer hardware", model)
v2 = addWordVectors("keyboard", model)
print 1 - spatial.distance.cosine(v1, v2)

I am defining a function to extract the words out of a sentence and return the sum of their word vectors. This function is called both for a query and a document. Then compute the cosine similarity of the resultant vectors. Here I am computing the similarity between the query "computer hardware" and "keyboard". The similarity comes to around 0.86 which indicates that "keyboard" is relevant to the query "computer hardware".

Note the adding the vectors for the words is just one way to obtain a single vector from multiple word vectors. One can also experiment with averaging of the word vectors or taking the min/max value of each component of the vector and so on.

Word2vec can also be used for text classification. In text classification, a document is represented by a vector of TF-IDF values for words in the vocabulary. One approach for converting the word vector representations into the Document-Term matrix is to take the sum (average, min/max etc.) of the word vectors of all the words present in the document and use this as the vector representation for the document. I found this nice post that does comparison of the word2vec averaging method with SVM and Naive Bayes classifier. In fact the authors of word2vec have developed a version called Doc2Vec (also known as paragraph vectors) for directly training sentences and paragraphs with the Skip Gram architecture instead of averaging the word vectors in the text. We will explore this path in a different post.

Some other interesting applications that can leverage the elegance of word vectors are :

In fact, I am looking forward to reading and understanding more about the above applications of word vectors and implement some of them with different datasets. Stay Tuned !!!

Categories: AI, MACHINE LEARNING, PROBLEM SOLVING