Machine Learning, AI and Programming

Using Word Vectors in Multi-Class Text Classification

Earlier we have seen how instead of representing words in a text document as isolated features (or as N-grams), we can encode them into multidimensional vectors where each dimension of the vector represents some kind semantic or relational similarity with other words in the corpus. Machine Learning problems such as classification or clustering, requires documents to be represented as a document-feature matrix (with TF or TF-IDF weighting), thus we need some way to convert the word vectors into fixed-length document vectors to be fed to any classification or clustering algorithm.

Traditional approaches of using isolated words as features or N-grams suffers from :

  • Sparsity (most of the words in the vocabulary are absent from a single document), as a result the TF-IDF matrix can take up a lot of unwanted space (if dense matrices are used to represent data) and also it means that documents cannot be compared to one another along dimensions where the features are not present.
  • Document representation (TF-IDF matrix) do not take the word order into account (considers each occurrence of a word independent from other words, which is definitely not true).
  • High Dimensionality. Large text documents can have thousands of words (features) per document. On top of that if we take 2-grams or 3-grams, the number of features per document increases further. It could lead to memory issues if we are holding the matrix object in memory. All features are not important (some could be just noise) and modeling with so many features might lead to overfitting (high bias low variance).

One can come up with different approaches to representing a document as a continuous vector (of smaller dimensions), for example take the weighted average of the vectors for all words in the document (weighted by the TF-IDF score of the word) or concatenating the weighted word vectors for the words in the vocabulary and so on. Concatenation of word vectors for all words might lead to very high feature dimensions. Whereas taking the weighted (by TF-IDF) average of the word vectors loses the word order information. So the authors of word2vec came up with an approach of learning the document vectors itself similar to word vectors.

In the same Neural Network architecture with one hidden layer and one output layer (similar to our Skip-Gram and CBOW model for learning word vectors), let W and W' represent the input and output weights matrix for the words and D and D' represent the input and output weights matrix for the documents, which are all initialized randomly. The columns of W (or W') represents the word vectors whereas the columns of D (or D') represents the document vectors. The final document vectors after training can be directly used as an input to a classification or clustering algorithm.

Each training instance is constructed by sliding over a window of context words from a document and concatenating the word vectors of the context words and also concatenating with the document vector to predict the next word in the sequence (concatenation preserves the ordering among the words). This is known as the Distributed Memory Model.

For example, given a document with the text :

"Artificial Intelligence and Machine Learning are most sought after skills this year"

Removing the stop-words and lower-casing, we would get :

"artificial intelligence machine learning sought skills year"

with a context window size of 3, we would get the following training instances and corresponding prediction outputs :

  • Train : [("artificial", "intelligence", "machine")]    Output : [("learning")]
  • Train : [("intelligence", "machine", "learning)]    Output : [("sought")]
  • Train : [("machine", "learning", "sought")]    Output : [("skills")]
  • Train : [("learning", "sought", "skills")]    Output : [("year")]

If for a word 'w', we denote its weight vector to be v(w), and the document vector for this document to be D, then the first training instance is represented as :


where || represents concatenation operator.

Note that for a single document, all training instances from that document share the same document vector. This is the method to generate the document vectors for the documents in training. For testing documents, one uses the already computed weights matrices W and W' from training phase to learn the document vectors D and D' for testing documents. In testing phase, the word vectors learnt from training data are kept fixed, only the testing document vectors are updated. This is the inference step.

Distributed Memory Model for doc2vec for the text "the cat sat on"

In this post we are going to look at how to train document vectors using the python gensim package, use document vectors along with SVM to do training and testing on 20 Newsgroup data and compare the results obtained once with only SVM trained on full TF-IDF feature matrix, and once by constructing the document vectors by taking the weighted average of the corresponding word vectors.

Let's create a utilities file for reading the 20 Newsgroups data and pre-process it. We put the following functions in the  "" script :

import nltk, logging
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')

def getContents(type='train'):
    mydata = fetch_20newsgroups(subset=type, shuffle=True, random_state=42)

    contents = [" ".join(data.split("\n")) for data in]
    labels =

    return {'Contents':contents, 'Labels':labels}

def myTokenizer(text):
    return nltk.regexp_tokenize(text, "\\b[a-zA-Z]{3,}\\b")

def tokenizeContents(contents):
    return [myTokenizer(content) for content in contents]

def getVectorizer():
    vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english', tokenizer=myTokenizer)

    return vectorizer

We are using the scikit-learn library to fetch the 20 Newsgroup dataset. The "fetch_20newsgroups" function downloads the data once and caches it for future use. We are using the tokenizer from nltk library to tokenize sentences into words using regular expressions. We are only considering alphabetic words only and that too of at-least of 3 letters. In order to construct a document-feature matrix, we are using TF-IDF weighting scheme with only unigrams. Also we are removing common english stop-words as they are too frequent and hence do not convey any additional information to a classification or clustering model.

Next we write a python script to include all functions required to train both word vectors and document vectors using the python gensim package.

import Utilities, os
import gensim, logging
import numpy as np
from sklearn import svm

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

min_count = 10
context_window = 10
vector_size = 300
downsample = 1e-5
negative_sampling = 5
num_threads = 4
num_epochs = 10

def getTrainTokens():
    trainContents = Utilities.getContents('train')
    trainTokens = Utilities.tokenizeContents(trainContents['Contents'])

    return {'Contents':trainContents['Contents'], 'Tokens':trainTokens, 'Labels':trainContents['Labels']}

def getTestTokens():
    testContents = Utilities.getContents('test')
    testTokens = Utilities.tokenizeContents(testContents['Contents'])

    return {'Contents':testContents['Contents'], 'Tokens':testTokens, 'Labels':testContents['Labels']}

def trainDoc2Vec(tokens, savePath):
    docs = [gensim.models.doc2vec.TaggedDocument(words=token, tags=['DOC_' + str(idx)])
            for idx, token in enumerate(tokens)]

    if (os.path.exists(savePath)):
        model = gensim.models.Doc2Vec.load(savePath)
        model = gensim.models.Doc2Vec(docs, min_count=min_count, window=context_window, size=vector_size,
                                      sample=downsample, negative=negative_sampling, workers=num_threads,

    return model

The first two functions tokenize the file contents into words, to be trained by the "trainDoc2Vec" and "trainWord2Vec" functions. For training Doc2Vec model, we are using  the following default settings :

  • min_count = 10, a word must occur at-least 10 times in the entire dataset.
  • window = 10, 10 context words to the left and 10 context words to the right of a target word.
  • vector_size = 300, dimension of each word vector and document vector.
  • sample = 1e-5, downsample frequent words or words with frequency greater than 1e-5 in this case.
  • negative = 5, use 5 words per training instance for negative sampling.
  • workers = 4, use 4 threads in parallel (uses Cython) to build the model.
  • iter = 10, we are running the stochastic gradient update 10 times for each training instance. In each epoch, the training instances are randomly sampled.

Lastly we present the codes for learning the word & document vectors, and use those vectors along with SVM from scikit-learn to create a classification model and compare the performances relative to using the full set of features. The following functions are included in the script.

import Doc2Vec, Utilities
import numpy as np
from sklearn import svm

def trainTest(trainData, trainLabels, testData, testLabels):
    clf = svm.SVC(decision_function_shape='ovo', C=100, gamma=0.9, kernel='rbf'), trainLabels)

    return clf.score(testData, testLabels)

def constructDocArrayFromWords(tokens, vocab, vectorizer, vectorModel, docFeatureMat):
    docArrays = np.zeros((len(tokens), Doc2Vec.vector_size))

    for i in range(len(tokens)):
        fileTokens = tokens[i]
        temp = np.zeros((len(fileTokens), Doc2Vec.vector_size))
        weights = np.zeros(len(fileTokens))

        for j in range(len(fileTokens)):
            token = fileTokens[j]

            if token in vocab:
                word_vector = vectorModel[token]
                feature_index = vectorizer.vocabulary_.get(token)
                tfidf = docFeatureMat[i, feature_index]
                word_vector = np.zeros(Doc2Vec.vector_size)
                tfidf = 0

            temp[j] = np.array(word_vector)
            weights[j] = tfidf

        weightSum = np.sum(weights)

        if (weightSum > 0):
            weights = np.array([weight / weightSum for weight in weights])

        docArrays[i] =

    return docArrays

def trainTestSVM(train, test):

    vectorizer = Utilities.getVectorizer()

    X_train = vectorizer.fit_transform(train['Contents'])
    X_test = vectorizer.transform(test['Contents'])

    return trainTest(X_train, train['Labels'], X_test, test['Labels'])

def trainTestSVM_Doc2Vec(train, test, useFullData=1):

    if (useFullData == 1):
        tokens = train['Tokens'] + test['Tokens']
        tokens = train['Tokens']

    vectorModel = Doc2Vec.trainDoc2Vec(tokens, 'doc2vec__'+str(useFullData))

    trainTokens = train['Tokens']

    trainLabels = train['Labels']

    trainArrays = np.zeros((len(trainTokens), Doc2Vec.vector_size))

    for i in range(len(trainTokens)):
        trainArrays[i] = vectorModel.docvecs['DOC_' + str(i)]

    testTokens = test['Tokens']

    testLabels = test['Labels']

    testArrays = np.zeros((len(testTokens), Doc2Vec.vector_size))

    for i in range(len(testTokens)):
        if (useFullData == 1):
            testArrays[i] = vectorModel.docvecs['DOC_' + str(i + len(trainTokens))]
            testArrays[i] = vectorModel.infer_vector(testTokens[i], steps=10)

    return trainTest(trainArrays, trainLabels, testArrays, testLabels)

def trainTestSVM_Word2Vec(train, test):

    vectorizer = Utilities.getVectorizer()

    X_train = vectorizer.fit_transform(train['Contents'])
    X_test = vectorizer.transform(test['Contents'])

    tokens = train['Tokens'] + test['Tokens']

    vectorModel = Doc2Vec.trainDoc2Vec(tokens, 'doc2vec__1')

    vocab = set.intersection(set(vectorModel.wv.vocab), set(vectorizer.vocabulary_.keys()))

    trainTokens = train['Tokens']

    trainLabels = train['Labels']

    trainArrays = constructDocArrayFromWords(trainTokens, vocab, vectorizer, vectorModel, X_train)

    testTokens = test['Tokens']

    testLabels = test['Labels']

    testArrays = constructDocArrayFromWords(testTokens, vocab, vectorizer, vectorModel, X_test)

    return trainTest(trainArrays, trainLabels, testArrays, testLabels)
  • The 'constructDocArrayFromWords' function is used to construct document vectors from word vectors explicitly without any implicit learning for document vectors. In this method, the document vector for a document is constructed by taking the weighted average of the word vectors for words present in the document (that are also part of the doc2vec vocabulary), weighted by the TF-IDF scores of these word vectors. i.e. if the word vector for a word 'w' is represented as v(w) and the TF-IDF score for word 'w' in document D as s(w, D), then the document vector is computed as :

\frac{\sum_{w{\epsilon}D}s(w, D)*v(w)}{\sum_{w{\epsilon}D}s(w, D)}

  • In the 'trainTestSVM_Doc2Vec' function, we are using a parameter 'useFullData', to distinguish between the two cases :
    • If useFullData=1, then we train the doc2vec model using both the training as well testing data from 20 Newsgroup.
    • If useFullData=0, then we train the doc2vec model using only the training data from 20 Newsgroup. The document vectors for the testing data are inferred by calling the 'infer_vector' function of the Doc2Vec utility. We will later see that the quality of the document vectors is better and hence the classification performance with useFullData=1, implying that more data definitely improves the Doc2Vec results.
  • The last function 'trainTestSVM_Word2Vec' , trains the doc2vec model similar to 'trainTestSVM_Doc2Vec' (with useFullData=1), but instead uses the word vectors and not the document vectors. It calls the 'constructDocArrayFromWords' function to create the document vectors from word vectors using weighted averaging method.

In essence, we compare the numbers from the last 4 functions calls :

train = Doc2Vec.getTrainTokens()
test = Doc2Vec.getTestTokens()

print trainTestSVM(train, test)
print trainTestSVM_Doc2Vec(train, test, useFullData=1)
print trainTestSVM_Doc2Vec(train, test, useFullData=0)
print trainTestSVM_Word2Vec(train, test)

The numbers with the configured parameter values are as follows :

  • Accuracy with full set of features (80,791 features) = 84%
  • Accuracy with 300 dimensional document vectors (doc vectors trained on entire train + test data) = 70%
  • Accuracy with 300 dimensional document vectors (doc vectors trained on only train data, test doc vectors are inferred) = 44%
  • Accuracy with 300 dimensional document vectors (doc vectors are not learned but constructed from word vectors trained on full data) = 61%

What we can infer from the above results is that although we are able to achieve 70% accuracy with only 300 dimensional document representations (doc vectors trained on 20 Newsgroup train + test data) but still we are 14% short from using a SVM model with the full set of features (with TF-IDF weighting).

Moreover when we only use the training data to create the document vectors and infer the vectors for test docs, the accuracy is much less (44%) implying that the amount of data available for training the doc2vec model plays a significant role and thus the total data available (train + test) is not sufficient to generate good enough vector representations for documents.

(I have tried to vary the vector size from 100 to 1000, but the accuracy remains almost the same with 1-2% minor variations).

One can also use the pre-trained word vectors (not trained on 20 Newsgroup dataset, but trained on 2014 English Wikipedia dump) of dimension 100, and trained with 400K words to infer the doc vectors for 20 Newsgroup train and test data and then build the classification model using SVM classifier. These vectors are trained using the Glove algorithm and not the original Doc2Vec algorithm mentioned above.

Here is a nice tutorial on using the Glove generated word vectors to train and test using Convolution Neural Network algorithm on the 20 Newsgroup data.

Get the full code on my Github profile.


Tags: , , , ,