Stokastik

Machine Learning, AI and Programming

Designing a Contextual Graphical Model for Words

I have been reading about Word Embedding methods that encode words found in text data into multi-dimensional vectors. The purpose of encoding into vectors is to give "meaning" to words or phrases in a context. Traditional methods of classification treat each word in isolation or at-most use a N-gram approach i.e. in vector space, the words are represented as one-hot vectors which are sparse and do not convey any meaning whereas learning vector representation of a word in a given context, using Neural Network methods such as Word2Vec or Glove generates dense vectors which encode different dimensions of relations with other words.

Word Vectors Displaying Linear Relationships

For example, the word vectors for "King" or "Queen" encode gender, ruling status etc. in different dimensions of the vectors, as such one can compute and check that the resultant vector for ("King")-("Queen") equals the difference ("Man")-("Woman").

Vector representations for words is most probably an universal encoding scheme that can work well with almost any kind of Natural Language Processing task. For example in Sentiment Analysis, Document Classification, POS tagging, Named Entity Recognition, Language Translation, Speech to Text, Search Engines etc., word vectors computed using either Word2Vec or Glove techniques have shown to improve performance over traditional methods which uses words in isolation.

Word2Vec or Glove methods are also an improvement over some other techniques for generating word vectors such as SVD or LSA. Due to the time and space complexity issues with SVD or LSA, recent methods like W2V or Glove are preferred. Different approaches for generating word vectors uses different word statistics information for example, SVD or LSA use the word co-occurrence matrix (co-occurrence are counted over entire documents), Word2Vec uses a center word to predict surrounding words in the context, i.e. it uses local information for words, whereas Glove uses both local (surrounding words) and global information (entire document).

I tried my hands on designing a simple and intuitive graphical model based approach for generating similar words given a single word or multiple words in a context. The problem that I was more concerned was to write a spelling corrector module for OCR outputs (generated using Tesseract OCR on document images). As explored in an earlier post, I had fiddled with an algorithm which uses word sequence information to suggest possible corrections.

P(\text{current word}|\text{previous word}) and

P(\text{current word}|\text{next word})

Basically it was a Markov Model, where the possible spelling suggestions were based on the previous and the next word in sequence as well as the Levenshtein distances between the error word and the possible corrections. The problem with this approach was that multiple words in a sequence of text could be misspelled and thus the possible spelling suggestions could be totally different depending on whether the previous and the next words in the sequence are themselves correct or incorrect.

For example, consider the text

"artificial intelligence and madline leaming is going to disrupt the Software Industries"

The above text is obtained as an OCR text output from one of the OCR engines. The words "machine" and  "learning" has not been detected correctly. If "machine" was detected correctly then by using our Markov Model we could have easily detected that the next word is not "leaming" but "learning" with a Levenshtein distance of 2 between the words.

To overcome this, I came up with a more generic solution, which utilizes not only the previous and the next words in the sequence but also other surrounding words in the context. For example given the text

"natural language procesing is a field of machine learning and data mining"

The error in the word "procesing" can be corrected by considering the words in context such as ["natural", "language", "machine", "learning", "data", "mining"] and so on. Even if the word "language" or "machine" are misspelled as "lanquage" and "madline", the other correct words in the context would most likely suggest possible corrections for not only "procesing" but even for incorrect versions of "language" or "machine".

The idea of using the context words is borrowed from some of the earlier mentioned techniques. The method for building and testing the graphical model is quite simple and uses only graph theory basics and do not use any explicit training or optimization of an objective unlike Word2Vec or Glove. The algorithm is as follows :

  • Build a graph with nodes and edges, where each node represents a word and an edge exists between two nodes if the corresponding words appear at-least once within a context window. The weight of an edge represents the conditional probability of a word given other word(s) in a context. Basically depending on the context window size C, generate instances from each document, where each instance comprises of a target word along with C words in sequence to the left of the target word and C words to the right.

For example, given the text :

"natural language processing is a field of machine learning and data mining"

with C=2, the training instances would look like (ignoring the common stop-words like "is", "a", "of", "and" etc) :

  • ["natural", ("language", "processing")]
  • ["language", ("natural", "processing", "field")]
  • ["processing", ("natural", "language", "field", "machine")]
  • ["field", ("language", "processing", "machine", "learning")]

and so on.

where the first word is the target word and the next set of words are the context words in each training instance.

Below is a C++ function to generate the above instances :

struct TrainInstance {
  int givenWordIndex;
  std::set<int> contextWordsIndices;
};
std::vector<TrainInstance> generateSkipGramTrainingInstances(const std::vector<std::vector<std::string>> &documentWords, 
                                                             std::unordered_map<std::string, int> &wordIndex, const int &contextSize,
                                                             const std::vector<std::string> &vocabulary, const bool &onlyNonVocab) {
  
  std::vector<TrainInstance> output;
  std::set<std::string> vocab(vocabulary.begin(), vocabulary.end());
  
  for (int i = 0; i < documentWords.size(); i++) {
    if (documentWords[i].size() > 1) {
      std::deque<int> leftIndices, rightIndices;
      
      for (int k = 0; k < documentWords[i].size(); k++) {
        
        std::string trainWord = documentWords[i][k];
        int trainWordIndex = wordIndex[trainWord];
        
        if (k > contextSize) {
          leftIndices.pop_front();
          leftIndices.push_back(wordIndex[documentWords[i][k-1]]);
        }
        else if (k > 0) leftIndices.push_back(wordIndex[documentWords[i][k-1]]);
        
        int maxEl = std::min((double)contextSize, (double)documentWords[i].size()-1);
        
        if (k == 0) for (int m = 1; m <= maxEl; m++) rightIndices.push_back(wordIndex[documentWords[i][m]]);
        
        else if (k + contextSize < documentWords[i].size()) {
          rightIndices.pop_front();
          rightIndices.push_back(wordIndex[documentWords[i][k + contextSize]]);
        }
        else rightIndices.pop_front();
        
        std::set<int> contextIndices;
        
        for (auto p = leftIndices.begin(); p != leftIndices.end(); ++p) contextIndices.insert(*p);
        for (auto p = rightIndices.begin(); p != rightIndices.end(); ++p) contextIndices.insert(*p);
        
        if ((onlyNonVocab && vocab.find(trainWord) == vocab.end()) || !onlyNonVocab) {
          TrainInstance inst;
          inst.givenWordIndex = trainWordIndex;
          inst.contextWordsIndices = contextIndices;
          
          output.push_back(inst);
        }
      }
    }
  }
  
  return output;
}

The above function takes as input, the tokenized words per document in the corpus. There are two levels of "std::vector" for "documentWords", the first one is for the document index, the second one is for the word index.

The function also takes as input, the word to index lookup table, the context size C, and two other parameters corresponding to a dictionary of english words. The use of the dictionary is for spelling correction purpose which we will discover later in the post.

The function returns a vector of "TrainInstance" struct, where the struct defined as above contains a target word index and a set of context word indices per instance similar to our examples shown above.

  • For each context word in an instance, update the edge weights between the context word and the target word, i.e. for every encounter of ("natural", "language") as a context-target pair throughout the corpus, the conditional probability of the word "language" given the context "natural" increases. Instead of using the words as it is (in string format) in the nodes, we can use word indices from a word to index lookup table.

Snippet of a Context-Target Directed Graph

Below is the C++ function for updating the graph nodes and edge weights. The graph is represented as a two-level map in C++ because we are not storing any additional node properties and thus do not require a separate Node class or struct to represent the graph. The map keys are the node indices and the value is the edge weight from the first index to the second index (directed graph).

std::unordered_map<int, std::unordered_map<int, double>> train(const std::vector<TrainInstance> &trainInstances) {
  std::unordered_map<int, std::unordered_map<int, double>> nodes;
  
  for (size_t i = 0; i < trainInstances.size(); i++) {
    TrainInstance instance = trainInstances[i];
    
    int currWordIndex = instance.givenWordIndex;
    std::set<int> contextWordsIndices = instance.contextWordsIndices;
    
    for (auto p = contextWordsIndices.begin(); p != contextWordsIndices.end(); ++p) nodes[*p][currWordIndex]++;
  }

  for (auto p = nodes.begin(); p != nodes.end();++p) {
    std::unordered_map<int, double> x = p->second;
    
    double sum = 0;
    for (auto q = x.begin(); q != x.end(); ++q) sum += q->second;
    for (auto q = x.begin(); q != x.end(); ++q) nodes[p->first][q->first] /= sum;
  }
  
  return nodes;
}
  • Given a vector of context words, compute the most probable target words within the same context. For each context word, find out its possible targets and their edge weights from the graph. Group by the unique target words. For each occurrence of a target word, add the score (1+ Edge_Weight) to the final score for this target. We are adding 1 to the edge weights because apart from the edge weights of the target words, we also want to rank these words based on how many context words predict the same target word. So a target word which co-occurs (in same context) with all of the given context words, will rank higher than any other target word which does not co-occur (in same context) with all of the context words irrespective of their edge weights.

Instead of keeping a hard bound on returning only the immediate neighbors of a node, we can go flexible by specifying a max count of possible target words in addition to a maximum depth of the nodes from the given context. Although when the vocabulary is large, going to a depth (path length) of 2 or more can be very slow.

Note that we are not computing conditional probabilities of the form :

P(w_t|w_{t-C}, ..., w_{t-1}, w_{t+1}, ..., w_{t+C})

The above formula implies that we are trying to compute the probability of a target word w_t given that all of the words [w_{t-C}, ..., w_{t-1}, w_{t+1}, ..., w_{t+C}] occurs in the context of w_t, but we are only  trying to maximize the chances of obtaining the correct word w_t for a spelling error.

As mentioned earlier that the context words may not be always same for w_t across documents and there could be possible spelling errors also in the context words itself, so we would like to choose a large enough context around the incorrect word at location 't', so as to maximize our chances of finding the true correct word based on Levenshtein distance for the incorrect and the possible correct words.

The following C++ function returns the most probable target words from a graph given a context word :

struct NodeQueue {
  int nodeIdx, nodeDepth;
  double nodeScore;
};
std::unordered_map<int, double> mostContextual(std::unordered_map<int, std::unordered_map<int, double>> &nodes, 
                                               const int &index, const int &count, const int &maxDepth) {
  
  int currIdx = index;
  std::set<int> visitedNodes;
  
  std::priority_queue<std::pair<int, double>, std::vector<std::pair<int, double>>, comparator> minHeap;
  
  std::queue<NodeQueue> nodeQueue;

  nodeQueue.push({currIdx, 0, 1.0});
  
  while(!nodeQueue.empty()) {
    
    NodeQueue frontNode = nodeQueue.front();
    
    if (frontNode.nodeDepth >= maxDepth) break;
    
    visitedNodes.insert(frontNode.nodeIdx);
    
    std::unordered_map<int, double> neighbors = nodes[frontNode.nodeIdx];
    
    nodeQueue.pop();
    
    for (auto p = neighbors.begin(); p != neighbors.end(); ++p) {
      double score = frontNode.nodeScore*p->second;
      
      if (visitedNodes.find(p->first) == visitedNodes.end()) {
        
        if ((int)minHeap.size() >= count && minHeap.top().second < score) minHeap.pop();
        
        if (minHeap.empty() || (int)minHeap.size() < count) {
          nodeQueue.push({p->first, frontNode.nodeDepth+1, score});
          minHeap.push(std::make_pair(p->first, score));
        }
      }
    }
  }
  
  std::unordered_map<int, double> output;
  
  while(!minHeap.empty()) {
    std::pair<int, double> tops = minHeap.top();
    output[tops.first] = tops.second;
    minHeap.pop();
  }
  
  return output;
}

The function accepts the graph, modeled using a two-level map, the index of the context word, the maximum number of possible target words and the maximum depth to traverse in the graph (in case where there are insufficient number of target words in the first level and so on).

The function performs a Breadth First Search starting at the context word index. To perform BFS, we are using a Queue data structure as recursion can lead to stack overflow issues. In order to keep track of the 'K' top-most target words (based on its edge weights from the context word), we are using a priority-queue based implementation of MinHeap data structure.

The below function takes the same inputs as the above function but instead takes in multiple context word indices instead of a single context word index. It internally calls the above function in a loop and consolidates the results for all the context word indices.

std::unordered_map<int, double> mostContextualMultiple(std::unordered_map<int, std::unordered_map<int, double>> &nodes, 
                                                       const std::set<int> &indices, const int &count, const int &maxDepth) {
  
  std::unordered_map<int, double> combined, output;
  std::vector<std::pair<int, double>> collect;
  
  for (auto p = indices.begin(); p != indices.end(); ++p) {
    std::unordered_map<int, double> contextual = mostContextual(nodes, *p, 1000, maxDepth);
    
    for (auto q = contextual.begin(); q != contextual.end(); ++q) {
      if (combined.find(q->first) != combined.end()) combined[q->first] += (1+q->second);
      else combined[q->first] = (1+q->second);
    }
  }
  
  std::priority_queue<std::pair<int, double>, std::vector<std::pair<int, double>>, comparator> minHeap;
  
  for (auto p = combined.begin(); p != combined.end(); ++p) {
    if ((int)minHeap.size() >= count && minHeap.top().second < p->second) minHeap.pop();
    if (minHeap.empty() || (int)minHeap.size() < count) minHeap.push(std::make_pair(p->first, p->second));
  }
  
  while(!minHeap.empty()) {
    std::pair<int, double> tops = minHeap.top();
    output[tops.first] = tops.second;
    minHeap.pop();
  }
  
  return output;
}

Note that we are using two different maximum counts for target words. When we are calling the function to return the probable target words for a single context word index, we are obtaining the top 1000 target words, but at the time of consolidation of the target words for all context word indices, we are considering a much lower number (around 50) of consolidated target words.

Since I generally work with OCR outputs of scanned document images from financial documents (mostly loan origination and servicing documents), I would like to evaluate what kind of target words words are returned given a few context words related to financial documents. (Due to compliance issues, I cannot display full file contents here).

Train the graphical model with a context size of 5 and remove common English stop-words.

model <- cpp__generateGraph(documentWords, contextSize = 5)

Given a snippet of text from a document :

"BENCHMARK Origination Compliance Verification  Processor and Underwriting Submission Checklist   Property    Builders certification ALL   Builder approval  printout from  website  ALL   Builders warranty ALL ?  Condo approval  printout from website for FHA  ALL ?  Engineers certification ALL   Health inspection ALL ?  Pest inspection ALL   Roof certification   applicable  ALL   Not inspected acknowledgement ALL ?  Septic inspection ALL   Subterranean Termite Soil Inspection ALL    Notice  Value  NOV  ALL   Well inspection ALL   Flood Certification  ? Required Flood certificate ALL   Sales Contract   Required Sales contract  all addenda   applicable  ALL   LDP   ALL   GSA   ALL   Seller credit letter  explanation ALL    Required  everyone who  part  the transaction  title company  escrow officer  listing agent  real estate office for listing agent  buyer  agent  real estate office for buyer  agent  processor  loan officer  borrower   borrower  all seller      Identity Documents"

I want to set the context words to ["benchmark", "compliance", "submission", "checklist"] and verify what target words do I predict :

sort(cpp__getMostSimilarMultiple(model, c("benchmark", "compliance", "submission", "checklist"), similarCounts = 50, maxDepth = 1), decreasing = T)

We compute the top 50 most probable words as targets  for the above context words :

processor     4.141843
verification  4.140408
required      4.110349
origination   4.097028
underwriting  4.090625
loan          4.076904
mortgage      4.069590
last          4.064455
page          4.053817
date          4.032603
form          4.030643
closing       4.025670
lender        4.025247
specific      4.023724
borrower      4.023479
applicable    4.022784
number        4.021188
home          4.020338
property      4.019561
prior         4.019289
state         4.019212
credit        4.018207
disclosure    4.017166

The first column are the target words and the second column are the scores sorted in decreasing order. The integer portion of the scores suggests that all these words were found in the same context as the given (4) context words. As evident from the scores that the algorithm has generated the missing words from the context words in the first line of the given text ("processor", "verification", "origination" and "underwriting").

The full code is available on my Github profile.

Doing spell corrections is easy considering that we already have written a function to get the tuple of target as well as context words (specified by the context size). We will re-use the same function but only include target words which are not part of an English dictionary.

// [[Rcpp::export]]
List cpp__spellCorrect(List model, std::vector<std::vector<std::string>> documentWords, 
                       std::vector<std::string> vocabulary, 
                       int contextSize=5, int similarCounts=50, int maxDepth=2) {
  
  ModelData modelData = convertModelDataFrame(model);
  
  std::vector<TrainInstance> nonVocab = generateSkipGramTrainingInstances(documentWords, modelData.wordIndex, 
                                                                          contextSize, vocabulary, true);
  
  std::unordered_map<std::string, std::string> cache;
  
  std::vector<std::string> incorrect, correct;
  std::set<std::string> vocab(vocabulary.begin(), vocabulary.end());
  
  for (size_t i = 0; i < nonVocab.size(); i++) {
    TrainInstance instance = nonVocab[i];
    
    std::string incorrectWord = modelData.invertedWordIndex[instance.givenWordIndex];
    std::string correctWord = incorrectWord;
    
    if (cache.find(incorrectWord) != cache.end()) correctWord = cache[incorrectWord];
    else {
      std::set<int> contextIds = instance.contextWordsIndices;
      
      if (contextIds.size() > 0) {
        std::unordered_map<int, double> out = mostContextualMultiple(modelData.nodes, contextIds, similarCounts, maxDepth);
        
        std::vector<std::pair<int, double>> temp;
        for (auto p = out.begin(); p != out.end(); ++p) temp.push_back(std::make_pair(p->first, p->second));
        
        std::sort(temp.begin(), temp.end(), [](const std::pair<int, double> &a, const std::pair<int, double> &b){return a.second > b.second;});
        
        int minDistance = 3;
        
        for (size_t j = 0; j < temp.size(); j++) {
          std::string possibleWord = modelData.invertedWordIndex[temp[j].first];
          
          int d = (int)levenshteinDistance(incorrectWord, possibleWord);
          
          if (d > 0 && d < minDistance && (double)d/(double)incorrectWord.size() <= 0.33 
                && (vocab.find(possibleWord) != vocab.end() 
                      || modelData.nodes[modelData.wordIndex[possibleWord]].size() > modelData.nodes[modelData.wordIndex[incorrectWord]].size())) {
                      
            minDistance = d;
            correctWord = possibleWord;
          }
        }
        
        cache[incorrectWord] = correctWord;
        
        if (correctWord != incorrectWord) {
          incorrect.push_back(incorrectWord);
          correct.push_back(correctWord);
        }
      }
    }
  }
  
  for (size_t i = 0; i < documentWords.size(); i++) {
    std::vector<std::string> words = documentWords[i];
    std::vector<std::string> updatedFileContent;
    
    for (size_t j = 0; j < words.size(); j++) {
      std::string word = words[j];
      
      if (cache.find(word) != cache.end()) words[j] = cache[word];
    }
    documentWords[i] = words;
  }
  
  return List::create(_["Corrections"]=DataFrame::create(_["Incorrect"]=incorrect, _["Correct"]=correct), _["UpdatedContents"]=documentWords);
}

The above function takes as input, the graph model (as Rcpp List type), the text contents for each document (contents are split per line), the stop-words to be excluded from the training instances and other parameters defined same as above.

First we are converting the Rcpp List model into a two-level map C++ data structure for faster execution. Then we obtain the tuple of incorrect target word and context words in its surroundings. For each tuple, we are obtaining the possible target words given the context words by calling the "mostContextualMultiple" function.

Based on the Levenshtein distance of the incorrect target word and the possible correct target words, we choose the target word having the minimum distance but at-most a distance of 2 from the incorrect word, also the ratio of the distance from the possible correct word to the length of the incorrect word should be less than 0.33 and either the possible target word should be part of the English dictionary or the number of neighbors for the possible correct word should be greater than the incorrect word in the graph. This ensures that we do not only select words part of the dictionary because nouns etc. might not be part of the dict, for e.g. "wikipedia".

Below listed are some of the possible spelling corrections done by the above algorithm :

cpp__getCorrectedWords(model, documentWords, vocabulary, contextSize = 5, similarCounts = 50, maxDepth = 1)
                 Incorrect                  Correct
1                 properiy                 property
2                    vaiue                    value
3                     ashx                     aspx
4                    ffsec                    ffiec
5                 mdrtgage                 mortgage
6                    yotar                   notary
7                 ariionaf                  arizona
8                 ffecords                  records
9                    orcle                   circle
10                    maii                     mail
11                 citfren                  citizen
12                   mamed                     name
13                 caivrsm                   caivrs
14                 gsaisam                   gsasam
15                 comoanv                  company
16                  amdunt                   amount
17                     paj                      pay
18              ciickforms               clickforms
19                   etiie                    ellie
20                  rriles                    miles
21  comdvwebuserprintprevi comdvwebuserprintpreview
22                   loafi                     loan
23                 rvicing                servicing
24                 borowor                 borrower
25                iroperty                 property
26                  amounr                   amount
27                 avfrnue                   avenue
28            couetruction             construction
29                   tilla                     will
30                seairity                 security
31                  nuoler                   number
32                   dtfve                    drive
33                  mailkg                  mailing
34                 adihtss                 adtirtss
35               atpresent                  present
36                   rento                     rent
37                     yrt                      yrs
38                  addnss                  address
39                     yre                      yrs
40                   dbane                     date
41                     rtv                      rev
42                   multj                     must
43                   npjls                     nmls
44                quaitied                qualified
45                buikiing                 building
46                    wals                    walls
47                     ihc                      inc
48                  hedule                 schedule
49                 ravenue                  revenue
50                  prefit                   profit
51                     los                     loss
52                     auq                      aug
53            nonfinanciai             nonfinancial
54               craditors                creditors
55         consumerflnance          consumerfinance
56                 alamqde                  alamode
57                   xjyes                     gyes
58                     omo                      ono
59                    oygs                     oyes
60                     onq                      ono
61                   matev                     date
62                    fter                      fte
63                moitgage                 mortgage
64                munidpal                municipal
65                oocmagic                 docmagic
66                    seto                      set
67             commonweauh             commonwealth
68                   pubuc                   public
69                     msp                      mip
70               precedinp                preceding
71                  exdude                  exclude
72                bcrrower                 borrower
73                    hcme                     home
74               sigpaturi                signature
75                envefope                 envelope
76                  seiler                   seller
77                   ombnc                      omb
78                     une                      one
79                     oss                     loss
80               commision               commission
81                  dantle                     date
82                btaxable                  taxable
83                ctaxable                 btaxable
84                 proqram                  program
85                     gst                      gse
86            imorovements             improvements
87                 reoairs                  repairs
88                   foans                    loans
89               orecedina                preceding
90                     fxl                      ixl
91                   emaii                    email
92        managemenibudgst         managementbudget
93                    gity                     city
94             dalinquency              delinquency
95                     pra                      pre
96                purchass                 purchase
97              homefauyer                homebuyer
98                  darice                    price
99                   marci                    maria
100                rachael                  michael
101                deutsch                 deutsche
102                  amany                      may
103                robertt                   robert
104                  toney                      one
105                 perksd                   period
106             disoiosure               disclosure
107                  runds                    funds
108                    ipc                     sipc
109                 oldman                  goldman
110                   srpc                     sipc
111                 waitto                     wait
112                  jaime                     name
113               bordowtz                bgrdowitz
114                 encjzo                   encizo
115             santamarta               santamaria
116                stephan                  stephen
117                   fiie                     file
118                   idsc                      ids
119                  julie                     june
120                 fosuua                   joshua
121                  raots                     rate
122                    uey                      uew
123                 datefs                     date
124               presense                 presence
125             dfscfosure               disclosure
126                wojtere                  wolters
127                  kwwer                   kluwer
128           titleprofiie             titleprofile
129                  shite                    state
130                  relrn                    retro
131                    rrg                      reg
132                 resull                   result
133                  fforn                     form
134           atstonecrest               stonecrest
135                  wners                   owners
136                    jan                      can
137                shingie                  shingle
138              manorwgod                manorwood
139                    dbr                      ddr
140                  rfsfk                     rsfr
141                 cordes                     code
142                 gordes                  ccordes
143                  huduw                      hud
144                  thnft                    theft
145                 ahowed                  allowed
146                indings                  hidings

Some of the drawbacks of the graph based approach mentioned above, compared to word vectors are :

  • The model is not universal. Unlike SVD or LSA or Word2Vec or Glove, this method does not generate word vectors and as such the model cannot be re-used in some of the common applications of NLP as mentioned earlier in the post.
  • The model cannot be used for evaluating syntactic and semantic relationships unlike W2V or Glove. In W2V/Glove, the model maintains linear relationships as demonstrated earlier with ("King")-("Man")=("Queen")-("Woman") example. Such kind of evaluation is not possible using the graph based approach as we are not encoding or projecting the words into some N dimensional space, where we can evaluate vector sums and differences.
  • Since the model is a directed graph (conditional probabilities), the edge weights are as-symmetric and cannot be qualified as a distance/similarity metric, but can be used to determine only the conditional probability of a word given contextual words in its surroundings, i.e. we cannot say how much similar one word is to another (as in word vectors) but we can say that given one word, what are the chances of another word occurring in the same context. From a higher level they might not seem too different but they are in-fact different given that same word can have different meaning in different contexts.

Plot of Number of Context Words vs. Frequency

  • The vector dimensions of each word in W2V or Glove are very small compared to the size of the vocabulary. Thus if there are V words in the vocabulary and dimension of the vectors is N, then the space needed to hold the model in memory is O(VN), whereas in the graph based approach, if we assume that in the average case at-least half of the words co-occur with each word in the same context, the worst case space complexity comes out to be O(V^2). This may not be true always, as only a "few" words will have the tendency to co-occur together in same context in standard English literature. As seen from the above plot that the number of context words for a target word follows a sort of power-law distribution with a mean of 39 different possible context words per target word. The power law can be confirmed with a log-log plot.

Log-Log plot with a straight line confirming a power law.

Categories: MACHINE LEARNING, PROGRAMMING

Tags: , , , , , ,