Stokastik

Machine Learning, AI and Programming

Spelling correction on OCR outputs

There are two aspects to OCR (Optical Character Recognition) correction, first one is that if the OCR error is consistent, i.e. makes the same mistakes uniformly across multiple documents, then assuming that the training documents will be almost similar to what we are going to expect at run time, then there is probably no need for OCR correction as the OCR will almost certainly make the same mistakes in the testing data and hence the mistakes becomes the features.

Whereas if our assumption that the testing documents would be similarly good or bad in quality as the training data does not hold good, then we need to perform OCR correction before training the classification model.

Going by our second hypothesis, we have experimented with few probabilistic approaches mentioned below in order to correct some of the OCR noise.

Approach 1 :

  • Start with a static dictionary of english words.
  • Compute the frequency of each word in the overall document set.
    • Divide the set of words into three separate bands based on the frequency of words.
    • Let's denote \mu_H as the upper level threshold and \mu_L as the lower level threshold.
    • Words which lie above the upper level threshold \mu_H can be considered as a "correct" word, whereas words lying below the lower level threshold \mu_L can be considered as "incorrect".
  • Words with frequency counts lying above the upper level threshold \mu_H are added to the dictionary. The dictionary now contains words from the english vocabulary as well words which frequently occurs in the  documents.
  • For each word lying below the lower level threshold \mu_L and also not part of the english vocabulary, compute the Levenshtein distance of this word with each word in the dictionary.
    • Each output is a 3-tuple (w_i, d_i, f_i) where w_i is the "correct" word from dictionary, d_i is the distance of w_i from the "incorrect" word and f_i is the frequency of w_i.
  • The most probable correction for each possible "incorrect" word is the word from dictionary with the least distance and with the highest frequency.
    • Sort the outputs from the dictionary, first on the distance, lowest to highest and then for same distance sort the words on frequency from highest to lowest and map the "incorrect" word to the highest ranked word after both the sorting.
    • Generally we only consider "correct" words which are at-most edit distance 2 apart.
  • For example if an incorrect word is "mortqage" and the possible corrections (with max edit distance 2) are {("mortgages", 2, 50), ("mortgage", 1, 45), ("mortgagee", 2, 47)}, then, sorting first on edit distance lowest to highest and then sorting on frequency, will return "mortgage" as the most probable "correct" word.
  • Create a HashMap from each "incorrect" word to the "correct" word as found from the above steps.
    • At the time of prediction, use the HashMap to correct the "incorrect" words, before applying the classification model.
    • For example, if we encounter the word "mortqage" during testing then we could directly use the HashMap to map it to "mortgage" without any computation.

Although the above approach works well for many incorrect words, but there are several problems encountered during the above implementations.

  • Not knowing the values of \mu_H and \mu_L beforehand.
    • Since the above approach is unsupervised, we have to experiment with the above two unknown parameters to obtain the best results.
  • Time complexity.
    • Given that we need to check the Levenshtein distance of each "incorrect" word with each "correct" word from the dictionary, then assuming that there are 'm' incorrect words and 'n' words in dict and the length of each word is 'd', the worst case time complexity is O(d^2*m*n).
    • Since the incorrect words might be repeated across docs and each incorrect word will "most likely" be mapped to the same correct word across the documents, it is a good idea to cache the mapping of the incorrect word to the correct word as and when we encounter them, so that every-time we encounter the same incorrect word, we use the cache to directly map it to the "corrected" word from an earlier encounter.
  • High number of false positives.
    • Although we are able to correct many incorrect word instances, but several correct words are also further "corrected" due to their frequency of occurrence being lower than \mu_L.
    • For example "lien" is a completely valid financial term, but since the term was not present in the vocabulary and the frequency of "lien" was quite low, it was corrected to "line".
  • More data and better dictionary.

Approach 2 :

The below approach is also an unsupervised approach with pre-defined values of \mu_L and \mu_H but in this approach we compute word frequencies based on neighboring words information and found to be much more reliable compared to the above approach.

  • Start with a static dictionary of english words.
  • Compute word transition probabilities.
    • Let us denote P_F(Y|X) as the conditional probability of the word Y given that the previous word is X.
    • Similarly we denote P_B(X|Y) as the conditional probability of the word X, given that the next word is Y.
    • We call them the forward and backward probabilities respectively.
    • If a word is located at the start or end, we compute their start and end probabilities respectively, i.e. P_F(Y|START), and P_F(X|END) and so on.
  • For each word, compute the forward probability from previous word (or START) and the backward probability from the next word (or END).
    • If both these probabilities lie below \mu_L and also the word is not present in the vocabulary, then we consider this word as "incorrect".
    • If this word is part of the vocabulary then do not correct it.
  • Given a sequence of words "X Y Z", if Y is an incorrect word as defined by the previous step, then compute the set of all possible words {X'} following X and the set of all possible words {Z'} before Z.
    • Consider which are the common words between X and Z, i.e. the intersection X'{\cap}Z'. Let us call the intersection as the set {Y'}.
    • For each possible word in the set return a 3-tuple (y'_i, d_i, f_i), where y'_i is a word from the set {Y'}, d_i is the Levenshtein distance of Y from y'_i and f_i is the average probability i.e.

f_i=\frac{P_F(X, y'_i)+P_B(y'_i, Z)}{2}

  • Sort the outputs obtained above, first on the distances d_i, from lowest to highest, then sort them on the average probability values f_i from highest to lowest.
    • Correct Y to the word which has the minimum distance (preferably 1) and with the highest average forward and backward probability.
  • For example if in the phrase "Appraisal Notice", "Notice" is misspelled as "Notjce", then from the forward transition probability of "Appraisal" we would be knowing that "Notice" (with distance of 1 from "Notjce") is most probable since its probability is much higher compared to "Notjce".
  • We store the forward as well as the backward transition probabilities as HashMaps.
    • At run time, when we do spelling correction on testing data, we re-use the computed transition probabilities to compute the possible corrections.

In both the above approaches, we lower-case the contents of both the documents and the vocabulary so that we are doing case-insensitive comparison and search.

The 2nd approach gives much more convincing results as compared to the 1st one since we are taking some context of the word into account while doing the spelling correction.

Some of the problems we faced while implementing approach 2 were as follows :

  • The problem of unknown parameters \mu_H and \mu_L exists since the approach is also unsupervised and thus we need to experiment with this two parameters to obtain an optimum result.
  • Running time complexity.
    • In 2nd approach, we do not need to compute the Levenshtein distance of the incorrect word with all possible correct word, since we only need to compute the distances between the incorrect word with other possible words that come in between the previous and the next word. Thus search space is drastically reduced.
    • But unlike in 1st approach, where we used a map from an incorrect word to a correct word based on the assumption that the same incorrect word will be mapping to the same correct word, may not hold good in 2nd approach as it also depends on the previous and the next words in the sequence.
    • Instead of having to map an incorrect word to a correct word, we use a distance cache, that stores the distances for an incorrect-correct pair of words. This prevents us from having to recompute the Levenshtein distances again.

Some possible improvements that could be done to further improve the quality of the spelling corrections are :

  • Use stemming.
    • Stem the words in the text documents.
    • While doing the initial lookup to verify if the word is part of the vocabulary or not, use both the stemmed and un-stemmed version of the word.
    • For example the word "notices" may not be present in the vocabulary but "notice" would be present and hence if the word is not stemmed, then we would be made to believe that the word "notices" is a spelling error.
  • Remove common stop-words.
  • Remove non-alphabetic characters from the text documents.

Identifying missed spaces.

A missing blank space between words is also a serious OCR noise. One way to handle missing blank spaces, is by breaking each word at all indices and checking whether each component is part of the vocabulary. If it is so, then we insert spaces in between the components. One can use a dynamic programming approach to identify missed spaces.

For example let H(i, j) be 1 if the substring from i to j in the word can be broken down into valid components else 0. By logic one can conclude recursively that :

H(i, j)=||_k\;H(i, k)*H(k+1, j)

where ||_k is OR-ing over the indices i <= k < j.

What the above recursive relation implies is that the substring from i to j can be broken down into valid components, if we can find any point k in between i and j, such that the substring i to k and substring k+1 to j can be both broken down into valid components themselves.

For example given the substring "moviesiwanttowatch", the possible values of k for which the given substring can be broken down into valid components are 6 (after "movies"), 7 (after "i"), 11 (after "want") and 13 (after "to").

If we denote by G(i, j), a vector to store the component words for the substring i to j, then G(i, j) is the concatenation of the vectors G(i, k) and G(k+1, j) if H(i, j)=1.

The following C++  code illustrates the missed-space words algorithm :

std::string spaceCorrectedWord(std::string incorrectWord, const std::set<std::string> &vocab, 
                               std::unordered_map<std::string, std::unordered_map<std::string, double>> &wordTransFreqFwd) {

  std::map<int, std::map<int, bool>> indicator;
  std::map<int, std::map<int, std::vector<std::string>>> components;
  
  for (int length = 1; length <= (int)incorrectWord.size(); length++) {
    for (int i = 0; i <= (int)incorrectWord.size()-length; i++) {
      int start = i;
      int end = start+length-1;
      
      indicator[start][end]=false;
      
      std::string subIncorrectWord=incorrectWord.substr(start, length);
      
      if (vocab.find(subIncorrectWord) != vocab.end()) {
        indicator[start][end]=true;
        components[start][end].push_back(subIncorrectWord);
      }
      else if (length > 1) {
        double maxFreq = std::numeric_limits<double>::min();
        int split = -1;
        
        for (int j = start; j < end; j++) {
          bool a = indicator[start][j];
          bool b = indicator[j+1][end];
          
          if (a && b) {
            std::string x = components[start][j].back();
            std::string y = components[j+1][end].front();
            
            bool c = wordTransFreqFwd.find(x) != wordTransFreqFwd.end() && wordTransFreqFwd[x].find(y) != wordTransFreqFwd[x].end();
            
            if (c) {
              indicator[start][end]=true;
              
              if (wordTransFreqFwd[x][y] >= maxFreq) {
                maxFreq = wordTransFreqFwd[x][y];
                split = j;
              }
            }  
          }
        }
        
        if (indicator[start][end]) {
          components[start][end].insert(components[start][end].end(), components[start][split].begin(), components[start][split].end());
          components[start][end].insert(components[start][end].end(), components[split+1][end].begin(), components[split+1][end].end());
        }
      }
    }
  }
  
  std::vector<std::string> allComponents = components[0][(int)incorrectWord.size()-1];
  
  std::string out = "";
  
  if ((int)allComponents.size() > 0) {
    for (int i = 0; i < (int)allComponents.size()-1; i++) out += allComponents[i] + " ";
    out += allComponents[(int)allComponents.size()-1];
  }
  else out = incorrectWord;
  
  return out;
}

For verifying whether a given split is a valid split or not, apart from checking whether each component word is part of the vocabulary, the transition probability from the last word on the left half of the split to the first word on the right half of the split must be non-zero for the split point to be valid.

For example, for the word "machinelearningalgorithmbook", if we look at the split point at index 15, then the left half has the components ["machine", "learning"] and right half has the components ["algorithm", "book"],  now the split is valid only if the transition probability from "learning" to "algorithm" is non-zero.

Some corrected spelling errors are as follows, the 2nd column is the incorrect spelling and the 3rd column is the corrected one :

54348               sourcefs                   source
2124               logicease               logic ease
6158            changeamount            change amount
48802 qualityconditionmarket quality condition market
27504         chargescredits          charges credits
23359              floodcert               flood cert
36153             transunion              trans union
20268             adiustmert               adjustment
28698            transadions             transactions
11790          homeownership           homeowner ship
38558                derrfem                  derrien
8953         consumerfinance         consumer finance
49619          paymentmonths           payment months
16817              corelogic               core logic
23009              realquest               real quest
53824                 typefs                     type
19922           lenderbroker            lender broker
40696                 resuto                  resufts
18466                  lalse                    false
47407                 kicome                   income
2551              denartment               department
22032            entryunable             entry unable
48910             transunion              trans union
20669               apraisal                appraisal
8977         consumerfinance         consumer finance
20487             dataverify              data verify
9572             fosterchild             foster child
4803         consumerfinance         consumer finance
9154                   lalso                    false
8726              prelimhary              preliminary
11854               saietime                sale time
34399                 seiler                   seller
16886              corelogic               core logic
11780       managementbudget        management budget
51070              scheduleb                 schedule
12847             dataverify              data verify
6502            sellercredit            seller credit
36792               conlracl                 contract
45760                tombail                  tomball
41051                 indude                  include
31468             transunion              trans union
22586        foreclosurebank         foreclosure bank
47288               ethnidty                ethnicity
25584           leadasbestos            lead asbestos
44415       managementbudget        management budget
34266                  tltfe                    title
43187                chlcaao                  chicago
35762        approveeligible         approve eligible
40005                engiish                  english
3511            sponsoragent            sponsor agent
31865              nonflllng                nonfiling
23876              greenpath               green path
20511             flproperty                 property
33226             transunion              trans union
28357              delbquent               delinquent
27313            transmittad              transmittal
5614        motivationbudget        motivation budget
39768               mortoage                 mortgage
16379               loansafe                loan safe
23788            agentbrcker             agent broker
28452                  vvest                     west
25298            perfotmance              performance
15634          probateestate           probate estate
31688               completo                 complete
18452            grantthomas             grant thomas
36923                privato                  private
41265      churchillmortgage       churchill mortgage
20673             dataverify              data verify
33267         lenderservicer          lender servicer
48129                 equire                  require
1201              sourcetype              source type
8844                ividends                dividends
28941                  shail                     shal
16573               loansafe                loan safe
20510             dataverify              data verify
45074                  ublic                   public
8738         consumerfinance         consumer finance
14555           documentprep            document prep
20273          garagecarport           garage carport
47703         buildingmobile          building mobile
42522               liabiity                liability
40370        consumerfinance         consumer finance
26606               gainloss                gain loss
51219                  impli                  implied
28427               coborrow               coborrower
36709          bonanzaavenue           bonanza avenue
13013              daystotal               days total
12308             dataverify              data verify
33475               gilliand                 gilliard
762       mortgagecalculator      mortgage calculator
42497                  tttle                    title
40318                escroxv                   escrow
23719               ffiliate                affiliate
51148                micheue                 michelle
37124          paymentmonths           payment months
4307                  vestor                 investor
25406                 pruoha                   prucha
52985               typerate                type rate
43071               industny                 industry
31659                spedfic                 specific

 

Categories: MACHINE LEARNING

Tags: , , ,