Stokastik

Machine Learning, AI and Programming

Tag: Spelling Correction

Generative vs. Discriminative Spell Corrector

We have earlier seen two approaches of doing spelling corrections in text documents. Most of the spelling errors encountered are in either user generated contents or OCR outputs of document images. Presence of spelling errors introduce noise in data and as a result impact of important features gets diluted. Although the methods explained are  different in how they are implemented but theoretically both of them work on the same principle. […]

Continue Reading →

Designing a Contextual Graphical Model for Words

I have been reading about Word Embedding methods that encode words found in text documents into multi-dimensional vectors. The purpose of encoding into vectors is to give "meaning" to words or phrases in a context. Traditional methods of document classification treat each word in isolation or at-most use a N-gram approach i.e. in vector space, the words are represented as one-hot vectors which are sparse and do not convey any meaning whereas […]

Continue Reading →

Spelling correction on OCR outputs

There are two aspects to OCR (Optical Character Recognition) correction, first one is that if the OCR error is consistent, i.e. makes the same mistakes uniformly across multiple documents, then assuming that the training documents will be almost similar to what we are going to expect at run time, then there is probably no need for OCR correction as the OCR will almost certainly make the same mistakes in the […]

Continue Reading →