Machine Learning, AI and Programming

Dynamic Programming in Natural Language Processing - Skip Grams

In this series of posts, I will be exploring and showing how dynamic programming technique is used in machine learning and natural language processing. Dynamic programming is very popular in programming interviews. DP technique is used mainly in problems which has optimal substructure and can be defined recursively, which means that a problem of size N can be solved by solving smaller sub-problems of size m < N. Such problems can be defined recursively.

For example, take a simple example of graph traversal. We have a NxN grid G filled with some values, starting at (0,0), we can go either right or down and while passing a point, we collect that value. What is the maximum value we can collect while traveling from (0,0) to (N-1, N-1).

The brute force approach, would be to simulate every possible path from (0,0) to (N-1, N-1). Since at each point in a path we have two choices, either right or down and the maximum path length is 2N-1, the total number of paths would be O(2^N), i.e. exponential. Instead we observe that if we already have calculated the maximum value from any point (i, j) to (N-1, N-1), D[i][j], then irrespective of any path we take to reach (i, j) from (0,0), the maximum value from (i, j) to (N-1, N-1) would always be D[i][j] and this value can be re-used.

Thus, we can define recursively :

D[p][q] = max(D[p+1][q], D[p][q+1]) + G[p][q], where G[p][q] is the value at point (p,q)

A Python implementation for the following :

def max_val_recursion(grid, start, cached):

    if start == (len(grid) - 1, len(grid) - 1):
        cached[start[0]][start[1]] = grid[start[0]][start[1]]

        p, q = (start[0] + 1, start[1]), (start[0], start[1] + 1)

        if p[0] < len(grid):
            if cached[p[0]][p[1]] != -1:
                a = grid[start[0]][start[1]] + cached[p[0]][p[1]]
                a = grid[start[0]][start[1]] + max_val_recursion(grid, p, cached)
            a = 0

        if q[1] < len(grid):
            if cached[q[0]][q[1]] != -1:
                b = grid[start[0]][start[1]] + cached[q[0]][q[1]]
                b = grid[start[0]][start[1]] + max_val_recursion(grid, q, cached)
            b = 0

        cached[start[0]][start[1]] = max(a, b)

    return cached[start[0]][start[1]]

def maximum_value(grid):
    cached = [[-1] * len(grid)] * len(grid)
    return max_val_recursion(grid, (0,0), cached)

grid = [[1,7,3,5],

The run-time complexity is O(N^2). The matrix 'cached' is the matrix D defined in our equation above.

Many tasks in Natural Language Processing requires solution that could be easily implemented using the concept of Dynamic Programming. There are many problems that require traversal and finding best combination of values, similar to the one in our above example. Problems such as the Viterbi Algorithm, the Forward Backward Algorithm, the Inside Outside algorithm and so on. But we start with some simpler problems in this series.

Although there could be many problems that I am not aware of, but I will be working with problems, that I have personally encountered as a machine learning engineer. The first one is generating skip gram features from a sequence of words. Skip Grams are similar to n-Grams but unlike n-Grams, where the n words are contiguous words from the original sequence, in skip Gram the n words need not be contiguous. For example, given the sentence "dynamic programming is a cool technique", the 3-grams would be :

[(dynamic programming is), (programming is a), (is a cool), (a cool technique)]

whereas all the skip grams of length 3 would be :

[(dynamic programming is), (dynamic programming a), (dynamic programming cool), (dynamic programming technique), (dynamic is a), (dynamic is cool), ... and so on]

It is quite obvious that while the number of n-Grams is O(N), where N is the number of words, the number of skip grams of length n is O(N^n) (N chooses n). To reduce the number of features, we generally use an extra "maximum skip" parameter, i.e. the maximum number of words we are allowed to skip each time. So a skip of 0, would just return the normal n-Grams, whereas a maximum skip of 1 would return :

[(dynamic programming is), (dynamic programming a), (dynamic is a), (dynamic is cool), (programming is a), (programming is cool), (programming a cool), ... and so on]

While generating the n-Grams is easy, and requires very minimal python programming to do so.

def ngrams(words, n):
    return [" ".join(words[pos : pos + n]) for pos in range(len(words) - n + 1)]

words = "dynamic programming is a cool technique".split()
print ngrams(words, 3)

Generating the skip-grams of length n and maximum skip of m, requires defining a recursive relation. Let us define H[k][w] to be all skip-grams of length w (with maximum skip of m) starting at the position k in the input sequence of words. Then to generate all skip grams of length w+1, ending at position k, we can define :

H[k][w+1] = [H[k+1][w] || H[k+2][w] || ...H[k+m+1][w]] + S[k], where || is the concatenation of vectors operation.

Basically, what it says is that, in order to generate skip grams of length w+1 starting at position k, first take all the skip grams of length w starting at positions k+1, k+2, ..., k+m+1 and then concatenate them and then add the word at k-th position at the beginning of each of those w-skip grams. For example, to find all skip grams of length 3 starting at position 0 (at word "dynamic") in the below string, the skip grams of length 2 starting at position 1 and 2 (for m=1) :

[(programming is), (programming a), (is a), (is cool)]

add the word "dynamic" at the beginning of each one of the above skip grams.

The final list of all skip grams of length n, is obtained by concatenating : H[0][n] || H[1][n] || ... H[N-n][n]

Below is the python code to generate the same.

import collections

def skip_grams_recursion(words, pos, n, m, cached):
    if n == 1:
        cached[(pos, n)] = [words[pos]]

        last_sg = []

        for idx in range(1, m + 2):
            if pos + idx < len(words) and n > 1:
                if (pos + idx, n - 1) in cached:
                    last_sg += cached[(pos + idx, n - 1)]
                    last_sg += skip_grams_recursion(words, pos + idx, n - 1, m, cached)

        out = [words[pos] + " " + x for x in last_sg]
        cached[(pos, n)] = out

    return cached[(pos, n)]

def skip_grams(sentence, n, m):
    words = sentence.split()
    cached = collections.defaultdict(list)

    out = []
    for idx in range(len(words) - n + 1):
        out += skip_grams_recursion(words, idx, n, m, cached)

    return out

print skip_grams("dynamic programming is a cool technique", 3, 1)

Output :
['dynamic programming is', 'dynamic programming a', 'dynamic is a', 'dynamic is cool', 'programming is a', 'programming is cool', 'programming a cool', 'programming a technique', 'is a cool', 'is a technique', 'is cool technique', 'a cool technique']

The run-time complexity is the same as the number of skip grams of length n (of maximum skip m). The variable 'cached' serves the purpose of memoization in DP, which is same as H[k][w] defined above. Instead of doing redundant computation, this variable serves to cache those outputs one time.

After we have a function, that accepts a sentence and returns the skip grams for that sentence, we can use this function as a callback for TF-IDF feature matrix in various NLP tasks.

from sklearn.feature_extraction.text import TfidfVectorizer

def get_tfidf_matrix(sentences):
    vectorizer = TfidfVectorizer(tokenizer=skip_grams, ngram_range=(1,1), token_pattern='\b\w[\w ]+\b')
    matrix = vectorizer.fit_transform(sentences)
    return matrix, vectorizer

Note that the token_pattern parameter in TFIDF Vectorizer is modified to exclude splitting on spaces again because our skip grams are space separated.

Also note that the number of skip grams can be huge sometimes depending on the values of n and m. So with larger sentences, often it might happen that TFIDF vectorizer throws "MemoryError" if insufficient CPU is available.

In the next part of this series we will look at Longest Common Subsequence (LCS) problem, and how we can modify and use LCS for string similarity in various NLP tasks.


Tags: , , , ,

Leave a Reply