Stokastik

Machine Learning, AI and Programming

Building a classification pipeline with C++11, Cython and Scikit-Learn

We have earlier seen how using Cython increases the performance of Python code 50-60x, mostly due to static typing as compared to dynamic typing in pure Python. But we have also seen how one can wrap pure C++ classes and functions with Cython and export them as Python packages with improved speed. The codes we have dealt with so far using Cython were mostly generic modules like generating primes using sieve approach and a modified priority queue data structure with capability for O(logN) deletion and updation. In this post we look at a more specific machine learning problem that needs to be deployed in a production environment serving real-time predictions to customers.

We will be building an end-to-end text classification pipeline using scikit-learn, cython and c++. The problem is to predict the value of certain attributes for e-commerce products, such as 'color' from the product title and description. The important steps we will follow are:

  • Create a 'CustomFeatureTransformer' class. This class is responsible for all feature related functionality such as feature selection, pre-processing, tokenization etc. and it returns back a TF-IDF weighted document-term matrix.
  • Create a 'CustomClassifier' class. This class is responsible for all model related functionality such as fitting a base estimator to the training document-term matrix and the labels, computing class probabilities, doing prediction on test data and returning back class labels.
  • A main python script 'main.py' - reads in the data from a CSV file, splits the data into training and validation, calls 'CustomFeatureTransformer' to compute the document-term matrix, calls 'CustomClassifier' to fit the model and so on.

We will be writing the logic for the 'CustomFeatureTransformer' class in pure C++ and then export the C++ class, headers and functions to Cython just as we saw in our earlier post on the modified priority queue data structure.

We write the 'CustomClassifier' class in Cython itself, since the core of the classifier logic is in implementing the base estimator such as Logistic Regression or Gradient Boosted Tree and so on. Thus we will not re-write these algorithms from scratch in C/C++ but just add static typing to the class, the methods and its variables.

Following are the steps we will follow while creating the 'CustomFeatureTransformer' class:

  • There should be only two methods 'fit' and 'transform' that will be exposed to external APIs.
  • 'fit' method will accept the input text as an array of strings and the class labels also as an array of strings.
  • 'transform' method will accept an array of input strings and return a sparse document term matrix.
  • The 'fit' method:
    • Breaks each input sentence into an array of words (after removing stop-words).
    • Creates 1-grams to 3-grams as tokens.
    • Selects the top N tokens based on the mutual information score of the tokens with the class labels and lastly
    • Computes the IDF values of the final tokens.

Here is the 'FeatureTransformer.h' header that we are going to use:

#ifndef FEATURETRANSFORMER_H
#define FEATURETRANSFORMER_H

#include <iostream> 
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include <queue>
#include <boost/algorithm/string.hpp>
#include <regex>
#include <string>
#include <algorithm>

struct SparseMat {
    std::vector<int> row;
    std::vector<int> col;
    std::vector<double> data;
};

typedef std::vector<std::string> str_vec;
typedef std::unordered_map<std::string, int> str_int_map;
typedef std::unordered_map<std::string, double> str_dbl_map;
typedef std::vector<std::vector<std::string>> str_2_vec;

class FeatureTransformer {
    private:
        int min_ngram;
        int max_ngram;
        int vocab_size;
        str_vec stop_words;
    
        str_2_vec tokenize (str_vec &sentences);
        str_vec ngrams (str_vec &tokens);
        str_2_vec ngram_tokenize (str_2_vec &tokenized_sents);
        str_vec mutual_info_select (str_2_vec &tokenized_sents, str_vec &labels);
        str_dbl_map compute_idf (str_2_vec &tokenized_sents);
        SparseMat compute_tf_idf (str_2_vec &tokenized_sents);
        
    
    public:
        str_vec vocab;
        str_int_map vocab_inv_index;
        str_dbl_map idf_scores;
    
        FeatureTransformer();
        FeatureTransformer(str_vec x_stop_words, int x_vocab_size, int x_min_ngram, int x_max_ngram);
        ~FeatureTransformer();
        void fit(str_vec sentences, str_vec labels);
        SparseMat transform(str_vec sentences);
};

#endif

We will be using C++11 throughout this demonstration, as it provides some useful features such as lambda functions etc. Also we will be quite heavily using the C++ Boost library for certain string functionalities. Later we will show how to setup C++11 and Boost for creating the cython package. Specially if you are developing on Mac OS this can sometimes cause an issue.

Note the 'SparseMat' struct which represents a sparse matrix in triplet format. This would be our output data structure for feature matrix. The constructor accepts the stop-words, the desired vocabulary size and the size of the n-grams to be used. Except for 'fit' and 'transform' methods, all other methods are made 'private'.

Next we define each of these methods in the 'FeatureTransformer.cpp' file:

The constructor and the destructor:

FeatureTransformer::FeatureTransformer () {}

FeatureTransformer::FeatureTransformer (str_vec x_stop_words, int x_vocab_size, int x_min_ngram, int x_max_ngram) {
    this->stop_words = x_stop_words;
    this->vocab_size = x_vocab_size;
    this->min_ngram = std::max(1, x_min_ngram);
    this->max_ngram = std::max(x_min_ngram, x_max_ngram);
}

FeatureTransformer::~FeatureTransformer () {}

The method 'tokenize' to split each sentence into words:

str_2_vec FeatureTransformer::tokenize (str_vec &sentences) {
    std::transform(sentences.begin(), sentences.end(), sentences.begin(), [](std::string x) {return std::regex_replace(x, std::regex("<[^<]+?>|[^\w\-]+"), " ");});
    std::transform(sentences.begin(), sentences.end(), sentences.begin(), [](std::string x) {boost::algorithm::to_lower(x); return x;});
    
    std::unordered_set<std::string> stops(this->stop_words.begin(), this->stop_words.end());
    
    str_2_vec tokenized_sents;
    std::string delimiters(" ");
    
    for (size_t i = 0; i < sentences.size(); i++) {
        str_vec words;
        boost::split(words, sentences[i], boost::is_any_of(delimiters));
        std::transform(words.begin(), words.end(), words.begin(), [](std::string x) {boost::algorithm::trim(x); return x;});
        words.erase(std::remove_if(words.begin(), words.end(), [&stops](std::string x){return stops.find(x) != stops.end();}), words.end());
        tokenized_sents.push_back(words);
    }
    
    return tokenized_sents;
}

First we remove some unwanted characters from the sentences (that comes with our input data) and then we convert the text to lowercase. Then we split each sentence into its words (delimited by space), trim the words for any unwanted spaces, and filter only the non-stop-words from them. These are some of the standard procedures.

Next we define the methods for generating the N-grams:

str_vec FeatureTransformer::ngrams (str_vec &tokens) {
    str_vec ngram_out;
    
    for (int i = this->min_ngram; i <= this->max_ngram; i++) {
        for (int j = 0; j <= static_cast<int> (tokens.size())-i; j++) {
            str_vec temp(tokens.begin() + j, tokens.begin() + j + i);
            std::string x = boost::algorithm::join(temp, " ");
            boost::algorithm::trim(x);
            ngram_out.push_back(x);
        }
    }
    return ngram_out;
}

str_2_vec FeatureTransformer::ngram_tokenize (str_2_vec &tokenized_sents) {
    str_2_vec ngrams_out;
    
    for (size_t i = 0; i < tokenized_sents.size(); i++) {
        str_vec tokens = tokenized_sents[i];
        str_vec ngram_out = this->ngrams(tokens);
        ngrams_out.push_back(ngram_out);
    }
    return ngrams_out;
}

It has pretty simple logic and quite easily implementable. But this is also one of the bottleneck in terms of speed. For a sentence of M words, generating n-grams takes O(M) time complexity. Thus for 1-to-3 grams is O(3M) for each sentence. Usually we try to limit ourselves to a maximum of 3-grams. As seen in most experiments, higher than 3-grams rarely bring any additional improvements.

The next function we are going to define is for selecting the top N features based on the class labels. We are going to use mutual information criteria to select these features.

struct comparator {
  bool operator()(const std::pair<std::string, double> &a, const std::pair<std::string, double> &b) {
    return a.second > b.second;
  }
};

str_vec FeatureTransformer::mutual_info_select (str_2_vec &tokenized_sents, str_vec &labels) {
    std::unordered_map<std::string, double> token_cnts, label_cnts, mi_values;
    std::unordered_map<std::string, std::unordered_map<std::string, double>> label_token_cnts;
    
    std::priority_queue<std::pair<std::string, double>, std::vector<std::pair<std::string, double>>, comparator> mi_max_heap;
        
    int total = static_cast<int> (tokenized_sents.size());
    
    for (size_t i = 0; i < tokenized_sents.size(); i++) {
        str_vec tokens = tokenized_sents[i];
        std::unordered_set<std::string> unique_tokens(tokens.begin(), tokens.end());
        
        std::string label = labels[i];
        
        std::for_each(unique_tokens.begin(), unique_tokens.end(), [&token_cnts](std::string token){token_cnts[token]++;});
        label_cnts[label]++;
        std::for_each(unique_tokens.begin(), unique_tokens.end(), [&label_token_cnts, &label](std::string token){label_token_cnts[label][token]++;});
    }
    
    for (auto it = label_token_cnts.begin(); it != label_token_cnts.end(); ++it) {
        std::string label = it->first;
        std::unordered_map<std::string, double> t_counts = it->second;
        
        for (auto it2 = t_counts.begin(); it2 != t_counts.end(); ++it2) {
            std::string token = it2->first;
            double val = it2->second;

            double x11 = val;
            double x10 = label_cnts[label] - val;
            double x01 = token_cnts[token] - val;
            double x00 = total - (x11 + x10 + x01);

            double x1 = label_cnts[label];
            double x0 = total - x1;

            double y1 = token_cnts[token];
            double y0 = total - y1;

            double p = x11/total;
            double q = x10/total;
            double r = x01/total;
            double s = x00/total;

            double u = x1/total;
            double v = x0/total;
            double w = y1/total;
            double z = y0/total;

            double a1 = (p != 0)? p*log2(p/(u*w)) : 0;
            double a2 = (q != 0)? q*log2(q/(u*z)) : 0;
            double a3 = (r != 0)? r*log2(r/(v*w)) : 0;
            double a4 = (s != 0)? s*log2(s/(v*z)) : 0;

            double mi = a1 + a2 + a3 + a4;
            mi_values[token] = std::max(mi_values[token], mi);
        }
    }
    
    for (auto it = mi_values.begin(); it != mi_values.end(); ++it) {
        std::string token = it->first;
        double mi_value = it->second;
        
        if (static_cast<int> (mi_max_heap.size()) < this->vocab_size) {
            mi_max_heap.push(std::make_pair(token, mi_value));
        }
        else {
            if (mi_value > mi_max_heap.top().second) {
                mi_max_heap.pop();
                mi_max_heap.push(std::make_pair(token, mi_value));
            }
        }
    }
    
    str_vec imp_features;
    while (!mi_max_heap.empty()) {
        std::pair<std::string, double> x = mi_max_heap.top();
        imp_features.push_back(x.first);
        mi_max_heap.pop();
    }
    
    return imp_features;
}

The idea here is pretty simple. Compute the following probabilities:

  • P(F) and P(F') - probability of a document containing feature F and not containing feature F respectively.
  • P(C) and P(C') - probability of a document belonging to class C and not belonging to class C respectively.
  • P(F,C), P(F,C'), P(F',C), and P(F',C') - joint probabilities of a document containing/not-containing feature F and belonging/not-belonging to class C.

and then apply the following mutual information formula to get a score.

P(F,C)*log(\frac{P(F,C)}{P(F)*P(C)})+P(F',C)*log(\frac{P(F',C)}{P(F')*P(C)})+P(F,C')*log(\frac{P(F,C')}{P(F)*P(C')})+P(F',C')*log(\frac{P(F',C')}{P(F')*P(C')})

Based on the scores for each feature F and class C pair, select the N top features with the highest scores irrespective of which class they belong to. Instead of sorting on the scores, we are using a fixed-size priority queue to store the top N scored features only. The priority queue stores a tuple of feature and score, which requires a custom comparator (defined above).

The next two methods are for computing the IDF scores for the features and the TF-IDF scores respectively:

str_dbl_map FeatureTransformer::compute_idf (str_2_vec &tokenized_sents) {
    std::unordered_map<std::string, double> token_idfs;
    int n = static_cast<int> (tokenized_sents.size());
    std::unordered_set<std::string> c_vocab(this->vocab.begin(), this->vocab.end());
    
    for (size_t i = 0; i < tokenized_sents.size(); i++) {
        str_vec tokens = tokenized_sents[i];
        tokens.erase(std::remove_if(tokens.begin(), tokens.end(), [&c_vocab](std::string x){return c_vocab.find(x) == c_vocab.end();}), tokens.end());
        
        std::unordered_set<std::string> unique_tokens(tokens.begin(), tokens.end());
        std::for_each(unique_tokens.begin(), unique_tokens.end(), [&token_idfs](std::string token){token_idfs[token]++;});
    }
    
    std::for_each(token_idfs.begin(), token_idfs.end(), [&n](std::pair<const std::string, double> &x) {x.second = log(static_cast<double> (n)/x.second);});
    
    return token_idfs;
}

SparseMat FeatureTransformer::compute_tf_idf (str_2_vec &tokenized_sents) {
    std::unordered_map<int, std::unordered_map<std::string, double>> token_counts;
    std::unordered_set<std::string> c_vocab(this->vocab.begin(), this->vocab.end());
    
    for (size_t i = 0; i < tokenized_sents.size(); i++) {
        for (size_t j = 0; j < tokenized_sents[i].size(); j++) {
            std::string token = tokenized_sents[i][j];
            if (c_vocab.find(token) != c_vocab.end()) {
                token_counts[i][token]++;
            }
        }
    }
    
    SparseMat out;
    
    for (auto it = token_counts.begin(); it != token_counts.end(); ++it) {
        int doc_index = it->first;
        std::unordered_map<std::string, double> t_cnt = it->second;
        
        for (auto it2 = t_cnt.begin(); it2 != t_cnt.end(); ++it2) {
            std::string token = it2->first;
            double tf = it2->second;
            
            out.row.push_back(doc_index);
            out.col.push_back(this->vocab_inv_index[token]);
            out.data.push_back(this->idf_scores[token]*tf);
        }
    }
    return out;
}

We separate the two methods because the IDF values are computed per feature and it remains fixed for training and testing documents, whereas for the TF-IDF scores, the term frequency of a feature is defined for that document only, and thus it varies across documents for the same feature. Thus TF-IDF score is computed per feature and per document.

Once we are done computing the TF-IDF score we populate the sparse matrix in the form of triplet <i, j, v>, where <i> is a vector of the document indices, <j> is the corresponding vector for the feature indices i.e. after we sort the selected features alphabetically, what is the index of each feature. <v> is the vector of the TF-IDF scores corresponding to the feature j and document i.

And lastly we define our 'fit' and 'transform' methods:

void FeatureTransformer::fit (str_vec sentences, str_vec labels) {
    str_2_vec tokenized_sents = this->tokenize(sentences);
    str_2_vec ngrams = this->ngram_tokenize(tokenized_sents);
    
    this->vocab = this->mutual_info_select(ngrams, labels);
    std::sort(this->vocab.begin(), this->vocab.end());
    
    for (size_t i = 0; i < this->vocab.size(); i++) {
        this->vocab_inv_index[this->vocab[i]] = static_cast<int> (i);
    }
    
    this->idf_scores = this->compute_idf(ngrams);
}

SparseMat FeatureTransformer::transform (str_vec sentences) {
    str_2_vec tokenized_sents = this->tokenize(sentences);
    str_2_vec ngrams = this->ngram_tokenize(tokenized_sents);
    
    return this->compute_tf_idf(ngrams);
}

These functions does the exact same job as the 'fit' and 'transform' methods of TFIDFVectorizer class from scikit-learn.

Next we need to define our Cython wrapper class, which will interface with the C++ 'FeatureTransformer' class. This is very similar to the C++ class definitions except that we use Cython syntax instead of C++ syntax.

Name the file as 'CustomFeatureTransformer.pyx':

from libcpp.vector cimport vector
from libcpp.unordered_map cimport unordered_map
from libcpp.string cimport string
from nltk.corpus import stopwords
import numpy as np
import scipy.sparse as sp

cdef extern from "FeatureTransformer.h":
    cdef struct SparseMat:
        vector[int] row, col
        vector[double] data
    
    cdef cppclass FeatureTransformer:
        vector[string] vocab
        unordered_map[string, int] vocab_inv_index
        unordered_map[string, double] idf_scores
        
        FeatureTransformer() except +
        FeatureTransformer(vector[string] stop_words, int vocab_size, int min_ngram, int max_ngram) except +
        void fit(vector[string] sentences, vector[string] labels)
        SparseMat transform(vector[string] sentences)
        
cdef class CustomFeatureTransformer(object):
    cdef FeatureTransformer ft
    
    def __cinit__(self, int num_features=100, int min_ngram=1, int max_ngram=3):
        self.ft = FeatureTransformer(stopwords.words('english'), num_features, min_ngram, max_ngram)
        
    def fit(self, sentences, labels):
        self.ft.fit(sentences, labels)
        
    def transform(self, sentences):
        cdef SparseMat out = self.ft.transform(sentences)
        cdef int n, m
        
        n = len(sentences)
        m = len(self.ft.vocab)
        
        return sp.csc_matrix((out.data, (out.row, out.col)), shape=(n, m))

We define our Cython wrapper class 'CustomFeatureTransformer' that declares the same 'fit' and 'transform' methods. Unlike the C++ class, the cython class do not define any other methods because the cython 'fit' and 'transform' methods calls the corresponding C++ class methods, which in turn calls the C++ private methods to fetch the results.

Also in the cython 'transform' method, we convert the C++ sparse matrix (triplet format) into 'CSC' scipy.sparse format. The format of the cython wrapper would be familiar if you have gone through my earlier post on "Interfacing C++ with Cython".

Thus now we are done with the 'CustomFeatureTransformer' implementation. Next we move on to defining the cython class for 'CustomClassifier'. As mentioned earlier, we will not be writing C++ class and methods for this, but instead 'cythonize' the Python class.

Create a file named 'CustomClassifier.pyx' and add the following code:

import numpy as np
from sklearn.metrics import classification_report
from sklearn.base import clone
from libcpp.vector cimport vector
from libcpp.string cimport string

cdef class CustomClassifier(object):
    cdef public vector[string] class_labels
    cdef double predict_threshold
    cdef object base_estimator, model
    
    def __cinit__(self, object base_estimator, double predict_threshold=0.5):
        self.base_estimator = base_estimator
        self.predict_threshold = predict_threshold

    def fit(self, X, y):
        cdef vector[string] labels = y
        
        self.model = self.base_estimator
        self.model.fit(X, labels)
        
        self.class_labels = self.model.classes_
        
    def predict_proba(self, X):
        cdef double[:,:] preds_probs = self.model.predict_proba(X)
        return preds_probs

    def predict(self, X):
        cdef int x
        cdef double[:,:] preds_probs = self.predict_proba(X)
        cdef int[:] indices = np.argmax(preds_probs, axis=1).astype(np.int32)
        
        cdef vector[string] out
        for x in indices:
            out.push_back(self.class_labels[x])
        
        return out

    def score(self, X, y):
        cdef vector[string] pred_labels = self.predict(X)
        return classification_report(y, pred_labels)

The custom classifier class has the familiar 'fit', 'predict' and 'predict_proba' methods that are usually available for any scikit-learn base estimator. Notice how we have assigned data types to the variables. For some of the variables it was straightforward to assign cython data types such as 'int' or 'double' or 'double[:,:]' etc. Using the 'libcpp.string' library, strings in cython can be assigned the 'string' data type instead of 'char*'.

For some variables like the 'base_estimator' or the 'model', these are objects of the scikit-learn base_estimator class and do not have direct translation to any standard data type in cython or C++. Thus we assign them 'cdef object' type. The cython version may not be the most optimized but the purpose is to show how to translate a 'classifier' class written in python into cython with just minor changes.

Lastly we write our setup.py file as follows:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext

ext_modules=[
    Extension("CustomFeatureTransformer", sources=["CustomFeatureTransformer.pyx", "FeatureTransformer.cpp"], language='c++', extra_compile_args=["-std=c++11", "-stdlib=libc++", "-mmacosx-version-min=10.9"], extra_link_args=["-stdlib=libc++", "-std=c++11", "-mmacosx-version-min=10.9"]),

    Extension("CustomClassifier", sources=["CustomClassifier.pyx"], language='c++', extra_compile_args=["-std=c++11", "-stdlib=libc++", "-mmacosx-version-min=10.9"], extra_link_args=["-stdlib=libc++", "-std=c++11", "-mmacosx-version-min=10.9"])]

setup(
  name = "CustomCythonClassifier",
  cmdclass = {"build_ext": build_ext},
  ext_modules = ext_modules)

Notice how we are compiling both the 'CustomFeatureTransformer' and 'CustomClassifier' using the same setup.py file.

We need to specify 'c++' in language as then cython will use C++ instead of C (default) for compilation. Moreover in order to use C++11, we need to specify "-std=c++11" in 'extra_compile_args' and 'extra_link_args'. For MacOS specifically, we need to specify the "-stdlib=libc++" and"-mmacosx-version-min=10.9" arguments else they fail to compile. I am using MacOS High Sierra v10.13.5.

I did some performance benchmark on the C++ and cython based classification pipeline by comparing it with pure pythonic implementations. The training speed is about 300-400% higher than the pure python version (and it probably increases with increasing number of input data) and the prediction time reduces by more than 150% per input file.

In production, where we generally expect 300 requests per second and about 1000 concurrent users, a reduction in 150% p99 latency will surely clear up that waiting queue much faster.

The codes for both the C++/Cython version and the Python version are hosted on my github repo:

Categories: DESIGN, MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , , , ,