Machine Learning, AI and Programming

Designing an Automated Question-Answering System - Part II

In this post we will look at the offline implementation architecture.

Assuming that, there are currently about a 100 manual agents, each serving somewhere around 60-80 customers (non-unique) a day, i.e. a total of about 8K customer queries each day for our agents. And each customer session has an average of 5 question-answer rounds including statements, greetings, contextual and personal questions. Thus on average we generate 40K client-agent response pairs each day, which comes to roughly around 12 million client queries in a year. We have around 2 years worth of historical data, i.e. 24 million questions.

Roughly around 40% of this data are actual questions (i.e. not statements or greetings etc. We have a question-statement detector whose details I will not go into in this post), thus around 10 million questions to be clustered on a 4-core linux machine with 32 GB RAM and 500 GB of disk space. Assuming that each question is on average 50 characters long, i.e. 50 bytes, then 10 million questions would take up approximately 0.5 GB of RAM, which can easily fit on our machine even if we read all the questions in memory.

Now, lets see what are the possible ways the questions can be represented. One of them is in plain string format. But this will not be much useful while working with standard clustering algorithms such as K-Means. We need to find vector representations for questions. Few common options for representing a question as a vector are :

TF-IDF representation is highly sparse. Assuming that the vocabulary size is 50k, then representing 10 million questions, would require 4 TB RAM for a dense matrix (assuming that each TF-IDF value is 8 bytes long). This is huge and would never fit on our machines. Luckily the sklearn implementation of TF-IDF returns a "csr_matrix" object which is a sparse matrix, i.e. only non-zero TF-IDF values are stored.

In most cases the sparsity is around 99.9%, i.e. 99.9% of the TF-IDF matrix is filled with 0's. Thus only 0.1% of the data is useful for further computation or in other words 4 GB RAM is sufficient to store the sparse representation.

Representations such as PCA, SVD, Word2Vec, Doc2Vec etc. are dense representations i.e. very few cells are non-zero. We use 128 length vector representation to encode a question. An 128 dim. representation for 10 million questions would take up 10 GB of RAM.

Multiple representations for each question can be concatenated with weights. The advantage of using concatenation over summation or averaging is that we do not constrain individual representation to be of specific length. Different engineers can work on different representation independently without any dependency on each other.

One issue that might come up during concatenation is that one representation might be dense and another might be sparse as in skip gram and word vector respectively. In case of mixed representations, it is more space economical to convert both into a sparse representation than both into dense.

To work with multiple question representations, we need to abstract a vector representation class, that can be extended and implemented by different algorithms to generate their own representations using their own vectorization models and define their own similarity metrics.

Create an abstract class "SentenceVector" which has the following class variables, the list of representations to be concatenated, the final representation and the vectorizer model used to create the representations. Each individual representation is defined in their own class, e.g. "PCAFeatures", "SkipGramFeatures", "WordVectorFeatures" respectively.

Abstraction of Vectors.

Below is the definition for the "SentenceVector" class :

import numpy as np
from scipy import sparse
from vectors.pca_features import PCAFeatures
from vectors.skip_gram_features import SkipGramFeatures
from vectors.word_embeddings_features import WordEmbeddingFeatures
from vectors.variational_autoencoder_features import VariationalAutoencoderFeatures
from vectors.lda_features import LDAFeatures
from sklearn.preprocessing import normalize

Function to combine dense vector representations
def combine_vector_representations_dense(reps):
    return np.hstack(reps)

Function to combine mixed (sparse + dense) vector representations
def combine_vector_representations_mixed(reps):
    return sparse.hstack(reps)

Function to use sparse or dense vector representations depending on use case.
def transform_vector_representation(representations, algorithms, output_format):
    if 'skip_gram' in algorithms:
        representations = sparse.csr_matrix(combine_vector_representations_mixed(representations))

        if output_format == 'dense':
            representations = representations.todense()

        representations = combine_vector_representations_dense(representations)

        if output_format != 'dense':
            representations = sparse.csr_matrix(representations)
    return representations

class SentenceVector(object):
    def __init__(self, sentences, algorithm='pca', output_format='dense'):
        obj_list = []
        self.sentences = sentences
        algorithms = algorithm.split('|')
        for algo in algorithms:
            if algo == 'skip_gram':
                obj = SkipGramFeatures(sentences)
            elif algo == 'word_vector':
                obj = WordEmbeddingFeatures(sentences)
            elif algo == 'vae':
                obj = VariationalAutoencoderFeatures(sentences)
            elif algo == 'lda': 
                obj = LDAFeatures(sentences)
                obj = PCAFeatures(sentences)
        self.output_format = output_format
        self.algorithms = algorithms
        self.algo_objs = obj_list
        self.representations = None
        self.model = None
    Function to get vector representations.
    def get_representations(self):
        reps, models = [], []
        for algo in self.algo_objs:
            rep, model = algo.get_representations()
        self.models = models
        self.representations = transform_vector_representation(reps, self.algorithms, self.output_format)
        return self.representations, self.model

Following is how we can generate concatenated representations with skip_gram vectors and word_vectors, by calling the above class.

import vectors.sentence_vector_features as svf

skip_gram_word2vec_concat_rep = svf.SentenceVector(questions, algorithm=skip_gram|word_vector, output_format='sparse')

Let's define one of the vectorization algorithm.

This is how the PCAFeatures class (for creating PCA representations for questions) looks like :

import numpy as np, math
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import config as cfg

Computes the TF-IDF matrix for a list of sentences.
def get_tfidf_matrix(sentences):
    print("Doing TF-IDF...")
    vectorizer = TfidfVectorizer(tokenizer=get_tokens, ngram_range=(1,1))
    matrix = vectorizer.fit_transform(sentences)
    return matrix, vectorizer

Function to do PCA dimensionality reduction on an input matrix.
Uses Truncated SVD for sparsity reasons.
def pca_reduction(vectors):
    print("Doing PCA...")
    pca = TruncatedSVD(n_components=128).fit(vectors)
    return pca.fit_transform(vectors), pca

class PCAFeatures(object):
    def __init__(self, sentences):
        self.sentences = sentences
        self.representations = None
        self.model = None

    Function to get PCA representations.
    def get_representations(self):
        tfidf_matrix, tf_idf_model = get_tfidf_matrix(self.sentences)
        self.representations, pca_model = pca_reduction(tfidf_matrix)
        self.model = {'tf_idf_model' : tf_idf_model, 'pca_model' : pca_model}

        return self.representations, self.model

and similarly for the other representations.

Coming back to our question clusters. To create the most accurate clusters, we need to compute pairwise similarity between each pair of questions. Then using hierarchical agglomerative clustering, along with some threshold, we can form the intent clusters. That's all, we are done !!!

But the problem is that we have 10 million questions, and a pairwise computation would requires 1014 computations and a matrix of size 730 TB to store the similarity values. Both the run time and the memory requirement is not feasible for us and we need something more fast and less compute intensive. The only way we can do that is if we can create some logical groupings initially of "similar looking" questions. Then compute pairwise similarity within each logical grouping.

Multi-Level Clustering in a Pipeline

Assuming we have N questions, if we create O(N0.5) number of clusters, then each cluster is expected to have O(N0.5) questions in it. Thus the pairwise computation required now is O(N) instead of O(N2). Thus for 10 million questions, each logical group would have around 3162 questions (on average) in it and the total number of expected computations is of O(N1.5) and the expected memory requirement is 80 MB. We process each sub-group at a time (in concurrent threads, since the sub-groups are independent) so that we do not need to store all of them.

Note that one can also use K-Means instead of agglomerative clustering. K-Means does not need to compute pairwise similarities, but computes similarities between questions and their cluster centroids, which is much more efficient, O(N). But the problem lies in that centroid. A cluster centroid will dilute the feature importance on averaging. Thus the quality of clusters would not be good enough.

What we need is layer-wise clustering (as shown in the above figure), i.e. create a first level logical grouping of questions (using either K-Means or some domain knowledge), then on each cluster, one can further do logical grouping (using K-Means) or directly use agglomerative clustering. This calls for designing the clustering as a pipeline.

import clustering.kmeans_clustering as km
import clustering.agglomerative_clustering as agg
import clustering.prune_clustering as pr
import vectors.sentence_vector_features as svf
import config as cfg
import time

Class for running abstract clustering algorithms in a pipeline.

Each clustering algorithm works with clusters created by the previous clustering algorithm
in the pipeline.

All clustering algorithms in a pipeline takes as input a dataframe, the cluster label for each sentence,
the cluster centers, cluster head for each sentence and the cluster confidence with cluster centers 
for each sentence, from the previous clustering algorithm.
class ClusterPipeline(object):
    def __init__(self, actions, data):
        self.actions = actions = data
        self.labels = []
        self.centers = []
        self.cluster_heads = []
        self.cluster_confidences = []
        self.rep_obj = None
    Function to run the clustering pipeline
    def run(self):
        questions = list([cfg.QUESTION_COLUMN_NAME])
        labels, centers, heads, confidences = [0] * len(questions), [], [], []
        for action in self.actions:
            if action == 'kmeans':
                start = time.time()
                print("Getting KMeans sentence vector representation...")
                self.rep_obj = svf.SentenceVector(questions, algorithm=cfg.KMEANS_ALGORITHM, output_format=cfg.KMEANS_FORMAT)
                print("KMeans Clustering Vectors...")
                kmeans = km.KMeans(, labels, centers, heads, confidences, self.rep_obj)
                labels, centers, heads, confidences =
                duration = time.time() - start
                print("Time taken to do KMeans Clustering = ", duration)
            elif action == 'agglomerative':
                start = time.time()
                print("Getting Agglomerative sentence vector representation...")
                self.rep_obj = svf.SentenceVector(questions, algorithm=cfg.AGG_ALGORITHM, output_format=cfg.AGG_FORMAT)
                print("Agglomerating clusters...")
                agglomer = agg.Agglomerative(, labels, centers, heads, confidences, self.rep_obj)
                labels, centers, heads, confidences =
                duration = time.time() - start
                print("Time taken to do Agglomerative Clustering = ", duration)
                start = time.time()
                print("Pruning clusters...")
                prune = pr.Prune(, labels, centers, heads, confidences, self.rep_obj)
                labels, centers, heads, confidences =
                duration = time.time() - start
                print("Time taken to do Pruning = ", duration)
        self.labels, self.centers, self.cluster_heads, self.cluster_confidences = labels, centers, heads, confidences
        return self.labels, self.centers, self.cluster_heads, self.cluster_confidences, self.rep_obj

Create a class called "ClusterPipeline" that has the following variables, the questions (+metadata) to be clustered, the list of clustering algorithms to be applied in a sequence, the final cluster labels, the cluster centers, and some other results required for analysis and reporting.

Abstraction of Clustering.

The clustering pipeline can be run by calling the following code :

import clustering.cluster_pipeline as cp
import pandas as pd

df = pd.read_csv('questions_file.csv', encoding = 'utf8')

pipeline = cp.ClusterPipeline(['kmeans', 'agglomerative'], df)

Each clustering algorithm is defined in its own class and it returns the cluster label and confidence for each question along with the cluster centers (the intents). Both K-Means and Agglomerative Clustering becomes part of this pipeline.

Note that, apart from K-Means and Hierarchical Agglomerative Clustering, one can also use Pruning to prune clusters based on certain criteria.

The idea of Pruning is to "trim" the boundaries of the clusters and then re-cluster these trimmed data points. This is to ensure that points which might have mistakenly assigned to a cluster during the previous step, can now be free to join with points which are more similar to it.

Following is the class for K-Means clustering :

Function to run Mini Batch K-Means algorithm on vector representation of sentences.
def cluster_representations(vectors, num_clusters):
    return MiniBatchKMeans(n_clusters=num_clusters, batch_size=min(len(vectors), cfg.MINI_BATCH_KMEANS_BATCH_SIZE), max_iter=cfg.KMEANS_NUM_ITER).fit(vectors)

Function to get the confidence (similarity) of each sentence from its cluster centroid.
def get_cluster_confidences(centers, vectors, batch_size=10000):
    confidences = []
    num_data = len(vectors)
    num_batches = int(math.ceil(float(num_data) / batch_size))

    for batch in range(num_batches):
        start = batch * batch_size
        end = min((batch + 1) * batch_size, num_data)

        sims = cosine_similarity(np.array(vectors[start:end]), centers)
        sims = (sims.T / np.sum(sims, axis=1)).T

        confidences += list(np.max(sims, axis=1))
    for idx in range(len(confidences)):
        confidences[idx] = 0 if math.isnan(confidences[idx]) else confidences[idx]
    return confidences

Function to get the cluster head for each sentence, it is the sentence with the highest
similarity with the same cluster centroid.
def get_cluster_heads(labels, confidences):
    n = len(set(labels))
    max_conf, arg_max_conf = [-float("Inf")] * n, [0] * n
    heads = [0] * len(labels)

    for idx in range(len(labels)):
        if confidences[idx] > max_conf[labels[idx]]:
            max_conf[labels[idx]] = confidences[idx]
            arg_max_conf[labels[idx]] = idx

        heads[idx] = arg_max_conf[labels[idx]]
    return heads

Class for running K-Means clustering algorithm.

Takes as input a dataframe, the cluster label for each sentence, the cluster centers, 
cluster head for each sentence and the cluster confidence with cluster centers for each sentence, 
from the previous clustering algorithm in a pipeline or standalone.
class KMeans(object):
    def __init__(self, data, labels, centers, heads, confidences, rep_obj):
        self.confidences = confidences = data
        self.labels = labels
        self.centers = centers
        self.heads = heads
        self.rep_obj = rep_obj
    Function to run K-Means clustering.
    def run(self):
        cluster_grps = defaultdict(list)
        for idx in range(len(self.labels)):
        n = len(self.labels)
        labels, heads, confidences = [0] * n, [0] * n, [0] * n
        centers = []
        max_cluster_id = 0
        for cl_id, points in cluster_grps.items():
            vectors = [self.rep_obj.representations[pt] for pt in points]
            num_clusters = int(math.sqrt(len(points)))
            clusters = cluster_representations(vectors, num_clusters)
            if len(centers) == 0:
                centers = clusters.cluster_centers_
                centers = np.vstack((centers, clusters.cluster_centers_))
            confs = get_cluster_confidences(clusters.cluster_centers_, vectors)
            for idx in range(len(points)):
                pt = points[idx]
                labels[pt] = max_cluster_id + clusters.labels_[idx]
                confidences[pt] = confs[idx]
            max_cluster_id += len(set(clusters.labels_))
        heads = get_cluster_heads(labels, confidences)
        return labels, centers, heads, confidences

Some notable features of the above class :

  • We are using Mini Batch K-Means algorithm instead of full K-Means. Mini Batch creates approximates clusters but is much faster and consumes lot less memory than full K-Means. Anyways we are not much concerned about cluster quality at this stage.
  • We take the clusters returned by the previous algorithm in the pipeline, and apply Mini Batch K-Means on each cluster.
  • Compute new cluster labels, by keeping a running counter of the total number of clusters created till now, and centers, by stacking new cluster centers vertically.

And then we have the agglomerative clustering class, which uses the pairwise similarity matrix to create the final intent clusters.

import config as cfg
from collections import defaultdict

Class for running agglomerative clustering algorithm.

Takes as input a dataframe, the cluster label for each sentence, the cluster centers, 
cluster head for each sentence and the cluster confidence with cluster centers for each sentence, 
from the previous clustering algorithm in a pipeline or standalone.
class Agglomerative(object):
    def __init__(self, data, labels, centers, heads, confidences, rep_obj):
        self.confidences = confidences = data
        self.labels = labels
        self.centers = centers
        self.heads = heads
        self.rep_obj = rep_obj
    Function to run agglomerative clustering.
    def run(self):
        output_clusters, cluster_grps = defaultdict(), defaultdict(list)
        for idx in range(len(self.labels)):

        max_cluster_id = 0

        for cl_id, points in cluster_grps.items():
            # compute pairwise similarities
            similarities = self.rep_obj.get_pairwise_similarities(points)

            # sort each row based on number of columns with similarity at-least defined by cfg.COSINE_SIMILARITY_THRESHOLD
            pt_indexes = self.rep_obj.get_sorted_by_similarity(similarities, cfg.COSINE_SIMILARITY_THRESHOLD)

            for idx in pt_indexes:
                q_idx, cluster_id = points[idx], max_cluster_id

                if q_idx not in output_clusters:
                    # get list of tuples of (cosine similarity, index) for questions from same cluster
                    most_similar = self.rep_obj.get_representation_index(similarities, idx)

                    output_clusters[q_idx] = (cluster_id, 1.0, q_idx)

                    if len(most_similar) > 0:
                        # filter the list of tuples for only similar questions with cosine similarity at-least defined by cfg.COSINE_SIMILARITY_THRESHOLD
                        most_similar = [(x, points[y]) for x, y in most_similar if points[y] != q_idx and x >= cfg.COSINE_SIMILARITY_THRESHOLD]

                        for sim, index in most_similar:
                            # if question belongs to two or more intents, then assign it to the intent with highest cosine similarity
                            if index not in output_clusters or (index in output_clusters and output_clusters[index][1] < sim):
                                output_clusters[index] = (cluster_id, sim, q_idx)

                    max_cluster_id += 1
        n = len(self.labels)
        labels, heads, confidences = [0] * n, [0] * n, [0] * n
        centers = [0] * max_cluster_id
        for pt, cluster in output_clusters.items():
            labels[pt] = cluster[0]
            confidences[pt] = cluster[1]
            heads[pt] = cluster[2]
            centers[cluster[0]] = self.rep_obj.representations[cluster[2]]
        return labels, centers, heads, confidences

Following are some notable features of the above class :

  • The above algorithm is bit different from the standard agglomerative clustering algorithm (hierarchically expands a cluster).
  • Inside each cluster (from the previous algorithm in the pipeline), compute the pairwise similarity between the questions in that cluster.
  • Sort the questions in the cluster based on how many questions in the same cluster are similar to it (defined by the cosine similarity threshold).
  • Choose the question with the highest number of similar questions in the same cluster. Then create a sub-cluster out of all these questions, with the selected question as the cluster head ("intent" in our case).
  • Repeat the above step for all questions in each cluster until we cannot have any more intents, i.e. all questions in a cluster has been assigned to some intent cluster.
  • If a question is similar to two or more intents, then it is assigned to that intent, which is most similar.

The sub-clusters obtained from the above agglomerative class are the final intent clusters that contains the similar questions.

In the next post of the series, we will look at how different vectorization algorithms perform against a standard tagged data-set of similar questions and analyze the results.


Tags: , , , , , ,