Stokastik

Machine Learning, AI and Programming

Designing a Question-Question Similarity Framework - Part II

In this post we will look at the offline implementation.

In our company, there are currently about a 100 manual agents, each serving somewhere around 60-80 customers (non-unique) a day, i.e. a total of about 8K customer queries each day for our agents. And each customer session has an average of 3 question-answer rounds including statements, greetings, contextual and personal questions. Thus on average we generate 24K QA pairs each day, which comes to roughly around 9 million questions in a year. We have around 3 years worth of historical data, i.e. 27 million questions.

Roughly around 40% of this data are actual questions (i.e. not statements or greetings etc. We have a question-statement detector whose details I will not go into in this post), thus around 11 million questions to be clustered on a 4-core linux machine with 32 GB RAM and 500 GB of disk space. Assuming that each question is on average 50 characters long, i.e. 50 bytes, then 11 million questions would take up approximately 0.5 GB of RAM, which can easily fit on our machine even if we read all the questions in memory.

Now, lets see what are the possible ways the questions can be represented. One of them is in plain string format. But this will not be much useful while working with standard clustering algorithms such as K-Means. We need to find vector representations for questions. Few common ways of representing a question as a vector are :

TF-IDF representation is highly sparse. Assuming that the size of the TF-IDF vocabulary is 10,000 features, then representing 11 million questions, would require 820 GB of RAM assuming that each TF-IDF value is 8 bytes long. By all means this is huge and would never fit on our machines. Luckily the sklearn implementation of TF-IDF returns a "csr_matrix" object which is a sparse matrix, i.e. only non-zero TF-IDF values are stored.

In most cases the sparsity is around 99.9%, i.e. 99.9% of the TF-IDF matrix is filled with 0's. Thus only 0.1% of the data is useful for further computation or in other words 0.82 GB RAM is sufficient to store the sparse representation.

Representations such as PCA, SVD, Word2Vec, Doc2Vec etc. are dense representations i.e. very few cells are non-zero. Generally 64 length or 128 length vector representation is more than sufficient to encode a question. An 128 length representation for 11 million questions would take up 10.5 GB of RAM. Generally PCA, SVD etc. are computed on the TF-IDF matrix itself (TruncatedSVD on sparse TF-IDF matrix) and thus the memory consumption is less than the summation of 0.82 and 10.5 GB.

Multiple representations for each question can be concatenated. The advantage of using concatenation is that we do not constrain individual representation to be of specific length. Different engineers can work on different representation independently without any dependency on each other. One issue that might come up during concatenation is that one representation might be dense and another might be sparse as in skip gram and word vector respectively.

In case of mixed representations, it is more beneficial to convert both into a sparse representation than both into dense. Combining two representation will take up memory equivalent to their summation. Thus skip gram + IDF weighted word vectors would take up around 12 GB of RAM. This is assuming that we concatenate the vectors at question level instead of stacking the matrices vertically, which will also take up the overhead of keeping individual representations in memory.

import numpy as np
from scipy import sparse
from vectors.pca_features import PCAFeatures
from vectors.skip_gram_features import SkipGramFeatures
from vectors.word_embeddings_features import WordEmbeddingFeatures
from vectors.autoClusterFeatures import AutoClusterProductTypeFeature
from sklearn.preprocessing import normalize
"""
Function to combine dense vector representations
"""
def combine_vector_representations_dense(reps):
return np.hstack(reps)
"""
Function to combine mixed (sparse + dense) vector representations
"""
def combine_vector_representations_mixed(reps):
return sparse.hstack(reps)
"""
Function to normalize sentence vectors using L2 norm.
"""
def normalize_representations_dense(representations):
representations = representations + 0.05
norms = np.sqrt(np.sum(representations**2, axis = 1))
representations = representations / norms[:,None]
return representations
"""
Function to normalize sentence vectors using L2 norm.
"""
def normalize_representations_mixed(representations):
representations = normalize(representations, norm='l2', axis=1)
return representations
"""
Function to use sparse or dense vector representations depending on use case.
"""
def transform_vector_representation(representations, algorithms, output_format):
if 'skip_gram' in algorithms:
representations = sparse.csr_matrix(combine_vector_representations_mixed(representations))
if output_format == 'dense':
representations = representations.todense()
else:
representations = combine_vector_representations_dense(representations)
if output_format != 'dense':
representations = sparse.csr_matrix(representations)
if output_format == 'dense':
representations = normalize_representations_dense(representations)
else:
representations = normalize_representations_mixed(representations)
return representations
class SentenceVector(object):
def __init__(self, sentences, algorithm='pca', output_format='dense'):
obj_list = []
algorithms = algorithm.split('|')
for algo in algorithms:
if algo == 'skip_gram':
obj = SkipGramFeatures(sentences)
elif algo == 'word_vector':
obj = WordEmbeddingFeatures(sentences)
elif algo == 'auto_cluster':
obj = AutoClusterProductTypeFeature(sentences)
else:
obj = PCAFeatures(sentences)
obj_list.append(obj)
self.output_format = output_format
self.algorithms = algorithms
self.algo_objs = obj_list
self.representations = None
self.model = None
"""
Function to get vector representations.
"""    
def get_representations(self):
reps, models = [], []
for algo in self.algo_objs:
rep, model = algo.get_representations()
reps.append(rep)
models.append(model)
self.models = models
self.representations = transform_vector_representation(reps, self.algorithms, self.output_format)
return self.representations, self.model

Create an abstract class "SentenceVector" (shown above) which has the following class variables, the list of representation algorithms to be concatenated, the final representations and the vectorizer model used to create the representations. Each individual representation is defined in its own class "PCAFeatures", "SkipGramFeatures", "WordVectorFeatures" respectively.

Abstraction of Vectors.

Following is how we can generate concatenated representations with skip_gram vectors and word_vectors, by calling the above class.

import vectors.sentence_vector_features as svf
skip_gram_word2vec_concat_rep = svf.SentenceVector(questions, algorithm=skip_gram|word_vector, output_format='sparse')
skip_gram_word2vec_concat_rep.get_representations()

This is how the PCAFeatures class (for creating PCA representations for questions) looks like :

import numpy as np, math
import message_cleaning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import euclidean_distances
import config as cfg
"""
Computes the TF-IDF matrix for a list of sentences.
"""
def get_tfidf_matrix(sentences):
print("Doing TF-IDF...")
vectorizer = TfidfVectorizer(tokenizer=message_cleaning.get_tokens, ngram_range=(1,1))
matrix = vectorizer.fit_transform(sentences)
return matrix, vectorizer
"""
Function to do PCA dimensionality reduction on an input matrix.
Uses Truncated SVD for sparsity reasons.
"""
def pca_reduction(vectors):
print("Doing PCA...")
pca = TruncatedSVD(n_components=cfg.PCA_VECTOR_SIZE).fit(vectors)
return pca.fit_transform(vectors), pca
"""
Function to normalize the vector representations using L2 norm.
"""
def normalize_representations(representations):
norms = np.sqrt(np.sum(representations**2, axis = 1))
representations = representations / norms[:,None]
return representations
class PCAFeatures(object):
def __init__(self, sentences):
self.sentences = sentences
self.representations = None
self.model = None
"""
Function to get PCA representations.
"""
def get_representations(self):
tfidf_matrix, tf_idf_model = get_tfidf_matrix(self.sentences)
self.representations, pca_model = pca_reduction(tfidf_matrix)
self.model = {'tf_idf_model' : tf_idf_model, 'pca_model' : pca_model}
self.representations = normalize_representations(self.representations)
return self.representations, self.model

and similarly for the other representations.

Coming back to our question clusters. To create the most effective clusters, we need to compute pairwise similarity between each pair of questions. Then using agglomerative clustering with some threshold, we can club the similar questions together. That's all, we are done !!!

But the problem is that we have 11 million questions, and a pairwise computation requires 121x1012 computations and more 8 times of that the size of matrix (in bytes) to store the similarity values. Both the run time and the memory requirement is not feasible for us and we need something more fast and less compute intensive. The only way we can do that is if we can create some logical groupings initially of "similar looking" questions. Then compute pairwise similarity within each logical grouping.

Multi-Level Clustering in a Pipeline

Assuming we have N questions, if we create O(N0.5) number of clusters, then each cluster is expected to have O(N0.5) questions in it. Thus the pairwise computation required now is O(N) instead of O(N2). Thus for 11 million questions, each logical group would have around 3317 questions (on average) in it and the total number of expected computations is 11 million with 88 million bytes = 84 MB only. We process each sub-group at a time (in concurrent threads, since the sub-groups are independent) so that we do not need to store all of them.

The pairwise similarity matrix can either be calculated using the vector representations or these can be kernels (pre-computed matrix). To compute the similarity matrix from the vectors, each representation might want to implement its own similarity logic (cosine, euclidean etc.). Thus the pairwise similarity computation is kept as part of the individual features vector classes giving the flexibility of implementation to its implementers.

After adding the functions for computing the pairwise similarity using vector representations. The "SentenceVector" class looks as following :

class SentenceVector(object):
def __init__(self, sentences, algorithm='pca', output_format='dense'):
obj_list = []
algorithms = algorithm.split('|')
for algo in algorithms:
if algo == 'skip_gram':
obj = SkipGramFeatures(sentences)
elif algo == 'word_vector':
obj = WordEmbeddingFeatures(sentences)
elif algo == 'auto_cluster':
obj = AutoClusterProductTypeFeature(sentences)
else:
obj = PCAFeatures(sentences)
obj_list.append(obj)
self.output_format = output_format
self.algorithms = algorithms
self.algo_objs = obj_list
self.representations = None
self.model = None
"""
Function to get vector representations.
"""    
def get_representations(self):
reps, models = [], []
for algo in self.algo_objs:
rep, model = algo.get_representations()
reps.append(rep)
models.append(model)
self.models = models
self.representations = transform_vector_representation(reps, self.algorithms, self.output_format)
return self.representations, self.model
"""
Function to get pairwise similarity matrix for a subset of points.
"""
def get_pairwise_similarities(self, indexes=None):
if indexes is not None:
representations = self.representations[indexes,:]
else:
representations = self.representations
return np.dot(representations, representations.T)
"""
Function to get a list of tuples (similarity, point_index) from one row
of the similarity matrix.
"""
def get_representation_index(self, similarities, index):
rep = similarities[index]
if self.output_format == 'dense':
if np.sum(rep) > 0:
return list(zip(rep, range(len(rep))))
else:
return []
else:
data = rep.data
if len(data) > 0:
rows, cols = rep.nonzero()
return list(zip(data, cols))
else:
return []
"""
Function to get representation for test sentences.
"""
def get_test_sentences_representation(self, sentences):
reps = []
for algo in self.algo_objs:
rep = algo.get_test_sentences_representation(sentences)
reps.append(rep)
return transform_vector_representation(reps, self.algorithms, self.output_format)
"""
Function to get similarities of test questions with cluster heads.
"""
def get_test_sentence_similarities(self, test_rep, cluster_head_indexes):
if self.output_format == 'dense':
cluster_head_reps = [self.representations[cluster_head_index] for cluster_head_index in cluster_head_indexes]
most_similar = np.dot(test_rep, np.array(cluster_head_reps).T)
cols = [cluster_head_indexes[col] for col in range(len(most_similar))]
return list(zip(most_similar, cols))
else:
cluster_head_reps = self.representations[cluster_head_indexes,:]
most_similar = np.dot(test_rep, cluster_head_reps.T)[0]
data = most_similar.data
rows, cols = most_similar.nonzero()
cols = [cluster_head_indexes[col] for col in cols]
return list(zip(data, cols))
"""
Function to sorted list of indexes from pairwise similarity matrix.
Indices are sorted according to the descending order of the number of the 'similar' intents with
similarity threshold of at-least 'sim_thres'.
"""    
def get_sorted_by_similarity(self, similarities, sim_thres):
thres_match = similarities >= sim_thres
if self.output_format == 'dense':
return np.sum(thres_match.astype(int), axis=1).argsort()[::-1]
else:
return np.array(thres_match.astype(int).sum(axis=1).T)[0].argsort()[::-1]

Not every-time it is possible to compute a vector representation. Lets say that I want to use Longest Common Subsequence similarity as a metric for pairwise similarity between questions. LCS between two strings at a time is relatively easy to compute whereas if we want to vectorize it, we have to compute the TF-IDF values for all possible skip-grams for a question.

If there are m words in a question, then there would be 2m possible number of skip-grams. Assuming 10 words in a question, there would be 1024 features that needs to be computed, and this is just for one question. For 11 million questions, this number would be huge. Although a sparse matrix (assuming 10 words per question) would take up around 10 GB memory (which can fit in the memory), but the concern is that, 10 words is an average number, in certain cases the number of words may go up to 50-60 words. If the number of features were linear in the number of words, this would not be much of concern, but unfortunately it is exponential, 250 = 1.1x1015.

Note that one can also use K-Means instead of agglomerative clustering. K-Means does not need to compute pairwise similarities, but computes similarities between questions and their cluster centroids, which is much more efficient, O(N). But the problem lies in that centroid. A cluster centroid will dilute the feature importance on averaging. Thus the quality of clusters would not be good enough.

What we need is layer-wise clustering (as shown in the above figure), i.e. create a first level logical grouping of questions (using either K-Means or some domain knowledge), then on each cluster, one can further do logical grouping (using K-Means) or directly use agglomerative clustering (if the number of questions in a cluster is not too huge). This calls for designing the clustering as a pipeline.

import clustering.kmeans_clustering as km
import clustering.agglomerative_clustering as agg
import clustering.prune_clustering as pr
import vectors.sentence_vector_features as svf
import config as cfg
import time
'''
Class for running abstract clustering algorithms in a pipeline.
Each clustering algorithm works with clusters created by the previous clustering algorithm
in the pipeline.
All clustering algorithms in a pipeline takes as input a dataframe, the cluster label for each sentence,
the cluster centers, cluster head for each sentence and the cluster confidence with cluster centers 
for each sentence, from the previous clustering algorithm.
'''
class ClusterPipeline(object):
def __init__(self, actions, data):
self.actions = actions
self.data = data
self.labels = []
self.centers = []
self.cluster_heads = []
self.cluster_confidences = []
self.rep_obj = None
'''
Function to run the clustering pipeline
'''
def run(self):
questions = list(self.data[cfg.QUESTION_COLUMN_NAME])
labels, centers, heads, confidences = [0] * len(questions), [], [], []
for action in self.actions:
if action == 'kmeans':
start = time.time()
print("Getting KMeans sentence vector representation...")
self.rep_obj = svf.SentenceVector(questions, algorithm=cfg.KMEANS_ALGORITHM, output_format=cfg.KMEANS_FORMAT)
self.rep_obj.get_representations()
print("KMeans Clustering Vectors...")
kmeans = km.KMeans(self.data, labels, centers, heads, confidences, self.rep_obj)
labels, centers, heads, confidences = kmeans.run()
duration = time.time() - start
print("Time taken to do KMeans Clustering = ", duration)
elif action == 'agglomerative':
start = time.time()
print("Getting Agglomerative sentence vector representation...")
self.rep_obj = svf.SentenceVector(questions, algorithm=cfg.AGG_ALGORITHM, output_format=cfg.AGG_FORMAT)
self.rep_obj.get_representations()
print("Agglomerating clusters...")
agglomer = agg.Agglomerative(self.data, labels, centers, heads, confidences, self.rep_obj)
labels, centers, heads, confidences = agglomer.run()
duration = time.time() - start
print("Time taken to do Agglomerative Clustering = ", duration)
elif action == 'prune':
start = time.time()
print("Pruning clusters...")
prune = pr.Prune(self.data, labels, centers, heads, confidences, self.rep_obj)
labels, centers, heads, confidences = prune.run()
duration = time.time() - start
print("Time taken to do Pruning = ", duration)
else:
start = time.time()
print("Getting KMeans sentence vector representation...")
self.rep_obj = svf.SentenceVector(questions, algorithm=cfg.KMEANS_ALGORITHM, output_format=cfg.KMEANS_FORMAT)
self.rep_obj.get_representations()
print("KMeans Clustering Vectors...")
kmeans = km.KMeans(self.data, labels, centers, heads, confidences, self.rep_obj)
labels, centers, heads, confidences = kmeans.run()
duration = time.time() - start
print("Time taken to do KMeans Clustering = ", duration)
self.labels, self.centers, self.cluster_heads, self.cluster_confidences = labels, centers, heads, confidences
return self.labels, self.centers, self.cluster_heads, self.cluster_confidences, self.rep_obj

Create a class called "ClusterPipeline" that has the following variables, the questions (+metadata) to be clustered, the list of clustering algorithms to be applied in a sequence, the final cluster labels, the cluster centers (and some other results required for analysis and reporting).

Abstraction of Clustering.

The clustering pipeline can be run with the following code :

import clustering.cluster_pipeline as cp
import pandas as pd
df = pd.read_csv('questions_file.csv', encoding = 'utf8')
pipeline = cp.ClusterPipeline(['kmeans', 'prune', 'agglomerative'], df)
pipeline.run()

Each clustering algorithm is defined in its own class and it returns the cluster label and confidence for each question and centers for each cluster. Both K-Means and Agglomerative Clustering becomes part of this pipeline. Apart from these two, we need another class called Pruning. It is not any separate algorithm, but its purpose is to prune large clusters. This can be called whenever we want to further reduce the size of clusters.

Following is the class for K-Means clustering :

'''
Function to run Mini Batch K-Means algorithm on vector representation of sentences.
'''
def cluster_representations(vectors, num_clusters):
return MiniBatchKMeans(n_clusters=num_clusters, batch_size=min(len(vectors), cfg.MINI_BATCH_KMEANS_BATCH_SIZE), max_iter=cfg.KMEANS_NUM_ITER).fit(vectors)
'''
Function to get the confidence (similarity) of each sentence from its cluster centroid.
'''
def get_cluster_confidences(centers, vectors, batch_size=10000):
confidences = []
num_data = len(vectors)
num_batches = int(math.ceil(float(num_data) / batch_size))
for batch in range(num_batches):
start = batch * batch_size
end = min((batch + 1) * batch_size, num_data)
sims = cosine_similarity(np.array(vectors[start:end]), centers)
sims = (sims.T / np.sum(sims, axis=1)).T
confidences += list(np.max(sims, axis=1))
for idx in range(len(confidences)):
confidences[idx] = 0 if math.isnan(confidences[idx]) else confidences[idx]
return confidences
'''
Function to get the cluster head for each sentence, it is the sentence with the highest
similarity with the same cluster centroid.
'''
def get_cluster_heads(labels, confidences):
n = len(set(labels))
max_conf, arg_max_conf = [-float("Inf")] * n, [0] * n
heads = [0] * len(labels)
for idx in range(len(labels)):
if confidences[idx] > max_conf[labels[idx]]:
max_conf[labels[idx]] = confidences[idx]
arg_max_conf[labels[idx]] = idx
heads[idx] = arg_max_conf[labels[idx]]
return heads
'''
Class for running K-Means clustering algorithm.
Takes as input a dataframe, the cluster label for each sentence, the cluster centers, 
cluster head for each sentence and the cluster confidence with cluster centers for each sentence, 
from the previous clustering algorithm in a pipeline or standalone.
'''
class KMeans(object):
def __init__(self, data, labels, centers, heads, confidences, rep_obj):
self.confidences = confidences
self.data = data
self.labels = labels
self.centers = centers
self.heads = heads
self.rep_obj = rep_obj
'''
Function to run K-Means clustering.
'''    
def run(self):
cluster_grps = defaultdict(list)
for idx in range(len(self.labels)):
cluster_grps[self.labels[idx]].append(idx)
n = len(self.labels)
labels, heads, confidences = [0] * n, [0] * n, [0] * n
centers = []
max_cluster_id = 0
for cl_id, points in cluster_grps.items():
vectors = [self.rep_obj.representations[pt] for pt in points]
num_clusters = int(math.sqrt(len(points)))
clusters = cluster_representations(vectors, num_clusters)
if len(centers) == 0:
centers = clusters.cluster_centers_
else:
centers = np.vstack((centers, clusters.cluster_centers_))
confs = get_cluster_confidences(clusters.cluster_centers_, vectors)
for idx in range(len(points)):
pt = points[idx]
labels[pt] = max_cluster_id + clusters.labels_[idx]
confidences[pt] = confs[idx]
max_cluster_id += len(set(clusters.labels_))
heads = get_cluster_heads(labels, confidences)
return labels, centers, heads, confidences

Following are some notable features of the above class :

  • We are using Mini Batch K-Means algorithm instead of full K-Means. Mini Batch creates approximates clusters but is much faster and consumes lot less memory than full K-Means. Anyways we are not much concerned about cluster quality at this stage.
  • We take the clusters returned by the previous algorithm in the pipeline, and apply Mini Batch K-Means on each cluster.
  • Compute new cluster labels, by keeping a running counter of the total number of clusters created till now, and centers, by stacking new cluster centers vertically.

Following class defines the Pruning logic. Although it is not a separate clustering algorithm but it's I/O is same as that of clustering, implying that it can easily fit into the pipeline.

from sklearn.cluster import MiniBatchKMeans
import config as cfg
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
import clustering.kmeans_clustering as km
import math
from collections import defaultdict
'''
Class for running K-Means clustering algorithm.
Takes as input a dataframe, the cluster label for each sentence, the cluster centers, 
cluster head for each sentence and the cluster confidence with cluster centers for each sentence, 
from the previous clustering algorithm in a pipeline or standalone.
'''
class Prune(object):
def __init__(self, data, labels, centers, heads, confidences, rep_obj):
self.confidences = confidences
self.data = data
self.labels = labels
self.centers = centers
self.heads = heads
self.rep_obj = rep_obj
"""
Function to prune the clusters.
Separate out all points from a cluster (having at-least 10 points), if the points lies at a 
distance of at-least 0.85 times the cluster radius.
Cluster radius is the maximum distance of any point in the cluster from its cluster center.
Create additional clusters out of all separated out points.
"""    
def run(self):
cluster_grps = defaultdict(list)
for idx in range(len(self.labels)):
cluster_grps[self.labels[idx]].append(idx)
to_prune = []
for cl_id, points in cluster_grps.items():
if len(points) > cfg.PRUNE_MIN_SIZE:
vecs = [self.rep_obj.representations[idx] for idx in points]
out = euclidean_distances(np.array(vecs), np.array([self.centers[cl_id]]))
out = out.T[0]
max_radius = np.max(out)
to_prune += [points[idx] for idx in range(len(out)) if out[idx] > cfg.PRUNE_RADIUS * max_radius]
pruned_vectors = [self.rep_obj.representations[idx] for idx in to_prune]
pruned_clusters = km.cluster_representations(pruned_vectors, int(math.sqrt(len(pruned_vectors))))
pruned_labels = pruned_clusters.labels_
pruned_labels = [len(self.centers) + x for x in pruned_labels]
pruned_cluster_centers = pruned_clusters.cluster_centers_
labels, heads = self.labels, self.heads
centers = np.array(list(self.centers) + list(pruned_cluster_centers))
confidences = km.get_cluster_confidences(centers, self.rep_obj.representations)
for idx in range(len(pruned_labels)):
labels[to_prune[idx]] = pruned_labels[idx]
heads = km.get_cluster_heads(labels, confidences)
return labels, centers, heads, confidences

Following are some notable features of the above class :

  • Prune clusters which has some minimum number of elements.
  • For clusters satisfying the minimum number of elements criteria, first remove all points at a distance of 0.85 times the radius of the cluster, and then re-cluster these excluded points by calling K-Means class.
  • Cluster labels and centers are re-computed to take into account the new set of clusters formed.

And lastly we have the agglomerative clustering class, which uses the pairwise similarity matrix to create the final intent clusters.

import config as cfg
from collections import defaultdict
'''
Class for running agglomerative clustering algorithm.
Takes as input a dataframe, the cluster label for each sentence, the cluster centers, 
cluster head for each sentence and the cluster confidence with cluster centers for each sentence, 
from the previous clustering algorithm in a pipeline or standalone.
'''
class Agglomerative(object):
def __init__(self, data, labels, centers, heads, confidences, rep_obj):
self.confidences = confidences
self.data = data
self.labels = labels
self.centers = centers
self.heads = heads
self.rep_obj = rep_obj
'''
Function to run agglomerative clustering.
'''    
def run(self):
output_clusters, cluster_grps = defaultdict(), defaultdict(list)
for idx in range(len(self.labels)):
cluster_grps[self.labels[idx]].append(idx)
max_cluster_id = 0
for cl_id, points in cluster_grps.items():
similarities = self.rep_obj.get_pairwise_similarities(points)
pt_indexes = self.rep_obj.get_sorted_by_similarity(similarities, cfg.COSINE_SIMILARITY_THRESHOLD)
for idx in pt_indexes:
q_idx, cluster_id = points[idx], max_cluster_id
if q_idx not in output_clusters:
most_similar = self.rep_obj.get_representation_index(similarities, idx)
output_clusters[q_idx] = (cluster_id, 1.0, q_idx)
if len(most_similar) > 0:
most_similar = [(x, points[y]) for x, y in most_similar if points[y] != q_idx and x >= cfg.COSINE_SIMILARITY_THRESHOLD]
for sim, index in most_similar:
if index not in output_clusters or (index in output_clusters and output_clusters[index][1] < sim):
output_clusters[index] = (cluster_id, sim, q_idx)
max_cluster_id += 1
n = len(self.labels)
labels, heads, confidences = [0] * n, [0] * n, [0] * n
centers = [0] * max_cluster_id
for pt, cluster in output_clusters.items():
labels[pt] = cluster[0]
confidences[pt] = cluster[1]
heads[pt] = cluster[2]
centers[cluster[0]] = self.rep_obj.representations[cluster[2]]
return labels, centers, heads, confidences

Following are some notable features of the above class :

  • The above algorithm is bit different from the standard agglomerative clustering algorithm (hierarchically expands a cluster).
  • Inside each cluster (from the previous algorithm in the pipeline), compute the pairwise similarity between the questions in that cluster.
  • Sort the questions in the cluster based on how many other questions in the same cluster are similar to it (defined by the similarity threshold).
  • Choose the question with the highest number of similar questions in the same cluster. Then create a sub-cluster out of all these questions, with the selected question as a sub-cluster head.
  • Repeat the above step for all questions in each cluster until we cannot have any more cluster heads, i.e. all questions in a cluster has been assigned to some sub-cluster.
  • If a question is similar to two or more sub-cluster head questions, then it is assigned to that sub-cluster head, which is most similar.

The sub-clusters obtained from the above agglomerative class are the final intent clusters that contains the similar questions.

Categories: AI, DESIGN, MACHINE LEARNING

Tags: , , , , , ,