In the second post of this series we had listed down different vectorization algorithms used in our experiments for representing questions. Representations form the core of our intent clusters, because the assumption is that if a representation algorithm can capture syntactic as well as semantic meaning of the questions well, then if two questions which actually speak of the same intent, will have representations that are very close to each other in the Euclidean space.
The role of clustering is only to bring the questions of the same intent together on the basis of their representations.
The choice of different representation algorithms was not simple and definitely not arbitrary. We wanted different representations to capture both syntactic as well as semantic similarity between two questions. Since our original dataset was un-labelled, we chose to evaluate different representations on Quora'a duplicate questions data-set which had around 400K pairs of tagged question pairs, out of which around 537K were unique questions.
Following are the different category of algorithms chosen :
- Discriminative vs. Generative models.
- Deep Learning vs. Classical Machine Learning
The unsupervised vectorisation algorithms that were evaluated were as follows :
- PCA dimensionality reduction on TF-IDF representations.
- Skip-Gram (1, 3, 2) : meaning that minimum length of skip grams was 1 word and maximum length was 3 words and the maximum skips allowed was 2.
- IDF weighted word embeddings.
- Sequence-To-Sequence Variational Auto-encoders trained with word embeddings as inputs.
- PCA reduction on Feature Graph Vectors (will introduce in this post).
In this post, we will use each of the above algorithms for representing questions and then once we obtain the clusters and the cluster labels, we can evaluate the accuracy of the algorithm using the following strategy :
- Since there are 400K pair of questions, we will concatenate them into a single list of 800K questions.
- If memory is not a constraint, then removing duplicate questions is not necessary for the method I am going to apply. The basic assumption is that duplicate questions will anyways come into the same cluster.
- For each pair of tagged question pair, create a map with key as the pair of questions and the value as 1 if they are tagged duplicate else 0.
- Now for each pair of tagged question, check the cluster label for both the question in the pair. If the tagged label is 1 and if they have same cluster label, then this pair is correctly classified. Similarly, If the tagged label is 0 and if they have different cluster labels, then this pair is again correctly classified.
- Compute the percentage of 400K pairs that are correctly classified.
The drawback with this method is that, ideally we should have evaluated the clustering based on every pair of questions. But since we have tagged labels for only 400K of these pairs, we cannot assume anything about the remaining pairs of questions.
Before getting into the numbers, I would like to explain one of the representation, that was not introduced in any of the earlier posts. These are the the feature graph based representations.
The idea behind feature graphs is pretty simple and intuitive.
"Two words are considered similar to each another if they occur in similar context"
The graph generation algorithm is as follows :
- Create nodes corresponding to each word found in the vocabulary of the questions. To limit the size of vocabulary, only consider words whose frequency of occurrence is at-least 5 in the entire corpus.
- The context of a word are all words found around the word in the questions. If the context size is set to 5, then the context words are 5 words to the left and 5 words to the right of the current word.
- Create edges between nodes if the words corresponding to these nodes occur at-least once in the same context. The edges are directed.
- The weight of an edge from node A to B is equal to the frequency of the word B occurring in the same context as word A. The edge weights are then normalized to sum to 1.
- Additionally each node is also assigned a score, that is equal to the inverse of the number of nodes connected to it through a direct edge. This has implications that, important words occur in lesser number of contexts as compared to common and un-important words such as "can", "what", "and", "but", "how" etc., thus important words will have higher scores.
- Finally we create a N-by-N feature-feature matrix (feature adjacency matrix) A.
- The value for the entry Aij = Wij * Si, where Wij is the edge weight from node i to node j and Si is the score of the node i.
- Similarity between two words P and Q is computed as the cosine similarity between the two rows 'p' and 'q' of the matrix corresponding to the these two words :
- To generate a representation for a sentence, simply compute the sum of the representations for the words in the sentenced. For example, given the question "How to open a bank account", the representation in terms of the adjacency matrix rows is as follows : Ahow + Ato + Aopen + Aa + Abank + Aaccount
Note that, this approach is not scalable as the size of the feature adjacency matrix is of the order of O(N2) for N features. Also the question representation matrix for M questions, would be of the order of O(M*N). One solution is to reduce the dimensionality of the feature-adjacency matrix using PCA.
Although we will lose information regarding which are the important features in the context for a particular word but that is of lower concern for us till we are able to find similarity between two words or two sentences with lesser memory and higher speed. We have used 128 dimensional PCA reduced vectors for feature-adjacency matrix.
Following is the function for computing the feature-adjacency matrix :
""" Function to create feature graph. For each word, words occurring within its context, i.e. within a window size, has an edge with this word, with weight equals to the frequency of occurrence of the pair of words in sentences. """ def create_feature_graph(sentences): print "Creating Feature Graph..." token_frequency = collections.defaultdict(int) for sent in sentences: tokens = get_feature_graph_tokens(sent) for token in tokens: token_frequency[token] += 1 feature_graph = collections.defaultdict(dict) for sent in sentences: tokens = get_feature_graph_tokens(sent) for idx in range(len(tokens)): token = tokens[idx] if token_frequency[token] >= 5: start, end = max(0, idx - 5, min(len(tokens) - 1, idx + 5) for idx2 in range(start, end + 1): token2 = tokens[idx2] if idx2 != idx and token_frequency[token2] >= 5: if token2 not in feature_graph[token]: feature_graph[token][token2] = 0 feature_graph[token][token2] += 1 for key in feature_graph: sums = 0.0 for key2, val2 in feature_graph[key].items(): sums += val2 for key2 in feature_graph[key]: feature_graph[key][key2] /= float(sums) feature_vec, feature_names = ,  for key in feature_graph: feature_vec.append(feature_graph[key]) feature_names.append(key) vectorizer = DictVectorizer(sparse=False) feature_data = vectorizer.fit_transform(feature_vec) print "Computing Feature Scores..." feature_scores = 1. / numpy.count_nonzero(feature_data, axis=1) feature_scores /= numpy.sum(feature_scores) feature_data = (feature_data.T * feature_scores).T print "Doing PCA dimensionality reduction..." feature_data = TruncatedSVD(n_components=128).fit_transform(feature_data) return feature_data, feature_names
Following are functions to compute word similarity and finding most similar words given a word from the vocabulary.
""" Function to get word similarity from feature graph. """ def get_word_similarity(word1, word2, feature_data, feature_names): x = feature_data[feature_names.index(word1)] y = feature_data[feature_names.index(word2)] return compute_feature_graph_similarity(x, y) """ Function to get most similar words for a given word from feature graph. """ def get_most_similar_words(word, feature_data, feature_names, num_similar=5): pairwise_sims = get_pairwise_similarities(feature_data) vector = pairwise_sims[feature_names.index(word)] vector = zip(vector, range(len(vector))) vector = sorted(vector, key=lambda k:-k)[:min(num_similar, len(vector))] return [feature_names[y] for x, y in vector]
For e.g. the feature graph trained with the duplicate questions dataset, returns the 10 most similar words to the words as follows:
- ['india', 'australia', 'nigeria', 'germany', 'europe', 'mumbai', 'kenya', 'singapore', 'delhi', 'kolkata']
- names of places, makes sense.
- ['youtube', 'facebook', 'linkedin', 'twitter', 'tumblr', 'skype', 'videos', 'ads', 'whatsapp', 'windows']
- names of popular apps.
- ['entrepreneur', 'banker', 'astrophysicist', 'investor', 'astronaut', 'actuary', 'adult', 'immigrant', 'architect', 'officer']
- names of professions.
Function to compute representation for a sentence :
def get_sentence_representation(sentences, feature_data, feature_names): representations = numpy.zeros((len(sentences), feature_data.shape)) feature_names_index = defaultdict(int) for idx in range(len(feature_names)): feature_names_index[feature_names[idx]] = idx for idx in range(len(sentences)): sent = sentences[idx] tokens = get_tokens(sent) indexes = [feature_names_index[token] for token in tokens if token in feature_names_index] representations[idx] = numpy.sum(feature_data[indexes,:], axis=0) return representations
Similar to words, we can find out the most similar questions given a question :
- How do I make scrambled eggs?
- What is the best way to cook scrambled eggs?
- How do you make scrambled eggs without milk?
- Why are scrambled eggs good for you and what benefits do they provide?
- Do your taste buds disappear with age?
- Why do taste buds disappear with age?
- Is it true that taste buds disappear with old age?
- What is the best weight loss supplements for womens?
- Does eating Kellogg womens K in breakfast help you lose weight?
- What are the best weight loss supplements for men and womens?
- Is it me or is it harder to maintain friendships?
- Why is it difficult to maintain friendships?
- How do you keep lifetime friendships?
Next we present the accuracy numbers of our intent clustering algorithm on the duplicate questions dataset :
- PCA dimensionality reduction on TF-IDF representations : 64%
- Skip-Gram (1, 3, 2) : 63%
- IDF weighted word embeddings : 63%
- Sequence-To-Sequence Variational Auto-encoders trained with word embeddings as inputs : 63 %
- PCA reduction on Feature Graph Vectors (will introduce in this post) : 65%
- Dense layer representations obtained from supervised LSTM model (cosine proximity output layer) : 66%
There is not much significant difference between these accuracy numbers, but we observe a few surprising things directly from these numbers as well as from analysis of the clustering outputs.
- Algorithms which were expected to give good numbers such as word embeddings and variational autoencoders actually performed worse by 2-3%.
- Generative models such as PCA and Feature Graphs actually performed better than Discriminative models like word embeddings.
- Variational Autoencoders although is a generative model but it was trained with word embeddings as inputs.
- Deep Learning model like Sequence-To-Sequence Variational Auto-encoders could not be trained well because the data was in-sufficient to capture semantic similarities.
- From Andrew Ng's paper, he concludes that generative model's error decreases sub-linearly with the amount of data whereas discriminative model's error decreases linearly with data.
- Implies that generative model requires lesser data to reach the same level of error as compared to discriminative model.
- Although generative model's error is lower with less data but its error stops decreasing after a certain point. The asymptotic error with discriminative models is lower than the asymptotic error with generative models.
- Given enough data, word embeddings can outperform PCA and feature graphs.
- Although the AUC score obtained with the supervised parallel LSTM model was around 0.93, but surprisingly the accuracy obtained in clustering using the same hidden layer representations is 0.66.
- To understand why this is so, accuracy numbers is not enough, we need to look at the False Positive and False Negative numbers also.
- False Positive Rate = 0.08, False Negative Rate = 0.79
- Vey high FNR compared to FPR, indicates that clustering makes errors mostly is detecting similar questions, if questions are not similar then clustering does good in keeping them into separate clusters.
- This can be reasoned from the perspective of using centroid based clustering such as K-Means in the first level.
- K-Means clustering computes clusters using euclidean distances between representations and not cosine similarity.
- K-Means clustering puts some of the similar questions (cosine similarity) into different clusters, as a result, the 2nd level clustering with hierarchical agglomerative algorithm cannot recover the similar questions lost from each cluster.
To resolve the problem with a very low accuracy obtained with the supervised Q-Q similarity LSTM model, I decided to try out two things :
- Instead of using cosine similarity for merging the two dense layer outputs in the LSTM network, I had used euclidean similarity for merging the 2 layers.
- This means that, I also needed to change to euclidean similarity from cosine similarity for hierarchical agglomerative clustering (pairwise similarities).
- Normally we use euclidean distances, not similarity. But to obtain a similarity measure from distance, I had used the following transformation :
sim(x, y) = 1 - tanh(euclidean_distance(x, y))
- The good thing is that, the above similarity measure gives a number between 0 and 1.
- Both the LSTM merge layer and the pairwise similarity matrix calculations in clustering, uses the above measure.
- Following is the modified function for the LSTM network :
""" Function to compute euclidean similarity between outputs from two layers of the network. """ def euclidean_similarity(vecs): x, y = vecs dists = K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True)) return 1.0 - K.tanh(dists) """ Function to get output shape of the euclidean similarity layer. """ def euclidean_similarity_output_shape(shapes): shape1, shape2 = shapes return (shape1, 1) """ Function to train the neural network model for question similarity. Two parallel LSTM layer's (for each pair of questions) output is concatenated and trained. """ def train_feedback_model(q_data_1, q_data_2, labels): print("Defining architecture...") q_input_1 = Input(shape=(q_data_1.shape, q_data_1.shape, )) q_input_2 = Input(shape=(q_data_2.shape, q_data_2.shape, )) q_lstm_1 = LSTM(128)(q_input_1) q_lstm_2 = LSTM(128)(q_input_2) dense_1 = Dense(128, activation='tanh')(q_lstm_1) dense_2 = Dense(128, activation='tanh')(q_lstm_2) predictions = Lambda(euclidean_similarity, output_shape=euclidean_similarity_output_shape)([dense_1, dense_2]) model = Model(inputs=[q_input_1, q_input_2], outputs=predictions) model.compile(optimizer='adam', loss="mse", metrics=['mse']) print("Training model...") model.fit([q_data_1, q_data_2], labels, epochs=cfg.LSTM_NUM_ITERS, batch_size=cfg.LSTM_BATCH_SIZE, verbose=2) print("Scoring...") predicted = model.predict([q_data_1, q_data_2]) score = roc_auc_score(labels, predicted, average="weighted") print("Score = ", score) return model
- Keras does not come with an in-built implementation for an euclidean similarity layer as in Dot() for cosine_proximity. I had to define my own euclidean similarity layer using the Lambda layer function of Keras.
- For the 2 dense layers, 'dense_1' and 'dense_2', I had used 'tanh' activation with 'mean-squared-error' loss function. Using 'relu' activation can cause NaN's with the 'mean-squared-error' loss function.
- Surprisingly, the question representations obtained from averaging the dense layer outputs, gives the same accuracy i.e. 65% on clustering these representations using KMeans + Agglomerative pipeline.
- The only possible explanation that remains for the lower than expected accuracy from the clustering outputs is K-Means (centroid based clustering). So in the next experiment, I decided to do away with K-Means and use only hierarchical agglomerative algorithm instead.
- Doing away with K-Means as the 1st level of clustering will lead us to memory issues with large number of questions, which is exactly for what K-Means was used as the 1st level of clustering.
- Thus instead of 800K questions, I decided to explore the performance with only a subset i.e. 20K questions (10K pairs).
- The numbers on 20K questions (10K pairs) with K-Means and without K-Means actually identify the problem. The accuracy with K-Means as the 1st level of clustering is 65%, whereas the accuracy without K-Means is 81%.