In continuation of my earlier posts on designing an automated question-answering system, in part three of the series we look into how to incorporate feedback into our system. Note that since getting labelled data is an expensive operation from the perspective of our company resources, the amount of feedback from human agents is very low (~ 2-3% of the total number of questions). So obviously with such less labelled data, we cannot train a supervised model, but probably can improve incrementally on the existing un-supervised model.
Given that we get feedback on how accurate our intent clusters are, there are two ways in which we incorporate that knowledge into our system. The first one, which is the most obvious one, is to add the feedback directly into our intent clusters. Basically what it means is that, since we now know which questions should come in the same cluster and which should not come in the same cluster, we must make appropriate changes in our clusters to reflect that. It is not as easy as it sounds. Let me explain why.
Given a pair of question and a tag. If we find that the tag is 1, i.e. they are similar and the questions were actually in the same cluster, then we don't do anything. But if they are in different clusters. Then they should be put into the same cluster. But which cluster ?
Let's say that the questions are denoted as A and B, then if B is put into A's cluster or A into B's cluster then we should be done, right ? But a new problem starts to arise. If B is moved to A's cluster, then what if there was a question C in B's cluster for which the tag for B and C is 1. Or what if there is a question D in A's cluster for which the tag for B and D is 0. We may not have tags for the pairs A and C or A and D. Thus the clusters would be in an inconsistent state.
The same problem would arise if the tag for A and B was 0 and they are in the same cluster. There is another problem that we have not accounted for. The tags are on questions also from outside sources. So it is possible that some tagged question may not be present in our own clusters.
One easy solution to the above problem, is to consider the tagged question pairs as completely separate from our intent clusters.
Cluster the feedback question pairs (tag = 1) using a hierarchical approach and then for questions which are already part of the existing intent clusters, update their cluster labels.
The drawback to this, is that questions which were earlier present in the same cluster as one of the question in the tagged pair, will now be in different clusters. This will increase the retrieval time for similar questions as number of clusters increases but will maintain consistency with the labelled data.
Following piece of code will create the new set of clusters from the tagged question pairs (for which tag=1).
new_clusters =  new_cluster_membership = defaultdict(int) while True: flag = False for question_1, question_2, tag in feedbacks: w = 1 if tag == 'Y' or tag == 'y' else 0 if question_1 not in new_cluster_membership: new_clusters.append([question_1]) new_cluster_membership[question_1] = len(new_clusters) - 1 if question_2 not in new_cluster_membership: new_clusters.append([question_2]) new_cluster_membership[question_2] = len(new_clusters) - 1 if w == 1: cl_1, cl_2 = new_cluster_membership[question_1], new_cluster_membership[question_2] if cl_1 != cl_2: flag = True new_clusters[cl_1] += new_clusters[cl_2] for g in new_clusters[cl_2]: new_cluster_membership[g] = cl_1 new_clusters[cl_2] =  if flag is False: break
We maintain a map from a question to the cluster index it has been assigned to and a list of list, for questions belonging to each new cluster. For a tagged pair of question, if the tag is 1, we check the cluster membership of each question from the pair. If a question has not been assigned a cluster, then we create a new cluster and map the question to this new cluster id. Now if the pair of question are in different cluster ids, then we map all questions from one of the cluster to the other cluster id (cluster merging step).
Repeat this process until it is not possible to change the cluster membership of any question.
Next we search for each tagged question in our existing intent clusters and if they are present then we update their cluster id else we just assign the new questions to the new cluster ids.
This was not the only way to incorporate feedback. Another interesting way in which we do it is as follows.
Remember that earlier we had talked about sentence vector representations. i.e. transforming each question into a semantic or syntactic vector representation and then use these vectors for cluster computation. Few of the representations, we had looked were PCA, Skip Gram vectors and IDF weighted word embeddings. All of them are un-supervised vector representations.
The problem with un-supervised representations is that, although they capture individual word semantics, but fail to assign appropriate weights to words when we are comparing two questions. For example, the two questions :
"What is the procedure to change my address ?"
"What is the procedure to change my laptop ?"
All our earlier methods of representing questions as vectors would give very high similarity to the above pair of questions, but in fact they have different answers. Un-supervised learning will only get you so far.
Our goal in this post is to use the word embeddings as input to a RNN/LSTM model and learn better sentence representations captured through the hidden layers of the network.
The idea is to train a LSTM model with tagged pair of questions and then use the weights learnt by the hidden layers of the network to generate vector representations for questions. Similar to the other representations, the learnt LSTM representations can be used independently or in combination with other un-supervised representations.
We will use Keras library in python to train the LSTM model.
To train LSTM model, we need the input to be a 3D tensor, (x, y, z), where x is the number of training examples, y is the number of words or tokens in each example (called time-steps) and z is the vector size of the representation of each word or token.
So for a question like "What is the procedure to withdraw money from ATM ?", after removing non-alphabetic characters, we are left with 9 words and if we train a word2vec model with a vector size of 128, then our tensor for this example alone will be of size (1, 9, 128).
But there could be many examples and each example will have different number of words. Either we could take the maximum number of words in questions or some fixed number. From our analysis, more than 98% of the questions have less than equal to 30 words. So we decided to keep our input tensor to be of size (N, 30, 128).
Questions with more than 30 words will obviously be truncated, whereas questions having less than 30 words will be padded with zero vectors.
def transform_sentence(sentence, word_vectors): tokens = get_tokens(sentence) max_words, w2v_dim = 30, word_vectors.vector_size lst =  for token in tokens: if token not in word_vectors: lst.append(np.zeros(w2v_dim, dtype=np.float32)) else: lst.append(word_vectors[token]) if len(lst) > max_words: lst = lst[:max_words] else: lst += list(np.tile(np.zeros(w2v_dim, dtype=np.float32), (max_words - len(lst), 1))) return np.array(lst)
def get_data_pairs(question_pair_1, question_pair_2, word_vectors): print("Generating training data...") q_data_1, q_data_2 = ,  q_data_1 = np.array([transform_sentence(question, word_vectors) for question in question_pair_1]) q_data_2 = np.array([transform_sentence(question, word_vectors) for question in question_pair_2]) return q_data_1, q_data_2
Next up, we specify the LSTM architecture, that we will be using for our training.
Following is the python code for training the above parallel LSTM modelling Keras.
def train_feedback_model(q_data_1, q_data_2, labels): print("Defining architecture...") q_input_1 = Input(shape=(q_data_1.shape, q_data_1.shape, )) q_input_2 = Input(shape=(q_data_2.shape, q_data_2.shape, )) q_lstm_1 = LSTM(128)(q_input_1) q_lstm_2 = LSTM(128)(q_input_2) merged_vector = keras.layers.concatenate([q_lstm_1, q_lstm_2], axis=-1) dense_layer = Dense(128, activation='relu')(merged_vector) predictions = Dense(1, activation='sigmoid')(dense_layer) model = Model(inputs=[q_input_1, q_input_2], outputs=predictions) model.compile(optimizer='adam', loss="binary_crossentropy", metrics=['accuracy']) print("Training model...") model.fit([q_data_1, q_data_2], labels, epochs=cfg.LSTM_NUM_ITERS, batch_size=cfg.LSTM_BATCH_SIZE) print("Scoring...") predicted = model.predict([q_data_1, q_data_2]) score = roc_auc_score(labels, predicted, average="weighted") print("Score = ", score) return model
'q_data_1' and 'q_data_2' corresponds to our 3D tensors for each pair of questions. We are using the Functional API model to define the architecture, as it is more convenient to use than a Sequential model for parallel inputs. Next we are evaluating the performance of the model on the tagged pair of questions using AUC score. On about 3800 pair of questions, with 4 epochs and batch size of 32, we obtain AUC score of 0.89.
Quite obviously, LSTM models are not meant to be trained on only 4K examples and expect the same accuracy or AUC score to hold on unseen test examples.
To get around that, we are augmenting our training data with the similar questions dataset available from Quora. The dataset has about 400K pair of tagged questions, which is still better than having only 4K pairs. Trained the model for 50 epochs and batch size of 32 and obtained an AUC score of 0.96 on the training data.
def train_feedbacks(): labels =  print("Processing feedback file...") df_feedback = pd.read_csv(cfg.FEEDBACK_FILE_PATH) question_pair_1 = list(df_feedback['question1']) question_pair_2 = list(df_feedback['question2']) is_similar = list(df_feedback['is_duplicate']) labels += list(df_feedback['is_duplicate']) print("Processing Quora similar questions dataset...") df_quora = pd.read_csv(cfg.QUORA_SIMILAR_QS_DATASET, sep='\t') question_pair_1 += list(df_quora['question1']) question_pair_2 += list(df_quora['question2']) labels += list(df_quora['is_duplicate']) question_pair_1 = [str(x) for x in question_pair_1] question_pair_2 = [str(x) for x in question_pair_2] w2v_model = gensim.models.Word2Vec.load(cfg.W2V_MODEL_FILE) q_data_1, q_data_2 = get_data_pairs(question_pair_1, question_pair_2, w2v_model) feedback_model = train_feedback_model(q_data_1, q_data_2, np.array(labels)) print("Saving lstm model...") feedback_model_json = feedback_model.to_json() with open(cfg.LSTM_FEEDBACK_MODEL_JSON_PATH, "w") as json_file: json_file.write(feedback_model_json) feedback_model.save_weights(cfg.LSTM_FEEDBACK_MODEL_WEIGHTS_PATH)
Save the LSTM model after the model is built.
Now we have a model which accepts a pair of questions and predicts a score close to 1 if the questions are similar and close to 0 if they are not. But remember that our goal is not to predict a similarity score right now, but to only get representations for questions, which can be used in clustering.
How to get the model representations ? There are multiple layers in the above architecture. One could potentially obtain hidden layer representations from 4 different layers, two LSTM layers, one concatenation layer and one dense layer.
Notice that the model takes two questions as input at a time, but we need representations for questions individually. One trick is to use the same question for both the input layers and then take average of the outputs from the LSTM layers.
def get_activations(model, questions, word_vectors): print("Getting Hidden Layer Activations...") q_data = np.array([transform_sentence(question, word_vectors) for question in questions]) get_activations = K.function([model.layers.input, model.layers.input], [keras.layers.Average()([model.layers.output, model.layers.output])]) activations = get_activations([q_data, q_data]) return activations
The above function, will return the average of the two LSTM layer outputs as the question representations. Refer to this FAQ page for Keras on how to obtain intermediate layer outputs.
Then we can use the LSTM feedback representations in the same way as we had used some of the un-supervised representations in the previous post. We have seen that it is best to concatenate the feedback representations along with the syntactic vectors such as Skip Grams to obtain better similarity between question pairs.
I realized later that, question representations learnt this way does not actually makes sense to be directly used for clustering, because during hierarchical agglomerative clustering, we are evaluating the cosine similarity between pairs of questions to determine whether they should be part of the same intent or not.
But our LSTM model does not learn to maximize the cosine similarities for same intents, instead it learns the similarity function that says whether two questions belong to the same intent or not.
Note : Although this network is a very good model where we want to only evaluate whether two questions belong to same intent or not.
Following is a probably a better network for the purpose of learning question representations that maximize the cosine similarity between same intents.
def train_feedback_model(q_data_1, q_data_2, labels): print("Defining architecture...") q_input_1 = Input(shape=(q_data_1.shape, q_data_1.shape, )) q_input_2 = Input(shape=(q_data_2.shape, q_data_2.shape, )) q_lstm_1 = LSTM(128)(q_input_1) q_lstm_2 = LSTM(128)(q_input_2) dense_1 = Dense(128, activation='relu')(q_lstm_1) dense_2 = Dense(128, activation='relu')(q_lstm_2) predictions = keras.layers.dot([dense_1, dense_2], axes=-1, normalize=True) model = Model(inputs=[q_input_1, q_input_2], outputs=predictions) model.compile(optimizer='adam', loss="binary_crossentropy", metrics=['mse']) print("Training model...") model.fit([q_data_1, q_data_2], labels, epochs=cfg.LSTM_NUM_ITERS, batch_size=cfg.LSTM_BATCH_SIZE) print("Scoring...") predicted = model.predict([q_data_1, q_data_2]) score = roc_auc_score(labels, predicted, average="weighted") print("Score = ", score) return model
The individual representations from the two LSTM layers are passed through two dense layers, and after that instead of concatenating the dense layer outputs, we take the cosine proximity of the two dense layer outputs along the last axis (i.e. for each example) and pass it as the prediction layer.
In this way, the network will learn weights, such that questions of same intents will have cosine similarities close to 1 while questions of different intents will have cosine similarities close to 0.
The question representation is taken to be the average of the dense layers outputs, given the same question as inputs to both the LSTM layers.
We will see in the next post, that the later network actually performs better on clustering.