Machine Learning, AI and Programming

Designing a Question-Question Similarity Framework - Part III

In continuation of my earlier posts on designing a question-question similarity framework, in part three of the series we look into how to incorporate limited amount of supervised feedback into our system. Note that since getting labelled data is an expensive operation from the perspective of our company resources, the amount of supervised feedback from human agents is very low (~ 2-3% of the total number of questions). So obviously with such less labelled data, we cannot train a supervised model, but probably can improve incrementally on the existing un-supervised model.

Given that we get feedback on how our question similarity system performs, there are two ways in which we incorporate that knowledge into our system. The first one, which is the most obvious one, is to add the feedback into our question clusters. Basically what that means is that, since we now know which questions should come in the same cluster and which should not come in same cluster, we must make appropriate changes in our clusters to reflect that. It is not as easy as it sounds. Let me explain why.

Given a pair of question and a tag. If we find that the tag is 1, i.e. they are similar and the questions were actually in the same cluster, then we don't do anything. But if they are in different clusters. Then they should be put into the same cluster. But which cluster ?

Let's say that the questions are denoted as A and B, then if B is put into A's cluster or A into B's cluster then we should be done, right ? But a new problem starts to arise. If B is moved to A's cluster, then what if there was a question C in B's cluster for which the tag for B and C is 1. Or what if there is a question D in A's cluster for which the tag for B and D is 0. We may not have tags for the pairs A and C or A and D. Thus the clusters would be in an inconsistent state.

The same problem would arise if the tag for A and B was 0 and they are in the same cluster. There is another problem that we have not accounted for. The tags are on questions from different sources. So it is possible that some tagged question may not be present in our own clusters.

One easy solution to the above problem, is to consider the question pairs as completely separate from our clusters.

Cluster the feedback question pairs (tag = 1) using a hierarchical approach and then for questions which are already part of the clusters (use a KD-Tree here), update their cluster labels.

The drawback to this, is that questions which were earlier present in the same cluster as one of the question in the tagged pair, will now be in different clusters. This will increase the retrieval time for similar questions as number of clusters increases but will maintain consistency with the labelled data.

Following piece of code will create the new set of clusters from the tagged question pairs (for which tag=1).

new_clusters = []
new_cluster_membership = defaultdict(int)
while True:
flag = False
for question_1, question_2, tag in feedbacks:
w = 1 if tag == 'Y' or tag == 'y' else 0
if question_1 not in new_cluster_membership:
new_cluster_membership[question_1] = len(new_clusters) - 1
if question_2 not in new_cluster_membership:
new_cluster_membership[question_2] = len(new_clusters) - 1
if w == 1:
cl_1, cl_2 = new_cluster_membership[question_1], new_cluster_membership[question_2]
if cl_1 != cl_2:
flag = True
new_clusters[cl_1] += new_clusters[cl_2]
for g in new_clusters[cl_2]:
new_cluster_membership[g] = cl_1
new_clusters[cl_2] = []
if flag is False:

We maintain a map from a question to the cluster index it has been assigned to and a list of list, for questions belonging to each new cluster. For a tagged pair of question, if the tag is 1, we check the cluster membership of each question from the pair. If a question has not been assigned a cluster, then we create a new cluster and map the question to this new cluster id. Now if the pair of question are in different cluster ids, then we map all questions from one of the cluster to the other cluster id (cluster merging step).

Repeat this process until it is not possible to change the cluster membership of any question.

Next we search for each tagged question in our old set of questions and if they are present then we update their cluster id else we just assign the new questions to the new cluster ids.

This was not the only way to incorporate feedback. Another interesting way in which we do it is as follows.

Remember that earlier we had talked about sentence vector representations. i.e. transforming each question into a semantic or syntactic vector representation and then use these vectors for cluster computation. Few of the representations, we had looked were TF-IDF + PCA vectors, Skip Gram vectors and IDF weighted word embeddings. All of them are un-supervised vector representations.

The problem with un-supervised representations is that, although they capture individual word semantics, but fail to assign weights to words when we are comparing two questions. For example, the two questions :

"What is the procedure to withdraw money from ATM ?"

"What is the procedure to withdraw money from Bank ?"

All our earlier methods of representing questions as vectors would give very high similarity to the above pair of questions, but in fact they have different answers. Un-supervised learning will only get you so far.

Our goal in this post is to use the word vectors (word2vec) as initial input to a Neural Network model and learn better sentence representations captured through the hidden layers of the network.

Obvious choice that comes to mind while working with word vectors as inputs is LSTM. The idea is to train a LSTM model with tagged pair of questions and then use the weights learnt by the hidden layers of the network to predict vector representations for our cluster questions. Similar to the other representations, the learnt LSTM representations can be used independently or in combination with other un-supervised representations.

We will use Keras library in python to train the LSTM model.

To train LSTM model, we need the input to be a 3D tensor, (x, y, z), where x is the number of training examples, y is the number of words or tokens in each example (called time-steps) and z is the vector size of the representation of each word or token.

So for a question like "What is the procedure to withdraw money from ATM ?", after removing non-alphabetic characters, we are left with 9 words and if we train a word2vec model with a vector size of 128, then our tensor for this example alone will be of size (1, 9, 128).

But there could be many examples and each example will have different number of words. Either we could take the maximum number of words in questions or some fixed number. From our analysis, more than 98% of the questions have less than equal to 30 words. So we decided to keep our input tensor to be of size (N, 30, 128).

Questions with more than 30 words will obviously be truncated, whereas questions having less than 30 words will be padded with zero vectors.

def transform_sentence(sentence, word_vectors):
tokens = mc.get_tokens(sentence)
max_words, w2v_dim = 30, word_vectors.vector_size
lst = []
for token in tokens:
if token not in word_vectors:
lst.append(np.zeros(w2v_dim, dtype=np.float32))
if len(lst) > max_words:
lst = lst[:max_words]
lst += list(np.tile(np.zeros(w2v_dim, dtype=np.float32), (max_words - len(lst), 1)))
return np.array(lst)
def get_data_pairs(question_pair_1, question_pair_2, word_vectors):
print("Generating training data...")
q_data_1, q_data_2 = [], []
q_data_1 = np.array([transform_sentence(question, word_vectors) for question in question_pair_1])
q_data_2 = np.array([transform_sentence(question, word_vectors) for question in question_pair_2])
return q_data_1, q_data_2

Next up, we specify the neural network architecture, that we will be using for our training.

LSTM architecture for feedback modeling.

Following is the python code for training the above parallel LSTM modelling Keras.

def train_feedback_model(q_data_1, q_data_2, labels):
print("Defining architecture...")
q_input_1 = Input(shape=(q_data_1.shape[1], q_data_1.shape[2], ))
q_input_2 = Input(shape=(q_data_2.shape[1], q_data_2.shape[2], ))
q_lstm_1 = LSTM(128)(q_input_1)
q_lstm_2 = LSTM(128)(q_input_2)
merged_vector = keras.layers.concatenate([q_lstm_1, q_lstm_2], axis=-1)
dense_layer = Dense(128, activation='relu')(merged_vector)
predictions = Dense(1, activation='sigmoid')(dense_layer)
model = Model(inputs=[q_input_1, q_input_2], outputs=predictions)
model.compile(optimizer='adam', loss="binary_crossentropy", metrics=['accuracy'])
print("Training model...")[q_data_1, q_data_2], labels, epochs=cfg.LSTM_NUM_ITERS, batch_size=cfg.LSTM_BATCH_SIZE)
predicted = model.predict([q_data_1, q_data_2])
score = roc_auc_score(labels, predicted, average="weighted")
print("Score = ", score)
return model

'q_data_1' and 'q_data_2' corresponds to our 3D tensors for each pair of questions. We are using the Functional API model to define the architecture, as it is more convenient to use than a Sequential model for parallel inputs. Next we are evaluating the performance of the model on the tagged pair of questions using AUC score. On about 3800 pair of questions, with 4 epochs and batch size of 32, we obtain AUC score of 0.89.

Quite obviously, LSTM models are not meant to be trained on only 4K examples and expect the same accuracy or AUC score to hold on unseen test examples.

To get around that, we are augmenting our training data with the similar questions dataset available from Quora. The dataset has about 400K pair of tagged questions, which is still better than having only 4K pairs. Trained the model with 25 epochs and batch size of 64 and obtained an AUC score of 0.87 on the training data.

def train_feedbacks():
labels = []
print("Processing feedback file...")
df_feedback = pd.read_csv(cfg.FEEDBACK_FILE_PATH)
question_pair_1 = list(df_feedback['question1'])
question_pair_2 = list(df_feedback['question2'])
is_similar = list(df_feedback['is_duplicate'])
labels += list(df_feedback['is_duplicate'])
print("Processing Quora similar questions dataset...")
df_quora = pd.read_csv(cfg.QUORA_SIMILAR_QS_DATASET, sep='\t')
question_pair_1 += list(df_quora['question1'])
question_pair_2 += list(df_quora['question2'])
labels += list(df_quora['is_duplicate'])
question_pair_1 = [str(x) for x in question_pair_1]
question_pair_2 = [str(x) for x in question_pair_2]
w2v_model = gensim.models.Word2Vec.load(cfg.W2V_MODEL_FILE)
q_data_1, q_data_2 = get_data_pairs(question_pair_1, question_pair_2, w2v_model)
feedback_model = train_feedback_model(q_data_1, q_data_2, np.array(labels))
print("Saving lstm model...")
feedback_model_json = feedback_model.to_json()
with open(cfg.LSTM_FEEDBACK_MODEL_JSON_PATH, "w") as json_file:

Save the LSTM model after the model is built.

Now we have a model which accepts a pair of questions and predicts a score close to 1 if the questions are similar and close to 0 if they are not. But remember that our goal is not to predict a similarity score right now, but to only get representations for questions, which can be used in clustering.

Note : Although it is possible to use the predicted scores from the model instead of cosine similarity in the final agglomerative clustering algorithm, but given that the "relevant" feedback data is much less, it is safer to use the model representations along with other un-supervised representations to compute cosine similarity.

How to get the model representations ? There are multiple layers in the above architecture. One could potentially obtain hidden layer representations from 4 different layers, two LSTM layers, one concatenation layer and one dense layer. But we will go with the final dense layer representations (of size 128).

Notice that the model takes two questions as input at a time, but we need representations for questions individually. One trick is to use the same question for both the input layers and then take the dense layer output. This ensures that the dense layer output is some kind of averaging of the two LSTM layer outputs.

def get_activations(model, questions, word_vectors):
print("Getting Hidden Layer Activations...")
q_data = np.array([transform_sentence(question, word_vectors) for question in questions])
get_activations = K.function([model.layers[0].input, model.layers[1].input], [model.layers[5].output])
activations = get_activations([q_data, q_data])[0]
return activations

The above function, will return the hidden layer outputs (from the last dense layer) as question representations. Refer to this FAQ page for Keras on how to obtain intermediate layer outputs.

Then we can use the LSTM feedback representations in the same way as we had used some of the un-supervised representations in the previous post. We have seen that it is best to concatenate the feedback representations along with the syntactic vectors such as Skip Grams to obtain better similarity between question pairs.


Tags: , , , , ,