Machine Learning, AI and Programming

Designing a Question-Question Similarity Framework - Part I

Natural Language Question Answering system such as chatbots and AI conversational agents requires answering customer queries in an intelligent fashion. Many companies employ manual resources to answer customer queries and complaints. Apart from the high cost factor with employing people, many of the customer queries are repetitive in nature and most of the time same intents are asked in different tones.

The goal of this post is to identify and group similar intents, so that we can answer any closely matching question with historical answers provided by manual agents. Any question not encountered earlier can be re-directed to the manual agents. This way companies can save cost by employing less resources but still answer customer queries with similar precision and more volume.

To begin with, there can be two different models of answering questions using an AI system. The first one is the generative model, where the answer for a question is generated in real time based on a generative model learnt on the training data and the second one is the retrieval model, where the answer is retrieved from a database of question and answers, based on similar matching questions.

The generative model requires lots of training data to produce grammatically and semantically correct sentences. Many domains like in Finance or Healthcare, the responses are very critical and a grammatical error can cause havoc. Thus till today most of the AI bots are retrieval models. In this post, we would focus on developing a framework for grouping similar questions. Later we will look at how to retrieve the best possible answer.

Some of the considerations that we will take into account while designing the system are :


  • Retrieve the best matching questions in less than a second (99% percentile latency) (online).
    • Have one of the question in a cluster as a cluster center.
    • Create a Hash table with the key as each word in the vocabulary of cluster centers and values as a list of cluster indexes whose cluster center question contains that word.
    • For any incoming question, extract all words and search the hash map for these words and then take the union of the cluster indexes.
    • Assumption is that the best matching questions will have at-least one word in common.
    • For a vocabulary of 5,000 words, assuming uniform distribution of words into clusters (which is not true, since few words are very common and many words are rare), then a question with 10 words, should require 10/5000 = 0.2% of the clusters to be searched.
    • Actual distribution of words in clusters (excluding stop-words, top 1000 words)

    • Out of 90K clusters, the highest ranked word (non stop-word) occurs in around 27K clusters.
    • The 95% confidence interval of the number of clusters to be searched for is around 20K, i.e. we need to search about 22% of the clusters (and not 0.2% assuming uniform distribution). Still it is one-fifth the time taken to search all clusters.
    • Cannot use Tries as they need either prefix or suffix to match.
  • Cluster about a million questions in less than an hour, so that multiple experiments with different hyper-parameters can be run within a designated time-frame (offline).
    • Will talk about this in the next post of this series.
  • Effective use of multi-core CPU's through concurrency and multi-threading.
    • Computation of question vectors, K-Means, word2vec can be parallelized.
    • Sometimes parallel processes can bring additional overhead due to duplication of data into each separate process. Have to be careful.


  • Serve 200 concurrent requests per second with a 99% percentile latency of one second (online).
    • Use message queue.
    • A single message queue can get blocked due to one request which might eat up significant resources. All requests behind this request will have to wait for it to finish.
    • Use multiple message queues or still better create new queue whenever the waiting time for a question in a queue goes beyond 1 second. (Latency is different from waiting time).
  • Work with 5 - 100 Million questions with a memory of 32 GB (offline). Will assume that disk space is not a constraint in our case.
    • Number of questions will increase 3-4 fold once bots replace some humans, because bots will respond faster as compared to humans.
  • At-least 80% coverage.
    • Compute the minimum number of questions that covers at-least 80% of the intents.
    • Sort the intent clusters (in decreasing order of their sizes) and then calculate the cumulative sum of cluster sizes. Return the cumulative sum for the first 80% of the intent clusters.
    • "Can you tell me the price of iPhoneX ?" and "What is the cost of iPhoneX ?" are different questions but the same intent.
    • On a set of 100K queries, 80% of the intents cover upto 85.7% of the queries.
  • LRU (Least Recently Used) or LFU (Least Frequently Used) caching.
    • Discard questions and intents with the least probability of being asked again in future.
    • Prevent training data to increase indefinitely.
    • When data is less we can afford to keep all intents, but in the worst case, when the data grows indefinitely, target the 80% coverage.
    • Compute the ratio of number of intents and the number of questions. Initially the ratio would be low < 10% but will increase gradually till at some point it will saturate (~ 90%) (because there will always be questions ~ 10% which have not been seen before). Beyond this saturation point, we can start to discard questions (defined by LFU).
    • LFU Cache (Source : Medium)

  • High value intents should be retrieved from memory while low value intents can be retrieved from disk.
    • Retrieve most common intents fast by compromising on the speed of the least common intents.
    • Reduce the number of comparisons during prediction time.
    • Use LFU cache to determine which questions should be kept in memory and which in disk.


  • We will be working mostly with un-supervised data i.e. without any prior tagging about which pair of questions are similar. Labeling data is expensive.
  • But at the same time, it should be able to include small incremental feedbacks from manual agents.
  • Question clusters to be evaluated in the answer space, i.e. use the answers of the questions in a cluster to determine the mean intra-answer similarity. High intra-answer similarity score indicates a good cluster and vice-versa. BIC score with answers might fit well as it also penalizes too many clusters.
  • Our goal is not only to include the feedback into the existing clusters but also learn features which are important from whatever small amount of tagging is available.

Implementation (discussed in detail in next post):

  • Handle trade-off between improving the cluster quality vs. generating the clusters with a low run time.
  • Abstract away the question representations, i.e. any kind of question vector representation. Few of the common representations for questions include TF-IDF, PCA, Skip Grams, Word2Vec, Doc2Vec, LDA Topic Models etc.
  • Question representation should capture both syntactic as well as semantic meaning.
  • Each question representation should be able to work with its own similarity or distance metric. For example, TF-IDF may work well with cosine similarity whereas PCA might work well with euclidean distances.
  • Combine different representations with weights.
  • Apart from vectors, clustering should be able to work with kernels, i.e. where vectors are very sparse and high dimensional. E.g. Tree Kernels or Longest Common Subsequence etc.
  • Abstract away the clustering algorithm itself. Clustering is just grouping of similar questions. One can choose to use K-Means or Agglomerative or Spectral or simply group questions based on certain domain knowledge based keywords.
  • Run multiple layers of clustering as and when needed, with each layer of clustering working on individual clusters from the previous layer. Basically a clustering pipeline.
  • Prune clusters based on internal quality metrics and when intermediate cluster sizes are too high.
  • Handle statements, i.e. not a query but something like "Ok I understand" or "I'm good" etc. But also be able to handle statements that does not appear to be a question (in strict grammatical sense) but requires an answer, such as "give me the phone number".
  • Apart from FAQ type queries, handle greetings ("Hello", "How are you ?" etc.), contextual (context from same chat session), personal (requiring user profile data) or sensitive (confidential) questions appropriately.
  • Handle multiple questions in a single customer query.
    • For e.g. "What is the cost of iPhoneX ? Where can I buy it in Bangalore ?".
    • Determine the matching agent answer for each question in a multi-question query. Agent answers may not be in same sequence as the questions asked by the customer.
    • For e.g. answer could be "You can get iPhoneX at MG Road for Rs. 95K"

There could be few other design constraints that might come up in near future, but for now these are good enough to build a scalable and robust QQ similarity model. Due to IP considerations, I will not be able to give detailed approach for each of the above design feature.

Note : The dataset used in designing the system and getting the numbers is from Finance and Investment domain. Due to confidentiality of data, I have used generic examples of questions from e-commerce domain in this post.


Tags: , , , , , , , , ,