Natural Language Question Answering system such as chatbots and AI conversational agents requires answering customer queries in an intelligent fashion. Many companies employ manual resources to answer customer queries and complaints. Apart from the high cost factor with employing people, many of the customer queries are repetitive in nature and most of the time, same intents are asked in different tones.
To begin with, there can be two different models of answering questions using an AI system. The first one is the generative model, where the answer for a question is generated in real time based on a generative model learnt on the training data and the second one is the retrieval model, where the answer is retrieved from a database of question and answers, based on similar matching questions.
The generative model requires lots of training data to produce grammatically and semantically correct sentences. Many domains like in Finance or Healthcare, the responses are very critical and a grammatical error can cause havoc. Thus till today most of the AI bots are retrieval models.
Let's begin with a design overview. Our question-answering system comprises of an offline and an online component. The offline component is nothing but the retrieval based model.
In the online system, given an expression or question by an user, an api returns the most similar matching questions and answers along with their confidences from the offline model.
We will assume the following about our offline model :
- The offline model is an unsupervised model, where the questions in our database are clustered around intents.
- The clustering algorithm produces intent clusters where each cluster is represented by one of the question which is the intent for that cluster and is located at the centre of the Euclidean space of the cluster.
- The questions themselves are represented as 128-dimensional vectors in the Euclidean space.
- For each question, we can find the corresponding answer given by a manual agent. This is not trivial to find as we will see later.
The corresponding online system :
- Given a question by an user, find the vector representation of the incoming question.
- Compute the distances of the incoming question vector with all the intents. Intents are the questions located at the center of each cluster in the offline model.
- Find out all intents which are most similar to the incoming question (defined by some distance threshold or just the K-nearest intents) and their similarity scores.
- Once we obtain the most similar intent clusters, find the most similar questions by computing the distance between the incoming question with all questions found inside the intent clusters in the previous step.
- Once we obtain the most similar questions along with their similarity scores, find out the corresponding answers.
- Return a tuple of question-answer through an API, sorted by the similarity scores in decreasing order.
- The answer for the top rated question is displayed to the user, with alternative suggestions from the next three most similar questions.
What looks like an easy implementation has several design and implementation challenges, which we discover as follows :
Challenges with generating the offline intent clustering model :
- There is no good method to evaluate clustering quality in a completely unsupervised environment.
- Since the quality of the question vectors directly impact the clustering quality, we decided to evaluate the vector generation algorithms itself but on a different tagged data-set.
- Used Quora's duplicate questions tagged dataset, to generate unsupervised question representation and used the same algorithm to cluster the questions.
- There are around 400K pairs of questions and 537K unique set of questions.
- We computed the fraction of these 400K pairs, whose tagging agrees with our assigned cluster labels.
- Obtained around 65% baseline accuracy.
- Evaluating intent cluster quality based on answers.
- Assumption is that although two questions might be different but they speak of the same intent if the answers given for these two questions by the manual agent are same or very much similar.
- For example, "Can I withdraw from my IRA account ?" and "Am I able to take out money from my IRA account ?"
- But answers for questions cannot be directly inferred from chat logs.
- For example, in response to a customer question, the agent instead of giving an answer has followed up with a clarifying question. The actual answer comes much later in the chat.
- Infer answers based on clustering results :
- For each question in a cluster, extract all agent responses after the time the question was asked in the session.
- Compute agent response representations similar to question representations.
- Cluster these responses.
- The response cluster with the highest intra-cluster similarity is most likely to contain the answers for these questions.
- Thus it becomes a chicken and egg problem : To evaluate clustering we need answers but to find answers we need clustering.
- Handle trade-offs between cluster quality and memory and time to generate the clusters.
- Centroid based clustering (such as K-Means) is linear and fast but not as accurate as Hierarchical clustering.
- Hierarchical clustering evaluates every possible pair of questions. Thus for 10 million questions, it would require us to create a matrix of size 1014, which would be of the order of 730 TB in memory.
- Run-time complexity for hierarchical clustering is O(N2), where N is the number of questions to cluster.
- Scalable solution is to use multi-level clustering. Create a first level of approximately good intent clusters using K-Means algorithm. Once all K-Means clusters contains at-most M questions, such that the matrix of size M2 fits in the available RAM, we can then run agglomerative hierarchical clustering on each K-Means intent cluster. Else we continue to create sub-clusters.
- For N questions, if we create N0.5 number of K-Means clusters, then each cluster will contain an expected number of N0.5 questions. Thus for each K-Means cluster, we need to create a matrix of expected size N. Thus the maximum memory requirement would be of the order of 80 MB.
- The average case run-time complexity would be O(N1.5).
- The above analysis is done assuming uniform distribution of questions into clusters, but in reality 85-90% of the clusters are single-element and some clusters might contain as large as O(N) number of elements.
- If M is the maximum size of cluster our system can handle then, to be on the safer side, one should choose N-M+1 clusters. But then due to too many clusters, similar intents might go into different clusters which we will never recover again in the final level of clustering.
- High Dimensionality of questions.
- Representing the questions as TF-IDF vectors creates a vocabulary of size 50K with 10 million questions, after removing stop-words and words with frequency of occurrence less than 5.
- A matrix of size 10 million by 50K would take up around 4 TB of RAM.
- Performed dimensionality reduction on this matrix using PCA (projected into 128 dimensional vectors). 128 dim. PCA matrix would require around 10 GB RAM.
- TF-IDF matrix can be stored as a sparse matrix because less than 0.1% of the matrix has non-zero entries. Thus we can considerably reduce memory overhead (4 GB compared to 4 TB).
- Other question representations such as IDF weighted word embeddings or hidden layer outputs from auto-encoders does not require explicitly working with large and sparse TF-IDF matrices.
- K-Means clustering with full data.
- Assuming that we create O(N0.5) number of cluster centroids for N questions, internal scikit-learn libraries would compute the pair-wise distance between each question and cluster centroid using numpy matrix multiplication which is fast.
- But then this would require creating a matrix of the order of O(N1.5). For N equals to 10 million, this would come to around 75 GB.
- For scalability reasons, we had used MiniBatch-KMeans algorithm which works with batches of data instead of the full data.
- Computing PCA on full data.
- Similar to K-Means algorithm, PCA requires computing the product matrix ATA, where A is the dense TF-IDF matrix.
- For 50K features, this would create a matrix of size 19 GB.
- For scalability reasons, we preferred to use Truncated SVD algorithm which works with sparse TF-IDF matrices, with a very less memory footprint.
- Handling greetings, personal questions, advisory questions, and context dependent questions.
- Greeting questions are generally of the form "Hello", "How are you?" etc. We chose not to include such questions into clustering as an user would not interact with a bot as he/she would do with a human being.
- Personal questions are questions that require access to personal data or PII data. For example, "can you tell me the balance in my account ?"
- Bots can answer personal questions provided it has been given access to user database and PII data, which in our case was not. So we refrain from answering personal questions.
- Advisory questions are questions that ask for specific advise such as "Should I sell my shares today ?". These are too risky to handle at this moment.
- Context dependent (CD) questions are questions that has a context within it from somewhere earlier in the same chat session. For example, "Can it be transferred to another account ?", here the word "it" resolves to something that the user has mentioned earlier in the chat session.
- The context can be resolved using techniques like coreference resolution.
- Detecting questions.
- The user may not always ask a question, but sometimes generic statements also.
- For example, "I know when this would be closed", is not asking for any reply from the agent or a bot.
- We had created a classifier for detecting question using crowdsourced tagged data and some rule based training, which has a good enough precision of around 95%.
- Scaling with incoming questions, increasing dataset size. (Not yet implemented)
- Keeping everything in disk is not an issue as they are cheap, but loading everything into memory will create an issue with increasing dataset size.
- Singleton clusters comprises of 85-90% of the intent clusters and cover around 60% of the total number of questions.
- One strategy is to determine which singleton clusters should be loaded into memory using either LRU (Least Recently Used) or LFU (Least Frequently Used) caching technique. Singleton intents which is the least accessed or last accessed 3-4 months ago (during offline clustering as well as online matching) can be kept in the disk.
- With increasing dataset size, we would ultimately require to do horizontal scaling of the data, where we can compute the intent clusters by running clustering on Spark or Hadoop clusters.
- Computation of question vector representations can be easily parallelised as each representation is independent.
- Clustering requires all the data to be present on a single machine. Some techniques for parallelising clustering are as follows :
Challenges with online system - The Intent Classifier : Mostly run-time.
- Retrieve the best matching question and answers in less than a second (99% percentile latency) (online).
- Store the offline intents (cluster heads) into a KD-Tree.
- Compute the vector representation of the incoming question and then lookup the KD-tree to find the most similar matching intents based on either a distance or just the K-nearest intents.
- For about 1M intents, lookup on the KD-Tree takes about 0.75 seconds on an average for each incoming question on a single machine.
- Can be easily parallelised, where 100 machines store 10K intents each and a KD-Tree built only on that 10K. A Map-Reduce job can map the most similar intents in each machine and then the reduce step can then merge-sort and filter the final results.
- Another way is to create a Hash table where the key is a word found in the vocabulary of intents and values are the list of cluster indices whose intent contains that word.
- For any incoming question, extract all words and search the hash map for these words and then take the union of the cluster indices.
- Assumption is that the best matching questions will have at-least one word in common.
- For a vocabulary of 5,000 words, assuming uniform distribution of words into clusters (which is not true, since few words are very common and many words are rare), then a question with 10 words, should require 10/5000 = 0.2% of the clusters to be searched.
- On a sample set 90K clusters, the highest ranked word (non stop-word) occurs in around 27K clusters.
- The 95% confidence interval of the number of clusters to be searched for is around 20K, i.e. we need to search about 22% of the clusters (and not 0.2% assuming uniform distribution). Still it is one-fifth the time taken to search all clusters.
- Cannot use Tries as they need either prefix or suffix to match. Although a Trie could be useful for finding an exact match.
- Keep a Bloom Filter data structure for the questions. This would help to resolve exact question matches very fast and with very less memory rather than a KD-Tree or Trie.
In the next posts of this series, we will look into some of the details of the implementations, such as how and which vectorization algorithms were chosen for representing questions, how a clustering pipeline was built, how unsupervised deep learning was used in the form of Variational Autoencoders, and so on.