Stokastik

Machine Learning, AI and Programming

ML System Design - Visual Search Engine

In this post we are going to discuss how to design a visual search engine similar to Bing and Google image search.

  • Problem Definition
    • Given a query image, retrieve a list of most relevant images (along with metadata) related or similar to the query image.
    • The search engine should be able to recognize objects within the query image and return results relevant to it. For e.g. given an image of a girl wearing a Nike hat, the search results should return images with urls relevant to Nike hat (preferably e-commerce sites) apart from exact matches.
    • A good example of more features is Bing visual search.
    • A good visual search engine should have the following properties:
      • Most relevant images should be shown before less relevant images.
      • Probability of clicking the relevant images is higher than clicking the less relevant images.
      • In case there are more clicks on less relevant images, we need to tune our model to capture this.
      • Images shown in results should load fast.
  • Core ML Model Definitions
    • Model Inputs
      • Query image Q.
    • Model Outputs
      • Given a query image predict tag probability distribution P(t_j=1|Q)
      • Given a query image and the tag distribution, predict score distribution for search result images S(i_k|Q, t_1, t_2, ...t_j)
    • Input Features
      • Image pixels array (0-255 pixel values) of size WxHx3, RGB image (3 channels).
      • Grayscale will not be useful here because color is an important feature. E.g. Find similar images to a green 'colored' t-shirt.
      • Convert image array to fixed size 128x128x3 for scalability and ease of processing.
      • If model requires image embeddings then convert image array into embeddings using either autoencoders (un-supervised) or pre-built networks such as VGG16 or ResNet50 etc. and take output from last but one layer (fully connected, 4096 dimensions) of the network.
      • Assumptions and Challenges - Input image can be from any domain. For using pre-built networks, assumption is that the network has been trained on images from similar domains.
    • Supervised or Unsupervised.
      • Un-supervised approaches for predicting tags for query image:
      • Supervised approaches for extracting tags for query image:
        • Train an image classifier over tags that takes an image as input and returns a probability distribution over possible tags P(t_j=1|Q).
      • Un-supervised approaches for image similarity:
      • Supervised approaches for image similarity:
        • Train pairs of images with labels 1 if they are similar (positive pairs) and label 0 if they are dissimilar (negative pairs)
        • Train pairs of images using ranking loss, contrastive loss or triplet loss functions.
        • There are several other approaches for Learning To Rank which we will explore later.
        • Challenge - A query image can be most similar to several images depending on intention of the user.
        • For e.g. if the query image contains a kid wearing a red sweater playing with a dog. This could mean either of the following:
          • The user is looking for the breed of the dog
          • Similar pictures of the same breed of dog.
          • Looking to find where can he buy similar red sweater for his kid.
          • Or just looking to find similar scenes.
Triplet Loss

Triplet Loss

    • Loss Function to optimize
      • Tag prediction network - Categorical cross-entropy loss over softmax probabilities of tags.
      • Log Loss = {\sum}_{i=1}^N{\sum}_{t=1}^C-y_{it}*\text{log}(\bar{y}_{it})
      • Ranking loss, Contrastive Loss or Triplet Loss.
      • Ranking loss: For a pair of images A and B
      • L=y*D(E_A, E_B)^2+(1-y)*\text{max}(0, m-D(E_A, E_B))^2
      • where E_A and E_B are embeddings corresponding to images A and B.
      • D(E_A, E_B)=||E_A-E_B||
      • is the absolute difference between the embeddings.
      • y = 1 for positive pairs and 0 for negative pairs. 'm' is a constant factor (generally around 1.0)
      • Triplet loss: For a triplet of images X, A and B
      • L=\text{max}(0, m+D(E_A, E_X)-D(E_B, E_X))
      • where X is an anchor image and (A,X) is a positive pair i.e. y=1 and (B,X) is a negative pair i.e. y=0
      • The idea is that the loss only penalizes pairs which are dissimilar (y=0) but are closer in the embedding space and vice-versa.
      • The penalty is 0 when pairs are dissimilar and they are also far away in the embedding space or they are similar and are also close in the embedding space.
    • Offline Metrics - Precision, Recall, Accuracy etc.
      • NDCG is useful for ranking problems in search.
      • For each result, if the rank of the result is j, then the score for that result is:
      • s_j=\sum_{i=1}^j\frac{2^{y_i}-1}{log_2(i+1)}
      • y_i=1 if result is actually relevant else y_i=0
      • To normalize the scores over different length ranking lists, we use normalization over ideal ranking score for same length sequences.
      • Ideal ranking score:
      • p_j=\sum_{i=1}^j\frac{1}{log_2(i+1)} and thus NDCG = q_j=\frac{s_j}{p_j}
      • Ranking metrics
  • Data Pipeline
    • Data Sources
      • Freely available image datasets
        • COCO and Open Images datasets for object detection.
        • ImageNet and CIFAR datasets comes with pre-defined image categories.
        • Challenge -  categories in these datasets are very limited.
      • Leverage Google Search
        • Based on what kind of results you are going to show, use Google apis to retrieve related search queries.
        • For e.g. if one of my requirement is to show restaurant images around London, then I would search Google for "restaurants in London" and then retrieve the top related search queries.
        • Question - How will you decide which queries are important to you without any historical data ?
        • Using the top queries again go 1-2 level deep and collect multiple related search queries.
        • Use the search queries to do Google image search.
        • Question : What if the search queries at depth 2 differ in intention from original search queries ? For e.g. original query "best restaurants in London", related query at depth 2 is "budget restaurants in London"
        • In the Google image search, assumption is that the first N results are related and thus similar to each other.
        • We can create positive examples with pairs from the first N search results and negative examples with one from first N and another beyond M search results.
        • Use heuristics to choose N and M. For e.g. N=50 and M=500.
        • Better method, use multiple values of N ranging from 5 to 500. Collect the data and do random labelling on the data for relevant and non-relevant images.
        • For each N, compute mean average precision@N. Choose N for which MAP@N >= 95% and MAP@N+1 < 95%.
        • Challenge - Value of N will differ significantly depending on query. For generic queries N will be pretty high e.g. "cats" or "dogs" but for very specific queries such as "dogs with brown colored fur" N would be low. How to handle this ?
        • Use the search queries and the related search queries as tags for the images.
        • Example tags : "coffee shops in Berlin", "places to stay in Belgium", "metro stations in London", "restaurants in London", "domino's pizza" etc.
        • Question - How to handle semantically similar tags such as "places to stay in Belgium" and "hotels in Belgium" ?
        • Question - How to handle so many tags ?
        • To reduce the number of possible tags and coalesce similar tags, we can do K-Means clustering on the tags using Glove or Word2vec approach and select one tag from each cluster.
        • Question - How to choose value (or range of values) of K in the above clustering approach ?
      • Crawl webpages
        • Assuming that this is a meta search engine, we can crawl partner websites and download images and metadata.
        • For e.g. if Amazon is a partner website, then based on the links in the site-index of Amazon, crawl the links and download images along with metadata (such as product url, product category, title, description etc.)
        • Create tags using n-grams extracted from the product category, title, description etc.
        • Question - How to handle cases where description of an item contains n-grams not relevant to the particular item. For e.g. "This t-shirt is also available in red and green color" ?
        • For more generic cases, we can crawl Wikipedia pages based on set of search queries (can use the related search queries from Google search).
        • Question - How to decide which Wikipedia pages to crawl ?
        • Download images and text from Wikipedia pages.
        • Based on the tags, create clusters of pages. Use manual tagging to review each cluster and associated images inside each cluster to create positive labels.
        • Negative labels could be images from same cluster but not relevant to the cluster.
        • Given a set of images from a cluster, reviewer has to say whether an image belongs to the cluster or not. Assumption is majority images in a cluster would be similar or relevant to each other.
        • Question - Which clustering algorithm would you use here and why ?
      • User query and clicks data
        • Log the queries by users with session ids and timestamps.
        • Log the displayed search results and the clicked search results.
    • Store raw data databases
      • Let's say that we have images corresponding to 1 million queries (spanning multiple domains) initially.
      • For each query let's assume that we crawl and download 50 images from Google search results. Thus number of images from Google search results is around 50 million.
      • Assuming that we are able to crawl websites such as Amazon, Wikipedia, Facebook, Instagram, Airbnb, Yelp, Zomato, Tripadvisor etc. From each of these sites we can download images along with metadata.
      • Assuming that our crawler is able to download 25 images per second in parallel threads, on running the crawler for 30 days will download approximately 30*86400*25=64.8 million
      • Thus to start with we will have around 120 million images and assuming that each image is of approximately 2 MB is size (we can compress them before saving), total sizes of images would be around 123 GB
      • Question - How to handle duplicate images in the database ?
      • Now for each image we will have 4 different scaled down versions with sizes 32x32, 64x64, 128x128 and 256x256.
      • Assuming that sizes are respectively 2/32, 2/16, 2/8 and 2/4 MB then total size for these 4 variants is around 0.94 MB. Thus total image sizes would come around 185 GB.
      • Coming to the tags for each image, assume that each image has on an average 5 different tags. Thus there are 600 million tags. Each tag on an average is of length 30 characters, thus total size is around 600*30 million bytes = 17 GB
      • Total image+tags size comes to around 200+ GB.
      • These can be easily stored on a single server with at-least 250 GB disk space.
      • But these data is just for one month and assuming that each month onwards the data size increase at-least 50%, thus by 1 year total data size would be around 12-13 TB.
      • We have not considered the memory size required for caching the images which might also come to around 10 TB.
      • Better store raw images on a CDN server, which handles storage and caching.
      • Store associated metadata such as image id, tags, source url, CDN image urls on MongoDB server. Total size of metadata would come to around 1TB. The size of the index on the image id field would be around 60GB.
      • Assuming we are working with 16GB machines, then we would need around 4 machines to begin with.
      • {'image_id' : <image_id>, 
         'source_url' : <source_url>, 
         'cdn_image_url_0' : <cdn_image_url_0>, 
         'cdn_image_url_32' : <cdn_image_url_32>,  
         'cdn_image_url_64' : <cdn_image_url_64>, 
         'cdn_image_url_128' : <cdn_image_url_128>, 
         'cdn_image_url_256' : <cdn_image_url_256>, 
         'tags' : <list_of_tags>}
      • Let's assume that there are 100 requests per second and thus logging each request with {user_id, session_id, timestamp, query, clicked_image_id} of 160 bytes (assuming 32 bytes for each field), total log size would be around 1.3 GB per day or 39 GB in a month.
      • Store the logs in Cassandra. Separate log servers. These have good write performance.
    • Store derived data
      • Image and text embeddings - in MongoDB
      • {'image_id': <image_id>, 
         'embedding': <embedding>}
      • If using LSH to index the embeddings, we need to store mappings {image_id -> bucket_id} serialized into a flat file and load and de-serialize them into memory.
      • {'image_id': <image_id>, 
         'bucket_id': <bucket_id>}
        {'bucket_id': <bucket_id>, 
         'image_ids': <list_of_image_ids_in bucket>}
    • Data preprocessing
      • Convert non JPG format such as PNG and GIF into JPG format for consistency and also JPG formats are of lesser size than PNG formats.
      • Convert each image into image arrays of size 128x128x3 (Numpy image arrays)
      • Text metadata crawled from urls might contain HTML tags which needs to be cleaned.
      • Remove special characters and non-UTF-8 characters from text.
      • Stemming, lemmatization, stop-words removal from text data are some of the other pre-processing steps.
    • Feature selection
      • Convert numpy image arrays of size 128x128x3 into un-supervised embeddings by passing through autoencoders. CNN based autoencoders might be useful in such cases.
      • Use 4096 dimensional autoencoder representations for each image.
      • Use pre-trained VGG16 to compute embeddings from the fully connected layer output of dimensions 4096.
      • Final image embeddings is concatenation of autoencoder and VGG16 representation.
      • For text data, we can use the following feature representations - TF-IDF, PCA, Glove or Word2Vec.
      • Since most of the tags are of shorter lengths, the context for each word is less, thus instead of word2vec we can use Glove representations which use global contexts.
CNN Autoencoder

CNN Autoencoder

    • Dimensionality Reduction
      • Pass the 8192 sized image embeddings through a fully connected dense layer of dimension 4096.
      • Final image embedding dimensions after dimensionality reduction is 4096.
      • Similarly for text data, we can use Glove with 300 dimensions.
      • Question - How to choose embedding dimensions for image and text ?
    • Input Feature transformation (whitening/standardisation)
      • Image arrays of size 128x128x3 contains values in the range 0-255.
      • It is useful to bring these values in the range 0-1 by dividing the array by 255. Smaller inputs to neural networks helps the network learn weights faster because larger input values can cause vanishing gradient problem.
      • Question - Why not subtract mean and divide by standard deviation for image normalization ? Would it be more effecient during training ?
  • Offline and Online Model training
    • Training data preparation (image-tag classifier)
      • The classifier would be multilabel-mullticlass classifier because for each image there would be multiple valid tags.
      • Input labels would be binary encoded (0,1) value for each possible label (Multi-label Binarizer).
      • Assuming a batch size of 128 for the input images of size 128x128x3, the total input size of a batch would be around 48 MB (assuming 64 bit floating point values). But the main memory requirement is from VGG16 weights which with a batch size of 128 would be around 14 GB.
      • Assuming we are training on a single GPU with 16 GB GPU memory, 128 batch size would just fit.
      • We will cache the pairs of image array and binary encoded labels for each instance so that we do not need to compute the image array as well as the encoded labels in each epoch.
      • Shuffle the entire training data after every epoch.
      • Question - What if the pre-trained network does not fit in the GPU memory ?
      • Question - How much large batch size should we use for training ?
      • Question - What if the size of the input data does not fit in memory where will you store it ?
    • Training data preparation (image-similarity network)
      • For each image X, find an image for positive pair A and another image for negative pair B.
      • As before we can train this data on a single GPU machine with 16 GB memory.
      • We will cache the triplet of image arrays and encoded labels for each instance so that we do not need to compute the image array as well as the encoded labels in each epoch.
      • To make the model robust and not learn obvious facts such as an image of a cat and a furniture are different, we need to choose the triplets such that the pairs are hard to classify as positive or negative.
      • Given an image X, compute the distances (using embeddings) of the positive images from X. Choose the positive images A which is most far away from X. This will become hard positive pairs.
      • Given an image X, compute the distances (using embeddings) of the negative images from X. Choose the negative images B which is most similar to X. This will become hard negative pairs.
      • Question - How to decide the threshold for distance/similarity in the above two cases ?
    • Model architecture (image-tag classifier)
      • The CNN network will have number of nodes in the output layer equal to the number of possible tags, which could be in the range 100-10000.
      • Each output node will be trained using logistic loss on its sigmoid activation output.
      • Thus each node will emit a value between 0 and 1. Note that this is not softmax as in softmax the sum of the probabilities for all nodes is 1 but here it is not necessary.
      • To train the classifier we can use pre-trained VGG16 network by freezing all layers except the last fully connected layer and then add our output layer on top of it.
      • Optimizer we will use Adam with constant learning rate of 0.001
Tag Classification Network

Tag Classification Network

    • Model architecture (image-similarity network)
      • The network would be a Siamese CNN network where we will have 3 inputs but all inputs will share the same CNN network weights.
      • We can again use the pre-trained VGG16 network here as the shared network but this time we will freeze the first 13 layers only and re-train the last 3 layers and the fully connected layer.
      • We are re-training last few layers because the initial CNN layers learns the fundamental image features which remains same across multiple images but since VGG16 was trained for classification and we need the Siamese network to learn whether two images are same or different. Thus although two images from same category will have same class label in the original VGG network but if one image is a green t-shirt and one image is a red t-shirt then if we want to train the pair of images for color similarity then these two images are different. We want the final weights to learn this difference.
      • The output layer would be a two node softmax layer emitting probabilities for the positive and the negative pairs.
      • The loss function would be triplet loss (as mentioned above).
      • Optimizer we will use Adam with constant learning rate of 0.001
Similarity Network with Triplet Loss

Similarity Network with Triplet Loss

    • Cold-start problem
      • Use un-supervised image embeddings (from autoencoders) to find and show similar images initially.
      • Collect click logs of users to identify which results are relevant and which are not.
      • For e.g. if a user is shown 100 search results and he clicks on the 10 result, it implies that the first 9 results were irrelevant and the 10 th one is relevant.
    • Training steps
      • Split the images data into training and testing by 80:20 split.
      • On the 80% training data create training batches of sizes 128 for the image-tag classifier and 64 for the Siamese network.
      • Also split the 80% training data into further 80% for training and 20% for validation. Thus the actual training data is 64% of the full data.
      • Since the number of training examples 0.8*120 million = 96 million is quite high we can train the models for 40-50 epochs and checkpoint the best model based on the loss on the 20% validation data.
      • To prevent overfitting of the model we will use the following strategies:
        • Adding dropout rate of 0.1 after the fully connected layers.
        • L2-regularization on the fully connected layer weights.
        • Data augmentation. Transform images in a batch by randomly adding rotation, stretch, brightness etc.
      • Hyper-parameter tuning - The parameters that can be tuned in the above networks are
        • learning rate of Adam optimizer (0.0001 to 0.01)
        • dropout rate (0.1 to 0.5)
        • batch sizes (32, 64, 128, 256)
        • number of epochs (30 to 100)
        • L2 regularization (0.00005 to 0.01)
        • image embedding size (64, 128, 256, ..., 4096)
      • The best parameters for the network can be chosen by running K-Fold validation (with K=5) and grid search with random strategy over the set of above hyper-parameters.
      • Instead of splitting the data into 80-20 one time, we split the data into 5 equal parts. Then based on a selected set of hyper-parameters we run the network 5 times where each time we select 4 parts for training and the remaining part for validation. Then average the precision-recall score for these 5 runs.
      • The best set of hyper-parameters is the one which maximizes precision-recall.
      • We can also select to minimize the average validation loss instead of precision-recall. It depends on the business use case.
      • After selecting the best hyper-parameters, use the same hyper-parameters to train model on the entire training data and use the hold out testing data to report precision-recall numbers.
      • Save the best network architecture and the weights for inferencing.
    • Distributed or out-of-core training:
      • Once the amount of data starts to increase say beyond a billion images, it will be difficult to train the data on a single GPU or even if a batch of image fits in a single GPU it will take long time to train.
      • For a single run with a fixed set of hyper-parameters with batch size of 128, it will take 7.8 million batches to train on 1 billion images.
      • Assuming each batch takes 0.5 milliseconds to train, each epoch will take around 1 hour to train. Given there are 50 epochs, it will takes 50 hours for each epoch to train
      • This is just for 1 run, since for hyper-parameter tuning there could be potentially 100-1000 runs, thus to train the best network it will take on an average 500*50=25000 hours to train which is about 3 years.
      • One solution is to sample training data for each run assuming that each sample well represents the entire data. We can use stratified sampling where we sample instances per class based on the true class distributions.
      • Thus for each run we can sample 1 million data points for training which will lead to training the best network in 27 hours only.
      • But we cannot always ensure that the samples well represent the entire data distribution.
      • Another approach is to use distributed training, i.e. use multiple GPUs.
      • With multiple GPUs we can even increase the batch size to 128*K where K is the number of GPUs. Since increase in batch size improves training, thus there is 2-fold advantage.
      • First the total time is reduced by a factor K. Thus when K=50 GPUs, it will take 21 days to train on the entire billion images instead of 3 years.
      • Also the number of epochs will reduce because the network will converge faster. Thus instead of 50 epochs, say 20 epochs is enough to train the data. Then the time taken will further reduce to around 8 days.
      • Distributed training on multiple GPUs is also an alternative when the pre-trained network is huge and does not fit in a single GPU.
    • Class Imbalance
      • The number of positive image pairs is much less than the number of negative image pairs.
      • One possible solution is to generate more positive pairs by data augmentation.
      • Not all negative pairs should be used for training because the network should be able to identify negative pairs which are hard to learn. For this reason we only sample negative pairs in which the distance between the pair of images in a negative pair is below a certain threshold.
  • LSH - Locality Sensitive Hashing
    • Assuming we have N images in our database, comparing a query image with every image to find the most similar ones has time complexity of O(N).
    • Assuming that each comparison takes 1 ms, total time taken for comparing with N=120 million images is 33 hours.
    • Although storage of 120 million image embeddings of 4096 dimensions is not an issue, but loading them in-memory for comparsion can cause out-of-memory issues if not done in batches or in distributed machines for parallel comparison.
    • In LSH, we choose K=64 random vectors of dimensions 4096.
    • Take dot product between each random vector with each image embedding.
    • For each dot product, if the value is +ve then replace the value with 1 else if it is negative then replace with 0.
    • Thus for each image embedding we would have a 64 dimensional binary hash.
    • Bucket the images with the same hash value. Idea is that images which are similar will have the same 64 dimensional binary hash.
    • Let their be B unique hashes (of dimension 64). B << 120 million.
    • For a given query image, compute its 64 dimensional binary hash using the same random vectors.
    • Compare the query image hash with the B hashes (using Jaccard Similarity). Select the bucket with the highest Jaccard Similarity.
    • Now do a linear scan over the 4096 dimensional image embeddings with only the images inside the selected bucket.
    • Question : How to choose K, the hash length ?
    • Question : Approximately how many comparisons do we need to make in the average and in the worst case ?
    • Question : How to handle false positives and false negatives ?
  •  Inferencing
    • One possible way to do inferencing is that when a query image Q is uploaded, first it is converted into an image array.
    • Next in the triplet network the image Q is paired with every image A from the training dataset. But for triplet we also need another image B.
    • Since we are only concerned about the similarity between Q and A, we can choose B to be any image randomly.
    • With this, we compute the D(E_Q, E_A) using only one half of the network (we can ignore the other half).
    • E_Q and E_A are the embeddings of Q and A respectively from the network.
    • D(E_Q, E_A)=||E_Q-E_A|| is the absolute difference between the embeddings.
    • Smaller the difference more similar is A to Q.
    • Thus we sort the images A based on the scores D(E_Q, E_A).
    • Before displaying the images, we also predict the tags from the query image Q using our image-tag classifier.
    • Based on which tags are predicted with at-least 80-90% confidence, we filter only those images from A which contains tags having at-least one tag in common with our query image. Let this images be called V.
    • Based on some heuristics that user will only be interested in not more than 1000 results, we pick the first 1000 images from V.
    • Next for each image_id corresponding to V, we fetch the metadata such as the image_urls from CDN, source url of the page from where the image was downloaded, any description of the image etc.
    • The output is sent as a JSON over HTTP to the web application.
    • But the problem is that we have 120 million images now and it is expected to reach a billion by next year. Thus linearly comparing Q across a billion images is not scalable in real time.
    • One scalable solution is as follows.
    • Pre-compute the embeddings of the training images.
    • Distribute the billion training images across 1000 servers and in each server compute the 4096 dimensional embeddings.
    • Each server will contain 1 million embeddings.
    • Create LSH tables for 1 million embeddings with hash length K=64.
    • This is done before any inferencing.
    • Now when Q is uploaded compute the embedding for Q on any server (decided by load balancer).
    • Pass the embedding of Q to all the 1000 servers and query the LSH hash tables.
    • On each server compute the nearest 1000 images and also their similarity scores and return pairs of (image_id, score).
    • Aggregate the (image_id, score) from all servers for all images and then sort them by their scores.
    • Assumption here is that the total number of images returned i.e. 1000 servers x 1000 images = 106 images will contain the top 1000 images if all the data was on one server.
  • Deployment in Production
    • Use Nginx as the web server for deployment.
    • For the application server, use Flask with Gunicorn. Flask as a standalone web server is not meant for production as it is single threaded and can process one request at a time.
    • The number of web-servers and the number of application servers depends on the current load which in turn depends on the number of concurrent requests.
    • Assuming that there are 1 million concurrent requests and we want each web-server to handle at-most 1000 concurrent requests then we would need 1000 web-servers behind a load balancer.
    • When a HTTP request is made by the user, the load balancer forwards the request to one of the 1000 web-servers depending on which server has the least number of pending requests.
    • For the application servers, we create a docker image of our application with the model file, the codes, the python dependencies and the system libraries.
    • Docker support virtualization i.e. we can install multiple versions of our model on the same server with different sets of dependencies, libraries, model files etc. All of these will be encapsulated into its own environment without interfering with each other or the base kernel of the OS.
    • This will enable doing A/B experimentation when we need to validate whether a new version of a model is performing better or worse than the current version.
    • Again depending on the load we can have multiple instances of the docker image running on different application servers.
    • To better manage the docker images, we will create a Kubernetes cluster which automatically does load balancing.
    • When an image arrives at an application server, it is converted to an image array and then passed through the triplet network model to compute embeddings.
    • We will have separate image indexing servers which hosts LSH hash tables built on top of 1 million image embeddings.
    • We will query the image indexing servers using map-reduce strategy to aggregate the (image_id, score) pairs and sort them based on the score.
    • Using the image-tag classifier, we will predict the most likely tags for the query image.
    • We will host MongoDB instances in different database servers sharded based on image_id field.
    • Apart from storing image metadata information along with the image_id, we will also store the following {'tag': <image_tag>, 'image_ids': <list_of_image_ids_containing_tag>}
    • For a particular image tag we will also store the list of image ids that contains that tag.
    • This will also be stored in the MongoDB servers.
    • We will query the MongoDB servers with the predicted image tags first using map-reduce operation and retrieve the list of image_ids containing the tags.
    • We will again query the MongoDB servers with the filtered image_ids to fetch the metadata information such as the image CDN urls, source web-page url etc.
    • We will then create JSON object of the response and send back to the Nginx web server to be send back to the client browser.
    • Instead of querying the MongoDB servers with the tags everytime, we will cache the {tag:image_ids} data in distributed Redis cache which is in-memory database.
    • Thus before querying MongoDB servers we will query the Redis servers to check if a tag exists as a key. If it exists we will fetch from Redis else will fetch from MongoDB and save in Redis with the tag as a key.
    • This will save the time taken to read from disk in the Mongo server.
    • We will have separate log servers running Cassandra in the backend.
    • For each query we will send request to the Cassandra log servers to log the query, query_id, timestamp, IP address of user, session id etc.
    • For each click on an image we will again send and AJAX request which will call the Cassandra log servers to log the query id, clicked image id, clicked image rank, clicked image score, click timestamp, session id etc.
    • Although the logging to Cassandra will be non-blocking (or asynchronous) but we will anyways use multiple instances of Cassandra behind a load balancer in order not to overwhelm one instance with too many writes. Also with one instance the size of the log would be very huge.
    • We do not need master-slave configuration because in production we are reading mostly from MongoDB and writing data to Cassandra. Writing to MongoDB and reading from Cassandra happens offline. Thus we do not need separate read and write instances for both MongoDB or Cassandra.
    • Replication strategy to handle failures and downtimes - For each DB instance, we will keep 3 replicas just in case one or more DB instances fails.
Architecture of Deployment

Architecture of Deployment

  • Analysis of model performance in production
    • Parse logs
      • Query the Cassandra data for the click logs.
      • For each session compute either MAP or the NDCG score of the predictions.
      • Compute the average MAP or NDCG over all sessions as the performance of the model.
      • We can also check Click Through Rate (CTR) which is the ratio of the number of clicks to the total number of results seen.
      • We can average the CTR across all sessions.
      • Question - Which business metric is most closer to the actual training loss ?
    • Training with feedbacks
      • We can use the clicks data to re-train the similarity model.
      • For e.g. if an user clicks the 10th, 15th and the 50th search results, then we can create 3 positive pairs (Q, 10), (Q, 15) and (Q, 50) where Q is the query image.
      • For negative pairs we will use the search results from 1 to 49 which has not been clicked e.g. (Q, 1), (Q, 2), ... (Q, 9), (Q, 11) ...
      • We will treat the feedback data separately from the original training data because we will be training the network with the feedback data with very low learning rate (10-5) and high L2 regularization factor (0.1).
      • Question - How will you handle scenarios where in two different sessions same image is queried but clicks were on different images ? Conflicting outputs ?
    • Debug issues in model
      • Check server error logs for any issues in the model and the APIs during inferencing.
  • Experimentation
    • A/B Testing
      • How can we verify whether our supervised model is performing better than the un-supervised approach ?
      • How can we verify whether the updated model with feedbacks performs better than the older models ?
      • We will use CTR@k to compute relevant metrics.
      • For example, with the un-supervised model if the user was shown 10 search results and he clicked 4th, 6th and 7th results, then CTR@k for k=1 to 10 is as follows:
      • CTR@k = [0, 0, 0, 1/4, 1/5, 2/6, 3/7, 3/8, 3/9, 3/10], then take average of the values which is (1/4+1/5+2/6+3/7+3/8+3/9+3/10)/10 = 0.222
      • Compute the mean CTR@k for the un-supervised approach across all sessions. Let us call the mean CTR@k as {\mu}_1
      • Next with the supervised model if the user clicked the 2nd, 4th and 5th results, then:
      • CTR@k = [0, 1/2, 1/3, 2/4, 3/5, 3/6, 3/7, 3/8, 3/9, 3/10], then average value is = 0.387
      • Note that although the user clicked 3 search results here too but the ranks were higher thus CTR@k is higher than before.
      • Compute the mean CTR@k for the supervised approach across all sessions. Let us call the mean CTR@k as {\mu}_2
      • Let's say that we have observed N_1 sessions with the un-supervised approach and N_2 sessions with the supervised approach
      • We need the probability that we will obtain mean CTR {\mu}_2 using the base model i.e. the un-supervised approach. This probability is the p-value.
      • To compute the p-value we also need to know the standard deviations of CTR@k for both the un-supervised model and the supervised model.
      • Let {\sigma}_1 be the std.dev. of the CTR@k for un-supervised model and {\sigma}_2 for the supervised model.
      • Compute the Z-score Z=\frac{{\mu}_2-{\mu}_1}{\frac{{\sigma}_1}{N_1}+\frac{{\sigma}_2}{N_2}}
      • Compute p-value from Z-score. If the p-value is less than 0.05 then it implies that the supervised model is truly better than the un-supervised approach.
      • Similarly if we deploy a model with feedbacks incorporated then we will perform similar analysis to determine whether the new model is better than the old one.
      • Question - How long will you perform the experimentation ?
      • Question - What should be an ideal value of N_1 and N_2 before you stop collecting more data ?
A/B Testing

A/B Testing

    • Control vs. experiment groups
      • For each query we will randomly select one model over the other with equal probability.
      • Both the models will be located on the same server as different Docker images.
      • To track the performance of the models, CTR@k, mean CTR and standard deviation, we will add another field 'model_id' in the click logs on Cassandra.

Categories: AI, DESIGN, MACHINE LEARNING

Tags: , , , , , , ,