Many times we trust sellers to upload correct product images for online products on our e-commerce platform, but due to checks in place that a product shall always accompany an image, 3rd party sellers upload stock or placeholder image in case there is no image for the product available. Problem is that although we do not want to show products without images on our website, but due to zero validation of the image, we show placeholder images against a product, which sometimes leads to decrease in sales. In this post we discuss few approaches we took to resolve this. We divide our approach into parts:
- Detect placeholder images that are has already been seen in our product catalog.
- Detect placeholder images which have not been seen previously.
The set of different possible placeholder images can be huge. Just doing a Google search can fetch 100K of them. Although the 3P sellers tries to cheat the system by uploading placeholders, but they prefer to either re-use them for multiple products or use images from a very small subset of similar looking placeholder images. Since it is very hard to define what is a placeholder. We will tag any image as a placeholder if that image is "considerably" different from other valid images of the same category.
For the first approach, the problem boils down to searching similar images. Given a set of N images seen and tagged as placeholders, classify a new image as a placeholder, if the new image is "similar" to at-least one of them. For this, we had decided that we will use image hashes to compare two images. The simplest of the image hashes we are aware of are : Average Hashing (aHash), Difference Hashing (dHash) and Perceptual Hashing (pHash).
I will not go into the details of each of the algorithm, but just give an overview.
- All of the above algorithms generates a 64 bit string representation as a hash. To compare two images, just convert the hash into 64 bit values and compute the Hamming distance between the two, i.e. number of mis-matched bit positions.
- All of them first scales down the image to 8x8 (aHash and pHash) and 9x8 (dHash), then converts it to grayscale.
- aHash computes the mean of the 64 colors and assigns 1 to all colors with value greater than mean and 0 to those less than mean.
- dHash computes the difference between consecutive colors. The difference is computed each row separately. Assign 1 if P[x][y] < P[x][y+1] else 0, where x is the row number and y is the column number of the 9x8 image color matrix.
- pHash computes the discrete cosine transform (DCT) of the image, then assigns 1 to all where it is greater than mean DCT and 0 to less than mean DCT.
We went with dHash and pHash due to their superior performance on small changes in the images. Finally found pHash to be slightly better than dHash.
- Pre-compute the pHash for all of the N tagged placeholder images.
- Compute the pHash for the new image when it comes.
- Compute the Hamming distance between the new image and all of the N placeholder images.
- If the minimum distance from all N placeholders is less than some threshold (15 in our case) we tag the new image as a placeholder.
- Compute the precision and recall for multiple such images. Obtain around 96% precision and recall.
- The test set was generated using Keras ImageDataGenerator class by adding random variations (such as shift in height and width, zoom in/out, rescaling etc.)
The python class for hashing based similarity approach. Put the code inside a file names hashing_predictor.py:
import imagehash, os, glob, itertools, numpy as np from keras.preprocessing.image import img_to_array, load_img import data_generator as dg from sklearn.metrics import classification_report class Hashing(object): def __init__(self, threshold=15): self.hashes =  self.threshold = threshold def fit(self, image_files): for img_file in image_files: img = load_img(img_file) self.hashes.append(imagehash.phash(img)) def predict(self, image_files): predictions =  for img_file in image_files: img = load_img(img_file) im_hash = imagehash.phash(img) min_dist = float("Inf") for x in self.hashes: min_dist = min(min_dist, abs(im_hash-x)) if min_dist < self.threshold: predictions.append(1) else: predictions.append(0) return predictions def score(self, pos_image_files, neg_image_files): labels = *len(pos_image_files) + *len(neg_image_files) preds = self.predict(pos_image_files) + self.predict(neg_image_files) print classification_report(labels, preds) def save(self): dg.save_data_npy(np.array(self.hashes), "trained_phashes.npy") def load(self): self.hashes = list(dg.load_data_npy("trained_phashes.npy"))
The 'fit' method is used to create the hashes for all the tagged placeholder images. The 'predict' method is used to predict the labels 1 (for a placeholder) and 0 (for non-placeholder).
import hashing_predictor as hp from hashing_predictor import Hashing import glob, os hashing = Hashing(threshold=15) train_image_files = glob.glob(os.path.join("tagged_placeholder_images", "*.*")) hashing.fit(train_image_files) pos_image_files = glob.glob(os.path.join("similar_placeholder_images", "*.*")) neg_image_files = glob.glob(os.path.join("product_images", "*.*")) print hashing.score(pos_image_files, neg_image_files)
We have not considered scaling this approach for large values of N yet, but that can be handled using either Locally Sensitive Hashing (LSH) or a KD-Tree (distributed version). We will discuss more on large scale image retrieval in the next post.
The problem with hash-based approach is that, it works fine for existing placeholder images or very similar looking placeholders. But the problem with placeholders is that there could be many variations which hash-based matching will not be able to capture.
Also another problem is that the number of tagged placeholder images is very less (around 60 or so).
Thus instead of relying on finding similar placeholder images, our new and updated approach is to "ask" few "nominated" (around 10) valid product images from the same category as the new image, whether the new image looks similar to one of them. If at-least 5 of the votes says that the new image is very dissimilar to what valid images from same category looks like, then we tag the new image as a placeholder.
The idea is borrowed from one-shot learning framework.
- Create dataset with pairs of images.
- The original images were resized to 64x64 and values were scaled by multiplying with 1.0/255.
- Assuming that there are M different categories and each category has Q images, we can create 0.5*M*Q*(Q-1) positive image pairs.
- A label of 1 implies that the pair belongs to same category.
- Similarly we can randomly sample pairs of images from two different categories. There could be a maximum of 0.5*M*(M-1)*Q2 different negative pairs from M categories.
- A label of 0 implies that the pair do not belong to same category.
- From both the positive and negative pairs we sample equal number of instances (Around 50K).
- We have observed that augmenting the negative pairs with some tagged placeholder images improves the recall of the placeholder images.
- We could not have trained a binary CNN model with single input images with label 1 indicating a placeholder and 0 a non-placeholder, because the number of tagged placeholder images were very less.
- We had experimented by augmenting the placeholder images with Keras ImageDataGenerator, but the results did not show much improvement.
- Using pairs of images as input enables us to generate more data rather than relying on augmentation techniques.
- Train a Siamese CNN using these pair of images.
- We followed two approaches, one where we had trained layers of CNN-MaxPooling from scratch and then added fully connected layers and
- One where we had used transfer learning from VGG16's pre-trained weights for the CNN layers and trained only the last fully connected layer.
- Transfer learning with VGG16 gave better and more stable results.
- The two inputs were combined by taking the element-wise absolute difference and passing through a sigmoid activation layer.
- Batch size was kept at 256.
- Number of image pairs used in training was around 100K, number of pairs used for validation was 20K.
- Placeholder images used in testing data were explicitly downloaded from Google images, thus these images are un-seen placeholders for the model.
- Number of pairs in testing data was around 20K. Although the number of placeholder images were very less (around 100).
- The testing set precision-recall came out to be 96% (best model). The recall for the label 0 (i.e. placeholders) was 91%.
- Using voting method by selecting 11 random valid product images to compare against, the recall for the placeholders came out to be around 95%.
- Out of 11 if at-least 5 of them gives a label of 0, then we tag that image as a placeholder.
The approach is pretty generic in that it can be used to detect whether an image in a product category is actually an image for that category. Detecting placeholders becomes a special use case.
Let's create the file siamese_predictor.py:
IMAGE_HEIGHT, IMAGE_WIDTH = 64, 64 def get_shared_model_vgg(image_shape): input = Input(shape=image_shape) base_model = VGG16(weights='imagenet', include_top=False, input_tensor=input) for layer in base_model.layers: layer.trainable = False n_layer = base_model.output n_layer = Flatten()(n_layer) n_layer = Dense(4096, activation='relu')(n_layer) n_layer = BatchNormalization()(n_layer) model = Model(inputs=[input], outputs=[n_layer]) return model class SiameseModel(object): def __init__(self, model_file_path, best_model_file_path, batch_size=256, training_samples=5000, validation_samples=5000, testing_samples=5000, use_vgg=False): self.model = None self.best_model_file_path = best_model_file_path self.batch_size = batch_size self.training_samples = training_samples self.validation_samples = validation_samples self.testing_samples = testing_samples def init_model(self): image_shape = (IMAGE_HEIGHT, IMAGE_WIDTH, 3) input_a, input_b = Input(shape=image_shape), Input(shape=image_shape) shared_model = get_shared_model_vgg(image_shape) shared_model_a, shared_model_b = shared_model(input_a), shared_model(input_b) n_layer = Lambda(lambda x: K.abs(x-x))([shared_model_a, shared_model_b]) n_layer = BatchNormalization()(n_layer) out = Dense(1, activation="sigmoid")(n_layer) self.model = Model(inputs=[input_a, input_b], outputs=[out]) adam = optimizers.Adam(lr=0.001) self.model.compile(optimizer=adam, loss="binary_crossentropy", metrics=['accuracy']) def fit(self): self.init_model() train_num_batches = int(math.ceil(float(self.training_samples)/self.batch_size)) valid_num_batches = int(math.ceil(float(self.validation_samples)/self.batch_size)) self.model.fit_generator(dg.get_image_data_siamese(self.training_samples, 'train', batch_size=self.batch_size), steps_per_epoch=train_num_batches, validation_data=dg.get_image_data_siamese(self.validation_samples, 'validation', batch_size=self.batch_size), validation_steps=valid_num_batches, epochs=10, verbose=1) def predict_proba(self, image_data): return self.model.predict(image_data) def predict(self, image_data): return np.rint(self.predict_proba(image_data)).astype(int) def score(self): data_generator, test_labels, pred_labels = dg.get_image_data_siamese(self.testing_samples, 'test', batch_size=self.batch_size), ,  total_batches = int(math.ceil(float(self.testing_samples)/self.batch_size)) num_batches = 0 for batch_data, batch_labels in data_generator: test_labels += batch_labels.tolist() pred_labels += self.predict(batch_data).tolist() num_batches += 1 if num_batches == total_batches: break print(classification_report(test_labels, pred_labels)) def score_ensemble(self, test_image_arr, voting_images_arr, frac=0.5, probability_threshold=0.5): image_data_0, image_data_1 = ,  index_map = collections.defaultdict(list) for v_img_arr in voting_images_arr: image_data_0.append(v_img_arr) image_data_1.append(test_image_arr) image_data_0, image_data_1 = np.array(image_data_0), np.array(image_data_1) proba = self.predict_proba([image_data_0, image_data_1]) proba = np.array([x for x in proba]) pred = proba-probability_threshold pred[pred <= 0] = 0 pred[pred > 0] = 1 return 0 if np.count_nonzero(pred) < frac*voting_images_arr.shape else 1
I have omitted certain details from the code to make the code shorter and just enough to explain the above mentioned approach.
I am using python generators to train and test the model in mini-batches. The codes for the generator is defined in an other file whose detail I am omitting here, but the essential idea is that since there are around 100K training image pairs and each image pair contains two images of size 64x64x3 (3 for the number of channels in RGB images). Loading the entire data in-memory would take around 18 GB of memory.
Instead we can load the batches of size 256 for 391 batches (100K image pairs) and train using 'fit_generator' method of Keras.
The python data generator looks something like (the code is incomplete):
def get_image_data_siamese(num_samples, prefix='train', batch_size=256): n = min(num_samples, len(data_pairs)) num_batches = int(math.ceil(float(n)/batch_size)) np.random.seed(42) batch_num = 0 while True: m = batch_num % num_batches if m == 0: p = np.random.permutation(n) image_data_1, image_data_2, labels = image_data_1[p], image_data_2[p], labels[p] start, end = m*batch_size, min((m+1)*batch_size, n) batch_num += 1 yield [image_data_1[start:end], image_data_2[start:end]], labels[start:end]
In the earlier 'SiameseModel' class, we are defining a method 'score_ensemble', which accepts a test image array (64x64x3) and a set of 11 voting image arrays. The method compares the test image with all the 11 voting image arrays and if at-least 'frac' percentage of predictions says that the label should be 0, then we classify the test image as a placeholder.