Stokastik

Machine Learning, AI and Programming

Understanding class activation maps with product color classification

In this post we are going to understand class activation maps or CAMs for visualising convolution neural nets by building a color classification model for e-commerce product images. The product images are tagged with the actual product color. Ideally a product image can have more than one color and it should be tagged with all the possible colors in the image, but to make the problem a bit simpler just for visualising purpose, we will tag all images with more than one color as 'Multicolor'.

Although color classification is not a very complex problem, provided that the huge number of color shades and variations are merged into a few standard colors such as Red, Green, Blue, Black, White, Yellow, Brown, Pink, Purple etc. For our problem, the images are tagged with only 19 standard colors + an additional 'multicolor' label.

The reason why I chose this problem for demonstrating CAMs is because e-commerce product images has the following challenges:

  1. There can be a very large white background, which might fool any traditional color extraction algorithm.
  2. There can be other products present in the same image along with the actual product.
    • For e.g. if the actual product is 'shoe' then the image might contain a person wearing a t-shirt and jeans along with the shoe.
    • For our purpose, all three of them i.e. 't-shirt', 'jeans' and 'shoe' are separate product types.
    • The model should be able to understand which product type to focus on.
  3. If the product is a white t-shirt worn by a person on a white background, then the model will predict 'white' as the product color most of the time.
    • But it might be that the model focussed more on the white background rather than the actual white t-shirt.
    • Using CAMs we can detect such anomalies in the model.

Challenges in color classification

In order to handle point No.2 above, instead of doing simple color classification, we will initially train a product type classifier on the training data and then using the same convolution layers, will fine tune the CNN weights learnt with product type model, for the color model. This will ensure that while searching for the correct region in the image to focus on for color classification, the color model can start focussing more where the product type model gave higher weights.

Following are some of the training and testing parameters used by the models:

  • Total number of training images = 241K, total number of testing images = 27K
  • Total number of product types = 24
  • Total number of color categories = 20
  • Images has been resized to 128x128x3
  • Batch size used for training = 64
  • Number of epochs = 20

We are going to use the following CNN architecture for both product type and color models:

def cnn_model():
    input = Input(shape=(image_size, image_size, 3))
    n_layer = input
    
    n_layer = Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same')(n_layer)
    #(128, 128, 64)
    n_layer = BatchNormalization()(n_layer)
    n_layer = MaxPooling2D(pool_size=(2, 2))(n_layer)
    #(64, 64, 64)
    
    n_layer = Conv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same')(n_layer)
    #(64, 64, 128)
    n_layer = BatchNormalization()(n_layer)
    n_layer = MaxPooling2D(pool_size=(2, 2))(n_layer)
    #(32, 32, 128)
    
    n_layer = Conv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same')(n_layer)
    #(32, 32, 128)
    n_layer = BatchNormalization()(n_layer)
    n_layer = MaxPooling2D(pool_size=(2, 2))(n_layer)
    #(16, 16, 128)
    
    n_layer = Conv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same')(n_layer)
    #(16, 16, 256)
    n_layer = BatchNormalization()(n_layer)
    n_layer = MaxPooling2D(pool_size=(2, 2))(n_layer)
    #(8, 8, 256)
    
    n_layer = Conv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same')(n_layer)
    #(8, 8, 256)
    n_layer = BatchNormalization()(n_layer)
    
    model = Model(inputs=input, outputs=n_layer)
    return model

The complete architecture of the product type model (including the fully connected and the output layers) is as follows:

def init_pt_model():
    input = Input(shape=(image_size, image_size, 3))
    n_layer = cnn_model()(input)
    
    n_layer = AveragePooling2D(pool_size=(8, 8))(n_layer)
    #(1, 1, 256)

    n_layer = Flatten()(n_layer)
    n_layer = Dropout(0.25)(n_layer)
    n_layer = BatchNormalization()(n_layer)
    
    out = Dense(24, activation="softmax")(n_layer)

    model = Model(inputs=input, outputs=out)

    adam = optimizers.Adam(lr=0.0005)
    model.compile(optimizer=adam, loss="categorical_crossentropy", metrics=['accuracy'])
    
    return model

Observe that instead of directly flattening the outputs from the CNN model, we are using Global Average Pooling on the output of the CNN using an AveragePooling2D layer.

Remember what MaxPooling does. If each CNN filter is of dimensions MxM then a MaxPool of size (2,2) slides over all 2x2 sub-arrays in the MxM array and takes the maximum value in each 2x2 sub-array. Thus an MxM array is reduced to M/2 x M/2 array.

Max Pooling of size (2,2)

Similarly an AveragePooling layer of size 2x2 will take the average of the values in each 2x2 sub-array and will reduce the MxM array to M/2 x M/2 array. In our case we take AveragePooling layer of size 8x8 and apply on 8x8 arrays from the CNN outputs. Thus effectively AveragePooling of size MxM applied on MxM array takes the average of all the values in the MxM array and reduces it to a scalar.

Average Pooling of size (2,2)

But why is this necessary ?

Before explaining this, I am going to explain what are class activation maps or CAMs. CAMs are very simple yet efficient techniques to visualize CNN networks. It allows us to visualize which regions of the image gets most "activated" for a particular class. The most activated regions are highlighted using a heat-map over the original image.

In our CNN model above, the output is of size 8x8x256 i.e. 256 filters each of size 8x8. After AveragePooling it is reduced to 1x1x256 and then flattened to a 256 sized vector which is then projected on the output layer with 24 classes. Thus each edge from the 256 units to a particular class C represents the weight of that corresponding filter for class C.

Once we get the weights for each filter we can then backtrack and compute the weighted sum of all the 256 filters of size 8x8. Let us assume that the CNN output of size 8x8x256 is represented by the matrix A and the weight matrix from the 256 units (flattened layer) to 24 classes is W. Thus W is of size 256x24. If we multiply A with W then we will obtain a matrix B of size 8x8x24.

B = A.W

If the actual class label for an example is C, then taking the value of B[:,:,C] will return a 8x8 matrix which is nothing but a color-map (or heat-map) for class label C. Thus B[:,:,C] represents the weighted sum of all 256 filters of size 8x8 for class label C.

Now assuming that we had not used AveragePooling and directly flattened the 8x8x256 CNN output into a vector of size 16384 units and then projected onto 24 classes, there is no way that we can backtrack and obtain the matrix B because the matrix W is of size 16384x24 and thus we cannot obtain B from A and W by simply multiplying them.

Moreover we have seen that using global AveragePooling and then flattening has better performance over direct flattening because in direct flattening of CNN outputs we lose the image boundary information.

Following method is used to plot class activation maps (or heat-map) over the original image:

import matplotlib
matplotlib.use('agg')
from matplotlib import pyplot as plt

def cam(model, image_array, true_labels, label_encoder, out_dir):
    image_array = np.array(image_array*255, dtype=np.uint8)
    
    class_weights = model.layers[-1].get_weights()[0]
    
    get_last_conv_output = K.function([model.layers[0].input, model.layers[1].get_input_at(0)], [model.layers[1].get_output_at(0)])
    
    conv_outputs = get_last_conv_output([image_array, image_array])[0]
    conv_outputs = scipy.ndimage.zoom(conv_outputs, (1, 16, 16, 1), order=1)
    
    for idx in range(image_array.shape[0]):
        t_label = label_encoder.inverse_transform(true_labels[idx:idx+1])[0]
        a = np.nonzero(true_labels[idx])
        
        if len(a) > 0 and len(a[0]) > 0:
            a = a[0][0]
            fig, ax = plt.subplots()

            ax.imshow(image_array[idx], alpha=0.5)

            x = np.dot(conv_outputs[idx], class_weights[:,a])
                    
            ax.imshow(x, cmap='jet', alpha=0.5)
            ax.axis('off')

            fig.savefig(out_dir + '/' + str(idx) + '_true_' + str(t_label).lower() + ".jpg")
            plt.close()

'image_array' is our input image numpy matrix of size (batch_size=64, 128, 128, 3). During training we normalize the pixel values between 0 and 1 by dividing with 255 and thus we are multiplying by 255 in the above code.

'class_weights' gets the matrix W of size 256x24 from the model.

'conv_outputs' gets the output A from the CNN model of size 8x8x256 defined earlier. In order to visualize the heat-map as an overlay over the actual image we need to convert the 8x8x256 matrix A into 128x128x256 matrix because our images are of size 128x128. Thus we are using 'scipy.ndimage.zoom' to extrapolate the pixel values.

Inside the for-loop for each image we are computing the dot product between A and W[:,C] where C is the actual class label for that image and obtaining the heat-map of dimensions 128x128.

If there are multiple dense layers in between the Flatten() layer and the output layer, then we need to do chain multiplication of all the weight matrices in-order to obtain the final output B. For e.g. if instead of :

Flatten() -> Dense(24)

we had:

Flatten() -> Dense_1(512) -> Dense_2(256) -> Dense_3(128) -> Dense_4(24)

Then let the corresponding weight matrices be W1 (256x512), W2 (512x256), W3 (256x128) and W4 (128x24), then in order to obtain B from A:

B = A.W1.W2.W3.W4

Now for doing color classification by re-using the weights learnt by the CNN layers of the product type model, we simply initialize the CNN layer weights for the color model with the final weights of the product type model as defined in the below code:

def init_color_model():
    input = Input(shape=(image_size, image_size, 3))
    cnn = cnn_model()
    
    pt_model = init_pt_model()
    pt_model.load_weights('data/pt_model.h5')
    
    cnn.set_weights(pt_model.layers[1].get_weights())

    n_layer = cnn(input)
    
    n_layer = AveragePooling2D(pool_size=(8, 8))(n_layer)
    #(1, 1, 256)
    
    n_layer = Flatten()(n_layer)
    n_layer = Dropout(0.25)(n_layer)
    n_layer = BatchNormalization()(n_layer)
    
    out = Dense(20, activation="softmax")(n_layer)

    model = Model(inputs=input, outputs=out)

    adam = optimizers.Adam(lr=0.0005)
    model.compile(optimizer=adam, loss="categorical_crossentropy", metrics=['accuracy'])
    
    return model

Observe that everything is same as the product type model, except for the 'set_weights' for the 'cnn_model' part.

Using transfer learning on the product type model enables the color model to learn faster and better because the model can start focussing on important regions in the image and thus learn faster.

Following are some results from the CAM predictions:

CAM heat map on red color t-shirt

Observe that the heat-map on the right side image is concentrated more on the actual T-shirt region and thus being able to detect the color of the T-shirt accurately.

CAM heat map on blue color jeans

There are certain activated regions outside of the actual product image, but there are some activated regions also on the actual product i.e. the blue jeans. Although the model correctly predicts the color but the confidence would be low here.

Heat map on necklace and shorts.

Notice how the heat-map almost takes on the shape of the necklace in the left side image. When the images for certain product types look almost the same, the heat-map actually appears as if it takes the shape of the object.

We obtain around 94% precision and 94% recall on the test dataset for the product type model whereas obtain around 87% precision and 80% recall for the color model. The color model performs poor due to several reasons:

  • Tagging of colors was very subjective.
    • For e.g. very difficult to distinguish white from silver, red from pink or orange, dark blue from black, purple from brown in many cases and it has fooled the humans.
  • Model is able to detect one of the colors correctly when the actual color is 'Multicolor'.
  • Model focusses more on regions where the actual product is not present and this was diagnosed from the CAM images.

The full code is shared on my GitHub repository.

References:

Categories: AI, MACHINE LEARNING

Tags: , , , , , , ,

Leave a Reply