# Understanding Variational AutoEncoders

This post is motivated from trying to find better unsupervised vector representations for questions pertaining to the queries from customers to our agents. Earlier, in a series of posts, we have seen how to design and implement a clustering framework for customer questions, so that we can efficiently find the most appropriate answer and at the same time find out most similar questions to recommend to the customer.

With a framework in-place which can incorporate any kind of vector representation (TF-IDF, PCA, weighted word embeddings, hidden layer outputs from LSTM units and concatenation of individual vectors), our improvement strategy to the task at hand is to find better unsupervised representations which can capture semantic as well as syntactic meaning of the questions.

Recently there has been a lot of buzz around Variational Autoencoders. So we decided to give it a try for our clustering of similar questions. In this post we will be understanding how VAs work and how it is different from a normal autoencoder or any other unsupervised algorithm. In the process, we will not limit ourselves only to text (questions) but we will also look to apply VAs to image related problems.

Variational Autoencoders are similar to any other autoencoder, i.e. a neural network architecture that has two parts, an encoder and a decoder.

Given an input vector X, the encoder network maps the input to a lower dimensional dense representation z (hidden layer) and then the decoder network takes the encoded input z and tries to reconstruct the original input. The reconstructed output X' from decoder network may not be exactly X, but can have some error.

The objective is to minimize the error = 0.5 * (X-X')T(X-X')

Then what is so special about using neural networks to encode and then decode back if at the end we are getting back the same input with some noise ?

Note that this is an unsupervised problem as instead of class labels, we are using the same input at the output layer. The idea is to learn the representations 'z', which captures inter-dependencies between different dimensions of the input variable. The raw input X may be some sparse TF-IDF or one-hot representation which do not capture much information.

In short, an autoencoder when trained with enough images, can learn to reconstruct any image. The weights and biases learnt in the network is reusable.

One can also think of the encoding process as a dimensionality reduction strategy.

Below is a function written in Python using Keras libraries, that models a standard autoencoder described as above :

from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D, Lambda, Flatten, Reshape, LSTM, RepeatVector, Dropout
from keras.models import Model
from keras.models import model_from_json
from keras import backend as K
from keras.losses import mse, binary_crossentropy
import cv2, os, pickle
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

def standard_autoencoder(X_train):
input_img = Input(shape=(X_train.shape[1], X_train.shape[2], X_train.shape[3]))

#Image needs to be flattened in order to train with Dense layers
x = Flatten()(input_img)

#Encoder
x = Dense(512, activation='relu')(x)
encoded = Dense(256, activation='relu')(x)
encoder = Model(input_img, encoded)

#Decoder
decoder_input = Input(shape=(256,))
x = Dense(512, activation='relu')(decoder_input)
x = Dense(m, activation='sigmoid')(x)
decoded = Reshape((X_train.shape[1], X_train.shape[2], X_train.shape[3]))(x)
decoder = Model(decoder_input, decoded)

#Full Autoencoder
autoencoder = Model(input_img, decoder(encoder(input_img)))
autoencoder.fit(X_train, X_train, epochs=50, shuffle=True, batch_size=32)

return encoder, decoder, autoencoder

I am using images from MNIST handwritten digits. I have used 5000 handwritten digit images for training the above autoencoder.

But before training the autoencoder, I am normalizing the pixel values between 0 and 1. In order to train these images with a standard autoencoder as shown above, we need the pixels as a 1D vector and not as a matrix. Thus we need to flatten the images before training. This is achieved through the Flatten() layer in Keras.

Standard Autoencoder for digit 2.

At the output layer, we need to transform the flattened image back to a pixel matrix using the Reshape() layer.

Code to read images and transform them.

def load_mnist_data():
(X_train, train_labels), (X_test, test_labels) = mnist.load_data()

X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255

X_train = X_train.reshape((len(X_train), X_train.shape[1], X_train.shape[2], 1))
X_test = X_test.reshape((len(X_test), X_test.shape[1], X_test.shape[2], 1))

return X_train, X_test, train_labels, test_labels

To train the model with the image data :

X_train, X_test, train_labels, test_labels = load_mnist_data()
encoder, decoder, autoencoder = standard_autoencoder(X_train)

For prediction, there are two possible ways, we can achieve the same result. If we only have images, and we want to get the reconstructed image from the autoencoder :

decoded_imgs = autoencoder.predict(X_test)

If we want to get the encoded representations for a set of images (probable dimensionality reduction for a classification problem).

encoded_imgs = encoder.predict(X_test)

We get the same decoded images as above but using only the decoder with the encoded images as input.

decoded_imgs = decoder.predict(encoded_imgs)

We can reuse the encoded representations as inputs for classifying digits or if we assume that the images are unlabelled, then these encoded representations can be used for clustering.

Code to train a neural net with the encoded images as inputs and the class labels 0 to 9.

def train_classifier(encoded_imgs, labels):
labels = to_categorical(labels)
inputs = Input(shape=(encoded_imgs.shape[1],))

x = Dense(128, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
outputs = Dense(10, activation='sigmoid')(x)

model = Model(inputs, outputs)
model.fit(encoded_imgs, labels, epochs=50, batch_size=32, shuffle=True)

return model

We obtain training accuracy of 99.99% on 5000 images using the encoded representations as inputs to the above neural network. And testing accuracy of 93% on 5000 test images.

encoded_imgs = encoder.predict(X_train[:5000])
model = train_classifier(encoded_imgs, train_labels[:5000])

encoded_imgs = encoder.predict(X_test[:5000])
model.evaluate(encoded_imgs, test_labels[:5000])

If we are already obtaining good enough vector representations, then why not standard autoencoders and why variational autoencoders ?

Variational Autoencoders are generative models, whereas standard autoencoders as described above are not generative models.

By generative models, we imply that we model the joint probability distribution P(X, z).

By Bayes' Rule :

P(X, z) = P(X|z) * P(z) = P(z|X) * P(X),

and the marginal likelihood of X is given by :

$P(X) = \int_{z}P(X, z)\;dz = \int_{z}P(X|z)*P(z)\;dz$

Intuitively this means that, if we know the distributions P(z|X), P(z) and P(X), then we can reconstruct any input image or text or music etc. as given by the Bayes' Theorem :

$P(X|z) = \frac{P(z|X)*P(X)}{P(z)}$

How is this different from the above standard autoencoder scheme ?

A standard autoencoder does not know the probability distribution P(z) of the hidden representations z. So every time, I pass an image as an input to the standard autoencoder, it generates the same encoded representation as well as the same decoded (reconstructed) image, because it "understands" only a single representation z, given an input and the weights.

Whereas in Variational Autoencoder, we learn the probability distribution P(z) of the hidden representations z. Thus every time, we pass the same image as an input to the autoencoder, the encoder uses the learnt mean and variance of the distribution P(z), to sample a different z every time, which when decoded by the decoder, introduces slight variations in the reconstructed image each time. That is where the term "variational" comes in.

This property of VA is useful in our question clustering task because, given two questions Q1 and Q2 which are semantically similar, the distance between their encoded representations z1 and z2, computed with standard autoencoders, may be higher than our maximum distance threshold of t=0.25.

But instead of a single representation z1 or z2 , we sample 5 representations each for z1 and z2 :

{z(0)1 , z(1)1 , z(2)1 , z(3)1 , z(4)1} and {z(0)2 , z(1)2 , z(2)2 , z(3)2 , z(4)2}

using a variational autoencoder and then compute the distances for each pair (25 pairs), and if the distance for any pair (or n out of 25 pairs) is at-most t=0.25, then we consider them similar. Thus this will increase our recall of similar questions, but precision may decrease.

In order to generate samples that are highly similar to examples in the dataset, one must maximize the likelihood P(X). In terms of loss function, the loss function to minimize is :

$L = -log \;P(X;\phi)$

where the parameters $\phi$ are the parameters of the model and we need to find the optimum values of $\phi$ that minimizes the above loss function.

From laws of marginal distribution :

$P(X;\phi) = \int_{z}P(X|z;\phi)*P(z)\;dz$

i.e. integration over all possible values of z. Note that we are using integration instead of a summation, because we need the distribution of z, P(z) to be continuous.

If P(z) was discrete then it would lead to scenario where there could be some z, sampled from P(z), which lies outside of the "variation" zone for any input example. If P(z) is continuous, then z is guaranteed to produce a variation of some input.

One possible approach is to consider a unit Gaussian, centered at 0 for P(z), i.e. P(z) = N(0, 1)

It is not possible to enumerate all possible values of z for the integration above. One way is to randomly sample M values of z, from P(z) = N(0, 1) and then approximate the integration with a summation :

$P(X;\phi) = \frac{1}{M}\sum_{j=1}^MP(X|z_j;\phi)$, where $z_j\;{\sim}\;N(0, 1)$

But with such a strategy we need to sample very high number of values of z, in order to approximate the integral with the above summation.

Another strategy is to use a different model $Q(z|X;\theta)$, that generates the hidden representations z given the input as X and the parameters $\theta$. Using the model $Q(z|X;\theta)$, we can generate samples of z, that are approximately close to the real distribution P(z).

With similar reasoning as above for P(z), we choose the distribution $Q(z|X;\theta)$ to be a continuous Gaussian distribution, but it need not be constrained to a unit Gaussian centered at 0.

$Q(z|X;\theta) = N(\mu(X), {\sigma}^2(X);\theta)$

where $N(\mu(X), {\sigma}^2(X);\theta)$ is some arbitrary Gaussian. This allows us to use any value of z randomly sampled from any Gaussian distribution.

The parameters of the Gaussian, the mean $\mu(X)$ and variance ${\sigma}^2(X)$ are hidden layer vectors which are differentiable and learnable through back-propagation.

Since we are approximating the true distribution P(z) of the hidden states z with $Q(z|X;\theta)$, we need to add another term to the loss function, i.e. the KL divergence of $Q(z|X;\theta)$ from P(z).

$L = -log\;P(X;\phi)\;+\;KL(Q(z|X;\theta), P(z))$

The less far away $Q(z|X;\theta)$ is from P(z), the lower the loss, as we are then approximating P(z) quite well.

One can use Stochastic Gradient Descent to solve for the unknown parameters $\phi$ and $\theta$ for the above models, which are nothing but the weights for the decoder network and the encoder network respectively.

For each epoch :

• For each example x:
1. Pass xthrough the encoder network with weights $\theta$.
2. Randomly sample a value of z from the distribution $N(\mu(X), {\sigma}^2(X);\theta)$.
3. Use this z as an input to decoder network with weights $\phi$
4. Get the output x'i from decoder network
5. Compute the squared error 0.5 * (x- x'i)2 or cross-entropy error -xi*log(x'i)
6. Using back-propagation, update the weights $\phi$ and $\theta$ and the mean and variance of the encoder network. In the next iteration, the encoder network will generate samples of z from updated distribution with new values of mean $\mu(X)$ and variance ${\sigma}^2(X)$.

Note that the sampling step in step 2 above is not a differentiable operation and for back-propagation to work we need every operation (that is not an input) to be differentiable. To get around this problem, a re-parameterization trick is used.

The idea is to take the sampling step to an input for the network. Sample a constant h from N(0, 1) as a parallel input apart from the inputs X and then compute a value of z in step 2 as :

$z\;=\;\mu(X)\;+\;\sigma(X)*{\epsilon}$

${\epsilon}$ is just a constant for back-propagation.

Architecture of VA (left : without re-parameterization, right : with re-parameterization)

The KL Divergence loss for each batch Xi is calculated as follows :

$-0.5*[\sum_{k}1+log({\sigma}^2)(X^{k}_i)-{\mu}^2(X^{k}_i)-{\sigma}^2]$

where the summation k is over the number of units in the mean and variance layers.

Since an output from variance layer could be 0, thus taking log will throw error. Instead we will model the variance layer as log_variance and modify the above equation as :

$-0.5*[\sum_{k}1+\text{log_var}(X^{k}_i)-{\mu}^2(X^{k}_i)-e^{\text{log_var}(X^{k}_i)}]$

where log_var = $log({\sigma}^2)$

The python function to model a variational autoencoder is as follows :

def variational_autoencoder(X_train):
m = np.prod(X_train.shape[1:])
input_img = Input(shape=(X_train.shape[1], X_train.shape[2], X_train.shape[3]))

#Image needs to be flattened in order to train with Dense layers
x = Flatten()(input_img)

#Encoder
x = Dense(512, activation='relu')(x)
z_mean, z_log_var = Dense(256)(x), Dense(256)(x)
z = Lambda(sampling, output_shape=(256,))([z_mean, z_log_var])
encoder = Model(input_img, [z_mean, z_log_var, z])

#Decoder
decoder_input = Input(shape=(256,))
x = Dense(512, activation='relu')(decoder_input)
x = Dense(m, activation='sigmoid')(x)
decoded = Reshape((X_train.shape[1], X_train.shape[2], X_train.shape[3]))(x)
decoder = Model(decoder_input, decoded)

#Full Autoencoder
autoencoder_out = decoder(encoder(input_img)[2])
out = CustomVariationalLayer()([z_mean, z_sigma, input_img, autoencoder_out])
autoencoder = Model(input_img, out)
autoencoder.fit(X_train, shuffle=True, epochs=100, batch_size=32)

return encoder, decoder, autoencoder

Notice that in the definition of the autoencoder model, we are using a custom variational autoencoder layer, which is just a dummy layer, that adds the custom loss function defined earlier to the autoencoder model. Here goes the definition of the CustomVariationalLayer (inherits the Keras Layer object) :

class CustomVariationalLayer(Layer):
def vae_loss(self, z_mean, z_sigma, inputs, outputs):
reconstruction_loss = K.sum(K.binary_crossentropy(K.batch_flatten(inputs), K.batch_flatten(outputs)), axis=-1)
kl_loss = - 0.5 * K.sum(1.0 + z_sigma - K.square(z_mean) - K.exp(z_sigma), axis=-1)
return K.mean(reconstruction_loss + kl_loss)

def call(self, inputs):
z_mean, z_sigma, inp, out = inputs
loss = self.vae_loss(z_mean, z_sigma, inp, out)
return out

The method to sample z from P(z) = N(0, 1), using re-parameterization trick is as follows :

def sampling(args):
z_mean, z_log_var = args
batch, dim = K.shape(z_mean)[0], K.int_shape(z_mean)[1]
epsilon = K.random_normal(shape=(batch, dim), mean=0.0, stddev=0.1)

return z_mean + K.exp(0.5 * z_log_var) * epsilon

Note that in order to sample using re-parameterization trick explained earlier, we are calculating $e^{0.5*\text{log_var}}$, which is nothing but the standard deviation ${\sigma}$ as defined in :

$z\;=\;\mu(X)\;+\;\sigma(X)*{\epsilon}$

To train the VA with the MNIST digits data :

X_train, X_test, train_labels, test_labels = load_mnist_data()
encoder, decoder, autoencoder = variational_autoencoder(X_train)

Variational Autoencoder

To save and load the models in production for inference, we will use Keras built in libraries instead of pickle library :

encoder.save("encoder.h5")
decoder.save("decoder.h5")
autoencoder.save("autoencoder.h5")

To load and compile the models during run time inference :

from keras.models import load_model

encoder.predict(X_test)

autoencoder.evaluate(x=X_test, y=None)

Note that in order to get predictions out of a layer, such as get encoded representations, we call the .predict() method which does not require the saved model to be recompiled again and thus do not require reference to any custom loss functions etc. It only requires the computed weights and biases.

Whereas in-order to .evaluate() the model against a test input, it needs to compute the loss apart from the predicted output, thus it needs re-compiling and hence we need to reference any custom loss function used to build the model. In this case, we declare the 'custom_objects' variable with the CustomVariationalLayer custom KL Loss layer.

In order to demonstrate the idea behind variational autoencoder, we take a single image and the learnt encoder and decoder model above, then pass the same image through the autoencoder several times and obtain a different encoded representation z each time sampled from P(z) and a reconstructed image that has a very slight variation for each of these images.

plt.imshow(X_train[4].reshape(28, 28))
plt.show()

n = 10
figure = np.zeros((28 * n, 28 * n))

for i in range(n):
for j in range(n):
encoded_img = encoder.predict(X_train[4:5])
encoded_img = encoded_img[2]
decoded_img = decoder.predict(encoded_img)

digit = decoded_img[0].reshape(28, 28)
figure[i * 28: (i + 1) * 28, j * 28: (j + 1) * 28] = digit

plt.figure(figsize=(10, 10))
plt.imshow(figure, cmap='gnuplot2')
plt.show()

The encoded image is a 3-tuple from the variational autoencoder, [mean, log_var, sampled z from N(mean, var)]. We need to take the 3rd value which is the encoded representation. We print 100 different variations for one of the image, on a 10x10 grid.

Slight Variations introduced by Variational Autoencoder

Although the images looks identical to the naked eye, but they were in-fact generated from different encodings z and if you observe closely then you can observe very slight variations in the images.

One can also combine variational autoencoder with a convolutional neural network, so that the image boundary information do not get lost due to flattening of images. One possible way to combine a VA with Conv2D layers is shown below :

def conv_variational_autoencoder(X_train):
m = np.prod(X_train.shape[1:])
input_img = Input(shape=(X_train.shape[1], X_train.shape[2], X_train.shape[3]))

#Encoder
x = Conv2D(32, (3, 3), activation='relu', padding='same')(input_img)
x = Conv2D(1, (3, 3), activation='relu', padding='same')(x)
q = Flatten()(x)
z_mean, z_sigma = Dense(196)(q), Dense(196)(q)
z = Lambda(sampling, output_shape=(196,))([z_mean, z_sigma])
encoder = Model(input_img, [z_mean, z_sigma, z])

#Decoder
decoder_input = Input(shape=(196,))
p = Reshape((14, 14, 1))(decoder_input)
x = Conv2DTranspose(32, (3, 3), activation='relu', padding='same')(p)
x = UpSampling2D((2, 2))(x)
dec_out = Conv2DTranspose(1, (3, 3), activation='sigmoid', padding='same')(x)
decoder = Model(decoder_input, dec_out)

#Full Autoencoder
autoencoder_out = decoder(encoder(input_img)[2])
out = CustomVariationalLayer()([z_mean, z_sigma, input_img, autoencoder_out])
autoencoder = Model(input_img, out)

autoencoder.fit(X_train, shuffle=True, epochs=50, batch_size=32)

return encoder, decoder, autoencoder

The image is first passed through a convolutional and a max-pooling layer and then flattened to pass the resultant image as an input to a variational autoencoder. The decoder is just the inverse i.e. a de-convolutional layer followed by up-sampling layer. The remaining code is similar to the variational autoencoder code demonstrated earlier.

Convolutional Variational Autoencoder

Edit 1 :

Since we were originally motivated to understand variational auto-encoders due to our problem of finding question representations, we would not be doing justice if we do not implement VAE for texts.

We have implemented a Sequence-To-Sequence Variational Autoencoder for text sentences, that is nothing more than a Variational Autoencoder layer sandwiched between stacked LSTM layers.

def seq2seq_vae(X_train):
inputs = Input(shape=(X_train.shape[1], X_train.shape[2]))

#Encoder
x = LSTM(512)(inputs)
x = Dense(256, activation='relu')(x)
z_mean, z_sigma = Dense(128)(x), Dense(128)(x)
z = Lambda(sampling, output_shape=(128,))([z_mean, z_sigma])
encoder = Model(inputs, [z_mean, z_sigma, z])

# Decoder
decoder_input = Input(shape=(128,))
x = Dense(256, activation='relu')(decoder_input)
x = RepeatVector(X_train.shape[1])(x)
outputs = LSTM(X_train.shape[2], return_sequences=True)(x)
decoder = Model(decoder_input, outputs)

# Full Autoencoder
autoencoder_out = decoder(encoder(inputs)[2])
out = CustomVariationalLayer()([z_mean, z_sigma, input_img, autoencoder_out])
autoencoder = Model(inputs, out)

autoencoder.fit(X_train, shuffle=True, epochs=50, batch_size=32)

return encoder, decoder, autoencoder

The input to the sequence-to-sequence VAE, is a sequence of word vectors for a sentence, i.e. if a sentence contains 15 words, then one input to the network is a sequence of 15 word vectors. The number of word vectors is known as the "timesteps". We are using maximum timestep of 10, i.e. the first 10 word vectors from a sentence.

The output is again a sequence of word vectors (using "return_sequences" in LSTM).

Note that this method do not learn word vectors, since we are assuming that our inputs and outputs are the word vectors which are immutable.

What we learn is an encoding for the sentence, given by the output of the encoding layer. This gives us freedom from having to use TF-IDF weighted word vectors for a sentence.

Edit 2 :

The dataset that we are working on, are a list of questions asked by real customers. Most questions have a structured beginning, i.e. the first few initial words look like "what is ...", "how can i...", "how much ...", "can i..." etc.

The sequence-to-sequence LSTM network defined above is prone to learning the beginning words of these questions very well and almost all the time the similar questions returned have the same first 'n' words.

To overcome this, we use a Bidirectional LSTM with dropout inputs, thus preventing overfitting towards starting or ending words in a question.

def train_seq2seq_vae_model(X_train):
print("Defining architecture...")

inputs = Input(shape=(X_train.shape[1], X_train.shape[2]))

#Encoder
x = Bidirectional(LSTM(512, dropout=0.2))(inputs)
x = Dense(256, activation='relu')(x)
z_mean, z_sigma = Dense(128)(x), Dense(128)(x)
z = Lambda(sampling, output_shape=(128,))([z_mean, z_sigma])
encoder = Model(inputs, [z_mean, z_sigma, z])

# Decoder
decoder_input = Input(shape=(128,))
x = Dense(256, activation='relu')(decoder_input)
x = RepeatVector(X_train.shape[1])(x)
outputs = Bidirectional(LSTM(X_train.shape[2], return_sequences=True, dropout=0.2))(x)
decoder = Model(decoder_input, outputs)

# Full Autoencoder
autoencoder_out = decoder(encoder(inputs)[2])
out = CustomVariationalLayer()([z_mean, z_sigma, input_img, autoencoder_out])
autoencoder = Model(inputs, out)

return encoder, decoder, autoencoder