Understanding Convolution for Deep Learning

With the recent advancements in Deep Learning and Artificial Intelligence, there has been continuous interest among machine learning enthusiasts and data scientists to explore frontiers in artificial intelligence on small to medium scale applications that was probably the realm of high speed supercomputers owned by a few tech giants only a few years ago. Few of such applications are Image and Speech Recognition, Language Translators, Automated Image Descriptions, Detecting Phrases in Text, Face recognition etc. Most of the recent developments has been with Images and Textual data.

The most impactful has been training image and text data with deep neural networks. Artificial neural networks is modeled on the similar concept as neurons in the human brain. The architecture consists of input neurons (input features), output neurons and hidden layers in between.

Each neuron in the input, hidden and output layers are assigned weights which are updated using a back-propagation algorithm.

Given a set of weights, the input features and the output, the errors are propagated back and the weights are updated such that the errors in the successive iterations are reduced. We will implement such a neural network from scratch in a separate post.

Artificial Neural Network

A deep neural network is distinguished from a simple feedforward neural network in that it consists of multiple hidden layers. Each hidden layer maps features from the previous layer into lower dimensional features in the next layer.

For example in image recognition, the first hidden layer maps the input pixels of the image to features that represents object edges, similarly the next hidden layer would map the object edges into object shapes.

This is very similar to how our cognition works. When we see an object, we try to interpret the structure and the shape to identify the object. All of these happens within layers of neurons in our brain.

One of the component inside a hidden layer is the convolution layer that plays the most important role in such feature mapping.

Convolution Layer

In each hidden layer, apart from the convolution (the first in the series) layer, there is a pooling layer that sub-samples output from the convolution layer and a non-linear activation function such as tanh or ReLu that is applied to the pooled features.

This post is about understanding why and how convolution is applied.

Convolution is a concept derived from digital signal processing. Given a function (input signal) f, and a filter function g evolving in time, convolution captures the overlap between the functions f and g, as the filter function g 'slides over' f in time. The summation (or integral) of the overlap is termed as the convolution.

Mathematically :

$(f*g)(\tau) = \int_{0}^{\infty}f(t)g(\tau-t)\;dt$ for continuous values of f and g

points where f or g are not defined in the domain of the integration, the product inside the integral is considered to be 0.

Convolution of Functions

Although the function g is moving right along the time axis, but the region of g that overlaps with the function f, evolves along the opposite direction and hence in the integral we are taking the product of f with the reverse function of g, i.e. $f(t)g(\tau-t)$

The above is easier to understand with discrete examples in vectors.

For example assuming f to be vector of length n and g to be vector of length m.

Vector convolution

In the above illustration, the vector f is of length 4 and g is a vector of length 3. As the vector g "slides over" f, the value of the convolution at a particular instant is the sum of the dot product of the values in the vectors in the overlapping region.

The total number of possible values in the convolution vector c is n+m-1.

$c[i]=\sum_{j=0}^{n-1}f[j]*g[m+j-i-1]$

In the above expression, g[k]=0 for any k < 0 or k >= m. Below is a code to compute the convolution of the vectors f and g. Here g is the filter function :

std::vector<int> convolution(std::vector<int> f, std::vector<int> g) {
int n = (int)f.size();
int m = (int)g.size();

std::vector<int> out;

for (int i = 0; i < n+m-1; i++) {
int sum = 0;

for (int j = 0; j < n; j++) {
if (m+j-i-1 < 0 || j-i >= 1) sum += 0;
else sum += f[j]*g[m+j-i-1];
}

out.push_back(sum);
}

return out;
}

The same process could be replicated for 2-D convolutions as well. In 2D convolutions, f and g are represented as matrices.

Matrix Convolution

In the above representation, we have the matrix f of dimensions 5x5 as :

$f=\begin{bmatrix}1&1&1&0&0\\0&1&1&1&0\\0&0&1&1&1\\0&0&1&1&0\\0&1&1&0&0\end{bmatrix}$

and the "kernel"matrix g of dimensions 3x3 is represented (by the yellow sub-matrix) as :

$g=\begin{bmatrix}1&0&1\\0&1&0\\1&0&1\end{bmatrix}$

Note that there could be two possible ways to do convolution. One is the narrow convolution and the other one is the wide convolution.

In our 1D vector example we computed the wide convolution where even the boundary elements of the function f are taken into consideration but in case of narrow convolution as shown in the 2D illustration, we only consider convolutions for region where the kernel g fully overlaps with some region of f.

In wide convolution partial overlaps also count. The number of total convolutions in the narrow case for 1-D vectors is n-m+1.

Narrow and Wide Convolutions

For 2D convolutions, given matrices F (of dimensions KxL) and G (of dimensions PxQ) the convolution C is given as (wide convolution) :

$C[i, j]=\sum_{k=0}^{K-1}\sum_{l=0}^{L-1}F[k,l]*G[P+k-i-1, Q+l-j-1]$

The narrow convolution is just a subset of the wide convolution.

Same as above, G[x, y] = 0 if x < 0 or y < 0 or x >=P or y >= Q. But the question remains, why is it so important for machine learning (or AI) ?

In image recognition, filters that are convolved with a given image, learns certain image characteristics that helps the neural network to correctly classify a given image. For example two such filters from image processing are :

$\text{blurFilter}=\frac{1}{9}\begin{bmatrix}1&1&1\\1&1&1\\1&1&1\end{bmatrix}$

The above filter when convolved with a given image, it blurs the image. What the filter essentially does is averages out the pixel values in a neighborhood. Another interesting filter is :

$\text{edgeDetectFilter}=\begin{bmatrix}1&1&1\\1&-8&1\\1&1&1\end{bmatrix}$

The above filter when convolved with a given image, results in an image with the object edges detected. If a neighborhood of a pixel contains similar pixels, then they are cancelled out by the filter, whereas if there is sharp transition due to an edge in either direction, then the filter does not cancel out the pixel values and thus gives a clear distinction between regions containing similar pixels (no edges) and regions containing sharp transitions (edges).

Given the below image (me posing in front of the Eiffel Tower) :

Below is an R program, that reads this image and converts it into grayscale and then applies the edge detection filter to the pixel matrix.

read.grayscale <- function(img.file) {
myjpg <- myjpg[,,1]+myjpg[,,2]+myjpg[,,3]
myjpg <- myjpg/max(myjpg)
myjpg
}

convolution <- function(img, filter) {
n <- nrow(img)
m <- ncol(img)

p <- nrow(filter)
q <- ncol(filter)

conv.img <- matrix(0, nrow=n-p+1, ncol=m-q+1)

for (i in 1:nrow(conv.img)) {
for (j in 1:ncol(conv.img)) {
conv.img[i,j] <- sum(img[i:(i+p-1), j:(j+q-1)]*filter)
}
}

conv.img
}


The 'read.grayscale' function reads the image file and converts it into grayscale and then returns the grayscale pixel matrix object. 'convolution' function applies a filter to the resulting pixel matrix.

filter.edge <- matrix(c(1,1,1,1,-8,1,1,1,1), nrow = 3, byrow = T)

conv.img <- convolution(img, filter.edge)