For the past few days, I have been reading quite a lot of research papers, articles and blogs related to artificial neural networks and its transition towards deep learning. With so many different concepts and experimentations around selecting the best neural network architecture for a problem, selecting the best learning hyper-parameters, selecting the best optimization algorithm and so on, it becomes a little overwhelming to connect all the dots together when you yourself start to solve a problem using neural networks. Fortunately there are very nice tutorials over the internet to help us with that. In this post my goal is to understand what makes neural networks so powerful by looking at it from the viewpoint of a universal function approximator.
In a most generic sense, a neural network can be thought of as an universal function approximator. What this means is that a neural network with sufficient "tweaking" can be made to learn any mathematical function. Learning a function here means that, given sufficient number of inputs and outputs from a mathematical function of n-variables, we can make the network learn to output values for a new set of inputs coming from the same domain of the function, that are "approximately" correct without explicitly learning the representation or form of the function. For example, assume a function F :
we do not know that the function looks like as above but we have access to N different input-output tuples of the format :
Based on the above I/O tuples, we can train a neural network to predict that the output for the input (0.7, 8.9) would be somewhere close to 79.55 (with some errors), without explicitly telling the network that the function is a non-linear function with the exponent for the first variable being 3.0 and the exponent for the second variable being 2.0. If the function was some linear combination of the input variables such as :
Then given the corresponding I/O tuples for the above function, it becomes a straightforward weights estimation problem that can be solved using standard regression techniques. But regression techniques (linear and logistic) cannot solve non-linear problems as seen in the earlier function. But neural networks capabilities extends way above such regression techniques as it can learn both linear and non-linear functional forms.
Most introductory tutorials and college courses on Neural Networks begin with the XOR function
This is a highly non-linear function. The inputs and outputs for this function can be represented using the following tuples :
If we represent the inputs on a two-dimensional plane as shown below, then the red dots corresponds to the output of 1 and the blue crosses corresponds to output 0. There is no way to draw a single line on the 2D plane that can separate the blue crosses from the red dots. Neural Networks are useful for learning such non-linearity.
When we think of functions, we generally think of continuous outputs. There could be functions that output only discrete values such as either 0 or 1 (for example a function F, that takes as inputs 2500 variables corresponding to the pixels of 50x50 images and outputs 0 if the image belongs to a dog or 1 if it belongs to a cat) or more than 2 discrete values (another function F, that again takes 2500 variables as inputs corresponding to pixels of 50x50 images, but this time the outputs could be 0 (dog), 1 (cat), 2 (rabbit), 3 (wolf)). Thus we see how classification problems can be thought of as generic function representation.
Not only dealing with non-linearity in functions that gives neural networks its strength. Neural networks can compose functions. This means that using neural networks, we can learn functions that takes as inputs other functions too, for example :
, where and
Although we could have represented the function F in terms of 'x' and 'y' by expanding the terms for G and H, but note that it is simpler for a learning system to learn the functions G and H first and then learn the function F with G and H as inputs since the function F can be composed of arbitrary functions, whose expansion would be very complex in terms of actual inputs and would be difficult to learn. But the data we would have access to would not have explicit outputs for G and H, it would only have the inputs x and y and the outputs F, i.e. the values of the functions G and H are "hidden" from us. Thus now we have a clear idea as to why there are these "hidden" layers in neural networks.
Note that we are not concerned with how complex the function representations are but only able to produce outputs that are very close to the actual outputs if we had used the true function representations above to compute them. The above problem with F, G, H and x and y can be solved using a neural network architecture, with 2 input units (for x and y), 1 hidden layer with 2 units (for G and H) and an output layer with one unit for F.
Similarly for the XOR function, first we can separately learn two lines (again "functions") that separates the top red dot from the blue crosses and the bottom red dot from the blue crosses respectively and then use another function that takes these two lines (or "functions") learns how to align these two lines in the 2D plane.
The power of function composition is used heavily in complex image recognition and classification tasks. To recognize an image of a dog, we need to identify the eyes, the ears, the snout, shape of the body, four legs, a tail etc. Think of the recognition of body parts as low level functions that can learn to recognize eyes, ears, nose, body shape etc. These low level functions when composed or combined together (as in assembling the body parts) and used by high level functions would ultimately help us classify pics of dogs from cats. That is how neurons in our brain works.
Using the same concept of function composition one can understand how increasing number of hidden layers or hidden units leads to overfitting. In order to model a function for a set of points, the more complex the function, there is a high chance that it might be overfitting and the model function may not generalize well to new points. Look at Bias-Variance tradeoffs.
Hidden units deep in the network represents the function composition of hidden units earlier in the network, thus adding more hidden layers only increases the complexity of the composite function.
If and and , then
is a higher degree polynomial compared to because it is a composition of F and G.
Below is short python function that constructs artificial data for a function :
, with 1000 data-points.
The 1000 instances are then split into 800 (training) and 200 (testing). The training instances are trained with a neural network with 2 hidden layers of 500 nodes each, initial learning rate of 0.8, number of epochs set to 100, momentum rate of 0.95 and full-batch training instead of mini-batch.
For regression problems it is observed that instead of mini-batches, training the full batch at a time gives better results.
def custom_fun(): func = lambda x, y: x**3 + y**2 a = np.random.uniform(0.0, 0.5, 1000) b = np.random.uniform(0.0, 0.5, 1000) c = [func(x, y) for (x, y) in zip(a, b)] trainX = np.array([a, b]).T trainY = np.array(c) n_samples = trainX.shape X_train, y_train = trainX[:int(.8 * n_samples)], trainY[:int(.8 * n_samples)] X_test, y_test = trainX[int(.8 * n_samples):], trainY[int(.8 * n_samples):] print y_test hidden_layers = [500, 500] weights_learning_rate = 0.8 bn_learning_rate = 0.9 num_epochs = 100 train_batch_size = 1000 momentum_rate = 0.95 dropout_rate = 0.0 layers = hidden_layers +  weights, biases, momentums, gamma, beta = NeuralNetwork.initialize(layers, X_train.shape) nn_model = NeuralNetwork.train_neural_network(X_train, y_train, hidden_layers=hidden_layers, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, num_epochs=num_epochs, train_batch_size=train_batch_size, momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=weights, ini_biases=biases, ini_momentums=momentums, ini_gamma=gamma, ini_beta=beta, type="regression") nn_predict = NeuralNetwork.predict_neural_network(X_test, nn_model, type="regression") print np.array([nn_predict, y_test]).T print NeuralNetwork.loss_reg(nn_predict, y_test) print ""
Following is a snippet from the result of the 200 test instances, with the predicted outputs in the first column and the actual in the second column.
[[array([ 0.09794416]) 0.12830553035218056] [array([ 0.18109904]) 0.16456003687636275] [array([ 0.29492958]) 0.3044874865459973] [array([ 0.09169924]) 0.09708847382411792] [array([ 0.22475072]) 0.22427391572651909] [array([ 0.06297007]) 0.049254779966471526] [array([ 0.21643774]) 0.2106450238758027] [array([ 0.17858005]) 0.20181847473506376] [array([ 0.03279526]) 0.04406216799846839] [array([ 0.02636673]) 0.016820582705981697] [array([ 0.11285015]) 0.13602471485335008] [array([ 0.09035625]) 0.08020536362400177] [array([ 0.10217008]) 0.10420876560343556] [array([ 0.03468824]) 0.021760401964940113] [array([ 0.0787135]) 0.12312145686018837] [array([ 0.02010603]) 0.012928433660894335] [array([ 0.13067931]) 0.1274879556530731] [array([ 0.09381108]) 0.08213737526727798] [array([ 0.10484451]) 0.10873124148225335] [array([ 0.1128415]) 0.10352914606312646] [array([ 0.12476749]) 0.13628966134554657] [array([ 0.0555351]) 0.03968318023762076] [array([ 0.06730468]) 0.09440282803461543] [array([ 0.11977753]) 0.11082907565669858] [array([ 0.15974003]) 0.1667682913779612] [array([ 0.08268617]) 0.07750909419281113] [array([ 0.08885391]) 0.07692247973326477] [array([ 0.21404035]) 0.21668720542595243]...]
The average mean squared error is 0.73. Note that most of the predicted values are quite close to the actual function response. Thus the network above has approximately learnt the function F(x, y) without knowing what is the function.
Similarly for the XOR function, I have defined the following function :
def xor_gate(): X_train = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y_train = np.array([0, 1, 1, 0]) hidden_layers = [10, 10] weights_learning_rate = 0.5 bn_learning_rate = 0.9 num_epochs = 100 train_batch_size = 32 momentum_rate = 0.95 dropout_rate = 0.0 layers = hidden_layers +  weights, biases, momentums, gamma, beta = NeuralNetwork.initialize(layers, X_train.shape) nn_model = NeuralNetwork.train_neural_network(X_train, y_train, hidden_layers=hidden_layers, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, num_epochs=num_epochs, train_batch_size=train_batch_size, momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=weights, ini_biases=biases, ini_momentums=momentums, ini_gamma=gamma, ini_beta=beta, type="regression") nn_predict = NeuralNetwork.predict_neural_network(X_train, nn_model, type="regression") print np.array([nn_predict, y_train]).T print ""
Following is the output for the predicted and the actual response of the function :
[[0.06003989 0] [0.97327438 1] [0.97113576 1] [0.02613045 0]]
As you see that, with 2 hidden layers of 10 nodes each, we have learnt the XOR function quite well indeed.
The number of hidden layers and hidden units required to obtain certain level of error depends on the complexity of the function.
Although neural network is very powerful as seen above what it is capable of, but there is no free lunch. Training a neural network model involves lots of intricacies as compared to other machine learning algorithms. There are too many interacting hyper-parameters that needs to be tweaked correctly in order to successfully train the network. There are a few good resources on the internet which explains on how to train a neural network from scratch. I found the following particularly useful :
- Backpropagation chapter from neural networks book by Rojas
- Notes on BackPropagation.
- Efficient BackProp by Le Cunn
- Lecture on Neural Network Tips and Tricks by Richard Socher for CS224D course
- Tutorial on BackPropagation by Quoc V.Le
- Practical Recommendations for Gradient-Based Training of Deep Architectures
- CS231n course materials.
Most of the advancement into deep learning and related areas were largely due to unsupervised pre-training algorithms for neural networks. Unsupervised pre-training helps to initialize weights and biases for each layer of the network as well as learn hidden representations for the input features which has found useful in variety of applications such as dimensionality reduction (similar to PCA but PCA is linear dim. reduction), collaborative filtering, missing or corrupted input data, word embeddings, topic modeling etc.
Without getting into their details, the purpose of both autoencoders and RBMs is to learn the weights and biases for each layer and also the hidden layer representations of each layer in a greedy manner, i.e. first learn the weights and hidden layer representations for the first hidden layer using the original inputs, then after these weights and hidden layer representations converges, move onto the second hidden layer, but the inputs are now the outputs from the first hidden layer and so on.
The pre-training step is similar to Expectation-Maximization, since EM is also used to estimate the most likely values of the unknown parameters (weights and biases in neural network) in the presence of "hidden" or "incomplete" data.
Pre-training focuses on re-usability.
Think of building a software application. While building the application, you need to create some components, e.g. a "click" button. But you need to add click buttons of different shapes, sizes, color at different parts of your application depending on the requirement and feedback from your users.
For each new requirement, instead of writing code for creating a new button with the required shape, size and color, you would want to create a re-usable click button template, with shape, size and color as parameters. Similarly, weights and features learnt for detecting edges, arcs etc. can be re-used for learning complex features such as eyes, ears, nose, lips etc.
Why pre-training works better ?
Think of how will you compute the n-th Fibonacci number. One obvious way is to use recursion, i.e. we define a recursive function.
def fibonacci(n): if n == 1 or n == 2: return 1 else: return fibonacci(n-1) + fibonacci(n-2)
But as you know that this method is not efficient, as we are doing redundant computations. In order to compute fibonacci(5), we would be calling fibonacci(3) and fibonacci(4). But fibonacci(4) will gain compute fibonacci(3) along with fibonacci(2). Thus the program 'does not learn' that we have already computed fibonacci(3). One obvious solution is to use dynamic programming, i.e. instead of recursive calls, cache the intermediate results (the last two in-fact) and re-use these cached values.
def fibonacci(n): a, b = 1, 1 if n == 1: return a elif n == 2: return b else: for m in range(3, n+1): c = b b = a + b a = c return b
Useful resources on Autoencoders and RBMs :
- Applications of Autoencoders in Natural Language Processing
- Deep Autoencoders
- Learning by Denoising Part 1: What and why of denoising
- Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction
- Contractive Auto-Encoders
- Building Autoencoders in Keras
- A Tutorial on Deep Learning
- Self-Taught Learning to Deep Networks
- Learning Deep Generative Model
- Efficient Learning of Sparse Representations with an Energy-Based Model
- Introduction to Restricted Boltzmann Machines
- A Practical Guide to Training Restricted Boltzmann Machines