For the past few days, I have been reading quite a lot of research papers, articles and blogs related to artificial neural networks and its transition towards deep learning. With so many different methods of selecting the best neural network architecture for a problem, the optimal hyper-parameters, the best optimization algorithm and so on, it becomes a little overwhelming to connect all the dots together when we ourselves start to solve a problem using neural networks. Fortunately there are very nice tutorials over the internet to help us do that. In this post my goal is to understand what makes neural networks so powerful by looking at it from the viewpoint of a universal function approximator.
In the most generic sense, a neural network can be thought of as an universal function approximator. What this means is that a neural network with sufficient "tweaking" can be made to learn any mathematical function. Learning a function here means that, given sufficient number of inputs and outputs for a mathematical function of n-variables, we can make the network learn to output values for a new set of inputs coming from the same domain of the function, that are "approximately" correct, without explicitly learning the representation or form of the function. For example, assuming that the function F is :
Instead of knowing the function form of F, all we have access to are the N different input-output tuples of the format :
Based on the above I/O tuples, we can train a neural network to predict that the output for input (0.7, 8.9) would be somewhere close to 79.55 (maybe not exact), without explicitly defining F. If the function was some linear combination of the input variables such as :
Then given the corresponding I/O tuples for the above function, it becomes a straightforward weights estimation problem that can be solved using standard linear regression techniques. Neural network's capabilities extends beyond linear regression techniques as it can learn both linear and non-linear functional forms.
Most introductory tutorials and online courses on Neural Networks begin with the XOR function
This is a non-linear function. The inputs and outputs for this function can be represented using the following tuples :
If we represent the inputs on a two-dimensional plane as shown below, then the red dots corresponds to the output of 1 and the blue crosses corresponds to output 0. There is no way to draw a single line on the 2D plane that can separate the blue crosses from the red dots. Neural Networks are useful for learning such non-linearity.
When we think of functions, we generally think of continuous outputs. There could be functions that output only discrete values such as either 0 or 1 (for example a function F, that takes as inputs 2500 variables corresponding to the pixels of 50x50 images and outputs 0 if the image belongs to a dog or 1 if it belongs to a cat) or more than 2 discrete values (the same input as above but with multiple classes of animal pictures). Thus any classification problem can also be thought of as a mathematical function.
It's not only dealing with non-linearity in functions that gives neural networks its strength. Neural networks can compose functions. This means that using neural networks, we can learn functions that takes as inputs other functions too, for example :
, where and
Although we could have represented the function F in terms of 'x' and 'y' by expanding the terms for G and H, but note that it is simpler to learn the functions G and H first and then learn the function F with G and H as inputs. But the data we would have access to, would not have explicit outputs for G and H, it would only have the inputs x and y and the outputs F, i.e. the values of the functions G and H are "hidden" from us. Thus now we understand that why there are "hidden" layers in neural networks.
Note that we are not concerned with how complex the function representations are but only able to produce outputs that are very close to the actual outputs if we had used the true function definitions to compute them. The above problem with F, G, H and x and y can be represented using a neural network architecture, with 2 input units (for x and y), 1 hidden layer with 2 units (for G and H) and an output layer with one unit for F.
Similarly for the XOR function, first we can separately learn two lines (again "functions") that separates the top red dot from the blue crosses and the bottom red dot from the blue crosses respectively and then use another function that takes these two lines (or "functions") and learns how to align them in the 2D plane.
The power of function composition is used heavily in complex image recognition and classification tasks. To recognize an image of a dog, one needs to identify the eyes, the ears, the snout, shape of the body, four legs, a tail etc. Think of the recognition of body parts as low level functions. These low level functions when composed or combined together (as in assembling the body parts) and used by high level functions would ultimately help us classify pics of dogs from cats. That is how neurons in our brain works.
Using the same concept of function composition one can understand how increasing number of hidden layers or hidden units leads to overfitting. In order to model a function for a set of points, the more complex a function, there is a high chance that it might be overfitting and the model function may not generalize well to new points. Look at Bias-Variance tradeoffs.
Hidden units deep in the network represents the function composition of hidden units earlier in the network, thus adding more hidden layers only increases the complexity of the composite function.
If and and , then
is a higher degree polynomial compared to because it is a composition of F and G.
Below is short python function that constructs artificial training data for a function :
, with 1000 data-points.
The 1000 instances are then split into 800 (training) and 200 (testing). The training instances are trained with a neural network with 2 hidden layers of 500 nodes each, initial learning rate of 0.8, number of epochs set to 100, momentum rate of 0.95 and full-batch training instead of mini-batch.
For regression problems it is observed that instead of mini-batches, training the full batch at a time gives better results.
def custom_fun(): func = lambda x, y: x**3 + y**2 a = np.random.uniform(0.0, 0.5, 1000) b = np.random.uniform(0.0, 0.5, 1000) c = [func(x, y) for (x, y) in zip(a, b)] trainX = np.array([a, b]).T trainY = np.array(c) n_samples = trainX.shape X_train, y_train = trainX[:int(.8 * n_samples)], trainY[:int(.8 * n_samples)] X_test, y_test = trainX[int(.8 * n_samples):], trainY[int(.8 * n_samples):] print y_test hidden_layers = [500, 500] weights_learning_rate = 0.8 bn_learning_rate = 0.9 num_epochs = 100 train_batch_size = 1000 momentum_rate = 0.95 dropout_rate = 0.0 layers = hidden_layers +  weights, biases, momentums, gamma, beta = NeuralNetwork.initialize(layers, X_train.shape) nn_model = NeuralNetwork.train_neural_network(X_train, y_train, hidden_layers=hidden_layers, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, num_epochs=num_epochs, train_batch_size=train_batch_size, momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=weights, ini_biases=biases, ini_momentums=momentums, ini_gamma=gamma, ini_beta=beta, type="regression") nn_predict = NeuralNetwork.predict_neural_network(X_test, nn_model, type="regression") print np.array([nn_predict, y_test]).T print NeuralNetwork.loss_reg(nn_predict, y_test) print ""
Following is a snippet from the result of the 200 test instances, with the predicted outputs in the first column and the actual in the second column.
[[array([ 0.09794416]) 0.12830553035218056] [array([ 0.18109904]) 0.16456003687636275] [array([ 0.29492958]) 0.3044874865459973] [array([ 0.09169924]) 0.09708847382411792] [array([ 0.22475072]) 0.22427391572651909] [array([ 0.06297007]) 0.049254779966471526] [array([ 0.21643774]) 0.2106450238758027] [array([ 0.17858005]) 0.20181847473506376] [array([ 0.03279526]) 0.04406216799846839] [array([ 0.02636673]) 0.016820582705981697] [array([ 0.11285015]) 0.13602471485335008] [array([ 0.09035625]) 0.08020536362400177] [array([ 0.10217008]) 0.10420876560343556] [array([ 0.03468824]) 0.021760401964940113] [array([ 0.0787135]) 0.12312145686018837] [array([ 0.02010603]) 0.012928433660894335] [array([ 0.13067931]) 0.1274879556530731] [array([ 0.09381108]) 0.08213737526727798] [array([ 0.10484451]) 0.10873124148225335] [array([ 0.1128415]) 0.10352914606312646] [array([ 0.12476749]) 0.13628966134554657] [array([ 0.0555351]) 0.03968318023762076] [array([ 0.06730468]) 0.09440282803461543] [array([ 0.11977753]) 0.11082907565669858] [array([ 0.15974003]) 0.1667682913779612] [array([ 0.08268617]) 0.07750909419281113] [array([ 0.08885391]) 0.07692247973326477] [array([ 0.21404035]) 0.21668720542595243]...]
The average mean squared error is 0.73. Note that most of the predicted values are quite close to the actual function response. Thus the network above has approximately learnt the function F(x, y) without knowing the functional form of F.
Similarly for the XOR function, I have defined the following function :
def xor_gate(): X_train = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y_train = np.array([0, 1, 1, 0]) hidden_layers = [10, 10] weights_learning_rate = 0.5 bn_learning_rate = 0.9 num_epochs = 100 train_batch_size = 32 momentum_rate = 0.95 dropout_rate = 0.0 layers = hidden_layers +  weights, biases, momentums, gamma, beta = NeuralNetwork.initialize(layers, X_train.shape) nn_model = NeuralNetwork.train_neural_network(X_train, y_train, hidden_layers=hidden_layers, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, num_epochs=num_epochs, train_batch_size=train_batch_size, momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=weights, ini_biases=biases, ini_momentums=momentums, ini_gamma=gamma, ini_beta=beta, type="regression") nn_predict = NeuralNetwork.predict_neural_network(X_train, nn_model, type="regression") print np.array([nn_predict, y_train]).T print ""
Following is the output for the predicted and the actual response of the function :
[[0.06003989 0] [0.97327438 1] [0.97113576 1] [0.02613045 0]]
As you see that, with 2 hidden layers of 10 nodes each, we have learnt the XOR function quite well indeed.
The number of hidden layers and hidden units required to obtain certain level of error depends on the complexity of the function.
Although neural network is very powerful as seen from above, but there is no free lunch. Training a neural network model involves lots of intricacies as compared to other machine learning algorithms. There are too many interacting hyper-parameters that needs to be tweaked correctly in order to successfully train the network. There are a few good resources on the internet which explains on how to train a neural network from scratch. I found the following particularly useful :
- Backpropagation chapter from neural networks book by Rojas
- Notes on BackPropagation.
- Efficient BackProp by Le Cunn
- Lecture on Neural Network Tips and Tricks by Richard Socher for CS224D course
- Tutorial on BackPropagation by Quoc V.Le
- Practical Recommendations for Gradient-Based Training of Deep Architectures
- CS231n course materials.
Re-usability of Neural Networks
Most of the advancement towards deep learning and related areas was largely due to unsupervised pre-training algorithms for neural networks. Unsupervised pre-training helps to initialize weights and biases for each layer of the network as well as learn hidden representations for the input features, found to be useful in variety of applications such as dimensionality reduction (similar to PCA but PCA is linear dim. reduction), collaborative filtering, missing or corrupted input data, word embeddings, topic modeling etc.
Without getting into their details, the purpose of both autoencoders and RBMs is to learn the weights and biases for each layer and also the hidden layer representations of each layer in a greedy manner, i.e. first learn the weights and representation for the first hidden layer using the original inputs, then using these weights and hidden layer representation as inputs, learn the weights and representation for the second layer and so on.
The pre-training step is similar to Expectation-Maximization, since EM is also used to estimate the most likely values of the unknown parameters (weights and biases in neural network) in the presence of "hidden" or "incomplete" data.
Pre-training focuses on re-usability.
Think of building a software application. While building the application, you need to create some components, e.g. a "click" button. But you need to add click buttons of different shapes, sizes, color at different parts of your application depending on the requirement and feedback from your users.
For each new requirement, instead of writing code for creating a new button with the required shape, size and color, you would want to create a re-usable click button template, with shape, size and color as parameters. Similarly, weights and features learnt for detecting edges, arcs etc. can be re-used for learning complex features such as eyes, ears, nose, lips etc.
Why is pre-training a better and faster approach ?
Think of how will you compute the n-th Fibonacci number. One obvious way is to use recursion, i.e. we define a recursive function.
def fibonacci(n): if n == 1 or n == 2: return 1 else: return fibonacci(n-1) + fibonacci(n-2)
But as you know that this method is not efficient, as we are doing redundant computations. In order to compute fibonacci(5), we would be calling fibonacci(3) and fibonacci(4). But fibonacci(4) will in turn, compute fibonacci(3) along with fibonacci(2). Thus the program 'does not learn' that we have already computed fibonacci(3). One obvious solution is to use dynamic programming, i.e. instead of recursive calls, cache the intermediate results (the last two in-fact) and re-use these cached values.
def fibonacci(n): a, b = 1, 1 if n == 1: return a elif n == 2: return b else: for m in range(3, n+1): c = b b = a + b a = c return b
Useful resources on Autoencoders and RBMs :
- Applications of Autoencoders in Natural Language Processing
- Deep Autoencoders
- Learning by Denoising Part 1: What and why of denoising
- Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction
- Contractive Auto-Encoders
- Building Autoencoders in Keras
- A Tutorial on Deep Learning
- Self-Taught Learning to Deep Networks
- Learning Deep Generative Model
- Efficient Learning of Sparse Representations with an Energy-Based Model
- Introduction to Restricted Boltzmann Machines
- A Practical Guide to Training Restricted Boltzmann Machines