In this post I am going to build an artificial neural network from scratch. Although there exists a lot of advanced neural network libraries written using a variety of programming languages, the idea is not to re-invent the wheel but to understand what are the components required to make a workable neural network. A full-fledged industrial scale neural network might require a lot of research and experimentation with the dataset. Building a simple neural network from scratch requires the understanding of the following concepts :

- The architecture - input layer, hidden layers and the output layer and the number of units in each of these layers.
- Cleaning, standardizing and/or dimensionality reduction of inputs like any other machine learning algorithms.
- Initialization of the weights and biases.
- Non-linear activation function at each of the hidden units - sigmoid, tanh, ReLU etc.
- The outputs - linear weighted combination of the inputs from the last hidden layer, or a logistic function or a probability distribution.
- The backpropagation algorithm for updating the weights and biases.
- Loss function to be minimized during training - mean squared error, logistic error or cross entropy.
- Regularization to prevent overfitting - L1, L2, dropouts
- Optimization algorithms for finding solution to the optimization problem - gradient descent, conjugate gradient, Newton's methods, L-BFGS etc.
- Learning rate updation - Momentum, Adagrad, Adam etc.

Following is the python function defined for training a neural network :

def train_neural_network(trainX, trainY, hidden_layers, num_epochs=10, weights_learning_rate=0.5, bn_learning_rate=0.5, train_batch_size=32, momentum_rate=0.9, dropout_rate=0.2, ini_weights=None, ini_biases=None, ini_momentums=None, ini_gamma=None, ini_beta=None, type="classification"): if type == "classification": trainY = one_hot_encoding(trainY) else: trainY = np.array(trainY).reshape(len(trainY), -1) layers = hidden_layers + [trainY.shape[1]] if ini_weights is None: weights, biases, momentums, gamma, beta = initialize(layers, trainX.shape[1]) else: weights, biases, momentums, gamma, beta = ini_weights, ini_biases, ini_momentums, ini_gamma, ini_beta trainX_batches, trainY_batches = generate_batches(trainX, trainY, train_batch_size) losses = [] expected_mean_linear_inp, expected_var_linear_inp = dict(), dict() exp_mean_linear_inp, exp_var_linear_inp = dict(), dict() for epoch in range(num_epochs): for layer in range(len(layers)): expected_mean_linear_inp[layer] = np.zeros(weights[layer].shape[1]) expected_var_linear_inp[layer] = np.zeros(weights[layer].shape[1]) for batch in range(len(trainX_batches)): trainX_batch = trainX_batches[batch] trainY_batch = trainY_batches[batch] fwd_pass_data = train_forward_pass(trainX_batch, weights, biases, gamma, beta, dropout_rate, type) outputs, linear_inp, scaled_linear_inp, mean_linear_inp, var_linear_inp = fwd_pass_data for layer in range(len(layers)): expected_mean_linear_inp[layer] += mean_linear_inp[layer] expected_var_linear_inp[layer] += var_linear_inp[layer] backprop = error_backpropagation(trainX_batch, trainY_batch, outputs=outputs, linear_inp=linear_inp, scaled_linear_inp=scaled_linear_inp, mean_linear_inp=mean_linear_inp, var_linear_inp=var_linear_inp, weights=weights, biases=biases, momentums=momentums, gamma=gamma, beta=beta, bn_learning_rate=bn_learning_rate, weights_learning_rate=weights_learning_rate, momentum_rate=momentum_rate, type=type) weights, biases, momentums, gamma, beta = backprop m = train_batch_size for layer in range(len(layers)): exp_mean_linear_inp[layer] = expected_mean_linear_inp[layer] / len(trainX_batches) if m > 1: exp_var_linear_inp[layer] = (float(m) / (m-1)) * expected_var_linear_inp[layer] / len(trainX_batches) else: exp_var_linear_inp[layer] = expected_var_linear_inp[layer] / len(trainX_batches) dummy_weights, dummy_biases = scale_weights_dropout(weights, biases, dropout_rate) outputs = test_forward_pass(trainX, weights=dummy_weights, biases=dummy_biases, gamma=gamma, beta=beta, mean_linear_inp=exp_mean_linear_inp, var_linear_inp=exp_var_linear_inp, type=type) if type == "classification": curr_loss = loss_class(outputs, trainY) else: curr_loss = loss_reg(outputs, trainY) cond = len(losses) > 1 and curr_loss > losses[-1] > losses[-2] if cond: weights_learning_rate /= float(2.0) losses.append(curr_loss) weights, biases = scale_weights_dropout(weights, biases, dropout_rate) model = (weights, biases, momentums, gamma, beta, exp_mean_linear_inp, exp_var_linear_inp) return model

'trainX' is the training feature numpy matrix and 'trainY' is the training numpy array of the class labels (for classification) or response (in regression).

'hidden_layers' is a list with the number of elements equals to the number of desired hidden layers and each element refers to the number of units in that hidden layer. For e.g. [4, 5] means 4 units in the 1st hidden layer and 5 units in the 2nd hidden layer.

'num_epochs' is the number of times each training instance is used by the network.

'weights_learning_rate' is the constant learning rate for the weights and biases in the stochastic gradient descent updates.

We are training the neural network using stochastic gradient descent with momentum. This is simple & fast yet effective technique. Although it is very easy to extend the optimization with AdaGrad or Adam variants. Depending on the size of the data and the requirement for second order optimization, one can experiment with conjugate gradient or L-BFGS techniques too.

'momentum_rate' is the momentum factor associated with stochastic gradient descent updates for the weights and biases. Momentum helps to accelerate SGD convergence to local minima.

'train_batch_size' is the size of the mini-batch used in stochastic gradient descent updates.

To improve the network convergence speed we are using Batch Normalization to handle internal co-variate shift of the inputs to each unit. This is done by standardizing the inputs to zero mean and unit variance (by the below function) and then scaled and shifted by learnable parameters, 'gamma' and 'beta' respectively used in the above function.

def standardize_mean_var(mydata, mean=None, var=None): if mean is None and var is None: mean = np.mean(mydata, axis=0) var = np.var(mydata, axis=0) std_data = (mydata - mean) * (var + 1e-5) ** -0.5 return std_data, mean, var

'bn_learning_rate' refers to the learning rate for the Batch Normalization parameters 'gamma' and 'beta'. Below is the pseudo-code from the original paper, where 'gamma' and 'beta' comes into the equation.

All the parameters such as 'expected_mean_linear_inp', 'expected_var_linear_inp', 'exp_mean_linear_inp' and 'exp_var_linear_inp' are variables used for batch normalization purpose. These variables compute the expected mean inputs (before non-linear function application) and expected input variance to each layer (the expectation is over a mini-batch).

To handle overfitting, we are using Dropout mechanism, i.e. randomly picking hidden units in each layer and dropping them probabilistically so that hidden units do not co-adapt. But during testing or prediction we are not dropping units and multiplying each unit weight with its retention probability (1 - dropout probability).

The below function for scaling the weights with the retention probability of the units, is called once all the weights and biases are learnt and when ready for prediction.

def scale_weights_dropout(weights, biases, dropout_rate): scaled_weights, scaled_biases = dict(), dict() for layer in weights: scaled_weights[layer] = weights[layer] * (1 - dropout_rate) scaled_biases[layer] = biases[layer] * (1 - dropout_rate) return scaled_weights, scaled_biases

'dropout_rate' is the probability with which a hidden unit is dropped. We are not using any L1 or L2 or Elastic Net regularization in our training.

'ini_weights', 'ini_biases', 'ini_momentums' are the initial values of the weights, biases and momentums, initialized randomly with small numbers form a uniform distribution.

'ini_gamma' and 'ini_beta' are the initial values for the gamma and beta for batch normalization.

def initialize(layers, num_features): weights, biases, momentums, gamma, beta = dict(), dict(), dict(), dict(), dict() for layer in range(len(layers)): if layer == 0: num_rows = num_features num_cols = layers[layer] else: num_rows = layers[layer - 1] num_cols = layers[layer] fan_in = num_rows if layer < len(layers)-1: fan_out = layers[layer + 1] else: fan_out = fan_in r = 4.0 * math.sqrt(float(6.0) / (fan_in + fan_out)) weights[layer] = np.random.uniform(-r, r, num_rows * num_cols).reshape(num_rows, num_cols) momentums[layer] = np.zeros((num_rows, num_cols)) biases[layer] = np.zeros(num_cols) gamma[layer] = np.ones(num_cols) beta[layer] = np.zeros(num_cols) return weights, biases, momentums, gamma, beta

We re-use the same above function for both classification as well as single output regression problems by passing a parameter 'type'. If we are solving a classification problem, then we do an one-hot encoding of the class labels vector and use it at the output layer. Else if it is a regression problem, then we use only a single output unit at the output layer.

The parameters weights, biases, etc. are initialized layer-wise.

def one_hot_encoding(classes): num_classes = len(set(classes)) targets = np.array([classes]).reshape(-1) return np.eye(num_classes)[targets]

Next we create the mini-batches for the training instances (we randomly shuffle the training instances).

def generate_batches(trainX, trainY, batch_size): concatenated = np.column_stack((trainX, trainY)) np.random.shuffle(concatenated) trainX = concatenated[:,:trainX.shape[1]] trainY = concatenated[:,trainX.shape[1]:] num_batches = math.ceil(float(trainX.shape[0])/batch_size) return np.array_split(trainX, num_batches), np.array_split(trainY, num_batches)

Then for each epoch, we pass all the mini-batches sequentially to the 'train_forward_pass' method, which does the feedforward step for the neural network. In the forward pass, the variables :

'linear_inp', 'scaled_linear_inp', 'mean_linear_inp', 'var_linear_inp' are cached values for batch normalization, that would be re-used for learning the params 'gamma' and 'beta' during backpropagation.

If the current layer is a hidden layer then we apply either sigmoid or ReLU activation function, else if the current layer is the output layer, then we apply either softmax (for classification) or linear weighted inputs (for regression).

Before generating the activations for a layer, units are dropped probabilistically (by selecting hidden units from a binomial distribution) and then outputs for that layer are computed.

def train_forward_pass(trainX, weights, biases, gamma, beta, dropout_rate, type): outputs, linear_inp, scaled_linear_inp = dict(), dict(), dict() mean_linear_inp, var_linear_inp = dict(), dict() curr_input = trainX for layer in range(len(weights)): linear_inp[layer] = curr_input.dot(weights[layer]) + biases[layer] scaled_linear_inp[layer], mean_linear_inp[layer], var_linear_inp[layer] = standardize_mean_var( linear_inp[layer]) shifted_inp = gamma[layer] * scaled_linear_inp[layer] + beta[layer] if layer == len(weights) - 1: if type == "classification": outputs[layer] = output_layer_activation_class_softmax(shifted_inp) else: outputs[layer] = output_layer_activation_reg(shifted_inp) else: binomial_mat = np.zeros(shape=(trainX.shape[0], weights[layer].shape[1])) for row in range(trainX.shape[0]): binomial_mat[row,] = np.random.binomial(1, 1 - dropout_rate, weights[layer].shape[1]) outputs[layer] = hidden_layer_activation_relu(shifted_inp) * binomial_mat curr_input = outputs[layer] return outputs, linear_inp, scaled_linear_inp, mean_linear_inp, var_linear_inp

After the feedforward step is done for each batch, we update the learnable parameters (weights, biases, momentums, gamma and beta) using the backpropagation algorithm below.

During the forward pass, we cache the non-linear outputs from each unit in each layer, as well as the scaled and shifted inputs to the non-linear function, which is found to be helpful while computing the gradients of each unit. Most of the back-propagation code below goes into batch normalization updates (refer the paper).

Mini-batch stochastic gradient descent update rules are applied for updating the unknown parameters. The weights are updated using momentum.

def error_backpropagation(trainX, trainY, outputs, linear_inp, scaled_linear_inp, mean_linear_inp, var_linear_inp, weights, biases, momentums, gamma, beta, bn_learning_rate, weights_learning_rate, momentum_rate, type): bp_grads_1, bp_grads_2 = dict(), dict() inverse_num_examples = float(1.0) / trainX.shape[0] for layer in reversed(range(len(weights))): denom = (var_linear_inp[layer] + 1e-5) ** -0.5 numer = linear_inp[layer] - mean_linear_inp[layer] if layer == len(weights) - 1: if type == "classification": bp_grads_2[layer] = output_layer_grad_class_softmax(outputs[layer], trainY) else: bp_grads_2[layer] = output_layer_grad_reg(outputs[layer], trainY) else: bp_grads_2[layer] = hidden_layer_grad_relu(outputs[layer]) next_layer_weights = weights[layer + 1] bp_grads_2[layer] *= bp_grads_1[layer + 1].dot(next_layer_weights.T) a = bp_grads_2[layer] * gamma[layer] b = np.sum(a * (-0.5 * (denom ** 3.0)) * numer, axis=0) c = np.sum(-a * denom, axis=0) + b * np.sum(-2.0 * numer) * inverse_num_examples bp_grads_1[layer] = a * denom + b * 2.0 * numer * inverse_num_examples + c * inverse_num_examples if layer > 0: total_err = outputs[layer - 1].T.dot(bp_grads_1[layer]) else: total_err = trainX.T.dot(bp_grads_1[layer]) beta[layer] -= bn_learning_rate * np.sum(bp_grads_2[layer], axis=0) * inverse_num_examples gamma[layer] -= bn_learning_rate * np.sum(bp_grads_2[layer] * scaled_linear_inp[layer], axis=0) * inverse_num_examples momentums[layer] = momentum_rate * momentums[layer] - weights_learning_rate * total_err * inverse_num_examples weights[layer] += momentums[layer] biases[layer] -= weights_learning_rate * np.sum(bp_grads_1[layer], axis=0) * inverse_num_examples return weights, biases, momentums, gamma, beta

After each epoch (for all batches), with an updated set of parameters, we compute the predicted outputs using the the learnt parameters.

Note that we are re-using some of the values computed during 'train_forward_pass' such as the mean and variance numbers for inputs to each layer. Everything is almost similar to 'train_forward_pass', except that we are not anymore dropping units during testing (since the weights are already adjusted with the retention probabilities).

def test_forward_pass(testX, weights, biases, gamma, beta, mean_linear_inp, var_linear_inp, type): outputs = dict() curr_input = testX for layer in range(len(weights)): linear_inp = curr_input.dot(weights[layer]) + biases[layer] scaled_linear_inp, _, _ = standardize_mean_var(linear_inp, mean=mean_linear_inp[layer], var=var_linear_inp[layer]) shifted_inp = gamma[layer] * scaled_linear_inp + beta[layer] if layer == len(weights) - 1: if type == "classification": outputs[layer] = output_layer_activation_class_softmax(shifted_inp) else: outputs[layer] = output_layer_activation_reg(shifted_inp) else: outputs[layer] = hidden_layer_activation_relu(shifted_inp) curr_input = outputs[layer] return outputs

Use the predicted outputs to compute the current loss.

Compute the squared error loss if its a regression problem, else compute the cross entropy loss for classification problem. Using Numpy matrix libraries instead of for-loops, all of the the above and below operations are transformed into matrix multiplications and divisions and thus speeding up the performance of the training.

def loss_cross_entropy(preds, actuals): return np.sum(np.sum(-actuals * np.log2(preds), axis=0)) / preds.shape[0] def loss_mse(preds, actuals): return np.sum(np.sum(0.5 * (preds - actuals) ** 2, axis=0)) / preds.shape[0] def loss_class(outputs, targets): num_layers = len(outputs) predictions = outputs[num_layers - 1] total_loss = loss_cross_entropy(predictions, targets) return total_loss def loss_reg(outputs, targets): num_layers = len(outputs) predictions = outputs[num_layers - 1] total_loss = loss_mse(predictions, targets) return total_loss

In order to put a check on whether the constant learning rate is too high due to which the loss is increasing, we put a condition that if the loss increases for two consecutive epochs, then reduce the 'weights_learning_rate' by half.

Coming to the non-linear activation functions, we are using either sigmoid or ReLU (leaky ReLU variant)

Following functions compute the activations with the sigmoid function and also the gradient of the sigmoid activation respectively for the hidden units.

def hidden_layer_activation_sigmoid(inputs): return (1.0 + np.exp(-inputs))**-1.0 def hidden_layer_grad_sigmoid(inputs): return inputs * (1 - inputs)

Similarly the following function computes the activations with the ReLU function (its a leaky ReLU).

def hidden_layer_activation_relu(inputs): return np.maximum(0.1 * inputs, 0.9 * inputs) def hidden_layer_grad_relu(inputs): temp = inputs temp[temp <= 0.0] = 0.1 temp[temp > 0.0] = 0.9 return temp

Following function computes the output layer outputs with the softmax function and also the gradient of the softmax outputs (as in multi-class classification problems)

def output_layer_activation_class_softmax(inputs): inputs = (inputs.T - np.mean(inputs, axis=1)).T out = np.exp(inputs) return (out.T/np.sum(out, axis=1)).T def output_layer_grad_class_softmax(pred_outs, true_outs): return pred_outs - true_outs

Similarly, if the output layer passes the inputs as it is without any transformation (as in regression problems).

def output_layer_activation_reg(inputs): return inputs def output_layer_grad_reg(pred_outs, true_outs): return pred_outs - true_outs

After the model is built, we can use the model to do prediction with either class labels or class probabilities (if classification) or continuous outputs (if regression).

def predict_neural_network(testX, model, type="classification"): weights, biases, _, gamma, beta, exp_mean_linear_inp, exp_var_linear_inp = model num_layers = len(weights) outputs = test_forward_pass(testX, weights=weights, biases=biases, gamma=gamma, beta=beta, mean_linear_inp=exp_mean_linear_inp, var_linear_inp=exp_var_linear_inp, type=type) preds = outputs[num_layers - 1] outs = [] for row in range(preds.shape[0]): if type == "classification": outs += [np.argmax(preds[row,])] else: outs += [preds[row,]] return outs

Following function is defined to perform K-fold cross validation (particularly for classification) on training examples for the neural network.

def train_nn_cv(trainX, trainY, hidden_layers, num_epochs=100, weights_learning_rate=0.1, bn_learning_rate=0.5, train_batch_size=32, momentum_rate=0.9, dropout_rate=0.2, num_cv=5, ini_weights=None, ini_biases=None, ini_momentums=None, ini_gamma=None, ini_beta=None): kf = KFold(n_splits=num_cv) for train_index, test_index in kf.split(trainX): trainX_batch, testX_batch = trainX[train_index], trainX[test_index] trainY_batch, testY_batch = trainY[train_index], trainY[test_index] model = train_neural_network(trainX_batch, trainY_batch, hidden_layers=hidden_layers, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, num_epochs=num_epochs, train_batch_size=train_batch_size, momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=ini_weights, ini_biases=ini_biases, ini_momentums=ini_momentums, ini_gamma=ini_gamma, ini_beta=ini_beta, type="classification") preds_train = predict_neural_network(trainX_batch, model, type="classification") preds_test = predict_neural_network(testX_batch, model, type="classification") print "Train F1-Score = ", f1_score(trainY_batch, preds_train, average='weighted') print "Train Accuracy = ", accuracy_score(trainY_batch, preds_train) print "Validation F1-Score = ", f1_score(testY_batch, preds_test, average='weighted') print "Validation Accuracy = ", accuracy_score(testY_batch, preds_test) print ""

It prints the classification accuracy as well as the weighted F1-scores for both training as well as testing.

Instead of initializing the weights and biases used in the 'train_neural_network' function with uniform random numbers, we can use autoencoders to generate initial weights and biases for us.

The code is pretty simple, as it considers each layer one at a time and then calls the 'train_neural_network' method on each layer with the inputs and outputs being the same. After learning the weights and biases and other unknown parameters for each layer, we fine tune these parameters by again calling the 'train_neural_network' on the entire network (all the layers concatenated) but with the learnt parameters as their initial values.

def train_autoencoder(trainX, hidden_layers, num_epochs, weights_learning_rate, bn_learning_rate, momentum_rate, dropout_rate, ini_weights, ini_biases, ini_momentums, ini_gamma, ini_beta): layers = hidden_layers weights, biases, momentums, gamma, beta = ini_weights, ini_biases, ini_momentums, ini_gamma, ini_beta exp_mean_linear_inp, exp_var_linear_inp = dict(), dict() curr_input = trainX for layer in range(len(hidden_layers)): l_weights, l_biases, l_momentums, l_gamma, l_beta = initialize([layers[layer], curr_input.shape[1]], curr_input.shape[1]) l_weights[0], l_biases[0], l_momentums[0], l_gamma[0], l_beta[0] = ini_weights[layer], ini_biases[layer], \ ini_momentums[layer], ini_gamma[layer], \ ini_beta[layer] model = train_neural_network(curr_input, curr_input, hidden_layers=[layers[layer]], num_epochs=num_epochs, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, train_batch_size=trainX.shape[0], momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=l_weights, ini_biases=l_biases, ini_momentums=l_momentums, ini_gamma=l_gamma, ini_beta=l_beta, type="regression") m_weights, m_biases, m_momentums, m_gamma, m_beta, m_exp_mean_linear_inp, m_exp_var_linear_inp = model weights[layer], biases[layer], momentums[layer], gamma[layer], beta[layer] = m_weights[0], m_biases[0], \ m_momentums[0], m_gamma[0], m_beta[ 0] exp_mean_linear_inp[layer], exp_var_linear_inp[layer] = m_exp_mean_linear_inp[0], m_exp_var_linear_inp[0] outputs = test_forward_pass(curr_input, weights=m_weights, biases=m_biases, gamma=m_gamma, beta=m_beta, mean_linear_inp=m_exp_mean_linear_inp, var_linear_inp=m_exp_var_linear_inp, type="regression") curr_input = outputs[0] return weights, biases, momentums, gamma, beta, exp_mean_linear_inp, exp_var_linear_inp, curr_input def train_autoencoder_reg(trainX, trainY, hidden_layers, num_epochs=100, weights_learning_rate=0.1, train_batch_size=32, bn_learning_rate=0.5, momentum_rate=0.9, dropout_rate=0.2, ini_weights=None, ini_biases=None, ini_momentums=None, ini_gamma=None, ini_beta=None, type="classification"): if type == "classification": layers = hidden_layers + [len(set(trainY))] else: layers = hidden_layers + [1] if ini_weights is None: weights, biases, momentums, gamma, beta = initialize(layers, trainX.shape[1]) else: weights, biases, momentums, gamma, beta = ini_weights, ini_biases, ini_momentums, ini_gamma, ini_beta autoencoder = train_autoencoder(trainX, hidden_layers=hidden_layers, num_epochs=num_epochs, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=weights, ini_biases=biases, ini_momentums=momentums, ini_gamma=gamma, ini_beta=beta) m_weights, m_biases, m_momentums, m_gamma, m_beta, m_exp_mean_linear_inp, m_exp_var_linear_inp, _ = autoencoder for layer in range(len(m_weights)): weights[layer] = m_weights[layer] biases[layer] = m_biases[layer] momentums[layer] = m_momentums[layer] gamma[layer] = m_gamma[layer] beta[layer] = m_beta[layer] model = train_neural_network(trainX, trainY, hidden_layers=hidden_layers, num_epochs=num_epochs, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, train_batch_size=train_batch_size, momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=weights, ini_biases=biases, ini_momentums=momentums, ini_gamma=gamma, ini_beta=beta, type=type) return model

To verify that our codes are working, we ran the neural network on the MNIST digits dataset (available in the scikit package), first with 5-fold cross validation and then by randomly initialized parameters (weights, biases etc.) and then with parameters initialized with autoencoders. Note that the pixel values for each image is flattened from 2D to 1D in the dataset thus our classifier cannot use image boundary informations.

We used 2 hidden layers of 700 units each, initial learning rate of 0.5, number of epochs=100, mini-batch size of 32 and dropout rate of 0.2.

def digits_classification(): mydata = datasets.load_digits() trainX = mydata.data trainY = mydata.target n_samples = trainX.shape[0] trainX, trainY = shuffle(trainX, trainY, random_state=0) X_train, y_train = trainX[:int(.8 * n_samples)], trainY[:int(.8 * n_samples)] X_test, y_test = trainX[int(.8 * n_samples):], trainY[int(.8 * n_samples):] hidden_layers = [700, 700] weights_learning_rate = 0.5 bn_learning_rate = 0.9 num_epochs = 100 train_batch_size = 32 momentum_rate = 0.95 dropout_rate = 0.2 layers = hidden_layers + [len(set(y_train))] weights, biases, momentums, gamma, beta = NeuralNetwork.initialize(layers, X_train.shape[1]) NeuralNetwork.train_nn_cv(X_train, y_train, hidden_layers=hidden_layers, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, num_epochs=num_epochs, train_batch_size=train_batch_size, momentum_rate=momentum_rate, dropout_rate=dropout_rate, num_cv=5, ini_weights=weights, ini_biases=biases, ini_momentums=momentums, ini_gamma=gamma, ini_beta=beta) nn_model = NeuralNetwork.train_neural_network(X_train, y_train, hidden_layers=hidden_layers, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, num_epochs=num_epochs, train_batch_size=train_batch_size, momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=weights, ini_biases=biases, ini_momentums=momentums, ini_gamma=gamma, ini_beta=beta, type="classification") nn_predict = NeuralNetwork.predict_neural_network(X_test, nn_model, type="classification") print "Test F1-Score = ", f1_score(y_test, nn_predict, average='weighted') print "Test Accuracy = ", accuracy_score(y_test, nn_predict) print "" autoencoder_model = NeuralNetwork.train_autoencoder_reg(X_train, y_train, hidden_layers=hidden_layers, weights_learning_rate=weights_learning_rate, bn_learning_rate=bn_learning_rate, num_epochs=num_epochs, train_batch_size=train_batch_size, momentum_rate=momentum_rate, dropout_rate=dropout_rate, ini_weights=weights, ini_biases=biases, ini_momentums=momentums, ini_gamma=gamma, ini_beta=beta, type="classification") nn_predict = NeuralNetwork.predict_neural_network(X_test, autoencoder_model, type="classification") print "Test F1-Score = ", f1_score(y_test, nn_predict, average='weighted') print "Test Accuracy = ", accuracy_score(y_test, nn_predict) print ""

Following are the results from the 5-fold cross-validation.

Train F1-Score = 0.99139865542 Train Accuracy = 0.991296779809 Validation F1-Score = 0.955500340188 Validation Accuracy = 0.954861111111 Train F1-Score = 0.992222573873 Train Accuracy = 0.992167101828 Validation F1-Score = 0.965778737187 Validation Accuracy = 0.965277777778 Train F1-Score = 0.994793760014 Train Accuracy = 0.994782608696 Validation F1-Score = 0.982898602654 Validation Accuracy = 0.982578397213 Train F1-Score = 0.983469761968 Train Accuracy = 0.98347826087 Validation F1-Score = 0.958330678075 Validation Accuracy = 0.95818815331 Train F1-Score = 0.995658724042 Train Accuracy = 0.995652173913 Validation F1-Score = 0.993088393852 Validation Accuracy = 0.993031358885

With 80-20 split of the digits data into training and testing, we obtain the following performance numbers on the 20% testing data using randomly initialized weights and biases.

Test F1-Score = 0.978076202818 Test Accuracy = 0.977777777778

That is a pretty good accuracy for a crude implementation.

The test accuracy and the F1-score are almost same with parameters initialization using the auto-encoders. This is because we are already using batch normalization and momentum techniques to improve the learning of the weights.

The entire code is hosted on my Github repo.

Following are some of the resources I found useful for building this neural network from scratch.

- Backpropagation chapter from neural networks book by Rojas
- Notes on BackPropagation.
- Efficient BackProp by Le Cunn
- Lecture on Neural Network Tips and Tricks by Richard Socher for CS224D course
- Tutorial on BackPropagation by Quoc V.Le
- Practical Recommendations for Gradient-Based Training of Deep Architectures

Categories: AI, MACHINE LEARNING, PROBLEM SOLVING

Tags: BackPropagation Algorithm, Batch Normalization, Deep Learning, Dropout, Neural Network, Numpy, Python, Scikit Learn, Stochastic GRadient Descent