Implementation of a Neural Network from scratch

Are you someone who is always fond of using machine learning and deep learning libraries such as Scikit-learn, TensorFlow, PyTorch, etc, but don’t really understand how a neural network works underneath the libraries? If so, this post tries its best to explain what a neural network is, how it works, and how anyone can implement it with some NumPy basics.

1. What is a Neural Network?

As shown in the above image, basically a neural network is a neuron (evaluated using Logistic Regression) repeated multiple times
In neural network notation, we don’t count the input layer. So, the above shown neural network is a “2 layer NN”

2. Neural Network Representation

3. Computing a Neural Network’s Output

The hidden layer has four neurons, so the output or activation from the hidden layer is a column vector with four single valued rows and is denoted by a[1] as it is the first layer of the network
In logistic regression, when input is X, a neuron computes the following two equations to get z & a:
- z = wx + b
- a = sigmoid(x)
- yhat = a (output of the neuron/layer)
- Size and dimension of the layers depends on the number of neurons contained by the layer

Now, for a neural network with hidden layers and multiple neurons, this is how the weights and biases are calculated
First the input layer (input values = X), is multiplied with the transpose of the first layer weights W[1].T, and passed through a sigmoid function to get the output from the first hidden layer (this implements Logistic Regression on the first layer)
- Same thing is repeated for the second layer where the output from the first acts as input to the second
From the output of the output layer, the cost function L is evaluated

Final algorithm to implement the neural network for one example at a time

4. Vectorizing across multiple examples

As shown in the above image, to train all the samples from a training set, we just stack each data point horizontally to create a matrix X
- X = [X1 X2 … Xm]
Similarly, the Z vector is evaluated using the formula z = w.T + b for each column of the vector X to finally form matrix Z
- Z = [Z1 Z2 … Zm]
- For each layer, there is a separate Z matrix
- e.g. Z[1], Z[2] represent the first and second layers of the network
The final output matrix of a layer A is the sigmoid of the matrix Z
- A = sigmoid(Z)
- This is the output of a layer and acts an input to the second layer
- If we go down vertically in a column of matrix A, it represents the activations from nodes of that hidden/output layer

Explanation about the dimensions of the vectors W, X, Z, and A

Vector X is formed by stacking all the data points horizontally.
- X = [x1 x2 … xm], where “m” is the number of samples
- Dimension of X is (nx, m)
  - nx: the number of features in a data point
  - m: the number of training samples
Vector W is formed by stacking the number of neurons/nodes in the layer for each data point of X
- W = [w1 w2 … wm], where the number of rows is the number of nodes in the layer
- So, W.T is the transpose of W to make it compatible for multiplication with X
- Dimension of W.T is (k, nx)
  - k: the number of nodes in the layer
  - nx: the number of features in a data point
Vector Z = W.T * X
- Its dimension is (k, nx) * (nx, m) = (k, m)
- k: the number of nodes in the layer
- m: the number of training samples
Vector A = sigmoid(Z)
- A is the result of using an activation function over Z to make the output in a range 0-1 (which is what the sigmoid does)
- Dimension is the same as that of Z, i.e. (k, m)

5. Activation Functions

Hyperbolic tangent function almost always works better than a sigmoid function
- Sigmoid has an output range (0, 1) and tanh function has an output range (-1, 1)
- Only place where a sigmoid function can be useful is at the output layer of a binary classification, where you want the output to be between (0, 1)
One of the downsides of both the sigmoid and tanh functions is that if the value of z is very large or very small, the slope of the function approximates to nearly zero. This can drastically slow down the gradient descent and can hinder convergence in those cases.
RULE OF THUMB
- Just use Relu (Rectified Linear Unit) function for all hidden layers and only use sigmoid at the output layer if you are trying to implement a binary classifier

Sometimes, leaky relu performs better than relu, but relu is the ultimate choice in most cases.

Why do we need to use non-linear activation functions?

The purpose of the activation function is to introduce non-linearity into the network
In turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables
Non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line–the word for this is affine).
Another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function

6. Derivatives of Activation Functions for Backpropagation

7. Gradient Descent for Neural Networks

Formula for computing derivatives for backpropagation

Deriving the derivative equations for gradient descent from scratch is quite complicated and requires the knowledge of linear algebra and matrix calculus

Implementation of a Neural Network from scratch

1. What is a Neural Network?

2. Neural Network Representation

3. Computing a Neural Network’s Output

4. Vectorizing across multiple examples

Explanation about the dimensions of the vectors W, X, Z, and A

5. Activation Functions

Why do we need to use non-linear activation functions?

6. Derivatives of Activation Functions for Backpropagation

7. Gradient Descent for Neural Networks

Recent Update

Trending Tags

Contents

Trending Tags

Implementation of a Neural Network from scratch

1. What is a Neural Network?

2. Neural Network Representation

3. Computing a Neural Network’s Output

4. Vectorizing across multiple examples

Explanation about the dimensions of the vectors W, X, Z, and A

5. Activation Functions

Why do we need to use non-linear activation functions?

6. Derivatives of Activation Functions for Backpropagation

7. Gradient Descent for Neural Networks

Recent Update

Trending Tags

Contents

Further Reading

A Beginner's Guide to Semi-Supervised Learning

My Summer Internship Experience at Meta: Building Video Understanding Systems and Exploring California (Summer 2022)

Practical Aspects of Deep Learning - 1

Trending Tags