Implementation of a Neural Network from scratch
Post
Cancel

# Implementation of a Neural Network from scratch

Are you someone who is always fond of using machine learning and deep learning libraries such as Scikit-learn, TensorFlow, PyTorch, etc, but don’t really understand how a neural network works underneath the libraries? If so, this post tries its best to explain what a neural network is, how it works, and how anyone can implement it with some NumPy basics.

### 1. What is a Neural Network? • As shown in the above image, basically a neural network is a neuron (evaluated using Logistic Regression) repeated multiple times
• In neural network notation, we don’t count the input layer. So, the above shown neural network is a “2 layer NN”

### 2. Neural Network Representation ### 3. Computing a Neural Network’s Output

• The hidden layer has four neurons, so the output or activation from the hidden layer is a column vector with four single valued rows and is denoted by a as it is the first layer of the network
• In logistic regression, when input is X, a neuron computes the following two equations to get z & a:
• z = wx + b
• a = sigmoid(x)
• yhat = a (output of the neuron/layer)
• Size and dimension of the layers depends on the number of neurons contained by the layer • Now, for a neural network with hidden layers and multiple neurons, this is how the weights and biases are calculated
• First the input layer (input values = X), is multiplied with the transpose of the first layer weights W.T, and passed through a sigmoid function to get the output from the first hidden layer (this implements Logistic Regression on the first layer)
• Same thing is repeated for the second layer where the output from the first acts as input to the second
• From the output of the output layer, the cost function L is evaluated • Final algorithm to implement the neural network for one example at a time ### 4. Vectorizing across multiple examples • As shown in the above image, to train all the samples from a training set, we just stack each data point horizontally to create a matrix X
• X = [X1 X2 … Xm]
• Similarly, the Z vector is evaluated using the formula z = w.T + b for each column of the vector X to finally form matrix Z
• Z = [Z1 Z2 … Zm]
• For each layer, there is a separate Z matrix
• e.g. Z, Z represent the first and second layers of the network
• The final output matrix of a layer A is the sigmoid of the matrix Z
• A = sigmoid(Z)
• This is the output of a layer and acts an input to the second layer
• If we go down vertically in a column of matrix A, it represents the activations from nodes of that hidden/output layer

#### Explanation about the dimensions of the vectors W, X, Z, and A

• Vector X is formed by stacking all the data points horizontally.
• X = [x1 x2 … xm], where “m” is the number of samples
• Dimension of X is (nx, m)
• nx: the number of features in a data point
• m: the number of training samples
• Vector W is formed by stacking the number of neurons/nodes in the layer for each data point of X
• W = [w1 w2 … wm], where the number of rows is the number of nodes in the layer
• So, W.T is the transpose of W to make it compatible for multiplication with X
• Dimension of W.T is (k, nx)
• k: the number of nodes in the layer
• nx: the number of features in a data point
• Vector Z = W.T * X
• Its dimension is (k, nx) * (nx, m) = (k, m)
• k: the number of nodes in the layer
• m: the number of training samples
• Vector A = sigmoid(Z)
• A is the result of using an activation function over Z to make the output in a range 0-1 (which is what the sigmoid does)
• Dimension is the same as that of Z, i.e. (k, m)

### 5. Activation Functions

• Hyperbolic tangent function almost always works better than a sigmoid function
• Sigmoid has an output range (0, 1) and tanh function has an output range (-1, 1)
• Only place where a sigmoid function can be useful is at the output layer of a binary classification, where you want the output to be between (0, 1)
• One of the downsides of both the sigmoid and tanh functions is that if the value of z is very large or very small, the slope of the function approximates to nearly zero. This can drastically slow down the gradient descent and can hinder convergence in those cases.

• RULE OF THUMB
• Just use Relu (Rectified Linear Unit) function for all hidden layers and only use sigmoid at the output layer if you are trying to implement a binary classifier • Sometimes, leaky relu performs better than relu, but relu is the ultimate choice in most cases.

#### Why do we need to use non-linear activation functions?

• The purpose of the activation function is to introduce non-linearity into the network

• In turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

• Non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line–the word for this is affine).

• Another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function

### 6. Derivatives of Activation Functions for Backpropagation   ### 7. Gradient Descent for Neural Networks

• Formula for computing derivatives for backpropagation • Deriving the derivative equations for gradient descent from scratch is quite complicated and requires the knowledge of linear algebra and matrix calculus Updated Apr 13, 2020 2020-04-13T14:47:03-05:00
This post is written by Ashish Jaiswal