Posts Implementation of a Neural Network from scratch

Implementation of a Neural Network from scratch

Are you someone who is always fond of using machine learning and deep learning libraries such as Scikit-learn, TensorFlow, PyTorch, etc, but don’t really understand how a neural network works underneath the libraries? If so, this post tries its best to explain what a neural network is, how it works, and how anyone can implement it with some NumPy basics.

1. What is a Neural Network?

  • As shown in the above image, basically a neural network is a neuron (evaluated using Logistic Regression) repeated multiple times
  • In neural network notation, we don’t count the input layer. So, the above shown neural network is a “2 layer NN”

2. Neural Network Representation

3. Computing a Neural Network’s Output

  • The hidden layer has four neurons, so the output or activation from the hidden layer is a column vector with four single valued rows and is denoted by a[1] as it is the first layer of the network
  • In logistic regression, when input is X, a neuron computes the following two equations to get z & a:
    • z = wx + b
    • a = sigmoid(x)
    • yhat = a (output of the neuron/layer)
    • Size and dimension of the layers depends on the number of neurons contained by the layer

  • Now, for a neural network with hidden layers and multiple neurons, this is how the weights and biases are calculated
  • First the input layer (input values = X), is multiplied with the transpose of the first layer weights W[1].T, and passed through a sigmoid function to get the output from the first hidden layer (this implements Logistic Regression on the first layer)
    • Same thing is repeated for the second layer where the output from the first acts as input to the second
  • From the output of the output layer, the cost function L is evaluated

  • Final algorithm to implement the neural network for one example at a time

4. Vectorizing across multiple examples

  • As shown in the above image, to train all the samples from a training set, we just stack each data point horizontally to create a matrix X
    • X = [X1 X2 … Xm]
  • Similarly, the Z vector is evaluated using the formula z = w.T + b for each column of the vector X to finally form matrix Z
    • Z = [Z1 Z2 … Zm]
    • For each layer, there is a separate Z matrix
    • e.g. Z[1], Z[2] represent the first and second layers of the network
  • The final output matrix of a layer A is the sigmoid of the matrix Z
    • A = sigmoid(Z)
    • This is the output of a layer and acts an input to the second layer
    • If we go down vertically in a column of matrix A, it represents the activations from nodes of that hidden/output layer

Explanation about the dimensions of the vectors W, X, Z, and A

  • Vector X is formed by stacking all the data points horizontally.
    • X = [x1 x2 … xm], where “m” is the number of samples
    • Dimension of X is (nx, m)
      • nx: the number of features in a data point
      • m: the number of training samples
  • Vector W is formed by stacking the number of neurons/nodes in the layer for each data point of X
    • W = [w1 w2 … wm], where the number of rows is the number of nodes in the layer
    • So, W.T is the transpose of W to make it compatible for multiplication with X
    • Dimension of W.T is (k, nx)
      • k: the number of nodes in the layer
      • nx: the number of features in a data point
  • Vector Z = W.T * X
    • Its dimension is (k, nx) * (nx, m) = (k, m)
    • k: the number of nodes in the layer
    • m: the number of training samples
  • Vector A = sigmoid(Z)
    • A is the result of using an activation function over Z to make the output in a range 0-1 (which is what the sigmoid does)
    • Dimension is the same as that of Z, i.e. (k, m)

5. Activation Functions

  • Hyperbolic tangent function almost always works better than a sigmoid function
    • Sigmoid has an output range (0, 1) and tanh function has an output range (-1, 1)
    • Only place where a sigmoid function can be useful is at the output layer of a binary classification, where you want the output to be between (0, 1)
  • One of the downsides of both the sigmoid and tanh functions is that if the value of z is very large or very small, the slope of the function approximates to nearly zero. This can drastically slow down the gradient descent and can hinder convergence in those cases.

    • Just use Relu (Rectified Linear Unit) function for all hidden layers and only use sigmoid at the output layer if you are trying to implement a binary classifier

  • Sometimes, leaky relu performs better than relu, but relu is the ultimate choice in most cases.

Why do we need to use non-linear activation functions?

  • The purpose of the activation function is to introduce non-linearity into the network

  • In turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

  • Non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line–the word for this is affine).

  • Another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function

6. Derivatives of Activation Functions for Backpropagation

7. Gradient Descent for Neural Networks

  • Formula for computing derivatives for backpropagation

  • Deriving the derivative equations for gradient descent from scratch is quite complicated and requires the knowledge of linear algebra and matrix calculus

Updated Apr 13, 2020 2020-04-13T14:47:03-05:00
This post is written by Ashish Jaiswal