Are you someone who is always fond of using machine learning and deep learning libraries such as Scikit-learn, TensorFlow, PyTorch, etc, but don’t really understand how a neural network works underneath the libraries? If so, this post tries its best to explain what a neural network is, how it works, and how anyone can implement it with some **NumPy** basics.

### 1. What is a Neural Network?

- As shown in the above image, basically a neural network is a neuron (evaluated using Logistic Regression) repeated multiple times
- In neural network notation, we don’t count the input layer. So, the above shown neural network is a
**“2 layer NN”**

### 2. Neural Network Representation

### 3. Computing a Neural Network’s Output

- The hidden layer has four neurons, so the output or activation from the hidden layer is a column vector with four single valued rows and is denoted by
**a[1]**as it is the first layer of the network - In logistic regression, when input is X, a neuron computes the following two equations to get
**z & a**:- z = wx + b
- a = sigmoid(x)
- yhat = a (output of the neuron/layer)
**Size and dimension of the layers depends on the number of neurons contained by the layer**

- Now, for a neural network with hidden layers and multiple neurons, this is how the weights and biases are calculated
- First the input layer (input values = X), is multiplied with the transpose of the first layer weights W[1].T, and passed through a sigmoid function to get the output from the first hidden layer (this implements Logistic Regression on the first layer)
- Same thing is repeated for the second layer where the output from the first acts as input to the second

- From the output of the output layer, the cost function L is evaluated

- Final algorithm to implement the neural network for one example at a time

### 4. Vectorizing across multiple examples

- As shown in the above image, to train all the samples from a training set, we just stack each data point horizontally to create a matrix X
- X = [X1 X2 … Xm]

- Similarly, the Z vector is evaluated using the formula
**z = w.T + b**for each column of the vector X to finally form matrix Z- Z = [Z1 Z2 … Zm]
- For each layer, there is a separate Z matrix
- e.g. Z[1], Z[2] represent the first and second layers of the network

- The final output matrix of a layer A is the sigmoid of the matrix Z
- A = sigmoid(Z)
- This is the output of a layer and acts an input to the second layer
- If we go down vertically in a column of matrix A, it represents the activations from nodes of that hidden/output layer

#### Explanation about the dimensions of the vectors W, X, Z, and A

- Vector X is formed by stacking all the data points horizontally.
- X = [x1 x2 … xm], where “m” is the number of samples
- Dimension of X is
**(nx, m)**- nx: the number of features in a data point
- m: the number of training samples

- Vector W is formed by stacking the number of neurons/nodes in the layer for each data point of X
- W = [w1 w2 … wm], where the number of rows is the number of nodes in the layer
- So, W.T is the transpose of W to make it compatible for multiplication with X
- Dimension of W.T is
**(k, nx)**- k: the number of nodes in the layer
- nx: the number of features in a data point

- Vector Z = W.T * X
- Its dimension is (k, nx) * (nx, m) =
**(k, m)** - k: the number of nodes in the layer
- m: the number of training samples

- Its dimension is (k, nx) * (nx, m) =
- Vector A = sigmoid(Z)
- A is the result of using an activation function over Z to make the output in a range 0-1 (which is what the sigmoid does)
- Dimension is the same as that of Z, i.e.
**(k, m)**

### 5. Activation Functions

- Hyperbolic tangent function almost always works better than a sigmoid function
- Sigmoid has an output range (0, 1) and tanh function has an output range (-1, 1)
- Only place where a sigmoid function can be useful is at the output layer of a binary classification, where you want the output to be between (0, 1)

One of the downsides of both the sigmoid and tanh functions is that if the value of

**z**is very large or very small, the slope of the function approximates to nearly zero. This can drastically slow down the gradient descent and can hinder convergence in those cases.**RULE OF THUMB**- Just use
**Relu (Rectified Linear Unit)**function for all hidden layers and only use sigmoid at the output layer if you are trying to implement a binary classifier

- Just use

- Sometimes, leaky relu performs better than relu, but relu is the ultimate choice in most cases.

#### Why do we need to use non-linear activation functions?

The purpose of the activation function is to introduce non-linearity into the network

In turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

Non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line–the word for this is affine).

Another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function

### 6. Derivatives of Activation Functions for Backpropagation

### 7. Gradient Descent for Neural Networks

- Formula for computing derivatives for backpropagation

- Deriving the derivative equations for gradient descent from scratch is quite complicated and requires the knowledge of linear algebra and matrix calculus