Optimization Algorithms in Deep Learning
Building deep neural networks is one thing, but optimizing it to train faster with better accuracy is a completely different set of domain. So, it is very important that we focus on optimizing our algorithms to converge faster with desirable accuracy and details. In this post, we discuss about a few optimization algorithms that are generally used to expedite the training process even with indefinitely larger training datasets. The usage of the algorithms solely depend upon our application and the type of dataset that we use.
If you want to check out the implementation of optimization algorithms in deep neural networks, kindly visit this link here: Optimization Algorithms  DNN
Here are the main differences among different types of gradient descents.
1. Minibatch Gradient Descent
 Gradient descent is an algorithm in machine learning that is used to evaluate the parameters that are used in the model
 The main downside of Gradient Descent is that it has to go through the entire training set on each descent (or iteration)
 So, if the training dataset used is very large, then the algorithm takes huge amount of time
 To mitigate this problem, we use minibatch gradient descent, taking batches from training data for each descent
 This helps gradient descent to progress smoothly by not requiring the entire training dataset on each step (or descent)

A minibatch from the training set is represented as , (we use curly braces to represent the minibatch)
 Each step of the descent is on a minibatch instead of the whole training set
 Size of each minibatch is the size of the dataset in each loop
 1 epoch = A single pass through the entire training set (going through all minibatches of the set for once)
 The only difference is that the gradient descent gets updated after each minibatch is processed within a running epoch unlike the full gradient descent
Size of Minibatch
 Size = “m”: Batch Gradient Descent (too slow although has better accuracy)
 Size = “1”: Stochastic Gradient Descent
 Too prone to noise and outliers
 We lose the advantage of speed of vectorization as each minibatch is only a single example

Size = “between 1 and m”: Generally taken sizes in practice are among 64, 128, 256, 512, 1024, etc
 Hence, size of minibatch is also another hyperparameter to consider
2. Exponentially Weighted Averages
 Used on basically any data that is in sequence
 It is also referred as smoothing of the data (or timeseries)
= + (1  )
 Generally, we take = 0.9 for practical consideration
Bias Correction
 In exponentially weighted averages, when the initial value = 0, then it can create an unwanted bias making the initial averages to be much lower than the actual. So, we use the following formula to update the value of
= / (1  )
 This is required for bias correction and not letting the initial values be affected by a fixed bias towards zero or origin
3. Gradient Descent with Momentum
 While using minibatch gradient descent, the parameters get updated after each minibatch cycle (having some variance in each update). This make the gradient descent to oscillate a lot while moving towards the convergence
 So, gradient descent with momentum computes an exponentially weighted averages of the gradients and then use that gradient to update the weights instead
 This helps in reducing the oscillations during G.D. and makes the convergence faster
= + (1  ) dw
= + (1  ) db
w = w 
b = b 
 So, here “” is a new hyperparameter involved which basically carries out exponentially weighted averages on each update making the convergence faster
 Generally, the practically considered value of is ~ 0.9
 Hence, this method is basically taking “exponentially weighted moving averages” method and merging it to the “minibatch gradient descent” algorithm
4. RMSprop Optimizer
 RMSprop is quite similar to G.D with Momentum except for the fact that it restricts the oscillations in the vertical direction
 This allows the descent to take greater leaps in the horizontal direction with greater learning rate as the vertical movement is restricted
 In this case, the exponentially weighted moving averages are calculated differently as shown below
= + (1  )
= + (1  )
w = w  * dw /
b = b  * db /
 RMSprop and Momentum algorithms both decrease the vertical oscillations and increase horizontal speed, making the descent converge faster for a given cost function
5. Adam Optimization Algorithm
 It combines the techniques from both RMSprop and Momentum algorithms to calculate the gradients
 The term Adam is derived from Adaptive Moment Estimation
 First, it calculates gradients using the momentum method:
= + (1  )dw
= + (1  )db
= / (1  ), <– where is the corrected form of
= / (1  ), <– where is the corrected form of
 Then we have the gradients using the RMSprop method:
= + (1  )
= + (1  )
= / (1  ), <– where is the corrected form of
= / (1  ), <– where is the corrected form of
 Finally the weights are updated as follows:
w = w  * /
b = b  * /
 Here, the hyperparameters are , = 0.9, = 0.999, and = with practical usecase values
6. Learning Rate Decay
 During gradient descent, the pathway may oscillate around the minimum if the learning rate is sufficiently large to avoid convergence.
 So, there is a technique to lower down the learning rate as it approaches the minimum so that it converges faster
 The formula for a decaying learning rate is given below:
=
 Here, is the initial learning rate
Other learning rate formulae that can be used
= *
 Discrete Staircase
= or
 One option is to manually decay the learning rate during the training process which is not feasible most of the times
NOTE
 Generally there are local optimum during gradient descent which are also called saddle points where the pathway of the descent may get stuck not resulting in a convergence
 So, we should be aware of such points in the descent
 Also, plateaus in the learning curve may make the learning slow
To conclude, these were some of the popular optimization algorithms that are used to speed up the convergence process in a deep neural networks. Incorporating these algorithms can speed up the training process from days to hours or even sometimes to minutes. Don’t forget to check out other posts related to machine learning and deep learning on my blog. Thank you for reading and cheers for your next machine learning model.
If you want to check out the implementation of optimization algorithms in deep neural networks, kindly visit this link here: Optimization Algorithms  DNN