An overview of gradient descent optimization algorithms (2016)

This post delves into popular gradient-based optimization algorithms, providing insights into their workings, strengths, and weaknesses. It discusses various optimization algorithms like Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, and Nadam, highlighting their motivations and update rules. The post also touches upon strategies for parallelizing and distributing stochastic gradient descent (SGD) and additional optimization techniques like shuffling, curriculum learning, batch normalization, and early stopping. It addresses challenges faced during training, such as choosing an appropriate learning rate and avoiding getting stuck in suboptimal local minima. Gradient descent variants like batch gradient descent, stochastic gradient descent, and mini-batch gradient descent are explained, along with their pros and cons. The post aims to provide a practical understanding of these algorithms and their applications in optimizing neural networks.

https://www.ruder.io/optimizing-gradient-descent/