The latest overview of deep learning optimization algorithms in 2018

** gradient descent algorithm**is a very widely used optimization algorithm in machine learning and is the most commonly used optimization method among many machine learning algorithms. Almost every current advanced (state-of-the-art) machine learning library or deep learning library will include different variants of the gradient descent algorithm implementation.

nevertheless， They are like a black box optimizer， It's hard to get them superior drawback Practical explanation of the。

An overview of gradient descent optimization algorithms

This article aims to provide an introduction to the different variants in the gradient descent algorithm to help users use it according to their specific needs. A detailed comparison of the different variants in the gradient descent algorithm and helps the user to use it according to their specific needs. Recently Ruder has compiled a list of 2018 deep learning optimization research highlights based on previous overviews in response to some new approaches to optimization algorithms in 2018, which are worth watching.

The ultimate goal of deep learning is to find a minimum value that generalizes well, and it would certainly be nice if this value could be found quickly and reliably. The main method is the stochastic gradient descent (SGD) method, which is 60 years old (Robbins and Monro, 1951) and it is very important for current backpropagation algorithms for deep learning.

Different optimization algorithms have been proposed in recent years, each using a different formulation to update the parameters of the model. Adam (Kingma and Ba, 2015) has remained one of the most commonly used optimization algorithms until today since it was introduced in 2015. This suggests that from the perspective of a machine learning practitioner, the best approach to deep learning optimization has not changed much to a large extent.

nevertheless， This year has already yielded some new ideas to improve deep learning optimization methods， This may be the way to optimize our model in the future。 In this blog post， I'll dive into the most exciting highlights and most promising directions for deep learning。 Please note， This blog post assumes in advance that you are already familiar withSGD and adaptive learning rate methods。 To increase the speed of learning， Please first understand the existing Gradient descent optimization algorithm， See my other blog post（http://ruder.io/optimizing-gradient-descent/index.html）。

** Understanding Generalization**

Optimization is closely related to generalization, as the minimum value of model convergence defines the degree of model generalization. Thus, boosting optimization is closely related to theoretical advances in understanding this minimal value generalization behavior and, more generally, a deeper understanding of the generalization capabilities of deep learning.

However, our understanding of the generalization behavior of deep neural networks is still very shallow. Recent work has shown that the number of possible local minima increases exponentially with the number of parameters (Kawaguchi, 2016). Considering the huge number of parameters in current deep learning architectures, it seems amazing that such models can converge and generalize quite well, especially considering that they can completely remember random inputs (Zhang et al., 2017).

Keskar et al. (2017) identify the clarity of minima as a source of poor generalization: in particular, they show that explicit minima found by batch gradient descent have a high generalization error. This is intuitive because we would normally expect our functions to be smooth, with explicit minima indicating that the corresponding error surface is highly irregular. However, recent work suggests that clarity may not be a good metric, as it suggests that local minima can generalize well (Dinh et al, 2017). Eric Jang's quora Q&A also discusses these articles.

An ICLR 2018 contributed paper shows through a series of ablation analyses that a model's dependence on a single direction in the activation space (i.e., activation of a single unit or feature map) is a good indication of its generalization performance. They demonstrate the applicability of this model to models with different datasets and varying degrees of label corruption. Also they found that dropout did not help with this problem and that batch normalization hindered unidirectional dependencies.

While these findings suggest that there is still much we do not know about deep learning optimization, it is important to remember that convergence guarantees and a great deal of the work that exists in convex optimization, and to some extent existing ideas and insights can be applied to non-convex optimization as well. The large-scale optimization tutorial in NIPS 2016 provides a fascinating overview of more theoretical work in the field.

** Gradient descent optimization framework**

** Batch gradient descent (Batch gradient descent)**

The model parameters are updated each time using all the training set samples, i.e.

θ=θ−η⋅∇θJ(θ)

where the gradient of the loss function is calculated using the full training set of samples each time, and then the learning rate is used to update each parameter of the model in the opposite direction of the gradient.

Batch gradient descent uses the entire training set for each learning, so its** advantages** in that each update will go in the right direction and will eventually be guaranteed to converge to the extrema (convex functions converge to the global extrema, non-convex functions may converge to the local extrema), but their** drawback** in that each learning time is too long and if the training set is so large that it consumes a lot of memory, and that batch gradient descent does not allow for online model parameter updates.

** Stochastic gradient descent (STD)**

The stochastic gradient descent algorithm randomly selects one sample at a time from the training set for learning, i.e.

θ=θ−η⋅∇θJ(θ;xi;yi)

The batch gradient descent algorithm uses the full training samples each time, so the computation is redundant because it uses the exact same set of samples each time.

And random gradient descent algorithm Only one sample is randomly selected at a time to update the model parameters， So each study is very quick， And it can be updated online。

The stochastic gradient drops the largest** drawback** in that each update may not follow the right direction and can therefore introduce optimization fluctuations, as follows.

Figure 1 SGD

On the other hand, however, one advantage of the fluctuations caused by stochastic gradient descent is that for basin-like regions (i.e., many local minima) then this fluctuating feature may make the direction of optimization jump from the current local minima to another better local minima, so that it may then eventually converge to a better local minima, or even a global minima, for non-convex functions.

Due to fluctuations, it therefore makes the number of iterations (learning) increase, i.e., convergence slows down.

Ultimately though its going to have the same convergence as the full volume gradient descent algorithm, i.e. the convex function converges to the global extremum and the non-convex loss function converges to the local extremum.

** Warm restarts (Warm restarts)**

SGD with restarts (hot restart)

SGDR (Loshchilov andHutter, 2017) is an effective method recently proposed as an SGD method that uses thermal restart instead of learning rate annealing. On each restart, the learning rate is initialized to a certain value and will decrease. It is important that the restart is a hot restart because the optimization does not start from scratch, but from the last step in which the model converges on the parameters. The key factor is to bring down the learning rate with an aggressive cosine annealing scheme, which will rapidly reduce the learning rate as follows.

Learning rate hot restart solution

The initial high learning rate after restart is used to essentially bounce the parameters from their previously converged minima to a different loss surface. Radical annealing allows the model to converge quickly to a new and better solution. The authors found empirically that hot restart SGD takes 2 to 4 times less time than learning rate annealing and achieves comparable or better performance.

Learning rate annealing with hot restart is also known as periodic learning rate and was originally proposed by Smith (2017). Two other articles by students of fast.ai (who have recently started teaching this method) discussing hot starts and cyclic learning rates are at (https://medium.com/@bushaev/improving-the-way-we-work-with-learning-rate-5e99554f163b) and (http://teleported.in/posts/cyclic-learning-rate/).

** Mini-batch gradient descent **

Mini-batch Gradient descent synthesizes thebatch Gradient descent withstochastic gradient decreasing， Striking a balance between the speed of each update and the number of updates， Its each update is randomly selected from the training setm（ among othersm<n） A sample of the study， i.e.：

θ=θ−η⋅∇θJ(θ;xi:i+m;yi:i+m)

Compared to stochastic gradient descent, Mini-batch gradient descent reduces convergence volatility, i.e., it reduces the variance of parameter updates, making them more stable.

relative to batch gradient descent, which improves the speed of each learning. And it doesn't have to worry about memory bottlenecks so it can use matrix operations for efficient computation.

In general [50,256] samples are randomly selected for learning for each update, but they should also be chosen according to the specific problem, and in practice several trials can be conducted to choose a sample number that is more suitable for both the update speed and the number of updates.

mini-batch gradient descent can guarantee convergence though. mini-batch gradient descent is commonly used in neural networks.

** Issues and challenges**

While gradient descent algorithms are effective and widely used, at the same time they have a number of challenges and problems that need to be addressed.

If the learning rate is too small, it leads to a very slow convergence rate. If the learning rate is too large, then its will hinder convergence, i.e., it will oscillate near the extrema.__Choosing a reasonable rate of study is difficult__(also known as Learning rate schedules) [11] Attempt to vary the learning rate, such as annealing, during each update. Some sort of pre-determined strategy is generally used or a smaller threshold is decayed in each iteration. Regardless of the tuning method, a fixed setup is required beforehand, and this side then cannot adapt itself to the characteristics of the dataset learned each time [10].__Learning rate adjustment__**Fine-tuning the learning rate (Tuning the learning rate)**

In many cases, except for the hyperparameters our model does not need to be improved and tuned.

Recent examples of language modeling have demonstrated that simply adjusting LSTM parameters (Melis et al., 2017)[20] and regularization parameters (Merity et al., 2017) can produce better results compared to more complex models. The learning rate η is an important optimization hyperparameter in deep learning. In fact, SGD has been shown to require a learning rate annealing scheme to converge to a good minimum. It is often assumed that adaptive learning rate methods like Adam's are more robust to different learning rates because they update the learning rate themselves. However, even for these methods, there can be a significant difference between a good learning rate and an optimal learning rate.

Zhang et al. (2017) show that SGDs with tuned learning rate annealing schemes and momentum parameters can not only compete with Adam but also converge faster. On the other hand, while we may argue that the adaptation of Adam's learning rate can mimic learning rate annealing, it is still beneficial to explicitly use the annealing scheme: if we add SGD to Adam for learning rate annealing, it converges faster in the machine translation task (Denkowski and Neubig, 2017).

In fact, learning rate annealing schemes seem to be the new feature engineering, as we can often find improved learning rate annealing schemes that improve the final convergence behavior of our models. An interesting example is e.g. Vaswani et al. (2017).

While it is common to see a model's hyperparameters subjected to large-scale hyperparameter optimization, it is interesting to see the learning rate annealing scheme as the focus of the same attention to detail: the authors use the parameters β1= 0.9, the non-default β2= 0.98, and ε= 10^-9 in Adam, and the learning rate η is probably one of the most complex annealing schemes.

where dmodel is the number of model parameters and warmup_steps=4000 Another recent paper by Smith et al. (2017) shows that there is a link between learning rate and batch size, and that the two hyperparameters are usually considered to be independent of each other: they show that decaying the learning rate is equivalent to increasing the batch size, yet the batches can increase in parallel.

Instead, we can reduce the number of model updates, thus speeding up training by increasing the learning speed and scaling the batch. This has implications for large-scale deep learning, where existing training schemes can now be retuned without the need to adjust hyperparameters.

If the data features are sparse or each feature has different fetch statistics features and spaces, then the same learning rate cannot be used for each parameter in each update, and those features that rarely occur should use a relatively large learning rate.__All parameters of the model are updated each time using the same learning rate__As in the neural net. So how do you avoid it. Dauphin [3] pointed out that the more serious problem is not the local extremal points, but the saddle points (These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.).__For non-convex objective functions, it is easy to get caught in those suboptimal local extrema__

** Gradient descent optimization algorithm**

Some gradient optimization methods that are frequently used in the deep learning community to solve appeal problems are discussed below, although algorithms that are not feasible in high-dimensional data, such as Newton's method, are not included.

**Momentum **

If in canyon areas (much steeper in some directions than in others, commonly at local extrema) [1], SGD will oscillate near these places, resulting in slow convergence. This situation is then solved by Momentum (Momentum) [2]. Momentum is added to the parameter update term with a single update (i.e., the momentum term), i.e.

νt=γνt−1+η ∇θJ(θ)

θ=θ−νt

where the momentum term hyperparametersγ<1 Generally it is less than or equal to0.9。

its role As shown in the figure below.

Figure 2 No momentum

Figure 3 plus momentum

Adding the momentum term is like rolling a ball down from the top of a hill, the ball accumulates the momentum ahead of it as it rolls down (the momentum keeps increasing), so the speed becomes faster and faster until it reaches the end.

Similarly, when updating the model parameters, for those parameters whose current gradient direction is the same as the last gradient direction, then enhancements are made, i.e., faster in those directions; for those parameters whose current gradient direction is different from the last gradient direction, then cuts are made, i.e., slower in those directions. Thus faster convergence with less oscillation can be obtained.

**NAG**[7]

A ball rolling downhill from the top of a hill will blindly choose the slope. A better way should be that you should slow down before you encounter a tilt up.

Nesterov accelerated gradient (NAG, Nesterov gradient acceleration) not only adds the momentum term, but also subtracts it from the loss function when calculating the gradient of the parameter, i.e., calculating ∇θJ(θ-γνt-1), in such a way that it predicts where the parameter will be located next time. To wit.

νt=γνt−1+η⋅∇θJ(θ−γνt−1)

θ=θ−νt

As shown in the figure below.

Figure 4 NAG update

A detailed description can be found in Ilya Sutskever's PhD thesis [9]. Assuming the momentum factor parameter γ = 0.9, the current gradient term is first calculated, as in the small blue vector above, and then the momentum term is added so that a large jump is obtained, as in the large blue vector above. This is the update that contains only the momentum term. And the NAG first comes with a large jump (momentum term) and then adds a small correction using the current gradient calculated with momentum (red vector above) to get the green vector above. This can prevent too-fast updates to improve responsiveness, as in RNNs [8].

With the above two methods, it can be done to accelerate the convergence of SGD by being able to do adaptive updates based on the slope of the loss function during each learning process. The next step then requires the respective adaptive update of each parameter according to the importance of the parameter.

**Adagrad**

Adagrad [3] is also a gradient-based optimization algorithm that adapts to different learning rates for each parameter, getting large learning updates for sparse features and smaller learning updates for non-sparse features, making this optimization algorithm suitable for handling sparse feature data. Dean et al [4] found that Adagrad can improve the robustness of SGD well, and google then used it to train large-scale neural networks (watch videos to recognize cats: recognize cats in Youtube videos). Pennington et al [5] then used Adagrad in GloVe to train to obtain Word Embeddings, where frequently occurring words were given smaller updates and infrequently occurring words were given larger updates.

In the previous, each model parameter θi uses the same learning rate η, while Adagrad uses a different learning rate ηi for each model parameter θi at each update step, and let the gradient of the parameter θi of the objective function at the tth update step be gt,i, i.e.

gt,i=∇θJ(θi)

Then the SGD update equation is.

θt+1,i=θt,i−η⋅gt,i

While Adagrad uses a different learning rate for each parameter, its update equation is

where Gt ∈ Rd×d is a diagonal matrix, where the diagonal element eii in the ith row is the sum of squares of the gradients from the past to the current ith parameter θi , and epsilon is a smoothing parameter, in order to make the denominator non-zero (usually ϵ = 1e-8), and in addition the algorithm performance would be poor if the denominator were not open-rooted.

Further, all elements of Gt,ii,gt,i are written as vectors Gt,gt so that the vector dot product operation can be used.

Adagrad The main advantage is its ability to adapt different learning rates for each parameter， And the general manual is set to0.01。 whilst drawback In the need to calculate the sum of squares of the parameter gradient sequence， And the learning rate trend is decaying eventually reaching a very small value。

**RMSprop**

In fact, RMSprop is an intermediate form of Adadelta, also to reduce the problem of too fast decay of learning rate in Adagrad, viz.

Hinton suggests γ=0.9,η=0.001.

**Adam**

Adaptive Moment Estimation (Adam) is also an adaptive different learning rate method with different parameters, and differs from Adadelta and RMSprop in that it calculates the historical gradient decay differently, without using historical squared decay, and its decay is momentum-like, as follows.

mt=β1mt−1+(1−β1)gt

vt=β2vt−1+(1−beta2)g2t

mt and vt are the weighted average and weighted biased variance of the gradients, respectively, initially in the 0 vector, and Adam's authors find that they tend to be in the 0 vector (close to the 0 vector), especially when the decay factor (decay rate) β1,β2 is close to 1. To improve this problem, bias-correction (bias-corrected) is applied to mt vs. vt.

Ultimately, Adam's updated equation is

The default values suggested in the paper: β1 = 0.9, β2 = 0.999, ϵ = 10-8. The paper compares Adam to several other adaptive learning rates, all with better results.

** Comparison of optimization methods **

The following two diagrams visually compare each of the above optimization methods, see here for details, as shown in the figure.

Fig. 5 Performance of each SGD optimization method on lossy surfaces

As can be seen from the above figure, Adagrad, Adadelta and RMSprop are able to immediately transfer to the correct direction of movement on the loss surface to achieve fast convergence. And Momentum with NAG causes deviation (off-track). Also NAG is able to correct its course quickly after deviations as it improves responsiveness based on gradient corrections.

Fig. 6 Performance of each SGD optimization method on the lossy surface at the saddle point

As can be seen from the above figure, at the saddle points (i.e., zero gradient in some dimensions and non-zero gradient in some dimensions), SGD, Momentum & NAG keep oscillating in the direction of zero gradient at the saddle points, making it difficult to break the symmetry of the saddle point locations; Adagrad, RMSprop & Adadelta are able to shift quickly to the direction of non-zero gradient.

As can be seen from the two graphs above, the adaptive learning rate methods (Adagrad, Adadelta, RMSprop & Adam) have better convergence speed and convergence in these scenarios.

** How to choose an SGD optimizer**

If your data features are sparse, then you are better off using adaptive learning rate SGD optimization methods (Adagrad, Adadelta, RMSprop & Adam) because you don't need to manually adjust the learning rate during the iterations.

RMSprop is an extension of Adagrad that is similar to Adadelta, but an improved version of Adadelta uses RMS to automatically update the learning rate and does not require the initial learning rate to be set. And Adam is using momentum with bias correction on top of RMSprop. RMSprop, Adadelta and Adam perform about the same in similar situations. Kingma [15] states that the gain is slightly better than RMSprop for bias correction, as its gradient becomes more sparse as it approaches convergence. Therefore, Adam is probably the best SGD optimization method available.

What's interesting is that， Many recent papers have been written using the originalSGD gradient descent algorithm， and using a simple learning rate annealing adjustment（ No momentum term）。 What is available has been shown：SGD Able to converge to the minimum value point， But as opposed to the otherSGD， It can take longer.， and relies on robust initial values and a learning rate annealing adjustment strategy， and easily fall into local minima， Even saddle point.。 therefore， If you care about convergence speed or training a deep or complex network， You should choose an adaptive learning rateSGD Optimization methods。

** Parallel and Distributed SGD **

If you are dealing with a very large dataset and have clusters of machines available, then parallel or distributed SGD is a very good choice as the speed can be increased significantly. The nature of the SGD algorithm dictates that it is serial (step-by-step). So how to do asynchronous processing is a problem. Although serial can guarantee convergence, speed is a bottleneck if the training set is large. If asynchronous updates are performed, then this may lead to non-convergence. In the following, we will discuss how to perform parallel or distributed SGD, where parallel generally refers to multi-core parallelism on the same machine and distributed refers to clustered processing.

**Hogwild**

Niu [23] proposed the parallel SGD method known as Hogwild. The method performs parallelism at multiple CPU times. The processor accesses the parameters through shared memory and these parameters are not locked. It allocates a non-overlapping portion of the parameters to each cpu (allocation mutex), and each cpu updates only the parameters it is responsible for. This method is only suitable for dealing with data features that are sparse. This method achieves an almost optimal convergence rate because the same information is not rewritten between cpu's.

**Downpour SGD**

Downpour SGD is an asynchronous variant of SGD proposed by Dean [4] for use in DistBelief (the predecessor of Google TensorFlow). It trains multiple copies of the model at the same time on a training subset. These replicas send their respective updates to a parameter server (PS, parameter server), each of which updates only a mutually exclusive portion of the parameters, and the replicas do not communicate with each other. Thus it may lead to parameter divergence to the detriment of convergence.

**Delay-tolerant Algorithms for SGD**

McMahan & Streeter [12] extend AdaGrad by developing delay-tolerant algorithms, which not only adapts to past gradients, but also updates the delay. The method has been shown to be effective in practice.

**TensorFlow**

TensorFlow [13] is a large-scale machine learning library open sourced by Google, which was formerly known as DistBelief. It has been used on a large number of mobile devices or in large distributed clusters and has been tested in practice. Its distributed implementation is based on graph computation, which partitions the graph into multiple subgraphs, with each computational entity acting as a computational node in the graph, and they communicate via Rend/Receive. See here for details.

**Elastic Averaging SGD**

Zhang et al [14] proposed Elastic Averaging SGD (EASGD), which connects each work through an elastic force (a parameter server center that stores parameters) to perform parameter asynchronous updates. This allows the local variables to fluctuate further from the center variable, which in theory allows for more exploration of the parameter space. They show empirically that this increased capacity for exploration leads to improved performance by finding new local optima. Don't really understand this sentence, need to go read the paper.

** More SGD optimization strategies **

Next. More SGD optimization strategies to further improveSGD performance。 There are also numerous other optimization strategies， See also[22]。

**Shuffling and Curriculum Learning**

To make the learning process more unbiased, the samples in the training set should be randomly disrupted in each iteration.

On the other hand, in many cases we solve the problem step by step, and arranging the training set in some meaningful order will improve the performance of the model and the convergence of SGD, and how to build a meaningful arrangement of the training set is called Curriculum Learning [16].

Zaremba & Sutskever [17], in using Curriculum Learning to train LSTMs to solve some simple problems, showed that a combined or hybrid strategy is better than sorting the training set in increasing order of training difficulty.

**Batch normalization**

In order to facilitate training, we usually initialize the parameters according to the 0 mean 1 variance, and as we continue to train, the parameters are updated to varying degrees, so that these parameters lose the distribution properties of the 0 mean 1 variance, which will reduce the training speed and amplify the parameter changes as the network structure deepens.

Batch normalization [18] renormalizes the parameters to 0 mean 1 variance after each mini-batch backpropagation. This allows the use of a larger learning rate, as well as spending less effort on parameter initialization points. Batch normalization acts as a regularization, reducing or even eliminating the need for Dropout.

**Early stopping**

On the validation set if the loss function no longer decreases significantly during successive multiple iterations, then training should be ended early, see the NIPS 2015 Tutorial slides for details, or see some methods to prevent overfitting.

**Gradient noise**

Gradient noise [21] i.e., a Gaussian distribution N(0,σ2t) of random errors is added to each iteration of the computational gradient, i.e. :

gt,i=gt,i+N(0,σ2t)

The variance of the Gaussian error requires annealing.

Adding random errors to the gradient increases the robustness of the model, even if the initial parameter values are poorly chosen, and lends itself to training for particularly deep and responsible networks. The reason for this is that adding random noise will make it more likely that the local extrema will be skipped and go to a better local extremum, a possibility that is more common in deeper networks.

** conclude**

In the above, three frameworks for gradient descent algorithms are described and mini-batch gradient descent is the most widely used. We then highlight some optimization methods for SGD: Momentum, NAG, Adagrad, Adadelta, RMSprop & Adam, as well as some asynchronous SGD methods.

Finally, some other optimization suggestions to improve SGD performance are presented, such as: random shuffling and curriculum learning of the training set, batch normalization,, early stopping and Gradient noise.

I hope this article has given you some guidance on how to use different gradient optimization algorithms. Any more suggestions or ways to optimize would be appreciated? Or what tips and tricks do you use to better train SGD to be able to communicate together? Thanks.

[1] Sutton, R. S. (1986). Two problems with backpropagation and other steepest-descent learning procedures for networks. Proc. 8th Annual Conf. Cognitive Science Society.

[2] Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.http://doi.org/10.1016/S0893-6080(98)00116-6.

[3] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. Retrieved fromhttp://jmlr.org/papers/v12/duchi11a.html.

[4] Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V, … Ng, A. Y. (2012). Large Scale Distributed Deep Networks. NIPS 2012: Neural Information Processing Systems, 1–11.http://doi.org/10.1109/ICDAR.2011.95.

[5] Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. http://doi.org/10.3115/v1/D14-1162.

[6] Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. Retrieved fromhttp://arxiv.org/abs/1212.5701.

[7] Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, pp. 543– 547.

[8] Bengio, Y., Boulanger-Lewandowski, N., & Pascanu, R. (2012). Advances in Optimizing Recurrent Networks. Retrieved from http://arxiv.org/abs/1212.0901.

[9] Sutskever, I. (2013). Training Recurrent neural Networks. PhD Thesis.

[10] Darken, C., Chang, J., & Moody, J. (1992). Learning rate schedules for faster stochastic gradient search. Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop, (September), 1–11. http://doi.org/10.1109/NNSP.1992.253713.

[11] H. Robinds and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, vol. 22, pp. 400–407, 1951.

[12] Mcmahan, H. B., & Streeter, M. (2014). Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning. Advances in Neural Information Processing Systems (Proceedings of NIPS), 1–9. Retrieved from http://papers.nips.cc/paper/5242-delay-tolerant-algorithms-for-asynchronous-distributed-online-learning.pdf.

[13] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … Zheng, X. (2015). TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems.

[14] Zhang, S., Choromanska, A., & LeCun, Y. (2015). Deep learning with Elastic Averaging SGD. Neural Information Processing Systems Conference (NIPS 2015), 1–24. Retrieved fromhttp://arxiv.org/abs/1412.6651.

[15] Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13

[16] Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings of the 26th Annual International Conference on Machine Learning, 41–48.http://doi.org/10.1145/1553374.1553380.

[17] Zaremba, W., & Sutskever, I. (2014). Learning to Execute, 1–25. Retrieved fromhttp://arxiv.org/abs/1410.4615.

[18] Ioffe, S., & Szegedy, C. (2015). Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv Preprint arXiv:1502.03167v3.

[19] Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. arXiv, 1–14. Retrieved from http://arxiv.org/abs/1406.2572.

[20] Sutskever, I., & Martens, J. (2013). On the importance of initialization and momentum in deep learning. http://doi.org/10.1109/ICASSP.2013.6639346.

[21] Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Martens, J. (2015). Adding Gradient Noise Improves Learning for Very Deep Networks, 1–11. Retrieved fromhttp://arxiv.org/abs/1511.06807.

[22] LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient BackProp. Neural Networks: Tricks of the Trade, 1524, 9–50. http://doi.org/10.1007/3-540-49430-8_2.

[23] Niu, F., Recht, B., Christopher, R., & Wright, S. J. (2011). Hogwild ! : A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, 1–22.

[24] Duchi et al. [3] give this matrix as an alternative to the full matrix containing the outer products of all previous gradients, as the computation of the matrix square root is infeasible even for a moderate number of parameters dd.