Jun 5, 2018 - The generalization ability of a neural network depends on the optimization proce- ... and monitoring the layer-level training speeds tha...

0 downloads 4 Views 394KB Size

On layer-level control of DNN training and its impact on generalization

Simon Carbonnelle, Christophe De Vleeschouwer FNRS research fellows ICTEAM, Université catholique de Louvain Louvain-La-Neuve, Belgium {simon.carbonnelle, christophe.devleeschouwer}@uclouvain.be

Abstract The generalization ability of a neural network depends on the optimization procedure used for training it. For practitioners and theoreticians, it is essential to identify which properties of the optimization procedure influence generalization. In this paper, we observe that prioritizing the training of distinct layers in a network significantly impacts its generalization ability, sometimes causing differences of up to 30% in test accuracy. In order to better monitor and control such prioritization, we propose to define layer-level training speed as the rotation rate of the layer’s weight vector (denoted by layer rotation rate hereafter), and develop Layca, an optimization algorithm that enables direct control over it through each layer’s learning rate parameter, without being affected by gradient propagation phenomena (e.g. vanishing gradients). We show that controlling layer rotation rates enables Layca to significantly outperform SGD with the same amount of learning rate tuning on three different tasks (up to 10% test error improvement). Furthermore, we provide experiments that suggest that several intriguing observations related to the training of deep models, i.e. the presence of plateaus in learning curves, the impact of weight decay, and the bad generalization properties of adaptive gradient methods, are all due to specific configurations of layer rotation rates. Overall, our work reveals that layer rotation rates are an important factor for generalization, and that monitoring it should be a key component of any deep learning experiment.

1 Introduction Generalization and gradient propagation are two popular themes in deep learning literature that motivated the questions studied in this paper. On the one hand, it has been observed that a network’s ability to generalize depends on a subtle interaction between the training data and stochastic gradient descent [36, 4]. Considering this idea at layer level, each layer of a deep network might behave differently with respect to generalization since the training of each layer is guided by different input and feedback signals. On the other hand, several works have shown that the norm of gradients can gradually increase or decrease with the layers, creating inequalities amongst layers during SGD training: some are trained more than others (vanishing and exploding gradients are two well-known examples [5, 16, 13]). This work explores an interaction between generalization and the intricate nature of gradient propagation, and focuses on the following research question: how do inequalities between layers during training influence generalization? To study this question in depth, we need a method for controlling and monitoring the layer-level training speeds that is robust to gradient propagation problems. In our first experiment, we achieve this by focusing on an extreme scenario where only one layer of the Preprint. Work in progress.

network is trained. Hence, controlling the layer-level training speeds reduces to selecting the layer to train. When switching to full training we propose the rotation rate of the layer’s weight vector (denoted by layer rotation rate hereafter) as a measure of layer-level training speed, and develop the LAYer-level Controlled Amount of weight rotation algorithm (Layca) that enables control over it through the layer-wise learning rate parameters. In both scenarios, the conclusion is clear: controlling the training speed on a per layer basis has an influence on the generalization ability of the network. Motivated by this observation, we then study the configurations of layer rotation rates that emerge from training with commonly used optimizers and practices. On the one hand, we show that minima that generalize well and are easily found by Layca, may require extensive tuning of layer-wise learning rates with SGD due to the intricate way gradient propagate in deep networks. With the same amount of learning rate tuning, we show that Layca significantly outperforms SGD on three different tasks (up to 10% test error improvement). On the other hand, we show that many mysterious observations around deep learning, such as the bad generalization ability of adaptive gradient methods [34], could be due to differences in layer rotation rate configurations. For example, we show that with the same experiment settings as in [34], applying Layca makes the training curves of adaptive methods and their non-adaptive equivalents indistinguishable. In addition to revealing a novel factor, unique to deep learning, that influences generalization, we expect that our work, and the development of Layca in particular, will also help practitioners by reducing the hyper-parameter tuning required to train state of the art networks. Moreover, through our analysis of previous observations around deep learning, we show that our work can also provide a common ground for investigating and clarifying at first sight unrelated deep learning mysteries, thus facilitating the work of theoreticians. Source code to reproduce all the figures of this paper is provided at https://github.com/Simoncarbo/Layer-level-control-of-DNN-training (code uses the Keras [7] and TensorFlow [2] libraries).

2 Related work Recent works have demonstrated that generalization in deep neural networks was mostly due to a puzzling interaction between the training data and the optimization method [36, 4]. Our paper discloses a way by which optimization influences generalization in deep learning: by prioritizing the training of specific layers. This novel factor complements batch size and global learning rate, two parameters of SGD that have been extensively studied in the light of generalization [22, 20, 32, 31, 17]. The works studying the vanishing and exploding gradients problems [5, 16, 13] heavily inspired this paper. These works introduce two ideas which are central to our investigation: the intuitive notion of layer-level training speed and the fact that SGD does not necessarily treat all layers equally during training. Our work explores the same phenomena, but studies them in the light of generalization instead of trainability and speed of convergence. Our paper also proposes Layca, an algorithm to control inequalities in training speed across layers. It is thus related to the works that sought solutions to the gradient propagation problems at optimization level [27, 14, 30]. Again, our work differs by focusing on the impact on generalization instead of on convergence speed and trainablity. Recently, a series of papers proposed optimization algorithms similar to Layca and observed an impact on generalization [35, 37, 12]. Our paper complements these works by providing an extensive analysis of the reasons behind such observations.

3 Layer-level analysis matters when studying generalization Many previous works on generalization in deep learning do not consider the decomposition of networks in layers, but treat networks as black box functions without internal structure. The goal of this section is to show through a toy example that apprehending the network at layer-level granularity is necessary when studying generalization in deep nets. In particular, it constitutes a first motivating example showing that prioritizing training of specific layers can influence generalization. The model used in this experiment is an eleven layer deep MLP (multi-layer perceptron) where all hidden layers are composed of 784 units with ReLU activation [26]. The data is the MNIST dataset [25], where only 10 randomly selected examples per class are used for training. The particularity of 2

this toy example is that, first of all, the hidden layers are equivalent up to parameter initialization: each layer has 784 inputs, 784 outputs, and a linear+ReLU mapping between them. Second, training only one layer is sufficient to reach 100% accuracy on the training set. This setting allows us to ask the question: will all layers, if trained in isolation, reach the same accuracy on the test set? In other words, are all layers equivalent with respect to generalization? 3.1 Which layer would you train? The generalization ability of a model depends on the training data used. In our toy example, the training of different hidden layers is guided by different inputs and feedback signals. At initialization, the difference lies in the number of random non-linear transformation applied to the network inputs (forward pass) and model errors (backward pass) before reaching the layer. Do forward and backward signals degrade (w.r.t. generalization) as they go through random non-linear transformations? Are both forward and backward signals as robust/sensible to random transformations? These questions are at the heart of the proposed dilemma: if you may train only a single layer, which one would you train to reach maximum test accuracy? Figure 1a shows, as a function of the trained layer index, the test accuracy obtained by training a single layer in our toy example network. The results are averaged over 10 experiments. There is a clear trend: the test accuracy degrades with the depth of the trained layer, with a difference of nearly 20% in test accuracy. Layers are thus not made equal with respect to generalization, and in our toy example, training the first layers of the network tends to generalize better than training layers at the end of the network. 3.2 Improving its own feedback: the first layers’ secret trick To understand why different layers have different generalization properties, Section 3.1 highlights the fact that each layer’s training is guided by different inputs and feedback signals. Here, using the same toy example, we present another key element that comes into play. Training a layer in isolation does not impact the inputs the layer receives. However, it impacts its feedback in a non-negligible way: a layer’s training influences the way errors propagate through every subsequent layer, because the backward pass is dependent on the activations of the forward pass. Our analysis reveals that this is a key ingredient that enables the first layers’ performance. Figure 1b shows, using the Silhouette coefficient [21], how the feedback received by the first layer gets more correlated with the classes/targets through training, sign of an improved feedback quality. Furthermore, Figure 1c reports the test accuracies obtained when influence on feedback is prevented.1 We observe that the trend is inverted: the last layers generalize better than the first layers. As a conclusion, based on this toy example, we have revealed novel mechanisms unique to deep networks that influence generalization in complex ways, that need to be taken into account for a theory of generalization in deep learning.

4 Monitoring and controlling layer-level training speed In Section 3, we show that all layers are not made equal with respect to generalization, through a toy experiment where one layer is trained at a time. In a realistic situation, however, all layers of a deep network are trained simultaneously. In this case, the training speed with which each layer trains, could also have an impact on the final generalization ability of the network (training only one layer being an extreme configuration). However, the notion of layer-level training speed is still unclear, and its control through SGD is potentially difficult because of the intricate nature of gradient propagation (cfr. vanishing and exploding gradients). The goal of this section is to present tools to monitor and control training speed at layer level, such that its impact on generalization can be studied in Section 5. 1 We prevent feedback improvement by storing ReLU’s regime (operating or saturated for positive and negative pre-activations respectively) for each neuron and for each sample of the training and test sets at initialization and keeping it fixed during training even if the activation of a sample crosses the zero threshold.

3

Test accuracy

0.25

0.70

0.20

0.65

0.15

0.60 0.55

0.60

Silhouette score

Test accuracy

0.30

0.75

0.10

0

1

2

3 4 5 6 Layer index

(a)

7

8

9

0.05

0.55 0.50 0.45

0

100

200 300 epoch

400

500

0

1

(b)

2

3 4 5 6 Layer index

7

8

9

(c)

Figure 1: Series of results on a toy example, designed to study if different layers relate similarly to generalization. An eleven layer MLP network composed of 10 identical layers is trained on a reduced MNIST dataset, such that training any of the 10 layers in isolation results in 100% train accuracy. (a) Test accuracy in function of the index of the trained layer (in forward pass ordering), averaged over 10 experiments. (b) Shows how the correlation between the first layer’s feedback and the output targets improves over training, even when only the first layer is trained. (c) Test accuracy in function of layer index when feedback improvement is prevented.

4.1 How to define training speed at layer level? Training speed can be understood as the speed with which a model converges to its optimal solution not to be confounded with learning rate, which is only one of the parameters that affect training speed in current deep learning applications. The notion of layer-level training speed is ill-posed, since a layer does not have a loss of its own: all layers optimize the same global loss function. Given a training step, how can we know by how much each layer’s update contributed to the improvement of the global loss? Previous work on vanishing and exploding gradients focused on the norm and variance of gradients as a measure of layer-level training speed [5, 16, 13]. Provided the empirical work on activation and weight binarization during [8, 28, 18], or after training [3, 6], we argue that changes to the norm of a weight vector do not matter, and that only its orientation matters. Therefore, we suggest to measure training speed through the rotation rate of a layer’s weight vector (also denoted by layer rotation rate in this paper). 2 4.2 Layca: an algorithm for layer-level control Given our definition of layer-level training speed, we now develop an algorithm to control it. Ideally, the layer rotation rates should be directly controllable with the layer-wise learning rates, ignoring the peculiarities of gradient propagation. We propose Layca (SGD-guided LAYer-level Controlled Amount of weight rotation), an algorithm where the layer-wise learning rates directly determine the amount of rotation performed by each layer’s weight vector during an optimization step, in a direction specified by an optimizer (SGD being the default choice). Inspired by techniques for optimization on manifolds [1], and on spheres in particular, Layca is composed of 4 operations, applied individually to each layer: projection of the optimizer’s step on the space orthogonal to current weights, rotation-based normalization of the step, performing the update scaled by the learning rate, and projecting the resulting weights back onto the sphere. Algorithm 1 details these operations.

5 A study of layer prioritization during end-to-end training Section 4.2 provides a tool (Layca) to control layer rotation rate, a tentative definition of layerlevel training speed designed to facilitate control over layer prioritization during end-to-end training. In this section, we analyse how prioritizing the training of specific layers impacts generalization by varying the layer-wise learning rates used by Layca. In order to evaluate Layca’s benefit, we perform the same experiments with SGD. The experiments are conducted on a 25 layers deep VGG-style CNN [29], ResNet-32 [15] and a 11 layers deep CNN trained on CIFAR-10 [24], CIFAR-100 [24] 2 It is worth noting that our measure focuses on weights that multiply the inputs of a layer (typically kernels of fully connected and convolutional layers). Additive weights (biases) are not used in our models, and we leave their study as future work.

4

Algorithm 1 Layca, an algorithm that enables control over the layer rotation rates through their learning rate parameter (see Section 4.2). Require: o, an optimizer (SGD is the default choice) Require: T , the number of training steps L is the number of layers in the network for l=0 to L-1 do Require: ρl (t), a layer’s learning rate schedule Require: w0l , the initial weights of layer l end for t←0 while t < T do s0t , ..., stL−1 = getStep(o, wt0 , ..., wtL−1 ) (get the updates of the selected optimizer) for l=0 to L-1 dol l l (st ·wt )wt slt ← slt − w (project step on space orthogonal to wtl ) l ·w l t

t

slt kwtl k2 kslt k2 l wt+1 ← wtl + ρl (t)slt kw0l k2 l l wt+1 ← wt+1 kw l k2

slt ←

(rotation-based normalization) (perform update) (project weights back on sphere)

t+1

end for t←t+1 end while

and the Tiny ImageNet dataset [10, 9] respectively. All our networks use batch normalization [19]. More information about networks and training procedure can be found in Supplementary Material. 5.1 Layer-wise learning rate configurations In this paper, we restrict ourselves to a static configuration of layer-wise learning rates where they exponentially increase/decrease with layer depth. The learning rate ρl (t) of layer l is parametrized by α ∈ [−1, 1] as follows: ( (L−1−l) (1 − α)5 L−1 ρ(t) if α > 0 (1) ρl (t) = l (1 + α)5 L−1 ρ(t) if α ≤ 0 Where l = 0, 1, ..., L − 1 is the index of the layer in forward pass ordering, ρ(t) is a global learning rate schedule parametrized by t, the current training step. Values of α close to −1 correspond to prioritizing first layers, 0 corresponds to no prioritization, and values close to 1 to prioritization of last layers. We study 13 values of α in our experiments. Visualization of the layer-wise learning rates in function of the studied α values is provided in Supplementary Material. The initial global learning rate, ρ(0), is determined by grid search over 10 values (3−7 , 3−6 , ..., 32 ) in the α = 0 setting (SGD’s detailed results are available in Supplementary Material). 5.2 Controlling layer prioritization with Layca Figure 2 shows, for the three tasks, how the test error evolves with α when Layca is used for training. Test errors are only reported for α values for which at least 98% training accuracy could be obtained. First of all, we observe that prioritizing early layers (negative α values) generalizes better than prioritizing the last layers (positive α values), which is in line with the observations of Section 3. Moreover, we observe that the best test accuracies are systematically obtained for α = 0, i.e. for uniform rotation rates, without prioritization. This observation might appear as contradictory with respect to Section 3, where the first layers’ superior ability to promote generalization is highlighted. There is however no contradiction, as the mechanisms governing the improvement of the forward and backward signals w.r.t generalization during training are different in isolated and simultaneous training settings. In particular, when all layers are trained simultaneously, training a layer will impact the inputs it receives (with a delay of two training steps), while it won’t when the layer is trained in isolation. A thorough explanation of our observations is left as future work. We hope that 5

our toy example (see Section 3) will provide the necessary intuitions to understand these complex phenomena. 5.3 Comparison of Layca and SGD The same experiment is performed when SGD is used for training, and the results are shown in Figure 2. We discuss two key differences between SGD’s and Layca’s performances, and both suggest Layca’s superior ability to control layer prioritization. First of all, while the evolution of the test accuracy in function of α is consistent for all tasks when Layca is used, it is not for SGD. In particular, while a clear rule emerged from the Layca experiment (choosing α = 0), SGD’s experiment does not provide any convincing recommendation on how to set α to maximize generalization. Secondly, Layca is able to cover a larger range of test accuracies with the same α values, and importantly, reaches test accuracies that outperform SGD by the order of 10% accuracy. In order to analyse further the different behaviours of Layca and SGD, we monitor the rotation of each layer’s weight vector across training for α = 0. We argue in Section 4 that rotation of a layer’s weight vector constitutes the bulk of DNN training. Figure 3 shows the evolution of the cosine distance between each layer’s weight vector and its initialization (denoted by layer-wise angle deviation curves hereafter). For the three tasks, Layca training exhibits a similar pattern where most layers’ weight vectors are rotated significantly and synchronously. Our results suggests that such behaviour, induced by high and uniform layer rotation rates, is indicative of good generalization performance. SGD does not exhibit such dynamics as, first of all, gradient propagation phenomena induce different layer rotation rates even when layers use the same learning rate (especially visible on the CIFAR10 task). Second, the total amount of rotation resulting from SGD training is significantly inferior to Layca, although the global learning rate used is determined by grid search in both cases (especially visible on the tiny Imagenet task). CIFAR100

0.8

0.50

0.7 0.6

SGD Layca

tiny ImageNet 0.5

Test accuracy

0.55

Test accuracy

Test accuracy

CIFAR10 0.9

0.45 0.40

0.4 0.3 0.2

0.35

0.5 −0.8 −0.6 −0.4 −0.2 0.0

0.2

0.4

0.6

0.8

−0.4

−0.2

0.0

α

0.2

0.4

0.6

−0.6 −0.4 −0.2

0.0

α

0.2

0.4

0.6

0.8

α

Figure 2: Test accuracies obtained with different layer-wise learning rate configurations (parametrised by α) when SGD and Layca are used for training, on three different tasks. Prioritizing the training of the first layers (negative α values) generalizes better than prioritizing last layers (positive α values), reinforcing the observations of Section 3. Moreover, Layca’s superior ability to control prioritization is suggested by its consistent behaviour across tasks, and its ability to reach test accuracies that significantly outperform SGD.

6 A second look at previous observations around deep learning Section 5 demonstrates that layer rotation rates and Layca provide a way to monitor and control layer prioritization during training, which influences generalization. In this section, we use these useful tools to shine new light on previous observations concerning the training dynamics and generalization properties of deep nets. 6.1 Occurrence of plateaus in learning curves The learning curves of most state of the art networks exhibit a curious phenomenon: the appearance of plateaus that are escaped by a reduction of the learning rate (e.g. [15]). To our knowledge, there are currently no explanations for this behaviour. While generalization is the main focus of our paper, we’ve observed through our experiments that layer rotation rate configurations are closely related to the emergence of such training plateaus. Figure 5a shows that when Layca is used on the tiny ImageNet task (see Section 5), the layer-wise learning rate configurations (parametrized by α as 6

CIFAR10

tiny ImageNet

CIFAR100

Cosine distance

1

Deep

0 Cosine distance

1

0

0

25

50 Epoch

75

100

0

30

60 Epoch

90

120

0

20

40 Epoch

60

80

Undeep

Figure 3: Layer-wise angle deviation curves: evolution of the cosine distance between each layer’s weight vector and its initialization across training when all layers use the same learning rate for Layca (first row) and SGD (second row). Layca training induces a significant and synchronized rotation of the layers’ weight vectors. Our results suggests that such behaviour, induced by high and uniform layer rotation rates, is indicative of good generalization performance.

described in Section 5) directly influence the height of the plateaus. More specifically, we observe that the higher and more uniform the rotation rates are (α closer to 0), the higher the plateaus, or equivalently the more difficult it is to reduce the loss in the early stages of training. The fact that plateaus are the most prominent when all layers are trained with high learning rate suggests that they are caused by some kind of interference between the layers during training. The same observation on the CIFAR-10 and CIFAR-100 tasks is presented in Supplementary Material. 6.2 Parameter-level adaptivity in deep learning It has recently been shown that adaptive gradient methods exhibit worse generalization properties than SGD in typical deep learning applications [34]. The key characteristic of adaptive methods is their tuning of the learning rate at parameter level based on the statistics of each parameter’s partial derivative. In this section, we argue that these approaches, in typical deep learning applications, result in learning rates that differ mostly across layers and negligibly inside layers, and that the observed generalization drop is due to the different layer rotation rate configurations that emerge from these methods. When the same layer rotation rate configuration is enforced by Layca, we thus expect the drop in generalization to disappear. To verify this hypothesis, we train the same convolutional network as [34] and show that when Layca is applied with the different optimization algorithms, the test curves of adaptive methods (RMSProp [33], Adagrad [11], Adam [23]) and their non-adaptive equivalents (SGD, SGD_AMom3 ) become indistinguishable (Figure 4). Moreover, Layca enables all methods to reach the best test accuracy reported by [34]. 6.3 The impact of weight decay on training of deep networks In the experiment described in Section 6.2, SGD generalizes as well as Layca. Moreover, in section 6.1, we make the observation that learning curve plateaus appear during training of most state of the art networks. These observations are consistent with our previous experiments only if with state of the art models whose meta-parameters were carefully tuned, SGD is able to generate high and uniform layer rotation rates, acquiring the key strength of Layca. We verify this by analysing the layer-wise angle deviation curves emerging from SGD training of the network used in Section 6.2. We observe in Figure 5b that indeed, training the network with SGD resulted in significant and synchronized rotation of each layer’s weight vectors. Further analysis showed that this beneficial behaviour was very sensitive to the tuning of weight decay. Indeed, training the same network without 3

SGD_AMom is a version of SGD with a momentum scheme similar to Adam (see Supplementary Material).

7

Without Layca

1.0 0.8

0.8

0.7

0.7

Test accuracy

0.9

Test accuracy

0.9

0.6

0.6

0.5

0.5

0.92

0.4 0.3

0.90

0.2 0.1

With Layca

1.0

0

50

0.4

0.92

0.3

0.90

0.2

0.88

100 150 Epoch

SGD

200

0.1

250

RMSprop

0

0.88 100 150 Epoch

50

Adagrad

SGD_AMom

200

250

Adam

Figure 4: Test accuracy curves for different adaptive methods and their non-adaptive equivalents. Recent work [34] has shown that these methods differed in terms of generalization ability (left). We demonstrate that when Layca is used with the different optimization algorithms, the differences vanish (right). This result suggests that the observations in [34] were caused by the different layer rotation rates emerging from these methods. The training and layer-wise angle deviation curves are provided in Supplementary Material. weight decay resulted in very different layer-wise angle deviation curves (Figure 5b), and a 4% drop in test accuracy. This suggests that in some cases, through some mysterious mechanism whose study we leave as future work, weight decay enables SGD to induce high and uniform layer rotation rates and thus, to generalize as well as Layca. Interestingly, Layca reached SGD’s performance without the need for weight decay. With weight decay

1 Cosine distance

3

0 1

2 1 0 0

20

40 Epoch

60

80

0

(a)

Without weight decay

Cosine distance

Training loss

α = 0.3 α = 0.4 α = 0.6

α = 0.0 α = 0.1 α = 0.2

4

0

50

100 150 Epoch

200

250

(b)

Figure 5: Left: training curves in function of α. Shows that higher and more uniform layer rotation rates induce higher plateaus, suggesting that an interference between the different layers is at the origin of such plateaus. The drop at epoch 70 is caused by a reduction of the learning rate by a factor 0.2. Right: Illustrates the drastic impact of weight decay on layer-wise angle deviation curves during SGD training.

7 Conclusion Inspired by the works on generalization and gradient propagation in deep networks, this paper tackles the following research question: how does prioritizing layers during training influence generalization? A toy experiment in Section 1 demonstrates the importance of this research direction. In order to extend our analysis to common training settings, we propose to define layer-level training 8

speed as the rotation rate of a layer’s weights (i.e. the layer rotation rate), and develop Layca, an algorithm that provides control over it. In section 5, we demonstrate that Layca enables a more pertinent study of the impact of layer prioritization on generalization compared to SGD, which is subject to the intricate gradient propagation in deep nets. Moreover, Layca’s built-in ability to use high and uniform layer rotation rates empowers it to significantly outperform SGD in terms of test accuracy on three different tasks. In Section 6, we show that Layca enables a precise control of the height of plateaus emerging in training curves, that Layca can eliminate the differences in generalization between adaptive methods and their nonadaptive equivalents, and finally, that state of the art models can exhibit, through tuning of the weight decay, high and uniform rotation rates, enabling SGD to generalize as well as Layca. Overall, the observations of Sections 5 and 6 provide evidence that layer rotation rates are a pertinent definition of layer-level training speed, and that as such, they are important indicators of the generalization properties that will emerge from training. Our hope is that this discovery will facilitate training of state of the art networks for practitioners, and provide guidance for theoreticians to solve deep learning’s remaining mysteries. Acknowledgements Special thanks to the reddit r/MachineLearning community for enabling outsiders to stay up to date with the last discoveries and discussions of our fast moving field.

References [1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization On Manifolds : Methods And Applications. In Recent Advances in Optimization and its Applications in Engineering, pages 125—-144. Springer, 2010. [2] Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016. [3] Pulkit Agrawal, Ross Girshick, and Jitendra Malik. Analyzing the Performance of Multilayer Neural Networks for Object Recognition. In ECCV, pages 329–344, 2014. [4] Devansh Arpit, Stanisław Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A Closer Look at Memorization in Deep Networks. In ICML, 2017. [5] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994. [6] Simon Carbonnelle and Christophe De Vleeschouwer. Discovering the mechanics of hidden neurons. https://openreview.net/forum?id=H1srNebAZ, 2018. [7] François Chollet et al. Keras, 2015. [8] Matthieu Courbariaux and Jean-Pierre David. BinaryConnect : Training Deep Neural Networks with binary weights during propagations. In NIPS, pages 3123—-3131, 2015. [9] Stanford CS231N. Tiny ImageNet Visual Recognition Challenge. imagenet.herokuapp.com/, 2016.

https://tiny-

[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, pages 248–255, 2009. [11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011. 9

[12] Boris Ginsburg, Igor Gitman, and Yang You. Large Batch Training of Convolutional Networks with Layer-wise Adaptive Rate Scaling, 2018. [13] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249–256, 2010. [14] Elad Hazan, Kfir Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In NIPS, pages 1594—-1602, 2015. [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016. [16] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. IJUFKS, 6(2):1–10, 1998. [17] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer , generalize better : closing the generalization gap in large batch training of neural networks. In NIPS, pages 1729—-1739, 2017. [18] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Neural Networks. In NIPS, 2016. [19] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML, pages 448—-456, 2015. [20] Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three Factors Influencing Minima in SGD. arXiv:1711.04623, 2017. [21] Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009. [22] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In ICLR, 2017. [23] Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [24] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009. [25] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2323, 1998. [26] Vinod Nair and Geoffrey E Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In ICML, pages 807—-814, 2010. [27] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In ICML, pages 1310—-1318, 2013. [28] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016. [29] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556, 2014. [30] Bharat Singh, Soham De, Yangmuzi Zhang, Thomas Goldstein, and Gavin Taylor. Layerspecific adaptive learning rates for deep networks. In ICMLA, pages 364—-368, 2015. [31] Leslie N Smith and Nicholay Topin. Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates. arXiv:1708.07120, 2017. [32] Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. In Proceedings of Second workshop on Bayesian Deep Learning (NIPS 2017), 2017. [33] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. 10

[34] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In NIPS, pages 4151– 4161, 2017. [35] Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell. Normalized gradient with adaptive stepsize method for deep neural network training. arXiv:1707.04822, 2017. [36] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires re-thinking generalization. In ICLR, 2017. [37] Zijun Zhang, Lin Ma, Zongpeng Li, and Chuan Wu. Normalized Direction-preserving Adam. arXiv:1709.04546, 2017.

Supplementary Material A

Additional notes

A.1 Information about models and training procedure

Please refer to the source code, provided at https://github.com/Simoncarbo/Layer-level-control-of-DNN-trainin A.2 Some recommendations when using Layca 1. A learning rate of 3−3 was optimal in all our experiments with Layca, and constitutes thus a good default value. 2. Using batch normalization is recommended. Early experiments suggest that removing batch normalization sometimes disables Layca’s ability to perform significant and synchronized rotation of the layers’ weight vectors. 3. Staying on plateaus for a large number of epochs (in other words waiting before reducing the learning rate) systematically improved generalization performance (this has also been observed for SGD in [17]). 4. Layca’s operations are prone to numerical instabilities. Replacing eventual NaN values in the update with zeros is required. 5. Layca was not evaluated for networks with additive biases. We suggest to remove biases for now (also in the batch normalization layers). If you use biases anyway, do not initialize them to zero: rotation of a zero vector makes no sense. A.3 SGD_AMom SGD_AMom was designed for Section 6.2, as a non-adaptive equivalent of Adam. In particular, SGD_AMom uses the same momentum scheme as Adam: vt wt

= =

m · vt−1 + (1 − m) · gt wt−1 − ρ · vt

where gt is the gradient at step t, ρ the learning rate, m the momentum parameter.

B Supplementary Figures

11

Positive α 1.0

0.8

0.8 Multiplier

Multiplier

Negative α 1.0

0.6 0.4

0.6 0.4

0.2

0.2

0.0

0.0 0.00

0.25

0.50 l/L 0.1

0

0.75

1.00

0.00

0.2

0.25

0.3

0.50 l/L

0.4

0.75

1.00

0.6

0.8

Figure 6: Supplementary figure for Section 5. Visualization of the prioritization schemes parametrized by α (as defined in Section 5). The y-axis corresponds to the factor that multiplies the global learning rate. The line colors represent the absolute value of α.

cifar10

validation train

1.0

1.0

0.8

0.8 accuracy

accuracy

accuracy

0.9 0.8

tinyImagenet

cifar100

1.0

0.6 0.4

0.7

validation train

0.2 0.6

0.6

validation train

0.4 0.2

0.0 3

−7

3

−6

3

−5

3

3 3 3 Learning rate

−4

−3

−2

−1

3

0

3

1

3

2

3

−7

3

−6

3

−5

3

3 3 3 Learning rate

−4

−3

−2

3

−1

0

3

1

3

2

3−7 3−6 3−5 3−4 3−3 3−2 3−1 30 Learning rate

31

32

Figure 7: Supplementary figure for Section 5. Results of the grid search for finding optimal global learning rates for SGD in the experiments of Section 5. The grid search is only performed for α = 0, and the same learning rate is then used for other values of α.

cifar10

cifar100

2.0 α = 0.0 α = 0.1 α = 0.2

1.0 0.5

α = 0.3 α = 0.4

α = 0.0 α = 0.1 α = 0.2

4

Training loss

Training loss

1.5

α = 0.3 α = 0.4

3 2 1

0.0

0 0

20

40

60

80

100

0

Epoch

20

40

60 Epoch

80

100

120

Figure 8: Supplementary figure for Section 6.1. Study of the impact of the α parameter on the height of plateaus emerging during training on the CIFAR10 and CIFAR100 datasets using the same networks and training parameters as in Section 5

12

SGD

RMSprop

Adagrad

0.8

0.8

Adam

Training accuracy

1.0

Training accuracy

1.0

SGD_AMom

0.6

0.6

0.4

0.4

0.2

0.2 0

50

100

Epoch

150

200

250

0

50

100

Epoch

150

200

250

Figure 9: Supplementary figure for Section 6.2. Training curves of the different methods. Without Layca (left) and with Layca (right). Adaptive methods become indistinguishable from their nonadaptative equivalents when Layca is applied.

Adagrad

0

50

Cosine distance

0

100 150 200 250 Epoch

0

50

100 150 200 250 Epoch

SGD

0

0

0

50

100 150 200 250 Epoch

SGD_AMom

1

Cosine distance

Cosine distance

1

RMSprop

1

Cosine distance

0

Adam

1

Cosine distance

1

0

50

0

100 150 200 250 Epoch

0

50

100 150 200 250 Epoch

Figure 10: Supplementary figure for Section 6.2. Layer-wise angle deviation curves for the different optimization methods.

13