@@ -73,10 +73,8 @@ Also, most of the captures are taken from ``Ranger21`` paper.
73
73
Adaptive Gradient Clipping (AGC)
74
74
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
75
75
76
- | This idea originally proposed in ``NFNet (Normalized-Free Network)``
77
- paper.
78
- | AGC (Adaptive Gradient Clipping) clips gradients based on the
79
- ``unit-wise ratio of gradient norms to parameter norms ``.
76
+ | This idea originally proposed in ``NFNet (Normalized-Free Network)`` paper.
77
+ | AGC (Adaptive Gradient Clipping) clips gradients based on the ``unit-wise ratio of gradient norms to parameter norms``.
80
78
81
79
- code :
82
80
`github <https://github.com/deepmind/deepmind-research/tree/master/nfnets >`__
@@ -99,8 +97,7 @@ centralizing the gradient to have zero mean.
99
97
Softplus Transformation
100
98
~~~~~~~~~~~~~~~~~~~~~~~
101
99
102
- By running the final variance denom through the softplus function, it
103
- lifts extremely tiny values to keep them viable.
100
+ By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.
104
101
105
102
- paper : `arXiv <https://arxiv.org/abs/1908.00700 >`__
106
103
@@ -123,8 +120,7 @@ Positive-Negative Momentum
123
120
| .. image:: https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/positive_negative_momentum.png |
124
121
+--------------------------------------------------------------------------------------------------------------------+
125
122
126
- - code :
127
- `github <https://github.com/zeke-xie/Positive-Negative-Momentum >`__
123
+ - code : `github <https://github.com/zeke-xie/Positive-Negative-Momentum >`__
128
124
- paper : `arXiv <https://arxiv.org/abs/2103.17182 >`__
129
125
130
126
Linear learning-rate warm-up
@@ -143,8 +139,7 @@ Stable weight decay
143
139
| .. image:: https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/stable_weight_decay.png |
144
140
+-------------------------------------------------------------------------------------------------------------+
145
141
146
- - code :
147
- `github <https://github.com/zeke-xie/stable-weight-decay-regularization >`__
142
+ - code : `github <https://github.com/zeke-xie/stable-weight-decay-regularization >`__
148
143
- paper : `arXiv <https://arxiv.org/abs/2011.11152 >`__
149
144
150
145
Explore-exploit learning-rate schedule
@@ -154,18 +149,14 @@ Explore-exploit learning-rate schedule
154
149
| .. image:: https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/explore_exploit_lr_schedule.png |
155
150
+---------------------------------------------------------------------------------------------------------------------+
156
151
157
-
158
- - code :
159
- `github <https://github.com/nikhil-iyer-97/wide-minima-density-hypothesis >`__
152
+ - code : `github <https://github.com/nikhil-iyer-97/wide-minima-density-hypothesis >`__
160
153
- paper : `arXiv <https://arxiv.org/abs/2003.03977 >`__
161
154
162
155
Lookahead
163
156
~~~~~~~~~
164
157
165
- | ``k`` steps forward, 1 step back. ``Lookahead`` consisting of keeping
166
- an exponential moving average of the weights that is
167
- | updated and substituted to the current weights every ``k_{lookahead}``
168
- steps (5 by default).
158
+ | ``k`` steps forward, 1 step back. ``Lookahead`` consisting of keeping an exponential moving average of the weights that is
159
+ | updated and substituted to the current weights every ``k_{lookahead}`` steps (5 by default).
169
160
170
161
- code : `github <https://github.com/alphadl/lookahead.pytorch >`__
171
162
- paper : `arXiv <https://arxiv.org/abs/1907.08610v2 >`__
@@ -180,10 +171,8 @@ Acceleration via Fractal Learning Rate Schedules
180
171
(Adaptive) Sharpness-Aware Minimization (A/SAM)
181
172
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
182
173
183
- | Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value
184
- and loss sharpness.
185
- | In particular, it seeks parameters that lie in neighborhoods having
186
- uniformly low loss.
174
+ | Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
175
+ | In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.
187
176
188
177
- SAM paper : `paper <https://arxiv.org/abs/2010.01412 >`__
189
178
- ASAM paper : `paper <https://arxiv.org/abs/2102.11600 >`__
0 commit comments