Code accompanying the paper Training Deep Learning Models with Norm-Constrained LMOs.
scion.py
: Contains theScion
andScionLight
reference implementation along with various norm choices.ScionLight
is a memory-efficient variant that reusesp.grad
.examples/
: Example usage containing nanoGPT experiments with and without weight sharing.
The Scion
optimizer comes with a couple of hyperparameters:
momentum
: The parameter is1-usual_momentum
of e.g. the PyTorch implementation of SGD with momentum. A good default is 0.1. Higher values seems to work better (e.g. 0.5) for short training runs with low noise as also supported by theory.scale
: Controls the per-layer constraint radius factor. The layerwise radius can be tuned on a small proxy model similarly to the input and output scaling factor of µP.lr
: The learning rate can similarly be tuned on a small proxy model (corresponds to γ in the paper).unconstrained
: When set toFalse
the constrained variant of the Scion is used, which guarantees the iterates to stay bounded. The flag is useful for numerical stability in long training runs and to avoid overfitting. See Section 3 for a discussion on the connection with weight decay.
Architectural changes:
- Scale activation functions (ReLU, GELU) by √2 to maintain the input variance.
For runnable examples see examples/
.
Below are some pseudocode configurations for different architectures and domains (see Appendix E.3 for exact parameter choices):
-
nanoGPT with weight sharing:
radius = 50.0 optim_groups = [{ 'params': model.transformer.h.parameters(), 'norm': 'Spectral', 'norm_kwargs': {}, 'scale': radius, }, { 'params': model.lm_head.parameters(), 'norm': 'Sign', 'norm_kwargs': {}, 'scale': radius*60.0, }] optimizer = Scion(optim_groups, lr=2**-12, momentum=0.1, unconstrained=False)
-
MLP:
radius = 1.0 optim_groups = [{ 'params': input_layer, 'norm': 'Linear', 'norm_kwargs': {'max': True}, 'scale': radius, }, { 'params': hidden_layers, 'norm': 'Linear', 'norm_kwargs': {}, 'scale': radius, }, { 'params': output_layer, 'norm': 'Sign', 'norm_kwargs': {'normalized': True}, 'scale': radius*100.0, }] optimizer = Scion(optim_groups, lr=2**-6, momentum=0.1) optimizer.init()
-
CNN:
radius = 1.0 optim_groups = [{ 'params': remaining_parameters, 'norm': 'Auto', # Picks layerwise norm based on the parameter shape 'norm_kwargs': {}, 'scale': radius, }, { 'params': output_layer, 'norm': 'Sign', 'norm_kwargs': {'normalized': True}, 'scale': radius*100.0, }] optimizer = Scion(optim_groups, lr=2**-4, momentum=0.5) optimizer.init()
If you find this work useful, please cite it as follows:
@article{pethick2025training,
title={Training Deep Learning Models with Norm-Constrained LMOs},
author={Pethick, Thomas and Xie, Wanyun and Antonakopoulos, Kimon and Zhu, Zhenyu and Silveti-Falls, Antonio and Cevher, Volkan},
journal={arXiv preprint arXiv:2502.07529},
year={2025}
}