Scion

Code accompanying the paper Training Deep Learning Models with Norm-Constrained LMOs.

Repository structure

scion.py: Contains the Scion and ScionLight reference implementation along with various norm choices. ScionLight is a memory-efficient variant that reuses p.grad.
examples/: Example usage containing nanoGPT experiments with and without weight sharing.

Notes

The Scion optimizer comes with a couple of hyperparameters:

momentum: The parameter is 1-usual_momentum of e.g. the PyTorch implementation of SGD with momentum. A good default is 0.1. Higher values seems to work better (e.g. 0.5) for short training runs with low noise as also supported by theory.
scale: Controls the per-layer constraint radius factor. The layerwise radius can be tuned on a small proxy model similarly to the input and output scaling factor of µP.
lr: The learning rate can similarly be tuned on a small proxy model (corresponds to γ in the paper).
unconstrained: When set to False the constrained variant of the Scion is used, which guarantees the iterates to stay bounded. The flag is useful for numerical stability in long training runs and to avoid overfitting. See Section 3 for a discussion on the connection with weight decay.

Architectural changes:

Scale activation functions (ReLU, GELU) by √2 to maintain the input variance.

Examples

For runnable examples see examples/. Below are some pseudocode configurations for different architectures and domains (see Appendix E.3 for exact parameter choices):

nanoGPT with weight sharing:

radius = 50.0
optim_groups = [{
    'params': model.transformer.h.parameters(),
    'norm': 'Spectral',
    'norm_kwargs': {},
    'scale': radius,
}, {
    'params': model.lm_head.parameters(),
    'norm': 'Sign',
    'norm_kwargs': {},
    'scale': radius*60.0,
}]
optimizer = Scion(optim_groups, lr=2**-12, momentum=0.1, unconstrained=False)

MLP:

radius = 1.0
optim_groups = [{
    'params': input_layer,
    'norm': 'Linear',
    'norm_kwargs': {'max': True},
    'scale': radius,
}, {
    'params': hidden_layers,
    'norm': 'Linear',
    'norm_kwargs': {},
    'scale': radius,
}, {
    'params': output_layer,
    'norm': 'Sign',
    'norm_kwargs': {'normalized': True},
    'scale': radius*100.0,
}]
optimizer = Scion(optim_groups, lr=2**-6, momentum=0.1)
optimizer.init()

CNN:

radius = 1.0
optim_groups = [{
    'params': remaining_parameters,
    'norm': 'Auto', # Picks layerwise norm based on the parameter shape
    'norm_kwargs': {},
    'scale': radius,
}, {
    'params': output_layer,
    'norm': 'Sign',
    'norm_kwargs': {'normalized': True},
    'scale': radius*100.0,
}]
optimizer = Scion(optim_groups, lr=2**-4, momentum=0.5)
optimizer.init()

Citation

If you find this work useful, please cite it as follows:

@article{pethick2025training,
  title={Training Deep Learning Models with Norm-Constrained LMOs},
  author={Pethick, Thomas and Xie, Wanyun and Antonakopoulos, Kimon and Zhu, Zhenyu and Silveti-Falls, Antonio and Cevher, Volkan},
  journal={arXiv preprint arXiv:2502.07529},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
LICENSE		LICENSE
README.md		README.md
scion.py		scion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scion

Repository structure

Notes

Examples

Citation

About

Releases

Packages

Languages

License

LIONS-EPFL/scion

Folders and files

Latest commit

History

Repository files navigation

Scion

Repository structure

Notes

Examples

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages