Skip to content

LIONS-EPFL/scion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Scion

Code accompanying the paper Training Deep Learning Models with Norm-Constrained LMOs.

Repository structure

  • scion.py: Contains the Scion and ScionLight reference implementation along with various norm choices. ScionLight is a memory-efficient variant that reuses p.grad.
  • examples/: Example usage containing nanoGPT experiments with and without weight sharing.

Notes

The Scion optimizer comes with a couple of hyperparameters:

  • momentum: The parameter is 1-usual_momentum of e.g. the PyTorch implementation of SGD with momentum. A good default is 0.1. Higher values seems to work better (e.g. 0.5) for short training runs with low noise as also supported by theory.
  • scale: Controls the per-layer constraint radius factor. The layerwise radius can be tuned on a small proxy model similarly to the input and output scaling factor of µP.
  • lr: The learning rate can similarly be tuned on a small proxy model (corresponds to γ in the paper).
  • unconstrained: When set to False the constrained variant of the Scion is used, which guarantees the iterates to stay bounded. The flag is useful for numerical stability in long training runs and to avoid overfitting. See Section 3 for a discussion on the connection with weight decay.

Architectural changes:

  • Scale activation functions (ReLU, GELU) by √2 to maintain the input variance.

Examples

For runnable examples see examples/. Below are some pseudocode configurations for different architectures and domains (see Appendix E.3 for exact parameter choices):

  • nanoGPT with weight sharing:

    radius = 50.0
    optim_groups = [{
        'params': model.transformer.h.parameters(),
        'norm': 'Spectral',
        'norm_kwargs': {},
        'scale': radius,
    }, {
        'params': model.lm_head.parameters(),
        'norm': 'Sign',
        'norm_kwargs': {},
        'scale': radius*60.0,
    }]
    optimizer = Scion(optim_groups, lr=2**-12, momentum=0.1, unconstrained=False)
  • MLP:

    radius = 1.0
    optim_groups = [{
        'params': input_layer,
        'norm': 'Linear',
        'norm_kwargs': {'max': True},
        'scale': radius,
    }, {
        'params': hidden_layers,
        'norm': 'Linear',
        'norm_kwargs': {},
        'scale': radius,
    }, {
        'params': output_layer,
        'norm': 'Sign',
        'norm_kwargs': {'normalized': True},
        'scale': radius*100.0,
    }]
    optimizer = Scion(optim_groups, lr=2**-6, momentum=0.1)
    optimizer.init()
  • CNN:

    radius = 1.0
    optim_groups = [{
        'params': remaining_parameters,
        'norm': 'Auto', # Picks layerwise norm based on the parameter shape
        'norm_kwargs': {},
        'scale': radius,
    }, {
        'params': output_layer,
        'norm': 'Sign',
        'norm_kwargs': {'normalized': True},
        'scale': radius*100.0,
    }]
    optimizer = Scion(optim_groups, lr=2**-4, momentum=0.5)
    optimizer.init()

Citation

If you find this work useful, please cite it as follows:

@article{pethick2025training,
  title={Training Deep Learning Models with Norm-Constrained LMOs},
  author={Pethick, Thomas and Xie, Wanyun and Antonakopoulos, Kimon and Zhu, Zhenyu and Silveti-Falls, Antonio and Cevher, Volkan},
  journal={arXiv preprint arXiv:2502.07529},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages