This repo is just my fork from ashleve/lightning-hydra-template where I modified some setup parameters to be ready to go directly after cloning. Reference the Modified from template section to see the changes. otherwise the main modifications are the following:
- I set up
hydra-submitit-launcher
for an easier usage of SLURM, and add example config setups for clusters (JUWELS, Terrabyte to come) - I commit to use Weight & Biases (W&B) logger. Other loggers are still possible to use, but everything is setup by default for W&B.
- I use wandb_osh to support offline, real-time logging of my runs on W&B. In this template, setting up
wandb_osh
is as easy as that:- Switch
logger.wandb.offline
toTrue
- Have the "Farm" running on the login node, i.e., with the command
wandb-osh
- Switch
There are 2 different ways you can setup your new repository: by keeping track of the template, or by starting a fresh new git repo with all the files from the template.
If you plan to host your code on DLR GitLab, you should make sure that when you create the new repository you create it as a "blank project", select <your_user> and not <your_group> in the Project URL and uncheck "Initialize repository with a README". Also, you should use the HTTPS URLs.
In both cases you first need to clone the template and rename the folder with <your_project_name>
:
# Clone the template
git clone https://github.com/CedricLeon/Setup_Lightning_Hydra_template.git
# Rename the folder with your project name
mv Setup_Lightning_Hydra_template/ <your_project_name>/
cd <your_project_name>/
Then you can either delete the remote and commit history of the template, this is the most straightforward way:
# Reset the git repository
rm -rf .git/
git init --initial-branch=main
# Add your remote
git remote add origin <your_remote_URL>
# Stage and commit all files + set origin main as upstream
git add .
git commit -m "Initial commit"
git push --set-upstream origin main
Or you can keep the remote but rename it to template
and add a new origin
.
I describe how to do that below, but you should know that it is just a homemade version of template repository from GitHub. It is less clean, but allows to host the new repo on a server that isn't GitHub (I didn't find a way to do that using the GitHub template feature). If someone has a cleaner way of doing it, please enlight me.
# Rename the template remote
git remote rename origin template
# Add your new repository remote. So, yes, you need to create it before
git remote add origin <your_remote_URL>
git remote -v
# Synchronize your (empty new repo) with a rebase to avoid non-fast-forward errors
git pull --rebase origin main
# Push the commit history and all the template files on the new repo (also set the origin/main branch as upstream)
git push --set-upstream origin main
# Force python version 3.11 for compatibility reasons (pytorch)
conda create -n <your_env_name> python=3.11
conda activate <your_env_name>
# /!\ install pytorch with GPU support, see https://pytorch.org/get-started/
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# install requirements
pip install -r requirements.txt
If you don't know pre-commit hooks, they do exactly what the name suggests, avoiding you to commit stupid typos or performing code linting for you in the background. Check the docs for more details.
So, in case you deleted .git
after cloning the template, you have to reinstall pre-commit.
It's also a good idea to run it against all files (if you have any) for the first time.
pre-commit install
pre-commit run --all-files
You can test that pre-commit is nicely setup with a dummy commit, or just by committing the changes of the next sections.
Note: if you are using VSCode commit system, the output logs are redirected towards the
OUTPUT/Git
console. Nevertheless, you should still get an error message if you messed something. Spoiler: the error message is not helpful, but redirects you towards the git logs.
I have fixed some parameters that are project specific with generic names (e.g., logger.wandb.project: "lightning-hydra-template"
). Here is a list you should check and replace:
- Documentation: Change the title of this
README.md
(and most likely also delete the crap I wrote 😉) - W&B: In
configs/logger/wandb.yaml
,team: "my-wandb-team"
andproject: "lightning-hydra-template"
- Submitit (if you plan to use multiruns):
- In
configs/hydra/launcher/
change your account settings in the different cluster setups:account: "your_juwels_project"
(if necessary, also update your favoritepartition
). - You can specify the launcher through your command, with the option
hydra.launcher.partition=juwels_single_gpu
for example - Otherwise, in the experiment file, add a configuration for
hydra-submitit-launcher
:
- In
# Just after defaults:
- override /hydra/launcher: juwels_single_gpu # for example
As a general comment, I advise to run a mock run (/!\ not with debug=fdr
/!\, it hides most of the config) and have a careful look at your config. @TODO
You can try running a 10 epochs training of a SimpleDenseNet on MNIST classification problem to check if everything runs smooth. If you already logged on W&B on your system you should not need to do anything else for the setup to be complete.
# Run on cpu by default
python src/train.py experiment=example
# If on a cluster, you can open an interactive session and run on gpu
python src/train.py experiment=example trainer=gpu
# Otherwise, you can run in "multirun" mode from the logging node
# /!\ Remember to specify the submitit launcher, and if necessary to set the run `offline`, otherwise W&B will crash the run /!\
python src/train.py -m experiment=example trainer=gpu hydra.launcher.partition=develbooster # or logger.wandb.offline=True
@TODO: refine the usage examples with how I use experiments, etc.
The method I describe below is my preferred way of using this template. Of course, that's only a theory and you are free to organize yourself differently, the repository is very flexible. However, after trying out different setups I often found myself lost, e.g., trying to find out why a parameter kept its old value when I was overriding it. In any case, Hydra documentation is your best friend. Now that you are warned, here is are my best practices.
In short, I recommend always creating runs from an experiment config. This enforces better hierarchy and organization, while having the advantage of grouping "all" modifications in a single file, making modifications easy.
See below an example to run with a chosen experiment configuration from configs/experiment/:
python src/train.py experiment=example
From here, you can override minor parameters from the CLI for a quick check or a specific run:
python src/train.py experiment=example trainer.max_epochs=2
Whenever you find yourself, running several times similar commands with a high number of overrides, this is probably a good time to create a new experiment.yaml
.
Sometimes, you might want to change big parts of your experiment config without wanting to redefine a new experiment, then, you can override Config Group options. Examples non-exhaustively include estimating results on a different dataset, checking run time on a different hardware, or logging to csv because you're a boomer.
# Train on CPU
python src/train.py experiment=example trainer=cpu
# Quickly test another dataset
python src/train.py experiment=example data=kodak
# Change the logger
python src/train.py experiment=example logger=csv
Debugging is a instance of the previous case, where you override the debug package from the CLI. However, it's so common and important it deserves its own section.
Firstly, whenever you specify debug
there won't be any logging or callbacks and the run will be executed without multithreading on CPU.
The best example is the fast_dev_run
option of the Lightning Trainer which will run 1 step of training, validation and test. This is what I use 99% of the time.
python src/train.py experiment=example debug=fdr
If you still want some logging, or want to debug on GPU, etc. you can always specify that after your debug setup.
python src/train.py experiment=example debug=default trainer=gpu
This section simply lists the major changes I brought to the original template ashleve/lightning-hydra-template. It's also here that I give a big shoutout to ashleve, in addition of the impressive work behind such repo, he is also on most of the Issues and PR I came across when I was setting this fork.
- Add deterministic training support (can be unset from config)
- Add the W&B offline management using
wandb_osh
(automatically adds the Lightning Callback when the run is set offline) - Redirect logs to subdirectories specific for each experiment (see
task_name
) - Automate job submission on cluster using
hydra-submitit-launcher
through--multirun
mode
- Uncomment my favorite logger in
environment.yaml
(wandb) as well as inrequirements.txt
- Add additional requirements:
hydra-submitit-launcher
torchgeo
wandb_osh
(Wandb Offline Sync Hook)
- Uncomment
sh
inrequirements.txt
to allow the tests intest_sweeps.py
- To execute all tests (require GPUs): execute
pytest
on a compute node (e.g., with an interactive session) to validate@RunIf(min_gpus=1)
intest_train.py
(make sure Pytorch is installed with GPU support) => Get all tests to be executed and None skipped. - I removed MacOS and Windows deployment test, as well as most of the different versions of python tested (reason: save compute resources)
- The tests to be executed in CI/CD are the
"not slow"
ones, for the same reason mentioned above
- Add a submitit setup for Terrabyte
- Increase test coverage, and provide classic examples to test Lightning Datamodule and Modules.
- Upgrade and "automate" the
task_name
parameter generation:- Either by using a specific name parameter in each Config Group option (config file) and
**kwargs
in the corresponding Modules. - Or by making it general and global in the "root" config file using Hydra interpolation system. Not that easy because it's impossible to interpolate in the Default List, see this stackoverflow.
- Either by using a specific name parameter in each Config Group option (config file) and
- Add my Lightning Callback plotting reconstructions/predictions every
$N$ epochs
@TODO: Specify how to add tests and provide examples. But nobody likes testing.
# run all tests
pytest
# run all tests except the ones marked as slow
pytest -k "not slow"