-
Notifications
You must be signed in to change notification settings - Fork 60
[DO NOT MERGE] Create temporary files for doc review #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Execute NeMo Run | ||
|
||
After configuring NeMo-Run, the next step is to execute it. Nemo-Run decouples configuration from execution, allowing you to configure a function or task once and then execute it across multiple environments. With Nemo-Run, you can choose to execute a single task or multiple tasks simultaneously on different remote clusters, managing them under an experiment. This brings us to the core building blocks for execution: `run.Executor` and `run.Experiment`. | ||
|
||
Each execution of a single configured task requires an executor. Nemo-Run provides `run.Executor`, which are APIs to configure your remote executor and set up the packaging of your code. Currently we support: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fix punctuation. Each execution of a single configured task requires an executor. Nemo-Run provides |
||
- `run.LocalExecutor` | ||
- `run.SlurmExecutor` with an optional `SSHTunnel` for executing on Slurm clusters from your local machine | ||
- `run.SkypilotExecutor` (available under the optional feature `skypilot` in the python package). | ||
|
||
A tuple of task and executor form an execution unit. A key goal of NeMo-Run is to allow you to mix and match tasks and executors to arbitrarily define execution units. | ||
|
||
Once an execution unit is created, the next step is to run it. The `run.run` function executes a single task, whereas `run.Experiment` offers more fine-grained control to define complex experiments. `run.run` wraps `run.Experiment` with a single task. `run.Experiment` is an API to launch and manage multiple tasks all using pure Python. | ||
The `run.Experiment` takes care of storing the run metadata, launching it on the specified cluster, and syncing the logs, etc. Additionally, `run.Experiment` also provides management tools to easily inspect and reproduce past experiments. The `run.Experiment` is inspired from [xmanager](https://github.com/google-deepmind/xmanager/tree/main) and uses [TorchX](https://pytorch.org/torchx/latest/) under the hood to handle execution. | ||
|
||
> **_NOTE:_** NeMo-Run assumes familiarity with Docker and uses a docker image as the environment for remote execution. This means you must provide a Docker image that includes all necessary dependencies and configurations when using a remote executor. | ||
|
||
> **_NOTE:_** All the experiment metadata is stored under `NEMORUN_HOME` env var on the machine where you launch the experiments. By default, the value for `NEMORUN_HOME` value is `~/.run`. Be sure to change this according to your needs. | ||
|
||
## Executors | ||
Executors are dataclasses that configure your remote executor and set up the packaging of your code. All supported executors inherit from the base class `run.Executor`, but have configuration parameters specific to their execution environment. There is an initial cost to understanding the specifics of your executor and setting it up, but this effort is easily amortized over time. | ||
|
||
Each `run.Executor` has the two attributes: `packager` and `launcher`. The `packager` specifies how to package the code for execution, while the `launcher` determines which tool to use for launching the task. | ||
|
||
### Launchers | ||
We support the following `launchers`: | ||
- `default` or `None`: This will directly launch your task without using any special launchers. Set `executor.launcher = None` (which is the default value) if you don't want to use a specific launcher. | ||
- `torchrun` or `run.Torchrun`: This will launch the task using `torchrun`. See the `Torchrun` class for configuration options. You can use it using `executor.launcher = "torchrun"` or `executor.launcher = Torchrun(...)`. | ||
- `ft` or `run.core.execution.FaultTolerance`: This will launch the task using NVIDIA's fault tolerant launcher. See the `FaultTolerance` class for configuration options. You can use it using `executor.launcher = "ft"` or `executor.launcher = FaultTolerance(...)`. | ||
|
||
> **_NOTE:_** Launcher may not work very well with `run.Script`. Please report any issues at https://github.com/NVIDIA/NeMo-Run/issues. | ||
|
||
### Packagers | ||
|
||
The packager support matrix is described below: | ||
|
||
| Executor | Packagers | | ||
|----------|----------| | ||
| LocalExecutor | run.Packager | | ||
| SlurmExecutor | run.GitArchivePackager | | ||
| SkypilotExecutor | run.GitArchivePackager | | ||
|
||
`run.Packager` is a passthrough base packager. `run.GitArchivePackager` uses `git archive` to package your code. Refer to the API reference for `run.GitArchivePackager` to see the exact mechanics of packaging using `git archive`. | ||
At a high level, it works in the following way: | ||
1. base_path = `git rev-parse --show-toplevel`. | ||
2. Optionally define a subpath as `base_path/GitArchivePackager.subpath` by setting `subpath` attribute on `GitArchivePackager`. | ||
3. `cd base_path && git archive --format=tar.gz --output={output_file} {GitArchivePackager.subpath}:{subpath}` | ||
|
||
This extracted tar file becomes the working directory for your job. As an example, given the following directory structure with `subpath="src"`: | ||
``` | ||
- docs | ||
- src | ||
- your_library | ||
- tests | ||
``` | ||
Your working directory at the time of execution will look like: | ||
``` | ||
- your_library | ||
``` | ||
If you're executing a Python function, this working directory will automatically be included in your Python path. | ||
|
||
> **_NOTE:_** git archive doesn't package uncommitted changes. In the future, we may add support for including uncommitted changes while honoring `.gitignore`. | ||
|
||
### Defining Executors | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. change title to an imperative verb. Define Executors |
||
Next, We'll describe details on setting up each of the executors below. | ||
|
||
#### LocalExecutor | ||
|
||
The LocalExecutor is the simplest executor. It executes your task locally in a separate process or group from your current working directory. | ||
|
||
The easiest way to define one is to call `run.LocalExecutor()`. | ||
|
||
#### SlurmExecutor | ||
|
||
The SlurmExecutor enables launching the configured task on a Slurm Cluster with Pyxis. Additionally, you can configure a `run.SSHTunnel`, which enables you to execute tasks on the Slurm cluster from your local machine while NeMo-Run manages the SSH connection for you. This setup supports use cases such as launching the same task on multiple Slurm clusters. | ||
|
||
Below is an example of configuring a Slurm Executor | ||
```python | ||
def your_slurm_executor(nodes: int = 1, container_image: str = DEFAULT_IMAGE): | ||
# SSH Tunnel | ||
ssh_tunnel = run.SSHTunnel( | ||
host="your-slurm-host", | ||
user="your-user", | ||
job_dir="directory-to-store-runs-on-the-slurm-cluster", | ||
identity="optional-path-to-your-key-for-auth", | ||
) | ||
# Local Tunnel to use if you're already on the cluster | ||
local_tunnel = run.LocalTunnel() | ||
|
||
packager = GitArchivePackager( | ||
# This will also be the working directory in your task. | ||
# If empty, the working directory will be toplevel of your git repo | ||
subpath="optional-subpath-from-toplevel-of-your-git-repo" | ||
) | ||
|
||
executor = run.SlurmExecutor( | ||
# Most of these parameters are specific to slurm | ||
account="your-account", | ||
partition="your-partition", | ||
ntasks_per_node=8, | ||
gpus_per_node=8, | ||
nodes=nodes, | ||
tunnel=ssh_tunnel, | ||
container_image=container_image, | ||
time="00:30:00", | ||
env_vars=common_envs(), | ||
container_mounts=mounts_for_your_hubs(), | ||
packager=packager, | ||
) | ||
|
||
# You can then call the executor in your script like | ||
executor = your_slurm_cluster(nodes=8, container_image="your-nemo-image") | ||
``` | ||
|
||
Use the SSH Tunnel when launching from your local machine, or the Local Tunnel if you’re already on the Slurm cluster. | ||
|
||
#### SkypilotExecutor | ||
This executor is used to configure [Skypilot](https://skypilot.readthedocs.io/en/latest/docs/index.html). Make sure Skypilot is installed and atleast one cloud is configured using `sky check`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fix spacing. This executor is used to configure Skypilot. Make sure Skypilot is installed and at least one cloud is configured using |
||
|
||
Here's an example of the `SkypilotExecutor` for Kubernetes: | ||
```python | ||
def your_skypilot_executor(nodes: int, devices: int, container_image: str): | ||
return SkypilotExecutor( | ||
gpus="RTX5880-ADA-GENERATION", | ||
gpus_per_node=devices, | ||
nodes = nodes | ||
env_vars=common_envs() | ||
container_image=container_image, | ||
cloud="kubernetes", | ||
# Optional to reuse Skypilot cluster | ||
cluster_name="tester", | ||
setup=""" | ||
conda deactivate | ||
nvidia-smi | ||
ls -al ./ | ||
""", | ||
) | ||
|
||
# You can then call the executor in your script like | ||
executor = your_skypilot_cluster(nodes=8, devices=8, container_image="your-nemo-image") | ||
``` | ||
|
||
As demonstrated in the examples, defining executors in Python offers great flexibility. You can easily mix and match things like common environment variables, and the separation of tasks from executors enables you to run the same configured task on any supported executor. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
.. NeMo-Run documentation master file, created by | ||
sphinx-quickstart on Thu Jul 25 17:57:46 2024. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
|
||
NeMo-Run documentation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revise title and remove "documentation." Suggest changing the title to "Get Started with NeMo-Run". Get Started with NeMo-Run |
||
====================== | ||
|
||
NeMo-Run is a powerful tool designed to streamline the configuration, execution and management of Machine Learning experiments across various computing environments. NeMo Run has three core responsibilities: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix punctuation and capitalization. NeMo-Run is a powerful tool designed to streamline the configuration, execution, and management of machine learning experiments across various computing environments. NeMo-Run has three core responsibilities: |
||
|
||
1. `Configuration <./guides/configuration.html>`_ | ||
2. `Execution <./guides/execution.html>`_ | ||
3. `Management <./guides/management.html>`_ | ||
|
||
Please click into each link to learn more. | ||
This is also the typical order Nemo Run users will follow to setup and launch experiments. | ||
Comment on lines
+15
to
+16
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revise: Please click on each link to learn more. This sequence also represents the typical order that NeMo-Run users follow to set up and launch experiments. |
||
|
||
.. toctree:: | ||
:glob: | ||
:maxdepth: 1 | ||
|
||
api/index.rst | ||
faq* | ||
|
||
Installation | ||
--------- | ||
Comment on lines
+25
to
+26
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revise title to an imperative verb. Install NeMo-Run |
||
To install the project, use the following command: | ||
|
||
``pip install git+https://github.com/NVIDIA/NeMo-Run.git`` | ||
|
||
To install Skypilot, we have optional features available. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revise. To install Skypilot, we have three options available: |
||
|
||
``pip install git+https://github.com/NVIDIA/NeMo-Run.git[skypilot]`` | ||
will install Skypilot w Kubernetes | ||
|
||
``pip install git+https://github.com/NVIDIA/NeMo-Run.git[skypilot-all]`` | ||
will install Skypilot w all clouds | ||
|
||
You can also manually install Skypilot from https://skypilot.readthedocs.io/en/latest/getting-started/installation.html | ||
Comment on lines
+32
to
+39
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revise section, add bullets, and fix punctuation.
|
||
|
||
Make sure you have `pip` installed and configured properly. | ||
|
||
|
||
Tutorials | ||
--------- | ||
|
||
The ``hello_world`` tutorial series provides a comprehensive introduction to NeMo Run, demonstrating its capabilities through a simple example. The tutorial covers: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fix punctuation. The |
||
|
||
- Configuring Python functions using ``Partial`` and ``Config`` classes. | ||
- Executing configured functions locally and on remote clusters. | ||
- Visualizing configurations with ``graphviz``. | ||
- Creating and managing experiments using ``run.Experiment``. | ||
|
||
You can find the tutorial series below: | ||
|
||
1. `Part 1 <examples/hello-world/hello_world.ipynb>`_ | ||
2. `Part 2 <examples/hello-world/hello_experiments.ipynb>`_ | ||
3. `Part 3 <examples/hello-world/hello_scripts.py>`_ | ||
Comment on lines
+56
to
+58
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. change list to bullets and suggest adding tutorial names to this list.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix punctuation.
Execute NeMo-Run