dlcalc

random command line tools for deep learning

Installation

pip install dlcalc

or

git clone https://github.com/jfc4050/dlcalc
cd dlcalc
pip install -e .

After this you should have access to the command line tools described below. Some people may need to add --user to their pip install command for them to properly go under $PATH.

Tools

Performance Modeling

3D Training Calculator

Calculator for estimating various performance characteristics of 3D parallel transformer model training:

memory consumption
pipeline bubble
communication overhead
compute intensity
etc..

This calculator is focused primarily on pretraining, so you won't find calculations for things like LoRA or RLHF/DPO/etc.

For more details run:

3dtrn -h

We've include a sample config you can use to see what the output looks like:

3dtrn examples/llama3_70b.yaml

We recommend pairing this with a profiler of your choice (NVIDIA Nsight Systems and PyTorch Profiler Traces are two good ones), and checking that what you see in the profiler is in the same ballpark as the theoretical values estimated by the calculator. If they are way off, you now know where to spend your investigation/debugging time.

NOTE: If you look at the sample config, you'll see that it takes an instance type which is used to derive various hardware specifications like intra/inter node bandwidth, theoretical FLOPS per device, number of accelerators per node, etc. You can check hardware.py to see what instance types are supported, you'll have to add it there if the instance type you're interested in isn't represented.

Topology

Topology Visualizer

Given a kubernetes pod name prefix for some compute cluster, retrieve AWS network topology info and plot. For more details run:

topoviz -h

Topology Evaluator

Evaluates how optimal a given training job's physical topology is (in terms of network hops in each DP ring). For more details run:

topoeval -h

Topology-Aware Scheduler

Topology-aware instance selection and rank assignment for maximally efficient DP communication. For more details run:

topoassign -h

KPIs

Samples/Sec -> Tokens/Day Converter

Pretty self explanatory, for more details run:

sps2tpd -h

Samples/Sec -> MFU Converter

If you're not familiar with what Model Flops Utilization (MFU) means, refer to Google's PaLM paper. Otherwise pretty self explanatory, for more details run:

sps2mfu -h

Misc.

Checkpoint Summarizer

Gives a human-readable summarization of keys, values, and tensor shapes in a given (PyTorch) model checkpoint. For more details run:

ckpt-summarize -h

Development

install development dependencies

pip install -e .[dev]

static checks can be run with

bash checks

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
.vscode		.vscode
dlcalc		dlcalc
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
checks		checks
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dlcalc

Installation

Tools

Performance Modeling

3D Training Calculator

Topology

Topology Visualizer

Topology Evaluator

Topology-Aware Scheduler

KPIs

Samples/Sec -> Tokens/Day Converter

Samples/Sec -> MFU Converter

Misc.

Checkpoint Summarizer

Development

About

Uh oh!

Releases 10

Packages

Uh oh!

Languages

License

jfc4050/dlcalc

Folders and files

Latest commit

History

Repository files navigation

dlcalc

Installation

Tools

Performance Modeling

3D Training Calculator

Topology

Topology Visualizer

Topology Evaluator

Topology-Aware Scheduler

KPIs

Samples/Sec -> Tokens/Day Converter

Samples/Sec -> MFU Converter

Misc.

Checkpoint Summarizer

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Languages

Packages