TorchWatcher: Track layer-wise metrics from PyTorch models to Neptune #18

LeoRoccoBreedt · 2025-06-02T14:28:22Z

Description

Include a summary of the changes and the related issue.

Related to: <ClickUp/JIRA task name>

Any expected test failures?

Add a [X] to relevant checklist items

❔ This change

adds a new feature
fixes breaking code
is cosmetic (refactoring/reformatting)

✔️ Pre-merge checklist

Refactored code (sourcery)
Tested code locally
Precommit installed and run before pushing changes
Added code to GitHub tests (notebooks, scripts)
Updated GitHub README
Updated the projects overview page on Notion

🧪 Test Configuration

OS:
Python version:
Neptune version:
Affected libraries with version:

Summary by Sourcery

Add a new PyTorch monitoring integration by introducing the TorchWatcher package with supporting notebooks, example script, and documentation, and update CI to test the new notebooks.

New Features:

Introduce TorchWatcher module to automatically track layer-wise activations, gradients, and parameters in PyTorch models
Add interactive notebooks demonstrating how to debug PyTorch training with layer-wise metrics using Neptune
Provide a standalone example script showing TorchWatcher integration and usage

CI:

Update GitHub workflow to include the new PyTorch debugging notebook in test-notebooks.yml

Documentation:

Add README documentation for TorchWatcher installation, features, and usage instructions

… metrics

…en and project name

…rientated

… tracked

…calculate gradient norms for batch (step) rather than epoch

…debugging when building LLM's

…sses using the data loader

- package to initialize hooks for Pytorch models, replacing theHookManager class - add readme.md for using the package - update the degbugging pytorch example to use the new package

…rs as well as allowing a user to specify which layers to track

…ch_example_v1

…etrics to track rather than named values

…ning loop. - more control on namespace logged during training

These need to be updated the the final branch when merged

…ging_model_training

… and flow

sourcery-ai · 2025-06-02T14:46:33Z

Reviewer's Guide

This PR introduces a PyTorch layer-wise monitoring integration for Neptune by implementing a standalone TorchWatcher utility, accompanied by tutorial and how-to notebooks, documentation, example scripts, and updates to the CI workflow to execute these new assets.

Sequence Diagram for TorchWatcher's watch() method

sequenceDiagram
    participant TL as Training Loop
    participant TW as TorchWatcher
    participant HM as HookManager
    participant PM as PyTorchModel (nn.Module)
    participant NR as NeptuneRun

    TL->>TW: watch(step, track_activations_flag, track_gradients_flag, track_parameters_flag)
    TW->>TW: Clear internal metrics buffer

    opt track_activations_flag is true
        TW->>HM: get_activations()
        activate HM
        HM-->>TW: activation_tensors
        deactivate HM
        TW->>TW: Process activation_tensors (compute stats, add to buffer)
    end

    opt track_gradients_flag is true
        TW->>HM: get_gradients()
        activate HM
        HM-->>TW: gradient_tensors
        deactivate HM
        TW->>TW: Process gradient_tensors (compute stats, add to buffer)
    end

    opt track_parameters_flag is true
        TW->>PM: Access parameter gradients (param.grad)
        activate PM
        PM-->>TW: parameter_gradient_tensors
        deactivate PM
        TW->>TW: Process parameter_gradient_tensors (compute stats, add to buffer)
    end

    TW->>NR: log_metrics(buffered_metrics, step)
    TW->>HM: clear() (clear stored activations/gradients in HookManager)

Class Diagram for TorchWatcher and HookManager

classDiagram
    class TorchWatcher {
        -model: nn.Module
        -run: NeptuneRun
        -hm: HookManager
        -debug_metrics: Dict
        -base_namespace: str
        -tensor_stats: Dict
        +__init__(model, run, track_layers, tensor_stats, base_namespace)
        -_safe_tensor_stats(tensor) Dict
        -_track_metric(metric_type, data, namespace)
        +track_activations(namespace)
        +track_gradients(namespace)
        +track_parameters(namespace)
        +watch(step, track_gradients, track_parameters, track_activations, namespace)
    }

    class HookManager {
        -model: nn.Module
        -hooks: List
        -activations: Dict
        -gradients: Dict
        -track_layers: List
        +__init__(model, track_layers)
        +save_activation(name) Callable
        +save_gradient(name) Callable
        +register_hooks(track_activations, track_gradients)
        +remove_hooks()
        +clear()
        +get_activations() Dict
        +get_gradients() Dict
        +__del__()
    }

    class NeptuneRun {
        <<Service Interface>>
        +log_metrics(data, step)
    }

    class nn.Module {
        <<PyTorch Library>>
        +named_parameters()
        +register_forward_hook()
        +register_full_backward_hook()
    }

    TorchWatcher "1" *-- "1" HookManager : creates & owns
    TorchWatcher ..> nn.Module : uses
    TorchWatcher ..> NeptuneRun : logs to
    HookManager ..> nn.Module : registers hooks on & uses

File-Level Changes

Change	Details	Files
Integrate PyTorch tracking notebooks and update CI workflow	Added a PyTorch debugging notebook demonstrating TorchWatcher integration Added a tutorial notebook for gradient-norm tracking with Neptune Updated test-notebooks GitHub Actions workflow to include the new notebooks	`.github/workflows/test-notebooks.yml` `integrations-and-supported-tools/pytorch/notebooks/pytorch_text_model_debugging.ipynb` `how-to-guides/debug-model-training-runs/debug_training_runs.ipynb`
Implement TorchWatcher library for metric tracking	Added HookManager to register forward/backward hooks on model layers Built TorchWatcher to compute and log configurable tensor statistics Enabled flexible tracking of activations, gradients, and parameters	`integrations-and-supported-tools/pytorch/notebooks/TorchWatcher.py`
Add documentation and usage examples	Created README with installation, usage, and namespace guidelines Added example script demonstrating TorchWatcher in a simple training loop	`integrations-and-supported-tools/pytorch/notebooks/README.md` `integrations-and-supported-tools/pytorch/notebooks/torch_watcher_example.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

LeoRoccoBreedt added 30 commits February 26, 2025 11:07

feat: Added initial Pytorch example to monitor batching and per layer…

b837338

… metrics

refactor: Update introduction section for more clarity on notebook use

9994528

chore: change how the custom run id gets automatically generated

223fde1

chore: update instructions on how users can get and set their API tok…

ceb3a24

…en and project name

chore: update the introduction to be more foundation model training o…

39f5bb9

…rientated

chore: update dataset section with a better description

69fdfba

refactor: update training loop where grads, norms and activations are…

1cde287

… tracked

refactor: update batch size and edit gradient norm logging code

2da28f4

chore: add data file to ignore for pytorch example

123fea2

refactor: update model architecture layers and update training loop

86b6d5d

refactor: update accuracy calculation to not output the percentage

a39077a

refactor: update model architecture layers, accuracy calculation and …

3e2d453

…calculate gradient norms for batch (step) rather than epoch

feat: Added a pytorch text-based example that is used to demonstrate …

8a3da57

…debugging when building LLM's

refactor: add validation and test loss calcualtion for each epoch

d03fd43

refactor: update configs and parameters

b7f8e08

refactor: update logged configs

b113a36

refactor: calculate activations per layer

7258993

refactor: add tracking for grad norms

ae330e6

refactor: add gradient tracking per epoch

d4d160c

chore: remove uneeded section

890bf26

refactor: add fully connected layer to model for more complexity

ebd34c7

chore: fix activation saving for layers

ae9abc6

refactor: update packages for example

6f23761

refactor: update dataset to be used in example

2c82a0f

refactor: update evalution function that calculates the validation lo…

48451d6

…sses using the data loader

refactor: update training loop to work with new data

5b5fbdd

chore: remove unused sections

8ff6f30

chore: re-organize notebook layout

b32b27b

chore: cleanup and add parameters in right place

4d659a2

refactor: add all debugging metrics to the same dictionary variable

5e7ec6e

LeoRoccoBreedt added 28 commits March 31, 2025 16:27

feat: create a TorchWatcher package

f44323d

- package to initialize hooks for Pytorch models, replacing theHookManager class - add readme.md for using the package - update the degbugging pytorch example to use the new package

style: update intro

2b001af

update TocrhWatcher package to default to tracking all available laye…

fdfca80

…rs as well as allowing a user to specify which layers to track

chore: update readme

84a7ca3

Merge commit 'ee5a98f111eb90400a373c292b591e3684762fcc' into lb/pytor…

2baa709

…ch_example_v1

chore: cleanup comments and add TODO for future improvements

75265ec

refactor: update TorchWatcher to use a common _track_metrics() method

ebde389

refactor: update the watch() method to accept boolen inputs for the m…

9f83fbf

…etrics to track rather than named values

feat: add ability to specify base_namespace and namespace during trai…

603af72

…ning loop. - more control on namespace logged during training

chore: updare readme and example for namespace feature

6bb4879

chore: pre-commit hooks changes

50b0a6d

refactor: update example to use the improved TorchWatcher package

e67e1f6

chore: pre-commit cleanup

a2d0d50

feat: notebook tutorial on debugging training runs with Neptune

fcf3657

chore: pre-commit changes

82f4d8b

chore: update colab link for branch

8c56296

refactor: keep notebook more self-contained for use in Colab

247679a

fix: f string compatibility in Colab

408ba02

fix: ensure model object is on correct device

314e8a2

chore: Add header links to GH, Neptune and docs

62227bc

These need to be updated the the final branch when merged

Merge commit 'fc4bc5ee2d6e3297c8611991e60f372ea785d213' into lb/debug…

f64d54d

…ging_model_training

style: minor updates to markdown

505f41c

chore: remove unused code

bed4899

style: remove image reference

75885bb

style: update ending steps and add button links to examples

33d3305

style: update text for better readability

494379f

style: Update markdown section of explanations for better readability…

0b8847b

… and flow

chore: pre-commit cleanup

69bdb16

LeoRoccoBreedt self-assigned this Jun 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TorchWatcher: Track layer-wise metrics from PyTorch models to Neptune #18

TorchWatcher: Track layer-wise metrics from PyTorch models to Neptune #18

Uh oh!

LeoRoccoBreedt commented Jun 2, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jun 2, 2025

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

TorchWatcher: Track layer-wise metrics from PyTorch models to Neptune #18

Are you sure you want to change the base?

TorchWatcher: Track layer-wise metrics from PyTorch models to Neptune #18

Uh oh!

Conversation

LeoRoccoBreedt commented Jun 2, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

❔ This change

✔️ Pre-merge checklist

🧪 Test Configuration

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jun 2, 2025

Reviewer's Guide

Sequence Diagram for TorchWatcher's watch() method

Class Diagram for TorchWatcher and HookManager

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

LeoRoccoBreedt commented Jun 2, 2025 •

edited by sourcery-ai bot

Loading