Skip to content

TorchWatcher: Track layer-wise metrics from PyTorch models to Neptune #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 105 commits into
base: main
Choose a base branch
from

Conversation

LeoRoccoBreedt
Copy link
Contributor

@LeoRoccoBreedt LeoRoccoBreedt commented Jun 2, 2025

Description

Include a summary of the changes and the related issue.

Related to: <ClickUp/JIRA task name>

Any expected test failures?


Add a [X] to relevant checklist items

❔ This change

  • adds a new feature
  • fixes breaking code
  • is cosmetic (refactoring/reformatting)

✔️ Pre-merge checklist

  • Refactored code (sourcery)
  • Tested code locally
  • Precommit installed and run before pushing changes
  • Added code to GitHub tests (notebooks, scripts)
  • Updated GitHub README
  • Updated the projects overview page on Notion

🧪 Test Configuration

  • OS:
  • Python version:
  • Neptune version:
  • Affected libraries with version:

Summary by Sourcery

Add a new PyTorch monitoring integration by introducing the TorchWatcher package with supporting notebooks, example script, and documentation, and update CI to test the new notebooks.

New Features:

  • Introduce TorchWatcher module to automatically track layer-wise activations, gradients, and parameters in PyTorch models
  • Add interactive notebooks demonstrating how to debug PyTorch training with layer-wise metrics using Neptune
  • Provide a standalone example script showing TorchWatcher integration and usage

CI:

  • Update GitHub workflow to include the new PyTorch debugging notebook in test-notebooks.yml

Documentation:

  • Add README documentation for TorchWatcher installation, features, and usage instructions

…calculate gradient norms for batch (step) rather than epoch
- package to initialize hooks for Pytorch models, replacing theHookManager class
- add readme.md for using the package
- update the degbugging pytorch example to use the new package
…rs as well as allowing a user to specify which layers to track
…ning loop.

- more control on namespace logged during training
These need to be updated the the final branch when merged
Copy link
Contributor

sourcery-ai bot commented Jun 2, 2025

Reviewer's Guide

This PR introduces a PyTorch layer-wise monitoring integration for Neptune by implementing a standalone TorchWatcher utility, accompanied by tutorial and how-to notebooks, documentation, example scripts, and updates to the CI workflow to execute these new assets.

Sequence Diagram for TorchWatcher's watch() method

sequenceDiagram
    participant TL as Training Loop
    participant TW as TorchWatcher
    participant HM as HookManager
    participant PM as PyTorchModel (nn.Module)
    participant NR as NeptuneRun

    TL->>TW: watch(step, track_activations_flag, track_gradients_flag, track_parameters_flag)
    TW->>TW: Clear internal metrics buffer

    opt track_activations_flag is true
        TW->>HM: get_activations()
        activate HM
        HM-->>TW: activation_tensors
        deactivate HM
        TW->>TW: Process activation_tensors (compute stats, add to buffer)
    end

    opt track_gradients_flag is true
        TW->>HM: get_gradients()
        activate HM
        HM-->>TW: gradient_tensors
        deactivate HM
        TW->>TW: Process gradient_tensors (compute stats, add to buffer)
    end

    opt track_parameters_flag is true
        TW->>PM: Access parameter gradients (param.grad)
        activate PM
        PM-->>TW: parameter_gradient_tensors
        deactivate PM
        TW->>TW: Process parameter_gradient_tensors (compute stats, add to buffer)
    end

    TW->>NR: log_metrics(buffered_metrics, step)
    TW->>HM: clear() (clear stored activations/gradients in HookManager)
Loading

Class Diagram for TorchWatcher and HookManager

classDiagram
    class TorchWatcher {
        -model: nn.Module
        -run: NeptuneRun
        -hm: HookManager
        -debug_metrics: Dict
        -base_namespace: str
        -tensor_stats: Dict
        +__init__(model, run, track_layers, tensor_stats, base_namespace)
        -_safe_tensor_stats(tensor) Dict
        -_track_metric(metric_type, data, namespace)
        +track_activations(namespace)
        +track_gradients(namespace)
        +track_parameters(namespace)
        +watch(step, track_gradients, track_parameters, track_activations, namespace)
    }

    class HookManager {
        -model: nn.Module
        -hooks: List
        -activations: Dict
        -gradients: Dict
        -track_layers: List
        +__init__(model, track_layers)
        +save_activation(name) Callable
        +save_gradient(name) Callable
        +register_hooks(track_activations, track_gradients)
        +remove_hooks()
        +clear()
        +get_activations() Dict
        +get_gradients() Dict
        +__del__()
    }

    class NeptuneRun {
        <<Service Interface>>
        +log_metrics(data, step)
    }

    class nn.Module {
        <<PyTorch Library>>
        +named_parameters()
        +register_forward_hook()
        +register_full_backward_hook()
    }

    TorchWatcher "1" *-- "1" HookManager : creates & owns
    TorchWatcher ..> nn.Module : uses
    TorchWatcher ..> NeptuneRun : logs to
    HookManager ..> nn.Module : registers hooks on & uses
Loading

File-Level Changes

Change Details Files
Integrate PyTorch tracking notebooks and update CI workflow
  • Added a PyTorch debugging notebook demonstrating TorchWatcher integration
  • Added a tutorial notebook for gradient-norm tracking with Neptune
  • Updated test-notebooks GitHub Actions workflow to include the new notebooks
.github/workflows/test-notebooks.yml
integrations-and-supported-tools/pytorch/notebooks/pytorch_text_model_debugging.ipynb
how-to-guides/debug-model-training-runs/debug_training_runs.ipynb
Implement TorchWatcher library for metric tracking
  • Added HookManager to register forward/backward hooks on model layers
  • Built TorchWatcher to compute and log configurable tensor statistics
  • Enabled flexible tracking of activations, gradients, and parameters
integrations-and-supported-tools/pytorch/notebooks/TorchWatcher.py
Add documentation and usage examples
  • Created README with installation, usage, and namespace guidelines
  • Added example script demonstrating TorchWatcher in a simple training loop
integrations-and-supported-tools/pytorch/notebooks/README.md
integrations-and-supported-tools/pytorch/notebooks/torch_watcher_example.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@LeoRoccoBreedt LeoRoccoBreedt self-assigned this Jun 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants