Skip to content

Update W&B exporter #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 16, 2025
Merged

Update W&B exporter #21

merged 10 commits into from
Jun 16, 2025

Conversation

SiddhantSadangi
Copy link
Member

@SiddhantSadangi SiddhantSadangi commented Jun 13, 2025

Description

  • Added console logs and file support.
  • Updated namespace of hardware metrics to runtime from system.
  • Updated script to use only CLI arguments.
  • Replaced . in system metric names with / to match the Neptune metric namespace format.

Any expected test failures?


Add a [X] to relevant checklist items

❔ This change

  • adds a new feature
  • fixes breaking code
  • is cosmetic (refactoring/reformatting)

✔️ Pre-merge checklist

  • Refactored code (sourcery)
  • Tested code locally
  • Precommit installed and run before pushing changes
  • Added code to GitHub tests (notebooks, scripts)
  • Updated GitHub README
  • Updated the projects overview page on Notion

🧪 Test Configuration

  • OS:
  • Python version:
  • Neptune version:
  • Affected libraries with version:

Summary by Sourcery

Enhance the W&B to Neptune migration script by adding file and console log support, converting interactive prompts to CLI arguments, refining metric namespaces, and improving logging, error handling, and configuration options.

New Features:

  • Add support for copying console output, source code, requirements, checkpoints, and other files to Neptune
  • Fully migrate script inputs to CLI arguments including entity, workspace, projects, worker count, log level, and timeout
  • Introduce a timeout mechanism for run and project migrations

Enhancements:

  • Switch hardware metric namespace from system to runtime and replace dots with slashes in metric names
  • Default --num-workers to 1 and expose --log-level for configurable logging verbosity
  • Insert detailed debug and info logs throughout the migration process and upgrade log formatting
  • Aggregate failures for runs and projects and report them at the end

Documentation:

  • Add a changelog section, update usage instructions to CLI-based workflow, and expand documentation of file and metric mappings in README

@SiddhantSadangi SiddhantSadangi requested a review from a team as a code owner June 13, 2025 15:44
Copy link
Contributor

sourcery-ai bot commented Jun 13, 2025

Reviewer's Guide

This PR refactors the W&B-to-Neptune migration script by replacing interactive prompts with a full argparse-driven CLI, enhancing logging configuration and error handling (including timeouts and aggregated failure reporting), adding support for uploading console logs and various run-related files to Neptune, migrating metric namespaces and separators to the new runtime/ format, and refreshing documentation to reflect these updates.

Sequence diagram for file upload and logging during run migration

sequenceDiagram
    participant Script
    participant W&B_Run as W&B Run
    participant Neptune_Run as Neptune Run
    participant FileSystem as File System

    Script->>W&B_Run: List files()
    loop For each file
        W&B_Run-->>Script: File metadata
        Script->>FileSystem: Download file
        FileSystem-->>Script: File path
        alt output.log
            Script->>Neptune_Run: log_string_series(runtime/stdout)
        else code/*
            Script->>Neptune_Run: assign_files(source_code/files/<filename>)
        else requirements.txt
            Script->>Neptune_Run: assign_files(source_code/requirements.txt)
        else checkpoint/ckpt
            Script->>Neptune_Run: assign_files(checkpoints/<filename>)
        else other
            Script->>Neptune_Run: assign_files(files/<filename>)
        end
    end
Loading

Class diagram for updated file handling and logging in migration script

classDiagram
    class Run {
        +log_configs(dict)
        +add_tags(list, group_tags)
        +log_string_series(dict, step)
        +assign_files(dict)
        +log_metrics(dict, step, timestamp)
    }
    class wandb_run {
        +id
        +name
        +project
        +notes
        +_attrs
        +summary
        +scan_history()
        +history(stream, pandas)
        +files()
        +tags
        +group
    }
    class File {
        +source
        +mime_type
    }
    class MigrationScript {
        +copy_run(wandb_run, wandb_project_name)
        +copy_config(neptune_run, wandb_run)
        +copy_summary(neptune_run, wandb_run)
        +copy_metrics(neptune_run, wandb_run)
        +copy_system_metrics(neptune_run, wandb_run)
        +copy_console_output(neptune_run, download_path)
        +copy_source_code(neptune_run, download_path, filename)
        +copy_requirements(neptune_run, download_path)
        +copy_other_files(neptune_run, download_path, filename, namespace)
        +copy_files(neptune_run, wandb_run, download_folder)
    }
    MigrationScript --> Run
    MigrationScript --> wandb_run
    MigrationScript --> File
Loading

Class diagram for CLI argument parsing and logging setup

classDiagram
    class argparse.ArgumentParser {
        +add_argument(...)
        +parse_args()
    }
    class Logger {
        +debug(msg)
        +info(msg)
        +error(msg)
        +setLevel(level)
    }
    class MigrationScript {
        +setup_logging(log_level)
    }
    MigrationScript --> Logger
    MigrationScript --> argparse.ArgumentParser
Loading

File-Level Changes

Change Details Files
Swapped interactive prompts for a comprehensive argparse-based CLI
  • Removed manual input functions
  • Introduced argparse arguments for entity, workspace, projects, worker count, log level, and timeout
  • Validated required arguments and raised errors if missing
utils/migration_tools/from_wandb/wandb_to_neptune.py
Enhanced logging configuration and suppression
  • Replaced hard-coded INFO level with dynamic log_level parameter
  • Improved log message formatting with function names and line numbers
  • Silenced noisy third-party loggers in a loop
utils/migration_tools/from_wandb/wandb_to_neptune.py
Migrated metric namespace and naming conventions
  • Changed hardware/system metrics namespace from system to runtime
  • Replaced . in metric names with / to match Neptune conventions
utils/migration_tools/from_wandb/wandb_to_neptune.py
utils/migration_tools/from_wandb/README.md
Added support for uploading console logs and files
  • Introduced EXCLUDED_PATHS and File type from neptune_scale.types
  • Implemented functions to copy console output, source code, requirements, checkpoints, and other files
  • Integrated new file-upload functions into the parallel copy flow
utils/migration_tools/from_wandb/wandb_to_neptune.py
Improved concurrency control and error handling
  • Added timeout parameter to ThreadPoolExecutor futures
  • Collected and reported FAILED_RUNS and FAILED_PROJECTS lists
  • Logged and printed aggregated failures with non-zero exit code on project errors
utils/migration_tools/from_wandb/wandb_to_neptune.py
Refreshed documentation with changelog and usage details
  • Expanded README with versioned changelog
  • Documented all CLI arguments with examples
  • Updated metadata mapping to include new runtime fields and file types
utils/migration_tools/from_wandb/README.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@SiddhantSadangi SiddhantSadangi changed the title Ss/wandb_exporter Update W&B exporter Jun 13, 2025
@SiddhantSadangi SiddhantSadangi self-assigned this Jun 13, 2025
@SiddhantSadangi SiddhantSadangi requested a review from Copilot June 13, 2025 15:44
@SiddhantSadangi SiddhantSadangi added the enhancement New feature or request label Jun 13, 2025
Copilot

This comment was marked as outdated.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SiddhantSadangi - I've reviewed your changes and found some issues that need to be addressed.

  • Avoid relying on a global tmpdirname inside copy_files—pass the download directory explicitly to keep the function pure and testable.
  • Using threadsafe_change_directory in a multithreaded context can lead to unpredictable cwd changes across threads—prefer uploading files by absolute path instead.
  • Consider cleaning up the temporary directory after completion (or using a context manager) and exiting with a non-zero code on failures to make automation more robust.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Avoid relying on a global `tmpdirname` inside `copy_files`—pass the download directory explicitly to keep the function pure and testable.
- Using `threadsafe_change_directory` in a multithreaded context can lead to unpredictable cwd changes across threads—prefer uploading files by absolute path instead.
- Consider cleaning up the temporary directory after completion (or using a context manager) and exiting with a non-zero code on failures to make automation more robust.

## Individual Comments

### Comment 1
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:268` </location>
<code_context>
-    #         filename.replace("code/", ""), wait=True
-    #     )
-    raise NotImplementedError("Source code logging is not implemented in Neptune yet")
+    with threadsafe_change_directory(os.path.join(download_path.replace(filename, ""), "code")):
+        filename = filename.replace("code/", "")
+        neptune_run.assign_files(
</code_context>

<issue_to_address>
Use `os.path.dirname` instead of `replace` for directories

Using string replacement on paths can lead to errors if the filename appears elsewhere in the path. `os.path.dirname(download_path)` is a safer and more reliable way to get the parent directory.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    with threadsafe_change_directory(os.path.join(download_path.replace(filename, ""), "code")):
        filename = filename.replace("code/", "")
        neptune_run.assign_files(
            {f"source_code/files/{filename}": File(source=filename, mime_type="text/plain")}
        )
=======
    with threadsafe_change_directory(os.path.join(os.path.dirname(download_path), "code")):
        filename = filename.replace("code/", "")
        neptune_run.assign_files(
            {f"source_code/files/{filename}": File(source=filename, mime_type="text/plain")}
        )
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:282` </location>
<code_context>
-    # with threadsafe_change_directory(download_path.replace(filename, "")):
-    #     neptune_run[namespace].upload_files(filename, wait=True)
-    raise NotImplementedError("File logging is not implemented in Neptune yet")
+    with threadsafe_change_directory(download_path.replace(filename, "")):
+        neptune_run.assign_files({f"{namespace}/{filename}": File(source=filename)})

</code_context>

<issue_to_address>
Use `os.path.dirname` to determine the directory

`os.path.dirname(download_path)` is more reliable than using `replace(filename, "")` for extracting the parent directory.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
-    with threadsafe_change_directory(download_path.replace(filename, "")):
-        neptune_run.assign_files({f"{namespace}/{filename}": File(source=filename)})
=======
+    with threadsafe_change_directory(os.path.dirname(download_path)):
+        neptune_run.assign_files({f"{namespace}/{filename}": File(source=filename)})
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:395` </location>
<code_context>
-
-    logger, log_filename = setup_logging()
+    wandb_entity = args.wandb_entity
+    neptune_workspace = args.neptune_workspace
+    num_workers = args.num_workers
+    timeout = args.timeout
</code_context>

<issue_to_address>
Validate that Neptune workspace is provided

Add a check to ensure `neptune_workspace` is set, and raise an error if not, to prevent downstream failures.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    wandb_entity = args.wandb_entity
    neptune_workspace = args.neptune_workspace
    num_workers = args.num_workers
    timeout = args.timeout

    logger, log_filename = setup_logging(args.log_level)
=======
    wandb_entity = args.wandb_entity
    neptune_workspace = args.neptune_workspace
    num_workers = args.num_workers
    timeout = args.timeout

    if not neptune_workspace:
        raise ValueError("Neptune workspace must be provided via --neptune-workspace to avoid downstream failures.")

    logger, log_filename = setup_logging(args.log_level)
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

…ved logging and error handling in the migration script to ensure smoother operation and clearer feedback during execution.
Copilot

This comment was marked as outdated.

@SiddhantSadangi
Copy link
Member Author

@sourcery-ai review

sourcery-ai[bot]

This comment was marked as outdated.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SiddhantSadangi - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:148` </location>
<code_context>
+                        neptune_run.log_configs({f"wandb/{attr}": getattr(wandb_run, attr)})
+                except TypeError:
+                    pass
+                except Exception as e:
+                    logger.error(
+                        f"[{copy_run.__name__}]\tFailed to copy '{attr}' from W&B run '{wandb_run.name}' due to exception:\n{e}"
</code_context>

<issue_to_address>
Use `logger.exception` for full traceback

`logger.exception` will log the stack trace, making it easier to debug unexpected errors.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
                except Exception as e:
                    logger.error(
                        f"[{copy_run.__name__}]\tFailed to copy '{attr}' from W&B run '{wandb_run.name}' due to exception:\n{e}"
                    )
=======
                except Exception as e:
                    logger.exception(
                        f"[{copy_run.__name__}]\tFailed to copy '{attr}' from W&B run '{wandb_run.name}' due to exception:\n{e}"
                    )
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SiddhantSadangi - I've reviewed your changes - here's some feedback:

  • Rather than scattering module‐level globals like neptune_workspace, timeout, and num_workers, consider bundling them into a single config or argument object and passing it explicitly to functions for better clarity and testability.
  • Instead of creating a timestamped directory in the current working directory, you could use Python’s tempfile.TemporaryDirectory to avoid clutter and ensure automatic cleanup.
  • Accessing the protected attribute neptune_run._project to build the run URL is brittle—switch to a public API or cache the project name separately to prevent breakage if internals change.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Rather than scattering module‐level globals like `neptune_workspace`, `timeout`, and `num_workers`, consider bundling them into a single config or argument object and passing it explicitly to functions for better clarity and testability.
- Instead of creating a timestamped directory in the current working directory, you could use Python’s tempfile.TemporaryDirectory to avoid clutter and ensure automatic cleanup.
- Accessing the protected attribute `neptune_run._project` to build the run URL is brittle—switch to a public API or cache the project name separately to prevent breakage if internals change.

## Individual Comments

### Comment 1
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:347` </location>
<code_context>
             try:
-                future.result()
+                future.result(timeout=timeout)
+            except TimeoutError:
+                logger.error(f"Timeout copying run {wandb_run.project}/{wandb_run.name}")
+                FAILED_RUNS.append((wandb_run.project, wandb_run.name, "Timeout"))
</code_context>

<issue_to_address>
Catch the correct TimeoutError for futures

Use concurrent.futures.TimeoutError instead of the built-in TimeoutError to correctly catch timeouts from future.result(timeout=...).
</issue_to_address>

### Comment 2
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:254` </location>
<code_context>
-    #     for line in f:
-    #         neptune_run["monitoring/stdout"].append(line)
-    raise NotImplementedError("Console logging is not implemented in Neptune yet")
+    with open(download_path) as f:
+        for step, line in enumerate(f):
+            neptune_run.log_string_series({"runtime/stdout": line}, step=step)
</code_context>

<issue_to_address>
Explicitly declare encoding when opening files

Not specifying an encoding may cause errors with non-UTF8 files. Use open(download_path, 'r', encoding='utf-8', errors='ignore') or another suitable encoding.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    with open(download_path) as f:
        for step, line in enumerate(f):
            neptune_run.log_string_series({"runtime/stdout": line}, step=step)
=======
    with open(download_path, 'r', encoding='utf-8', errors='ignore') as f:
        for step, line in enumerate(f):
            neptune_run.log_string_series({"runtime/stdout": line}, step=step)
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `utils/migration_tools/from_wandb/README.md:58` </location>
<code_context>
+
+### Note
+- If they don't exist yet, Neptune projects corresponding to the W&B projects will be created as [*private*][docs-project-access] with project description set to *Exported from <W&B project URL>*. To change the project visibility or description, use the web app or update the `create_project()` function in `copy_project()`.  
+- A `tmp_<timestamp>` directory is created in the current working directory to store the files and logs. Ensure that the current working directory is writable, and that there is enough disk space. You can delete this directory after the script has finished running and the required sanity checks have been performed.

 ## Metadata mapping from W&B to Neptune
</code_context>

<issue_to_address>
Clarify log file location.

Please specify whether `wandb_to_neptune_<timestamp>.log` is located inside the `tmp_<timestamp>` directory or elsewhere.
</issue_to_address>

### Comment 4
<location> `utils/migration_tools/from_wandb/README.md:57` </location>
<code_context>
+```
+
+### Note
+- If they don't exist yet, Neptune projects corresponding to the W&B projects will be created as [*private*][docs-project-access] with project description set to *Exported from <W&B project URL>*. To change the project visibility or description, use the web app or update the `create_project()` function in `copy_project()`.  
+- A `tmp_<timestamp>` directory is created in the current working directory to store the files and logs. Ensure that the current working directory is writable, and that there is enough disk space. You can delete this directory after the script has finished running and the required sanity checks have been performed.

</code_context>

<issue_to_address>
Consider rephrasing guidance on modifying script functions.

Rephrase to clarify that modifying `create_project()` is an advanced customization, not the standard approach, to avoid confusion for typical users.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
- If they don't exist yet, Neptune projects corresponding to the W&B projects will be created as [*private*][docs-project-access] with project description set to *Exported from <W&B project URL>*. To change the project visibility or description, use the web app or update the `create_project()` function in `copy_project()`.  
=======
- If they don't exist yet, Neptune projects corresponding to the W&B projects will be created as [*private*][docs-project-access] with project description set to *Exported from <W&B project URL>*. To change the project visibility or description, use the web app. Advanced users may also customize the `create_project()` function in `copy_project()`, but this is not required for typical usage.  
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

…irectory usage. Enhance `copy_console_output` function in migration script to log console output to Neptune runs.
@SiddhantSadangi SiddhantSadangi requested a review from Copilot June 16, 2025 10:29
@SiddhantSadangi
Copy link
Member Author

@sourcery-ai review

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

The PR modernizes the W&B → Neptune migration script by adding full file and console log support, converting interactive prompts to CLI arguments, and standardizing metric namespaces and logging.

  • Implements downloading and uploading of run files (logs, source code, requirements, checkpoints, etc.).
  • Switches configuration entirely to CLI flags with default values and enhanced logging.
  • Renames system metric namespace from system.* to runtime/* and updates matching documentation.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
utils/migration_tools/from_wandb/wandb_to_neptune.py Add CLI parsing, file and console log copying, update metric namespace and logging
utils/migration_tools/from_wandb/README.md Document new features, changelog, usage instructions, and CLI arguments
Comments suppressed due to low confidence (5)

utils/migration_tools/from_wandb/wandb_to_neptune.py:331

  • Consider renaming the local list FAILED_RUNS to failed_runs to follow standard snake_case naming for mutable variables.
    FAILED_RUNS = []

utils/migration_tools/from_wandb/wandb_to_neptune.py:438

  • This local variable FAILED_PROJECTS should be renamed to failed_projects to adhere to snake_case conventions.
    FAILED_PROJECTS = []

utils/migration_tools/from_wandb/wandb_to_neptune.py:164

  • Accessing the private attribute _project may break if the internal API changes. Consider using a public property or method on neptune_run to construct the target URL.
                    f"Copied {wandb_run.url} to https://scale.neptune.ai/{neptune_run._project}/runs/table?viewId=standard-view&query=(%60sys%2Fname%60%3Astring%20MATCHES%20%22{wandb_run.name}%22)&experimentsOnly=false&runsLineage=FULL&lbViewUnpacked=true"

utils/migration_tools/from_wandb/wandb_to_neptune.py:254

  • New file-reading and line-by-line logging logic has been introduced here. It would help maintain reliability if unit or integration tests covered copy_console_output handling of missing or malformed log files.
    with open(download_path, encoding="utf-8", errors="ignore") as f:

utils/migration_tools/from_wandb/wandb_to_neptune.py:58

  • [nitpick] Consider adding a brief docstring explaining the expected format of log_level values and the return tuple contents (logger, log_filename).
def setup_logging(log_level: str) -> tuple[logging.Logger, str]:

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SiddhantSadangi - I've reviewed your changes - here's some feedback:

  • Consider using a tempfile.TemporaryDirectory (or cleaning up tmpdirname in a finally block) so that the temporary folder is removed after migration instead of littering the working directory.
  • Rather than relying on globals like neptune_workspace, wandb_entity, and timeout inside copy_project, pass them explicitly as function parameters to improve clarity and testability.
  • The EXCLUDED_PATHS set mixes prefixes and exact filenames (e.g. config.yaml); to avoid accidental partial matches you might normalize paths and check equality for files while keeping prefix checks for directories.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider using a tempfile.TemporaryDirectory (or cleaning up `tmpdirname` in a finally block) so that the temporary folder is removed after migration instead of littering the working directory.
- Rather than relying on globals like `neptune_workspace`, `wandb_entity`, and `timeout` inside `copy_project`, pass them explicitly as function parameters to improve clarity and testability.
- The EXCLUDED_PATHS set mixes prefixes and exact filenames (e.g. `config.yaml`); to avoid accidental partial matches you might normalize paths and check equality for files while keeping prefix checks for directories.

## Individual Comments

### Comment 1
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:425` </location>
<code_context>
-
-    if selected_projects == "":
-        selected_projects = wandb_project_names
+    if args.projects:
+        selected_projects = [p.strip() for p in args.projects.split(",") if p.strip()]
     else:
-        selected_projects = [project.strip() for project in selected_projects.split(",")]
+        selected_projects = wandb_project_names

     if not_found := set(selected_projects) - set(wandb_project_names):
</code_context>

<issue_to_address>
No check for empty selected_projects after filtering.

Consider exiting or prompting the user if selected_projects is empty to prevent unnecessary execution.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    if args.projects:
        selected_projects = [p.strip() for p in args.projects.split(",") if p.strip()]
    else:
        selected_projects = wandb_project_names

    if not_found := set(selected_projects) - set(wandb_project_names):
        print(f"Projects not found: {not_found}")
    print(f"Copying {len(selected_projects)} projects: {selected_projects}\n")
    logger.info(f"Copying {len(selected_projects)} projects: {selected_projects}")
=======
    if args.projects:
        selected_projects = [p.strip() for p in args.projects.split(",") if p.strip()]
    else:
        selected_projects = wandb_project_names

    if not selected_projects:
        print("No valid projects selected after filtering. Exiting.")
        exit(1)

    if not_found := set(selected_projects) - set(wandb_project_names):
        print(f"Projects not found: {not_found}")
    print(f"Copying {len(selected_projects)} projects: {selected_projects}\n")
    logger.info(f"Copying {len(selected_projects)} projects: {selected_projects}")
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SiddhantSadangi - I've reviewed your changes - here's some feedback:

  • Refactor copy_run into smaller subfunctions for attribute handling, metrics copying, and file uploads to reduce complexity and improve readability.
  • Avoid relying on module-level globals (e.g., neptune_workspace, timeout) inside core functions by passing them explicitly as parameters to improve modularity and testability.
  • Consider batching or streaming console log lines rather than logging each line individually to reduce API calls and improve performance when handling large log files.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Refactor `copy_run` into smaller subfunctions for attribute handling, metrics copying, and file uploads to reduce complexity and improve readability.
- Avoid relying on module-level globals (e.g., `neptune_workspace`, `timeout`) inside core functions by passing them explicitly as parameters to improve modularity and testability.
- Consider batching or streaming console log lines rather than logging each line individually to reduce API calls and improve performance when handling large log files.

## Individual Comments

### Comment 1
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:283` </location>
<code_context>
+
+
+def copy_files(neptune_run: Run, wandb_run, download_folder: str) -> None:  # type: ignore
+    for file in wandb_run.files():
+        if file.size and not any(file.name.startswith(path) for path in EXCLUDED_PATHS):
+            download_path = os.path.join(download_folder, file.name)
+            try:
+                logger.debug(
+                    f"Downloading file {file.name} for run {wandb_run.name} to {download_path}"
+                )
+                file.download(root=download_folder, replace=True, exist_ok=True)
+                if file.name == "output.log":
</code_context>

<issue_to_address>
No handling for file name collisions in copy_files.

Files with identical names from different subdirectories may overwrite each other in the download_folder. Please implement logic to preserve directory structure or handle name collisions to prevent data loss.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SiddhantSadangi - I've reviewed your changes - here's some feedback:

  • Remove unused helper functions and commented-out code (e.g., the threadsafe_change_directory context manager) to reduce clutter and maintainability burden.
  • Add cleanup logic to delete the temporary directory after migration completes to prevent leftover files from accumulating on disk.
  • Rename local mutable lists FAILED_RUNS and FAILED_PROJECTS to lowercase (e.g., failed_runs, failed_projects) to follow Python naming conventions for non-constant variables.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Remove unused helper functions and commented-out code (e.g., the threadsafe_change_directory context manager) to reduce clutter and maintainability burden.
- Add cleanup logic to delete the temporary directory after migration completes to prevent leftover files from accumulating on disk.
- Rename local mutable lists FAILED_RUNS and FAILED_PROJECTS to lowercase (e.g., failed_runs, failed_projects) to follow Python naming conventions for non-constant variables.

## Individual Comments

### Comment 1
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:347` </location>
<code_context>
             try:
-                future.result()
+                future.result(timeout=timeout)
+            except TimeoutError:
+                logger.error(f"Timeout copying run {wandb_run.project}/{wandb_run.name}")
+                FAILED_RUNS.append((wandb_run.project, wandb_run.name, "Timeout"))
</code_context>

<issue_to_address>
Ambiguous TimeoutError catch

`future.result(timeout=timeout)` raises `concurrent.futures.TimeoutError`, not the built-in `TimeoutError`. Please import and catch `concurrent.futures.TimeoutError` explicitly.
</issue_to_address>

### Comment 2
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:46` </location>
<code_context>
+EXCLUDED_PATHS = {"artifact/", "config.yaml", "media/", "wandb-"}
+

 def exc_handler(exctype, value, tb):
     logger.exception("".join(traceback.format_exception(exctype, value, tb)))
</code_context>

<issue_to_address>
exc_handler never registered

Set `sys.excepthook = exc_handler` after defining the handler to ensure it processes uncaught exceptions.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def exc_handler(exctype, value, tb):
    logger.exception("".join(traceback.format_exception(exctype, value, tb)))
=======
def exc_handler(exctype, value, tb):
    logger.exception("".join(traceback.format_exception(exctype, value, tb)))

import sys
sys.excepthook = exc_handler
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `utils/migration_tools/from_wandb/wandb_to_neptune.py:149` </location>
<code_context>
+                except TypeError:
+                    pass
+                except Exception as e:
+                    logger.error(
+                        f"[{copy_run.__name__}]\tFailed to copy '{attr}' from W&B run '{wandb_run.name}' due to exception:\n{e}"
+                    )
</code_context>

<issue_to_address>
Use logger.exception to capture traceback

Using `logger.exception` will automatically include the stack trace, making debugging easier.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
                except Exception as e:
                    logger.error(
                        f"[{copy_run.__name__}]\tFailed to copy '{attr}' from W&B run '{wandb_run.name}' due to exception:\n{e}"
                    )
=======
                except Exception:
                    logger.exception(
                        f"[{copy_run.__name__}]\tFailed to copy '{attr}' from W&B run '{wandb_run.name}'"
                    )
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

szaganek
szaganek previously approved these changes Jun 16, 2025
@SiddhantSadangi SiddhantSadangi merged commit e856870 into main Jun 16, 2025
7 checks passed
@SiddhantSadangi SiddhantSadangi deleted the ss/wandb_exporter branch June 16, 2025 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants