Skip to content

refactor data mover: switch to BatchJob with auto cleanup and sleep after every run #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 11, 2025

Conversation

prekshivyas
Copy link
Contributor

No description provided.

Copy link
Collaborator

@roclark roclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @prekshivyas! This looks good to me - just some small comments on extra wait states.

raise TimeoutError(f"Job {job_id} did not complete within {timeout} seconds.")
current_job = client.job.get(job_id)
current_job_status = current_job.status.state
if count > 0 and current_job_status in [LeptonJobState.Completed, LeptonJobState.Failed, LeptonJobState.Unknown]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the count > 0 check? If the job is immediately in one of the acceptable states, figured we could break out right away, but I suppose it could be in Unknown prior to running?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had pout some logs to check what states come up when job is just scheduling or starting - sometimes it would be Unknown like in the initial tests - i saw it coming more than 2 times, but in my later tests it would just come up as Starting and not Unknown anymore - just to be sure and avoid this randomness i put a count

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks! We can keep the count check then. If we want to be super efficient, we could do another branch for if current_job_status == LeptonJobState.Unknown and count == 0 and sleep/retry in that state and only check for Completed and Failed without the count here, but I'm not too particular about it.

if current_job_status != LeptonJobState.Completed:
raise RuntimeError(f"Job {job_id} failed with status: {current_job_status}")

time.sleep(sleep)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards not putting a sleep here. If the data has been uploaded to the remote FS and the data-mover is marked as Completed, I'd say we continue immediately to the next stages to reduce downtime. Theoretically, we should only make it to this line if everything is ready to continue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I see ! yeah we could remove sleep so I will do that now

prekshivyas added 4 commits June 9, 2025 12:57
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
prekshivyas added 2 commits June 11, 2025 13:22
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
Copy link
Collaborator

@roclark roclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@roclark roclark merged commit caf3f12 into NVIDIA-NeMo:main Jun 11, 2025
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants