refactor data mover: switch to BatchJob with auto cleanup and sleep after every run #265

prekshivyas · 2025-06-08T18:58:21Z

No description provided.

…fter every run Signed-off-by: prekshivyas <[email protected]>

roclark

Thanks for putting this together, @prekshivyas! This looks good to me - just some small comments on extra wait states.

roclark · 2025-06-09T16:13:07Z

nemo_run/core/execution/lepton.py

+                raise TimeoutError(f"Job {job_id} did not complete within {timeout} seconds.")
+            current_job = client.job.get(job_id)
+            current_job_status = current_job.status.state
+            if count > 0 and current_job_status in [LeptonJobState.Completed, LeptonJobState.Failed, LeptonJobState.Unknown]:


Do we need the count > 0 check? If the job is immediately in one of the acceptable states, figured we could break out right away, but I suppose it could be in Unknown prior to running?

So I had pout some logs to check what states come up when job is just scheduling or starting - sometimes it would be Unknown like in the initial tests - i saw it coming more than 2 times, but in my later tests it would just come up as Starting and not Unknown anymore - just to be sure and avoid this randomness i put a count

That makes sense, thanks! We can keep the count check then. If we want to be super efficient, we could do another branch for if current_job_status == LeptonJobState.Unknown and count == 0 and sleep/retry in that state and only check for Completed and Failed without the count here, but I'm not too particular about it.

roclark · 2025-06-09T16:15:20Z

nemo_run/core/execution/lepton.py

+        if current_job_status != LeptonJobState.Completed:
+            raise RuntimeError(f"Job {job_id} failed with status: {current_job_status}")
+
+        time.sleep(sleep)


I lean towards not putting a sleep here. If the data has been uploaded to the remote FS and the data-mover is marked as Completed, I'd say we continue immediately to the next stages to reduce downtime. Theoretically, we should only make it to this line if everything is ready to continue.

hmmm I see ! yeah we could remove sleep so I will do that now

Signed-off-by: prekshivyas <[email protected]>

roclark

LGTM, thanks!

refactor data mover: switch to BatchJob with auto cleanup and sleep a…

c855d63

…fter every run Signed-off-by: prekshivyas <[email protected]>

prekshivyas force-pushed the main branch from 08b4b8c to c855d63 Compare June 9, 2025 15:51

roclark reviewed Jun 9, 2025

View reviewed changes

remove sleep after data mover job

cffd105

Signed-off-by: prekshivyas <[email protected]>

prekshivyas force-pushed the main branch from 4ce22fc to cffd105 Compare June 9, 2025 17:30

prekshivyas added 4 commits June 9, 2025 12:57

ruff linting fix

61a3998

Signed-off-by: prekshivyas <[email protected]>

ttl finish remove extra setting

bd4da21

Signed-off-by: prekshivyas <[email protected]>

remove sleep param when calling the data mover

180eb6b

Signed-off-by: prekshivyas <[email protected]>

ruff and pytest fix

5d99d8f

Signed-off-by: prekshivyas <[email protected]>

prekshivyas had a problem deploying to public June 11, 2025 20:15 — with GitHub Actions Failure

ruff fix

7d3c48a

Signed-off-by: prekshivyas <[email protected]>

prekshivyas had a problem deploying to public June 11, 2025 20:21 — with GitHub Actions Failure

prekshivyas added 2 commits June 11, 2025 13:22

tests ruff fix

c4182d7

Signed-off-by: prekshivyas <[email protected]>

remove extra configs

0c49b10

Signed-off-by: prekshivyas <[email protected]>

prekshivyas had a problem deploying to public June 11, 2025 20:29 — with GitHub Actions Failure

roclark approved these changes Jun 11, 2025

View reviewed changes

roclark merged commit caf3f12 into NVIDIA-NeMo:main Jun 11, 2025
18 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor data mover: switch to BatchJob with auto cleanup and sleep after every run #265

refactor data mover: switch to BatchJob with auto cleanup and sleep after every run #265

Uh oh!

prekshivyas commented Jun 8, 2025

Uh oh!

roclark left a comment

Uh oh!

roclark Jun 9, 2025

Uh oh!

prekshinv Jun 9, 2025

Uh oh!

roclark Jun 9, 2025

Uh oh!

roclark Jun 9, 2025

Uh oh!

prekshinv Jun 9, 2025

Uh oh!

roclark left a comment

Uh oh!

Uh oh!

Uh oh!

refactor data mover: switch to BatchJob with auto cleanup and sleep after every run #265

refactor data mover: switch to BatchJob with auto cleanup and sleep after every run #265

Uh oh!

Conversation

prekshivyas commented Jun 8, 2025

Uh oh!

roclark left a comment

Choose a reason for hiding this comment

Uh oh!

roclark Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

prekshinv Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

roclark Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

roclark Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

prekshinv Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

roclark left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!