-
Notifications
You must be signed in to change notification settings - Fork 60
refactor data mover: switch to BatchJob with auto cleanup and sleep after every run #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…fter every run Signed-off-by: prekshivyas <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together, @prekshivyas! This looks good to me - just some small comments on extra wait states.
nemo_run/core/execution/lepton.py
Outdated
raise TimeoutError(f"Job {job_id} did not complete within {timeout} seconds.") | ||
current_job = client.job.get(job_id) | ||
current_job_status = current_job.status.state | ||
if count > 0 and current_job_status in [LeptonJobState.Completed, LeptonJobState.Failed, LeptonJobState.Unknown]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the count > 0
check? If the job is immediately in one of the acceptable states, figured we could break out right away, but I suppose it could be in Unknown
prior to running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I had pout some logs to check what states come up when job is just scheduling or starting - sometimes it would be Unknown like in the initial tests - i saw it coming more than 2 times, but in my later tests it would just come up as Starting and not Unknown anymore - just to be sure and avoid this randomness i put a count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks! We can keep the count check then. If we want to be super efficient, we could do another branch for if current_job_status == LeptonJobState.Unknown and count == 0
and sleep/retry in that state and only check for Completed
and Failed
without the count here, but I'm not too particular about it.
nemo_run/core/execution/lepton.py
Outdated
if current_job_status != LeptonJobState.Completed: | ||
raise RuntimeError(f"Job {job_id} failed with status: {current_job_status}") | ||
|
||
time.sleep(sleep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards not putting a sleep
here. If the data has been uploaded to the remote FS and the data-mover is marked as Completed
, I'd say we continue immediately to the next stages to reduce downtime. Theoretically, we should only make it to this line if everything is ready to continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm I see ! yeah we could remove sleep so I will do that now
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
Signed-off-by: prekshivyas <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
No description provided.