-
Notifications
You must be signed in to change notification settings - Fork 683
feat: Set maximum job duration for Azure Batch #5996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Adds a maximum job duration for Azure Batch, after which all jobs will terminate which will terminate them automatically and remove them from quota. This acts as a killswitch to prevent Job quota being used up if Nextflow dies unexpectedly. Signed-off-by: adamrtalbot <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Why not just use the |
Is not the task already constrained by the container max wall time 👉 5c11a0d? in any case, agree with Ben, we should avoid introducing yet another setting |
No, that is the time limit for the individual task not the job. Remember, a job is a queue in Azure Batch, which means Nextflow does not handle it in any way outside of closing it down when the pipeline closes. See this issue for details: #3926 This is a problem because there is a finite number of active jobs allowed by Azure Batch (e.g. maximum of 100). Azure seem to be becoming more strict with this which is causing problems and we need to be more frugal. When Nextflow exits abruptly, it leaves them in active state which uses up quota and requires them to be manually deleted. By putting a finite time limit, orphaned jobs (queues) are correctly terminated after a block of time, which means they are much less likely to fill up the quota and prevent jobs and tasks being created. |
Adam is right, queues in Azure are called "jobs", and apparently they aren't meant to be kept around after the workflow is finished. Still, I would try to find a better name for the setting because Nextflow users will be confused by |
Options:
|
how about |
Java doc for task says:
Java doc for job says:
I understand the rationale
However this may break long running jobs (tho 7 days could be quite unusual). Enabling cleanup by default, would not improve the problem? nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy Lines 995 to 998 in de05ecc
|
The proposed timeout is for when a pipeline is killed before it can cleanup the jobs |
Here's the refined version:
Correct, this is why I've set a long default value. It's essentially a dead man's switch, and we could also set it to 30, 60, or even 365 days. Setting it to
We already handle job termination and deletion effectively during normal pipeline runs, but we still encounter orphaned active jobs, typically when Nextflow is killed abruptly and isn't able to close properly. This change will cause jobs to terminate themselves after a period of inactivity, which means eventually you will return to zero active jobs. |
Let's use |
Signed-off-by: adamrtalbot <[email protected]>
Signed-off-by: adamrtalbot <[email protected]>
Signed-off-by: adamrtalbot <[email protected]>
Done and done. |
when: | ||
// This is what happens inside createJob0 for setting constraints | ||
def content = new BatchJobCreateContent('test-job', new BatchPoolInfo(poolId: 'test-pool')) | ||
if (exec.getConfig().batch().jobMaxWallClockTime) { | ||
final constraints = new BatchJobConstraints() | ||
final long millis = exec.getConfig().batch().jobMaxWallClockTime.toMillis() | ||
final java.time.Duration maxWallTime = java.time.Duration.ofMillis(millis) | ||
constraints.setMaxWallClockTime(maxWallTime) | ||
content.setConstraints(constraints) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't feel like a very good test...but I can't see how to mock the methods inside createJob0? Or is this an indication I should split up the method to a createJobConstraints and createJob0 method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a3dbc7c
…olation Signed-off-by: adamrtalbot <[email protected]>
Signed-off-by: adamrtalbot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look great.
@@ -945,7 +961,11 @@ class AzBatchService implements Closeable { | |||
apply(() -> client.updateJob(jobId, jobParameter)) | |||
} | |||
catch (Exception e) { | |||
log.warn "Unable to terminate Azure Batch job ${jobId} - Reason: ${e.message ?: e}" | |||
if (e.message?.contains('Status code 409') && e.message?.contains('JobCompleted')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a static exception type is may be better, see for example
Co-authored-by: Paolo Di Tommaso <[email protected]> Signed-off-by: Adam Talbot <[email protected]>
Signed-off-by: adamrtalbot <[email protected]>
Adds a maximum job duration for Azure Batch, after which all jobs will terminate which will terminate them automatically and remove them from quota. This acts as a killswitch to prevent Job quota being used up if Nextflow dies unexpectedly.
Signed-off-by: adamrtalbot [email protected]
Hi! Thanks for contributing to Nextflow.
When submitting a Pull Request, please sign-off the DCO [1] to certify that you are the author of the contribution and you adhere to Nextflow's open source license [2] by adding a
Signed-off-by
line to the contribution commit message. See [3] for more details.