Skip to content

feat: Set maximum job duration for Azure Batch #5996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

adamrtalbot
Copy link
Collaborator

Adds a maximum job duration for Azure Batch, after which all jobs will terminate which will terminate them automatically and remove them from quota. This acts as a killswitch to prevent Job quota being used up if Nextflow dies unexpectedly.

Signed-off-by: adamrtalbot [email protected]

Hi! Thanks for contributing to Nextflow.

When submitting a Pull Request, please sign-off the DCO [1] to certify that you are the author of the contribution and you adhere to Nextflow's open source license [2] by adding a Signed-off-by line to the contribution commit message. See [3] for more details.

  1. https://developercertificate.org/
  2. https://github.com/nextflow-io/nextflow/blob/master/COPYING
  3. https://github.com/apps/dco

Adds a maximum job duration for Azure Batch, after which all jobs will terminate which will terminate them automatically and remove them from quota. This acts as a killswitch to prevent Job quota being used up if Nextflow dies unexpectedly.

Signed-off-by: adamrtalbot <[email protected]>
@adamrtalbot adamrtalbot requested a review from a team as a code owner April 22, 2025 09:27
Copy link

netlify bot commented Apr 22, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 6d3aa54
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/680a13aa6265cd0008e8d5a0

@bentsherman
Copy link
Member

Why not just use the time directive?

@adamrtalbot
Copy link
Collaborator Author

adamrtalbot commented Apr 22, 2025

Why not just use the time directive?

That is for a task not a job. Jobs are queues in Azure Batch and we keep leaving them hanging around forever until stuff breaks.

@pditommaso
Copy link
Member

Is not the task already constrained by the container max wall time 👉 5c11a0d? in any case, agree with Ben, we should avoid introducing yet another setting

@adamrtalbot
Copy link
Collaborator Author

adamrtalbot commented Apr 22, 2025

Is not the task already constrained by the container max wall time 👉 5c11a0d? in any case, agree with Ben, we should avoid introducing yet another setting

No, that is the time limit for the individual task not the job. Remember, a job is a queue in Azure Batch, which means Nextflow does not handle it in any way outside of closing it down when the pipeline closes. See this issue for details: #3926

This is a problem because there is a finite number of active jobs allowed by Azure Batch (e.g. maximum of 100). Azure seem to be becoming more strict with this which is causing problems and we need to be more frugal. When Nextflow exits abruptly, it leaves them in active state which uses up quota and requires them to be manually deleted.

By putting a finite time limit, orphaned jobs (queues) are correctly terminated after a block of time, which means they are much less likely to fill up the quota and prevent jobs and tasks being created.

@bentsherman
Copy link
Member

Adam is right, queues in Azure are called "jobs", and apparently they aren't meant to be kept around after the workflow is finished. Still, I would try to find a better name for the setting because Nextflow users will be confused by maxJobTime vs the time directive

@adamrtalbot
Copy link
Collaborator Author

Adam is right, queues in Azure are called "jobs", and apparently they aren't meant to be kept around after the workflow is finished. Still, I would try to find a better name for the setting because Nextflow users will be confused by maxJobTime vs the time directive

Options:

  • jobInactivityTimeout <- my favourite because it says what it's for not what it is
  • jobMaxWallClockTime <- most accurate
  • jobTimeoutLimit & maxJobDuration <- very similar to process.time

@bentsherman
Copy link
Member

how about jobIdleTimeout

@pditommaso
Copy link
Member

Java doc for task says:

Set the maxWallClockTime property: The maximum elapsed time that the Task may run, measured from the time the Task starts. If the Task does not complete within the time limit, the Batch service terminates it. If this is not specified, there is no time limit on how long the Task may run.

Java doc for job says:

Set the maxWallClockTime property: The maximum elapsed time that the Job may run, measured from the time the Job is created. If the Job does not complete within the time limit, the Batch service terminates it and any Tasks that are still running. In this case, the termination reason will be MaxWallClockTimeExpiry. If this property is not specified, there is no time limit on how long the Job may run.

I understand the rationale

When Nextflow exits abruptly, it leaves them in active state which uses up quota and requires them to be manually deleted.

However this may break long running jobs (tho 7 days could be quite unusual).

Enabling cleanup by default, would not improve the problem?

// delete all jobs
if( config.batch().deleteJobsOnCompletion ) {
cleanupJobs()
}

@bentsherman
Copy link
Member

bentsherman commented Apr 22, 2025

The proposed timeout is for when a pipeline is killed before it can cleanup the jobs

@adamrtalbot
Copy link
Collaborator Author

Here's the refined version:

However, this may break long-running jobs (though 7 days could be quite unusual).

Correct, this is why I've set a long default value. It's essentially a dead man's switch, and we could also set it to 30, 60, or even 365 days. Setting it to null means the constraint won't be applied.

Wouldn't enabling cleanup by default improve the problem?

We already handle job termination and deletion effectively during normal pipeline runs, but we still encounter orphaned active jobs, typically when Nextflow is killed abruptly and isn't able to close properly. This change will cause jobs to terminate themselves after a period of inactivity, which means eventually you will return to zero active jobs.

@pditommaso
Copy link
Member

Let's use jobMaxWallClockTime that consistent with Batch API with 30d as default

@adamrtalbot
Copy link
Collaborator Author

Let's use jobMaxWallClockTime that consistent with Batch API with 30d as default

Done and done.

Comment on lines 955 to 964
when:
// This is what happens inside createJob0 for setting constraints
def content = new BatchJobCreateContent('test-job', new BatchPoolInfo(poolId: 'test-pool'))
if (exec.getConfig().batch().jobMaxWallClockTime) {
final constraints = new BatchJobConstraints()
final long millis = exec.getConfig().batch().jobMaxWallClockTime.toMillis()
final java.time.Duration maxWallTime = java.time.Duration.ofMillis(millis)
constraints.setMaxWallClockTime(maxWallTime)
content.setConstraints(constraints)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel like a very good test...but I can't see how to mock the methods inside createJob0? Or is this an indication I should split up the method to a createJobConstraints and createJob0 method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a3dbc7c

Copy link
Collaborator

@christopher-hakkaart christopher-hakkaart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look great.

@@ -945,7 +961,11 @@ class AzBatchService implements Closeable {
apply(() -> client.updateJob(jobId, jobParameter))
}
catch (Exception e) {
log.warn "Unable to terminate Azure Batch job ${jobId} - Reason: ${e.message ?: e}"
if (e.message?.contains('Status code 409') && e.message?.contains('JobCompleted')) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants