OCPNODE-2982: add a OCP enhancement for Kueue #1759

kannon92 · 2025-02-19T22:38:06Z

No description provided.

openshift-ci-robot · 2025-02-19T22:38:09Z

@kannon92: This pull request references OCPNODE-2982 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-02-19T22:38:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joepvd for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enhancements/kueue/kueue.md

enhancements/kueue/openshift-kueue.md

varshaprasad96 · 2025-02-27T23:15:07Z

enhancements/kueue/openshift-kueue.md

+
+### Non-Goals
+
+- UX improvements on the Kueue APIs as part of the operator.


Though UX is out - we would have to discuss on how we would like to migrate Distributed Workloads UI which majorly has Kueue metrics. We can keep that as a part of RHOAI, but it probably makes sense to expose those metrics in OCP dashboard instead for standalone installations of Kueue without RHOAI.

I think you could maybe reuse the same Prometheus Rules that you use for these metrics.

Yes, but having kueue-metrics on OCP dashboard seems to make sense for users trying to install Kueue in a stand-alone mode without RHOAI. We would have to discuss the role of DW dashboard in the long term.

I agree but I'm not sure every customer installation would want the same prometheus rules that RHOAI has.

This bullet point was mostly in mind for how to improve Kueue APIs or add easier ways to create kueue resources. We want to call that out of scope for this operator.

Metrics will be exported into OCP dashboard as part of our deployment. What RHOAI chooses to build for prometheus rules or dashboards may need some discussion if its general for all kinds of Kueue installations or tailor made for RHOAI.

enhancements/kueue/openshift-kueue.md

JoelSpeed · 2025-03-04T12:31:13Z

enhancements/kueue/openshift-kueue.md

+- Topology-Aware Scheduling
+
+Kueue uses [Resource Flavors](https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/) to describe what resources are available on a cluster to better support heterogenous clusters. 
+Resource flavors are cluster scoped and set up by the cluster adminstrator.


How do we expect resource flavours to work in HCP clusters? Will they be within the workload cluster API rather than the management cluster?

IIUC, autoscaling exists in the management plane, and so, does the autoscaler look at provisioning requests in the HCP workload cluster API, or management cluster API?

Kueue would create provision requests in the worker cluster. ProvRequests are scoped to the namespace of the workload so I would expect these to be coming from the customer environment.

I have a section on hypershift below where I mention that this would probably live in a infra node on the customer clusters in HCP cases.

@elmiko Do you happen to know if the ProvisioningRequest is expected to be in the workload cluster from the perspective of the Cluster Autoscaler, where it distinguishes different clients for workload and management clusters?

hmm, it would appear that the cluster autoscaler is going to use which kubeconfig is specified in the --kubeconfig flag (or similar mechanism eg env variable or service account). so we might need to rethink how this works if we want to support multiple topologies in the upstream.

for example, users have some discretion with the clusterapi provider as to where they put the clusterapi resources, this would definitely impact how the autoscaler is deployed with respect to provisioningrequests.

How do we expect resource flavours to work in HCP clusters? Will they be within the workload cluster API rather than the management cluster?
IIUC, autoscaling exists in the management plane, and so, does the autoscaler look at provisioning requests in the HCP workload cluster API, or management cluster API?

Do you happen to know if the ProvisioningRequest is expected to be in the workload cluster from the perspective of the Cluster Autoscaler, where it distinguishes different clients for workload and management clusters?

I see kueue+provioningRequest setup as a Cluster Admin task. From kueue pov it's just an Openshift cluster. I don't foresee anything specific about hcp beyond the aforementioned webhook performance considerations.

i looked at the cluster-autoscaler code a little more to confirm that provisioing request is using the kubeconfig from the flagged argument, and it appears that it does.

this shouldn't be a problem on openshift as we setup these kubeconfig for the user, but is something that will require some documentation in upstream.

Ok, so which kubeconfig is that? The same one it uses to fetch pods and nodes, or the one it uses to fetch CAPI resources?

this is a core cluster autoscaler feature so it must be able to use the guest cluster config, just like Pods and Nodes. Whether we also want to enable the option to watch this management side when in conjunction with capi provider I see as an orthogonal topic

I wasn't planning for them to be viewed from the management side at present, just wanted to confirm that the CAPI autoscaler expectation is that these are guest cluster side, as that is what is needed for this use case

enhancements/kueue/openshift-kueue.md

varshaprasad96 · 2025-03-04T19:08:03Z

enhancements/kueue/openshift-kueue.md

+Due to this, it is requested that Kueue Operator will still allow the
+changing of feature gates to provide more advanced functionality.


Is there a way to specify only a subset of FGs that can strictly make the cluster non-upgradable, and allow cluster upgrades for others? Is this a pattern that is supported by OCP, and are there any other layered operators that do this?

enhancements/kueue/openshift-kueue.md

varshaprasad96 · 2025-03-04T19:11:22Z

enhancements/kueue/openshift-kueue.md

+| ?                  | GA           | 4.19           | ?        | Y
+
+Kueue releases 6 times a year, roughly. 
+They will have roughly 2 releases per k8s version so we can take the latest version that is built with the kubernetes version that OCP comes with.


Following up from the previous comment - we probably should also chart out release schedule for Kueue operator and map Kueue <-> Kueue operator versions?

We have three phases for this and right now I am working on getting a tech preview out that would focus on a single kueue version to operator. For a tech preview of an operator we won't necessary handle upgrades (there is no version less than this).

Phase 2 would be RHOAI integration where we figure out versioning and we should have a better idea of how to use Konflux to release Kueue + Kueue Operator. We still need to create file based cataloges and figure out what kind of stream we want our operator to be.

Right now, 0.1 of the operator would be using Kueue 0.10 or 0.11 depending on when we get everything signed off for Kueue. We are still waiting on production tasks to official release Kueue.

We are still figuring out the rollout of this.

enhancements/kueue/openshift-kueue.md

KPostOffice · 2025-03-06T22:20:29Z

enhancements/kueue/openshift-kueue.md

+Kueue achieves this via a `WaitForPodsReady` field in their configuration to wait for all the pods to have quota.
+Kueue will only admit the workload if all pods are able to be admitted at once.


waitForPodsReady makes it so that Kueue only considers a workload successfully admitted if all the pods for that workload get to the ready state. Kueue Workloads are enough to ensure gang admission. I find the Kueue documentation here confusing.

Given you have experience here, could you explain, as an end user, in what situation would I/would I not want this? Any concrete use cases you can talk to?

In my experience, users usually want workloads to be admitted if pods are ready.

My mental model of Kube:

schedule

init containers succeed

main containers pull

probes pass

pod is ready for action

My experience is that users assume that if a pod is scheduled it means that it is ready for action. Scheduling means that a pod has been assigned a node and then it starts running the pod lifecycle. So in most cases a pod is scheduled but it could take a long time to have the pod actually ready.

So most workloads I've seen start to consider "WaitForReady" as a way to enforce that pods are sufficient to run.

We did this in JobSet to make sure that leader jobs are ready before going to the worker jobs.

This doc seems to explain it better well. https://kueue.sigs.k8s.io/docs/tasks/manage/setup_wait_for_pods_ready/

I think @kannon92 can speak more to the concrete use cases. I think the main benefit is the backoff/retry semantics introduced for groups of pods. It can capture cases where pods may never get to the ready state due to some service being offline. In that case, Kueue can deactivate the workload after a certain number of retries so that the capacity being used for the admitted workload can be used for workloads that will progress.

@KPostOffice Is there any plans to use this right now? Or could we just remove it from operator?

Default this to disabled and leave it as a future addition if there is a need for it?

Right so IIUC, WaitForPodsReady tells kueue that it must wait for all of the pods it just created to be ready, before it considers the capacity to be used? And it if decides that some of the pods never become ready, it will kill the workloads after ????, and that means it can then schedule something else instead?

Kueue will only admit the workload if all pods are able to be admitted at once.

What does it mean for Kueue to admit a workload, because this doesn't seem to tally with what I've just written above if I use my mental model of admission. From this sentence I expect that Kueue is blocking the pod scheduling until it can see that there's enough capacity for all of the pods in the group to be scheduled simultaneously, which I guess prevents some being scheduled earlier than others and therefore minimises the time they must wait?

It almost sounds like this option is more, wait for capacity, than wait for pods ready? Or am I confusing two different concepts here?

Reviewing the docs from upstream, it seems really like this is a batch scheduling problem, it's like they want an option of jobScheduling: Parallel | Sequential, where Parallel means that it will just schedule all pods from all jobs at once, and Sequential is where it will wait for the pods ready from a particular job, before moving to the next

The option as it's described currently really focuses on the implementation and not the visible feature for the end user, which is why I think it's hard to understand what it's achieving

Correct. Essentially the problem is that Kubernetes has not yet added Gang scheduling as a core feature. There is a scheduling plugin called CoScheduling that people have used but plugins are a bit more difficult to support and people don't usually mess with the kube scheduler.

So kueue defined there method to at least have some basic support for gang scheduling by waiting for ready pods.

Sequential is where it will wait for the pods ready from a particular job, before moving to the next

That sounds more like you are thinking of a workflow (ie argo Workflows, Tekton, or Kubeflow pipelines). Kueue is aiming to focus on a single workload and assuming that they should all run together. Kueue has some issues around proper workflow support but they have not made much progress in that avenue.

Can you define what you mean by workload in this context? I think the term is fairly overloaded and I'm not sure I see the difference between what the two of us are saying

If waitForPodsReady is enabled and Parallel then Kubernetes will admit a Workload[^1], but it also watch the resulting pods and evict and requeue the Workload if the timeout is exceeded. If waitForPodsReady and Sequential is enabled then Kueue will wait until the pods either get to a ready state or surpass their # of retries before moving onto the next Workload in that ClusterQueue

[^1] All references to Workload are a reference to Kueue's Workload CR

enhancements/kueue/openshift-kueue.md

haircommander · 2025-05-12T20:40:29Z

could use a comb through after new GA dates and accelerated timelines, but generally looks good 👍

enhancements/kueue/openshift-kueue.md

haircommander · 2025-05-14T15:01:55Z

my preference would be to squash the commits, but if we want to leave them as-is that's fine too

LGTM (I can tag if we want to leave)

kannon92 · 2025-05-14T15:11:18Z

my preference would be to squash the commits, but if we want to leave them as-is that's fine too

LGTM (I can tag if we want to leave)

Rebased and please tag.

haircommander · 2025-05-14T15:26:51Z

/lgtm

openshift-ci · 2025-05-14T15:30:17Z

@kannon92: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 19, 2025

openshift-ci bot requested review from derekwaynecarr and runcom February 19, 2025 22:38

enxebre reviewed Feb 20, 2025

View reviewed changes

enhancements/kueue/kueue.md Outdated Show resolved Hide resolved

varshaprasad96 reviewed Feb 27, 2025

View reviewed changes

JoelSpeed reviewed Mar 4, 2025

View reviewed changes

varshaprasad96 reviewed Mar 4, 2025

View reviewed changes

KPostOffice reviewed Mar 6, 2025

View reviewed changes

JoelSpeed reviewed Mar 18, 2025

View reviewed changes

ChristianZaccaria reviewed Mar 31, 2025

View reviewed changes

enhancements/kueue/openshift-kueue.md Outdated Show resolved Hide resolved

kannon92 force-pushed the openshift-kueue branch from 7b7f54e to 7720791 Compare April 15, 2025 17:06

haircommander reviewed May 12, 2025

View reviewed changes