-
Notifications
You must be signed in to change notification settings - Fork 497
OCPNODE-2982: add a OCP enhancement for Kueue #1759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@kannon92: This pull request references OCPNODE-2982 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
### Non-Goals | ||
|
||
- UX improvements on the Kueue APIs as part of the operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though UX is out - we would have to discuss on how we would like to migrate Distributed Workloads UI which majorly has Kueue metrics. We can keep that as a part of RHOAI, but it probably makes sense to expose those metrics in OCP dashboard instead for standalone installations of Kueue without RHOAI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could maybe reuse the same Prometheus Rules that you use for these metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but having kueue-metrics on OCP dashboard seems to make sense for users trying to install Kueue in a stand-alone mode without RHOAI. We would have to discuss the role of DW dashboard in the long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but I'm not sure every customer installation would want the same prometheus rules that RHOAI has.
This bullet point was mostly in mind for how to improve Kueue APIs or add easier ways to create kueue resources. We want to call that out of scope for this operator.
Metrics will be exported into OCP dashboard as part of our deployment. What RHOAI chooses to build for prometheus rules or dashboards may need some discussion if its general for all kinds of Kueue installations or tailor made for RHOAI.
- Topology-Aware Scheduling | ||
|
||
Kueue uses [Resource Flavors](https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/) to describe what resources are available on a cluster to better support heterogenous clusters. | ||
Resource flavors are cluster scoped and set up by the cluster adminstrator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we expect resource flavours to work in HCP clusters? Will they be within the workload cluster API rather than the management cluster?
IIUC, autoscaling exists in the management plane, and so, does the autoscaler look at provisioning requests in the HCP workload cluster API, or management cluster API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kueue would create provision requests in the worker cluster. ProvRequests are scoped to the namespace of the workload so I would expect these to be coming from the customer environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a section on hypershift below where I mention that this would probably live in a infra node on the customer clusters in HCP cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elmiko Do you happen to know if the ProvisioningRequest is expected to be in the workload cluster from the perspective of the Cluster Autoscaler, where it distinguishes different clients for workload and management clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it would appear that the cluster autoscaler is going to use which kubeconfig is specified in the --kubeconfig
flag (or similar mechanism eg env variable or service account). so we might need to rethink how this works if we want to support multiple topologies in the upstream.
for example, users have some discretion with the clusterapi provider as to where they put the clusterapi resources, this would definitely impact how the autoscaler is deployed with respect to provisioningrequests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we expect resource flavours to work in HCP clusters? Will they be within the workload cluster API rather than the management cluster?
IIUC, autoscaling exists in the management plane, and so, does the autoscaler look at provisioning requests in the HCP workload cluster API, or management cluster API?
Do you happen to know if the ProvisioningRequest is expected to be in the workload cluster from the perspective of the Cluster Autoscaler, where it distinguishes different clients for workload and management clusters?
I see kueue+provioningRequest setup as a Cluster Admin task. From kueue pov it's just an Openshift cluster. I don't foresee anything specific about hcp beyond the aforementioned webhook performance considerations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i looked at the cluster-autoscaler code a little more to confirm that provisioing request is using the kubeconfig from the flagged argument, and it appears that it does.
this shouldn't be a problem on openshift as we setup these kubeconfig for the user, but is something that will require some documentation in upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so which kubeconfig is that? The same one it uses to fetch pods and nodes, or the one it uses to fetch CAPI resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a core cluster autoscaler feature so it must be able to use the guest cluster config, just like Pods and Nodes. Whether we also want to enable the option to watch this management side when in conjunction with capi provider I see as an orthogonal topic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't planning for them to be viewed from the management side at present, just wanted to confirm that the CAPI autoscaler expectation is that these are guest cluster side, as that is what is needed for this use case
Due to this, it is requested that Kueue Operator will still allow the | ||
changing of feature gates to provide more advanced functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to specify only a subset of FGs that can strictly make the cluster non-upgradable, and allow cluster upgrades for others? Is this a pattern that is supported by OCP, and are there any other layered operators that do this?
| ? | GA | 4.19 | ? | Y | ||
|
||
Kueue releases 6 times a year, roughly. | ||
They will have roughly 2 releases per k8s version so we can take the latest version that is built with the kubernetes version that OCP comes with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up from the previous comment - we probably should also chart out release schedule for Kueue operator and map Kueue <-> Kueue operator versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have three phases for this and right now I am working on getting a tech preview out that would focus on a single kueue version to operator. For a tech preview of an operator we won't necessary handle upgrades (there is no version less than this).
Phase 2 would be RHOAI integration where we figure out versioning and we should have a better idea of how to use Konflux to release Kueue + Kueue Operator. We still need to create file based cataloges and figure out what kind of stream we want our operator to be.
Right now, 0.1 of the operator would be using Kueue 0.10 or 0.11 depending on when we get everything signed off for Kueue. We are still waiting on production tasks to official release Kueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still figuring out the rollout of this.
Kueue achieves this via a `WaitForPodsReady` field in their configuration to wait for all the pods to have quota. | ||
Kueue will only admit the workload if all pods are able to be admitted at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitForPodsReady
makes it so that Kueue only considers a workload successfully admitted if all the pods for that workload get to the ready state. Kueue Workloads
are enough to ensure gang admission. I find the Kueue documentation here confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given you have experience here, could you explain, as an end user, in what situation would I/would I not want this? Any concrete use cases you can talk to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, users usually want workloads to be admitted if pods are ready.
My mental model of Kube:
- schedule
- init containers succeed
- main containers pull
- probes pass
- pod is ready for action
My experience is that users assume that if a pod is scheduled it means that it is ready for action. Scheduling means that a pod has been assigned a node and then it starts running the pod lifecycle. So in most cases a pod is scheduled but it could take a long time to have the pod actually ready.
So most workloads I've seen start to consider "WaitForReady" as a way to enforce that pods are sufficient to run.
We did this in JobSet to make sure that leader jobs are ready before going to the worker jobs.
This doc seems to explain it better well. https://kueue.sigs.k8s.io/docs/tasks/manage/setup_wait_for_pods_ready/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @kannon92 can speak more to the concrete use cases. I think the main benefit is the backoff
/retry
semantics introduced for groups of pods. It can capture cases where pods may never get to the ready state due to some service being offline. In that case, Kueue can deactivate the workload after a certain number of retries so that the capacity being used for the admitted workload can be used for workloads that will progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KPostOffice Is there any plans to use this right now? Or could we just remove it from operator?
Default this to disabled and leave it as a future addition if there is a need for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right so IIUC, WaitForPodsReady tells kueue that it must wait for all of the pods it just created to be ready, before it considers the capacity to be used? And it if decides that some of the pods never become ready, it will kill the workloads after ????, and that means it can then schedule something else instead?
Kueue will only admit the workload if all pods are able to be admitted at once.
What does it mean for Kueue to admit a workload, because this doesn't seem to tally with what I've just written above if I use my mental model of admission. From this sentence I expect that Kueue is blocking the pod scheduling until it can see that there's enough capacity for all of the pods in the group to be scheduled simultaneously, which I guess prevents some being scheduled earlier than others and therefore minimises the time they must wait?
It almost sounds like this option is more, wait for capacity, than wait for pods ready? Or am I confusing two different concepts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing the docs from upstream, it seems really like this is a batch scheduling problem, it's like they want an option of jobScheduling: Parallel | Sequential
, where Parallel
means that it will just schedule all pods from all jobs at once, and Sequential
is where it will wait for the pods ready from a particular job, before moving to the next
The option as it's described currently really focuses on the implementation and not the visible feature for the end user, which is why I think it's hard to understand what it's achieving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Essentially the problem is that Kubernetes has not yet added Gang scheduling as a core feature. There is a scheduling plugin called CoScheduling that people have used but plugins are a bit more difficult to support and people don't usually mess with the kube scheduler.
So kueue defined there method to at least have some basic support for gang scheduling by waiting for ready pods.
Sequential is where it will wait for the pods ready from a particular job, before moving to the next
That sounds more like you are thinking of a workflow (ie argo Workflows, Tekton, or Kubeflow pipelines). Kueue is aiming to focus on a single workload and assuming that they should all run together. Kueue has some issues around proper workflow support but they have not made much progress in that avenue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define what you mean by workload in this context? I think the term is fairly overloaded and I'm not sure I see the difference between what the two of us are saying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If waitForPodsReady
is enabled and Parallel
then Kubernetes will admit a Workload[^1], but it also watch the resulting pods and evict and requeue the Workload if the timeout is exceeded. If waitForPodsReady
and Sequential
is enabled then Kueue will wait until the pods either get to a ready state or surpass their # of retries before moving onto the next Workload in that ClusterQueue
[^1] All references to Workload are a reference to Kueue's Workload CR
7b7f54e
to
7720791
Compare
could use a comb through after new GA dates and accelerated timelines, but generally looks good 👍 |
my preference would be to squash the commits, but if we want to leave them as-is that's fine too LGTM (I can tag if we want to leave) |
Rebased and please tag. |
/lgtm |
@kannon92: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
No description provided.