GEP 3388 Retry Budget API Implementation #3607

ericdbishop · 2025-02-10T18:04:43Z

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?:

adds a new BackendTrafficPolicy with ability to configure budgeted retries

k8s-ci-robot · 2025-02-10T18:04:53Z

Hi @ericdbishop. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ericdbishop · 2025-02-10T18:14:31Z

apis/v1alpha2/backendtrafficpolicy_types.go

+	// Retry defines the configuration for when to retry a request to a target
+	// backend.
+	//
+	// Implementations SHOULD retry on connection errors (disconnect, reset, timeout,
+	// TCP failure) if a retry stanza is configured.
+	//
+	// Support: Extended
+	//
+	// +optional
+	// <gateway:experimental>
+	Retry *CommonRetryPolicy `json:"retry,omitempty"`


Planning to correct this description (previously discussed here), but I'm also considering changing Retry to RetryBudget so we can better capture the distinction between a constrained budget on retries, versus the static count retries that are configured within HTTPRoute. I think CommonRetryPolicy is okay but would also be curious if we think RetryBudgetPolicy would be more self-explanatory.

CommonRetryPolicy was originally an abstraction from the initial "two possible approaches" proposal just to minimize duplication - agreed that the Common* prefix is probably no longer appropriate, but not quite sure what the correct name should be here:

I feel like *Policy implies a top-level resource like BackendTrafficPolicy that is actually an impl of the policy attachment pattern, not a sub-resource.

We could just collapse the fields into BackendTrafficPolicy inline, but I like the way SessionPersistence is broken out currently - it feels like it will be more composable if we add additional functionality to BackendTrafficPolicy

I'm not quite sure yet if we do indeed want to narrow the scope down to RetryBudget or choose a name that could allow additional fields within this stanza.

I agree with those points, I think it makes sense to leave the retry budget configuration broken out from BackendTrafficPolicy instead of inline. I could see replacing the name CommonRetryPolicy with something like RetryConstraint if we wanted to allow for the possibility down the line of constraining retries based off of something other than a budget.

I took some liberty here in renaming Retry and the CommonRetryPolicy struct to RetryConstraint, but open to better suggestions.

ericdbishop · 2025-02-10T18:21:32Z

apis/v1alpha2/backendtrafficpolicy_types.go

+	// Support: Extended
+	//
+	// +optional
+	BudgetPercent *int `json:"budgetPercent,omitempty"`


Previous comment on validation. The maximum valid argument for BudgetPercent should be 100 as that is effectively the same as having no retry budget at all, but should the minimum value we allow be 0? Should users be allowed to block all retries in that way?

I set the minimum as 0 for the time being.

ericdbishop · 2025-02-10T18:32:21Z

apis/v1alpha2/backendtrafficpolicy_types.go

+// CommonRetryPolicy defines the configuration for when to retry a request.
+type CommonRetryPolicy struct {


What's the minimum viable set of fields here for an implementation to say that they support retry budgets?

Link to comment.

Given confirmation that Envoy's retry_budget spec could be modified to include a parameter that matches BudgetInterval, I think it would be safe to require that implementations should include all fields to be considered supporting retry budgets.

But that being said, I could see how BudgetInterval could be excluded to match Envoy's existing retry budget behavior which @mikemorris detailed here, making only MinRetryRate and BudgetPercent truly necessary.

I could see how BudgetInterval could be excluded to match Envoy's existing retry budget behavior

In the context of @tonya11en's comment at #3573 (comment) and envoyproxy/envoy#30205 (comment), even though this could be possible to enable, I'm unsure if it would actually be desireable even for Envoy-based implementations of Gateway API?

This additionally has some bearing on the semantic meaning of budgetInterval: 0 (weird, effectively a rate with a division by zero unless we use it as a shorthand for Envoy's current behavior) vs if we want to prescribe a default interval when omitting the field entirely (which could make UX more concise).

I do think having a default budgetInterval would make sense, going off of the default I see for Linkerd's ttl parameter, maybe 10s is reasonable? Agree about the meaning of budgetInterval: 0 being strange. Also, since it is a duration, it would require a unit of time. It does seem more desirable to require implementations to include all three parameters after seeing that additional context.

Circling back to this, I'm not quite sure if we would need to remove the default here (to allow the field to be omitted without implicitly setting the value to 10s), or if kubebuilder defaults aren't even applicable for pointer types (and thus the annotation should be removed because it's irrelevant)?

apisx/v1alpha2/backendtrafficpolicy.go

…monRetryPolicy

…orm with api structure

ericdbishop · 2025-02-27T21:45:31Z

/retest

robscott

Thanks @ericdbishop! Largely LGTM, just a couple comments about validation

robscott · 2025-02-28T20:26:44Z

apisx/v1alpha2/shared_types.go

+	//
+	// Support: Extended
+	// +kubebuilder:validation:Minimum=0
+	Count *int `json:"count,omitempty"`


We need to avoid unbounded values here, I'd recommend some kind of max here, even if that max is very high.

The context in which this is currently used is for MinRetryRate as "an override for very low volume traffic" as @htuch described in https://github.com/kubernetes-sigs/gateway-api/pull/3607/files#r1964340430 but this struct type is (intentionally) sufficiently generic that it could be re-used in the future for #326. As such, I would defer to @htuch on what a "safe" maximum might entail for this at scale with a potential denominator as short as 1ms (I'm assuming there may be integer type maximums to keep in mind for implementations too)

Given this is going to have a variable denominator, e.g. hour, I think you're just going to have to go with some type based limit or something that would make sense over a long period.

robscott · 2025-02-28T20:27:01Z

apisx/v1alpha2/shared_types.go

+	// time during which the given count of requests occur.
+	//
+	// Support: Extended
+	Interval *Duration `json:"interval,omitempty"`


This also feels like something that should have a min and max value. I'm assuming 1s is acceptable for a min, unclear what a good max is. Any ideas @htuch or @mikemorris?

Can we actually enforce min/max constraints on GEP-2257 Duration types?

If so, I think 1ms is the minimum expressible through that format (even if logically a per-second rate may be more common, expressing rates with a denominator in milliseconds may be helpful for higher-throughput use cases to work around a count max constraint), assuming we want to exclude the "divide by zero" shorthand for Envoy's current behavior.

The max expressible appears to be 99999h - in practice I'd expect even a rate per hour may be unlikely to be configured, instead preferring enforce a more normal distribution over longer time spans, but maybe 1h, 12h or 24h is a reasonable max?

Good question, I think we'd have to define min+max with CEL using > and < comparisons. A rough example is here: https://github.com/kubernetes-sigs/gateway-api/blob/main/apis/v1/httproute_types.go#L320.

I agree that any of 1h, 12h, and 24h would be reasonable here, but I'd prefer to start with the lower value 1h since it would be much harder to tighten this validation retroactively than to loosen it.

The maximum GEP-2257 Duration is actually 99999h99999m99999s99999ms, which I would agree is quite a bit larger than would probably make sense here. I'm honestly not quite sure what kind of math we can do in CEL here, but a really simple check might be to disallow anything with more than two digits before h, or more than four before m or s?

Fortunately CEL supports a duration type that is compatible with our duration, so we should be able to have pretty reasonable comparisons here.

https://github.com/kubernetes-sigs/gateway-api/pull/3607/files#diff-1cb23e8f5b439a1f7ccf83b31ed45000f8ca373d9f27fd16ccfba3d9d0a0a8f3R37

robscott

Thanks @ericdbishop! This is really close to the wire for v1.3. If you're able to squeeze this in today, I think we're good to go, otherwise may need to wait until the next release.

/approve

apisx/v1alpha2/shared_types.go

apisx/v1alpha2/backendtrafficpolicy.go

k8s-ci-robot · 2025-03-03T22:21:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ericdbishop, robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [robscott]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: Rob Scott <[email protected]>

kflynn · 2025-03-03T22:43:53Z

Assuming the question of where the validation expression should go gets resolved, I'm good with this, thanks for all the work @ericdbishop! 🙂

youngnick · 2025-03-04T01:33:05Z

/lgtm

Nice work pushing this through!

youngnick · 2025-03-04T01:35:46Z

/unhold

k8s-ci-robot requested review from robscott and youngnick February 10, 2025 18:04

ericdbishop changed the title ~~Gep 3388 retry budget api implementation~~ GEP 3388 Retry Budget API Implementation Feb 10, 2025

ericdbishop commented Feb 10, 2025

View reviewed changes

ericdbishop commented Feb 13, 2025

View reviewed changes

apisx/v1alpha2/backendtrafficpolicy.go Outdated Show resolved Hide resolved

ericdbishop added 10 commits February 13, 2025 09:08

apis: add implementation for GEP-3388 HTTPRoute Retry Budget

4b4fd67

fmt and add descriptions for parameters

1993417

Move GEP 3388 to Experimental

81ee318

make generate

7b61c97

Minor change

6661e47

Require both parameters of RequestRate

e359c3e

Begin fixing Retry description. Add defaults, some validation, in Com…

b166a68

…monRetryPolicy

Taking the liberty of renaming CommonRetryPolicy to RetryConstraint

1a122fa

Shamelessly copying from backendlbpolicy and backendtlspolicy to conf…

5a02c8e

…orm with api structure

Fleshing out the description for RetryConstraint

5f2b55b

ericdbishop requested a review from robscott February 27, 2025 21:37

fix imports

d3741e2

robscott approved these changes Feb 28, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2025

Modify descriptions; add greater validation

c624de3

robscott reviewed Mar 3, 2025

View reviewed changes

apisx/v1alpha2/shared_types.go Outdated Show resolved Hide resolved

apisx/v1alpha2/backendtrafficpolicy.go Outdated Show resolved Hide resolved

Update apisx/v1alpha2/backendtrafficpolicy.go

7afef1e

Co-authored-by: Rob Scott <[email protected]>

ericdbishop added 2 commits March 3, 2025 17:51

Not complete, but adding CEL tests for backend traffic policy

c4910de

fix missing retryConstraint in test struct; add tests for invalid config

ad75b71

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 3, 2025

ericdbishop added 8 commits March 3, 2025 18:54

move to apis/v1alpha1

4d7bc3f

add main_test

7c8fe59

rename file

6fc4796

fix missing quotes :)

bc35cad

fix CEL condition

9a1f63b

move validation message to interval field directly

46431b3

remove unused apisx/v1alpha2 file

a7a263b

remove more files from before merging experimental api versions

9a1d005

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 4, 2025

k8s-ci-robot assigned youngnick Mar 4, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2025

k8s-ci-robot merged commit 29a5dd8 into kubernetes-sigs:main Mar 4, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEP 3388 Retry Budget API Implementation #3607

GEP 3388 Retry Budget API Implementation #3607

ericdbishop commented Feb 10, 2025

k8s-ci-robot commented Feb 10, 2025

ericdbishop Feb 10, 2025

mikemorris Feb 10, 2025

ericdbishop Feb 11, 2025

ericdbishop Feb 12, 2025

ericdbishop Feb 10, 2025

ericdbishop Feb 12, 2025

ericdbishop Feb 10, 2025

mikemorris Feb 10, 2025 •

edited

Loading

ericdbishop Feb 11, 2025

mikemorris Mar 6, 2025

ericdbishop commented Feb 27, 2025

robscott left a comment

robscott Feb 28, 2025

mikemorris Feb 28, 2025 •

edited

Loading

htuch Feb 28, 2025

robscott Feb 28, 2025

mikemorris Feb 28, 2025 •

edited

Loading

robscott Feb 28, 2025

kflynn Feb 28, 2025 •

edited

Loading

robscott Feb 28, 2025

ericdbishop Mar 2, 2025

robscott left a comment

k8s-ci-robot commented Mar 3, 2025

kflynn commented Mar 3, 2025

youngnick commented Mar 4, 2025

youngnick commented Mar 4, 2025

		// CommonRetryPolicy defines the configuration for when to retry a request.
		type CommonRetryPolicy struct {

GEP 3388 Retry Budget API Implementation #3607

GEP 3388 Retry Budget API Implementation #3607

Conversation

ericdbishop commented Feb 10, 2025

k8s-ci-robot commented Feb 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemorris Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericdbishop commented Feb 27, 2025

robscott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemorris Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemorris Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kflynn Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robscott left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 3, 2025

kflynn commented Mar 3, 2025

youngnick commented Mar 4, 2025

youngnick commented Mar 4, 2025

mikemorris Feb 10, 2025 •

edited

Loading

mikemorris Feb 28, 2025 •

edited

Loading

mikemorris Feb 28, 2025 •

edited

Loading

kflynn Feb 28, 2025 •

edited

Loading