-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEP 3388 Retry Budget API Implementation #3607
base: main
Are you sure you want to change the base?
GEP 3388 Retry Budget API Implementation #3607
Conversation
Hi @ericdbishop. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
// Retry defines the configuration for when to retry a request to a target | ||
// backend. | ||
// | ||
// Implementations SHOULD retry on connection errors (disconnect, reset, timeout, | ||
// TCP failure) if a retry stanza is configured. | ||
// | ||
// Support: Extended | ||
// | ||
// +optional | ||
// <gateway:experimental> | ||
Retry *CommonRetryPolicy `json:"retry,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Planning to correct this description (previously discussed here), but I'm also considering changing Retry
to RetryBudget
so we can better capture the distinction between a constrained budget on retries, versus the static count retries that are configured within HTTPRoute
. I think CommonRetryPolicy
is okay but would also be curious if we think RetryBudgetPolicy
would be more self-explanatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CommonRetryPolicy
was originally an abstraction from the initial "two possible approaches" proposal just to minimize duplication - agreed that the Common*
prefix is probably no longer appropriate, but not quite sure what the correct name should be here:
- I feel like
*Policy
implies a top-level resource likeBackendTrafficPolicy
that is actually an impl of the policy attachment pattern, not a sub-resource. - We could just collapse the fields into
BackendTrafficPolicy
inline, but I like the waySessionPersistence
is broken out currently - it feels like it will be more composable if we add additional functionality toBackendTrafficPolicy
- I'm not quite sure yet if we do indeed want to narrow the scope down to
RetryBudget
or choose a name that could allow additional fields within this stanza.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with those points, I think it makes sense to leave the retry budget configuration broken out from BackendTrafficPolicy
instead of inline. I could see replacing the name CommonRetryPolicy
with something like RetryConstraint
if we wanted to allow for the possibility down the line of constraining retries based off of something other than a budget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took some liberty here in renaming Retry
and the CommonRetryPolicy
struct to RetryConstraint
, but open to better suggestions.
// Support: Extended | ||
// | ||
// +optional | ||
BudgetPercent *int `json:"budgetPercent,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous comment on validation. The maximum valid argument for BudgetPercent
should be 100 as that is effectively the same as having no retry budget at all, but should the minimum value we allow be 0? Should users be allowed to block all retries in that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the minimum as 0 for the time being.
// CommonRetryPolicy defines the configuration for when to retry a request. | ||
type CommonRetryPolicy struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the minimum viable set of fields here for an implementation to say that they support retry budgets?
Given confirmation that Envoy's retry_budget
spec could be modified to include a parameter that matches BudgetInterval
, I think it would be safe to require that implementations should include all fields to be considered supporting retry budgets.
But that being said, I could see how BudgetInterval
could be excluded to match Envoy's existing retry budget behavior which @mikemorris detailed here, making only MinRetryRate
and BudgetPercent
truly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see how BudgetInterval could be excluded to match Envoy's existing retry budget behavior
In the context of @tonya11en's comment at #3573 (comment) and envoyproxy/envoy#30205 (comment), even though this could be possible to enable, I'm unsure if it would actually be desireable even for Envoy-based implementations of Gateway API?
This additionally has some bearing on the semantic meaning of budgetInterval: 0
(weird, effectively a rate with a division by zero unless we use it as a shorthand for Envoy's current behavior) vs if we want to prescribe a default interval when omitting the field entirely (which could make UX more concise).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think having a default budgetInterval
would make sense, going off of the default I see for Linkerd's ttl
parameter, maybe 10s
is reasonable? Agree about the meaning of budgetInterval: 0
being strange. Also, since it is a duration, it would require a unit of time. It does seem more desirable to require implementations to include all three parameters after seeing that additional context.
// RetryConstraint defines the configuration for when to allow or prevent | ||
// further retries to a target backend by dynamically calculating a 'retry | ||
// budget'. This budget is calculated based on the percentage of incoming | ||
// traffic composed of retries over a given time interval. Once the budget | ||
// is exceeded, additional retries will be rejected by the backend. | ||
// | ||
// For example, if the retry budget interval is 10 seconds, there have been | ||
// 1000 active requests in the past 10 seconds, and the allowed percentage | ||
// of requests that can be retried is 20% (the default), then 200 of those | ||
// requests may be composed of retries. Active requests will only be | ||
// considered for the duration of the interval when calculating the retry | ||
// budget. | ||
// | ||
// Configuring a RetryConstraint in BackendTrafficPolicy is compatible with | ||
// HTTPRoute Retry settings for each HTTPRouteRule that targets the same | ||
// backend. While the HTTPRouteRule Retry stanza can specify whether a | ||
// request should be retried and the number of retry attempts each client | ||
// may perform, RetryConstraint helps prevent cascading failures, such as | ||
// retry storms, during periods of consistent failures. | ||
// | ||
// After the retry budget has been exceeded, additional retries to the | ||
// backend must return a 503 response to the client. | ||
// | ||
// Additional configurations for defining a constraint on retries MAY be | ||
// defined in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire description requires wordsmithing.
…orm with api structure
@mikemorris @robscott @kflynn @youngnick @dprotaso Hi team, would appreciate an initial review as this PR is already pretty large. I cherry-picked/followed some of @dprotaso's changes from #3588 so would appreciate clarification if I correctly created a separate API group, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ericdbishop!
// BudgetPercent defines the maximum percentage of active requests that may | ||
// be made up of retries. | ||
// | ||
// Support: Extended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some more work on this, but I'd argue that we should have a concept of "this field MUST be supported if you support this feature" (retry budgets in this case).
// Support: Extended | ||
// | ||
// +optional | ||
// +kubebuilder:default=10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like something we'll want to define a min and max value for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we were thinking that 0s could potentially be used as shorthand for the current behavior offered by Envoy. As far as a maximum I could pick some arbitrarily high value, but I'm not sure what would be appropriate. Thoughts @mikemorris?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@htuch @tonya11en Are y'all interested in enabling the existing "in-flight" Envoy functionality, or would a strict over-time measurement generally be preferable for most use cases? (Thinking if we actually just want to exclude 0s
altogether)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer not to have magical 0s value and just skip interval if the implementation doesn't support that capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @htuch (Hi, Harvey! 😂) that avoiding 0s being magical is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem unreasonable to support existing Envoy behavior to me and just say skip budget interval in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only tricky part with that would be that making interval
an optional field makes it potentially easier to create invalid configurations for impls requiring an interval and not supporting "in-flight" - might need to add an additional conformance feature name for this case.
This is somewhat the inverse of how features would typically be implemented by using an optional field rather than omitting it (and having the optional field required for implementations not supporting the feature), not sure if that might cause any conformance difficulties.
/ok-to-test |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ericdbishop! Largely LGTM, just a couple comments about validation
// | ||
// Support: Extended | ||
// +kubebuilder:validation:Minimum=0 | ||
Count *int `json:"count,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to avoid unbounded values here, I'd recommend some kind of max here, even if that max is very high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context in which this is currently used is for MinRetryRate as "an override for very low volume traffic" as @htuch described in https://github.com/kubernetes-sigs/gateway-api/pull/3607/files#r1964340430 but this struct type is (intentionally) sufficiently generic that it could be re-used in the future for #326. As such, I would defer to @htuch on what a "safe" maximum might entail for this at scale with a potential denominator as short as 1ms
(I'm assuming there may be integer type maximums to keep in mind for implementations too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is going to have a variable denominator, e.g. hour, I think you're just going to have to go with some type based limit or something that would make sense over a long period.
// time during which the given count of requests occur. | ||
// | ||
// Support: Extended | ||
Interval *Duration `json:"interval,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also feels like something that should have a min and max value. I'm assuming 1s
is acceptable for a min, unclear what a good max is. Any ideas @htuch or @mikemorris?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we actually enforce min/max constraints on GEP-2257 Duration types?
If so, I think 1ms
is the minimum expressible through that format (even if logically a per-second rate may be more common, expressing rates with a denominator in milliseconds may be helpful for higher-throughput use cases to work around a count
max constraint), assuming we want to exclude the "divide by zero" shorthand for Envoy's current behavior.
The max expressible appears to be 99999h
- in practice I'd expect even a rate per hour may be unlikely to be configured, instead preferring enforce a more normal distribution over longer time spans, but maybe 1h
, 12h
or 24h
is a reasonable max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I think we'd have to define min+max with CEL using >
and <
comparisons. A rough example is here: https://github.com/kubernetes-sigs/gateway-api/blob/main/apis/v1/httproute_types.go#L320.
I agree that any of 1h
, 12h
, and 24h
would be reasonable here, but I'd prefer to start with the lower value 1h
since it would be much harder to tighten this validation retroactively than to loosen it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maximum GEP-2257 Duration is actually 99999h99999m99999s99999ms
, which I would agree is quite a bit larger than would probably make sense here. I'm honestly not quite sure what kind of math we can do in CEL here, but a really simple check might be to disallow anything with more than two digits before h
, or more than four before m
or s
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately CEL supports a duration type that is compatible with our duration, so we should be able to have pretty reasonable comparisons here.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ericdbishop, robscott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind documentation
/kind feature
What this PR does / why we need it:
Implements GEP-3388: Retry Budgets
Which issue(s) this PR fixes:
Fixes #3388
Does this PR introduce a user-facing change?: