-
Notifications
You must be signed in to change notification settings - Fork 1.8k
iam: separate backoffs, add jitter, and increase for conflicts #10786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Within our project, we can have multiple instances of terraform modifying iam policies, and in many cases these instances are kicked off at exactly the same time. We're running into errors where we exceed the backoff max (which in reality is 16 seconds, not 30). Also, Google reccomends that backoffs contain jitter [1] to prevent clients from retrying all at once in synchronized waves. This change (1) separates the 3 distinct backoffs used in the iam policy read-modify-write cycle, (2) introduces jitter on each retry, and (3) increases the conflict max backoff to 5 minutes. [1] https://cloud.google.com/iot/docs/how-tos/exponential-backoff
100cd5f
to
1ad1a77
Compare
@rileykarson Any chance you could have a look at this PR? 🙏 |
@slevenick do you mind taking a look? I think you're more familiar with the IAM framework than I am. |
What symptoms are you seeing that you think jitter will fix? My understanding is that Terraform is already introducing small differences in time when dispatching the processes that handle individual resources, and everything within an individual resource is handled serially Bumping the timeout to longer than 16 seconds probably makes sense, but how are you hitting conflict errors ~5 times in a row? |
We tend to kick off a bunch of separate terraform applies at the same time in OpenShift CI (as many as 30, I think). We could certainly solve our problem at a higher abstraction layer if you'd prefer to leave the current code as is.
Even though 16s is ~5 retries, it's in a relatively short window when we have so many concurrently running applies. |
If you're kicking the runs off as separate terraform applies then adding jitter at this level shouldn't have an impact? Different runs of It seems like the issue comes down to conflicting changes being made during this process. Is there a reason these resources are managed in different Terraform configs & handled by different CI jobs at the same time? It feels like a solution could be to manage all similar resources in a single file, as we have implemented batching for IAM resources for just this sort of situation |
I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Within our project, we can have multiple instances of terraform
modifying iam policies, and in many cases these instances are kicked off
at exactly the same time. We're running into errors where we exceed the
backoff max (which in reality is 16 seconds, not 30). Also, Google
reccomends that backoffs contain jitter [1] to prevent clients from
retrying all at once in synchronized waves.
This change (1) separates the 3 distinct backoffs used in the iam policy
read-modify-write cycle, (2) introduces jitter on each retry, and (3)
increases the conflict max backoff to 5 minutes.
[1] https://cloud.google.com/iot/docs/how-tos/exponential-backoff