Skip to content

Do not resync blocks in running store gateways during rollout deployment and container restart #5363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

justinjung04
Copy link
Contributor

@justinjung04 justinjung04 commented May 25, 2023

What this PR does:

Problem

Currently the logic for existing store gateway instances to re-sync block is too aggressive, leading to unnecessary re-sync when pods are restarted due to OOM or when doing rollout deployment.
Unnecessary re-sync increases chance of OOM and query unavailability.

Current Behavior

  • ring.HasReplicationSetChanged is used to determine whether re-sync should be performed (true if number of instances, or state/address/token of any instance changes)
  • Store gateway is scaled up => Number of instances in the ring is increased, then the instance state changes from JOINING -> ACTIVE => existing instances re-sync blocks at most twice, once for number of instance change and once for state change
  • Store gateway is scaled down => Instance state changes from ACTIVE -> LEAVING, then number of instances in the ring is decreased => existing instances re-sync blocks at most twice, once for state change and once for number of instance change
  • Store gateway instance is restarted (OOM) => Instance state changes from ACTIVE -> JOINING -> ACTIVE => existing instances re-sync blocks at most twice, both for state change
  • Store gateway instance is replaced by a new instance (rollout deployment) => Instance state changes from ACTIVE -> LEAVING, instance address changes, then new instance state changes from JOINING -> ACTIVE => existing instances re-sync blocks at most three times, twice for state change and once for address change

Proposed Behavior

  • ring.HasReplicationSetChangedWithoutStateAndAddress is used to determine whether re-sync should be performed (true if number of instances, or token of any instance changes)
  • Store gateway is scaled up => Number of instances in the ring is increased, then the instance state changes from JOINING -> ACTIVE => existing instances re-sync blocks at most once for number of instance change
  • Store gateway is scaled down => Instance state changes from ACTIVE -> LEAVING, then number of instances in the ring is decreased => existing instances re-sync blocks at most once for number of instance change
  • Store gateway instance is restarted (OOM) => Instance state changes from ACTIVE -> JOINING -> ACTIVE => existing instances do not re-sync blocks
  • Store gateway instance is replaced by a new instance (rollout deployment) => Instance state changes from ACTIVE -> LEAVING, instance address changes, then new instance state changes from JOINING -> ACTIVE => existing instances do not re-sync blocks if tokens_file_path is set. Otherwise, existing instances re-sync blocks once

Which issue(s) this PR fixes:
n/a

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Justin Jung <[email protected]>
@justinjung04 justinjung04 force-pushed the store-gateway-sync-improvement branch from 65edfa6 to 309ea10 Compare May 25, 2023 23:13
…be less than the replication factor

Signed-off-by: Justin Jung <[email protected]>
@justinjung04 justinjung04 marked this pull request as ready for review May 31, 2023 14:58
@justinjung04 justinjung04 force-pushed the store-gateway-sync-improvement branch from a83fb9c to 5843bcb Compare May 31, 2023 17:43
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. Thanks for the detailed writeup and contribution!

@yeya24
Copy link
Contributor

yeya24 commented Jun 2, 2023

cc @alvinlin123 or @alanprot for another review.

@alanprot
Copy link
Member

alanprot commented Jun 5, 2023

Nice!! This LGTM.

@alanprot alanprot merged commit d988472 into cortexproject:master Jun 5, 2023
cathy-qiu pushed a commit to cathy-qiu/cortex that referenced this pull request Jun 7, 2023
…ent and container restart (cortexproject#5363)

* Do not resync blocks in running store gateways during rollout deployment and container restart

Signed-off-by: Justin Jung <[email protected]>

* Add changelog

Signed-off-by: Justin Jung <[email protected]>

* Change numScaleUpGateways in RingTopologyChangedAfterScaleUp test to be less than the replication factor

Signed-off-by: Justin Jung <[email protected]>

* Ignore RegisteredTimestamp in HasReplicationSetChangedWithoutStateAndAddress

Signed-off-by: Justin Jung <[email protected]>

---------

Signed-off-by: Justin Jung <[email protected]>
Signed-off-by: Cathy Qiu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants