Skip to content

Support for decommissioning and recommissioning a zone #3402

Open
@Bukhtawar

Description

@Bukhtawar

Is your feature request related to a problem? Please describe.
There are use-case where decommissioning a zone/rack might be beneficial

  1. In a multi-zone deployment setup it might be good to support zonal deployment rather than a rolling restart per node, which might be too slow for a big cluster and might take sufficiently longer. In such a case zone serves a unit of deployment
  2. Andon cord during zonal outages is another handy mechanism which enables graceful traffic shutdown in the impacted zone/rack espl when certain nodes are still operating in a degraded manner

Background
With #2859 we intend to weigh away shard search traffic, however since OpenSearch follows a synchronous replication model, it is possible to have replication request stuck due to any impairment in the write path. The current health check mechanisms to detect and remediate a bad node is only a best effort strategy and doesn't cover deeper health checks across all network paths. For predictability, we propose pulling an andon cord to cut-off inter-zone replication traffic, which can be achieved by decommissioning the node in the impacted zone.

Implications
As a result of decommissioning a zone all shards that were taking in write traffic might fail to ensure data consistency semantics are honoured and stale shards are marked unavailable. To make sure no in-flight requests fail we need to weigh away shard search traffic as a part of #2859. In some setups where there are no dedicated coordinator setups we need to ensure no HTTP traffic is being sent and all traffic is drained before a decommission API is triggered

Describe the solution you'd like
A graceful mechanism to

  1. Decommission a zone/rack
  2. Recommission a zone/rack
POST /_cluster/decommission 
{
 "awareness_attribute" : {"zone" : "A-0"}
}
DELETE /_cluster/decommission 
{
 "awareness_attribute" : {"zone" : "A-0"}
}

Constraints

  1. Attribute value should be a one of the values in union(forces_zone, discovered_zone)
  2. There should be only one active zone under decommission or recommission
  3. The shard request weights on the decommissioned zone from [Feature] Support for weighted zonal search request routing policy #2859 should be set to zero

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussIssues intended to help drive brainstorming and decision makingenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions