Description
Is your feature request related to a problem? Please describe.
There are use-case where decommissioning a zone/rack might be beneficial
- In a multi-zone deployment setup it might be good to support zonal deployment rather than a rolling restart per node, which might be too slow for a big cluster and might take sufficiently longer. In such a case zone serves a unit of deployment
- Andon cord during zonal outages is another handy mechanism which enables graceful traffic shutdown in the impacted zone/rack espl when certain nodes are still operating in a degraded manner
Background
With #2859 we intend to weigh away shard search traffic, however since OpenSearch follows a synchronous replication model, it is possible to have replication request stuck due to any impairment in the write path. The current health check mechanisms to detect and remediate a bad node is only a best effort strategy and doesn't cover deeper health checks across all network paths. For predictability, we propose pulling an andon cord to cut-off inter-zone replication traffic, which can be achieved by decommissioning the node in the impacted zone.
Implications
As a result of decommissioning a zone all shards that were taking in write traffic might fail to ensure data consistency semantics are honoured and stale shards are marked unavailable. To make sure no in-flight requests fail we need to weigh away shard search traffic as a part of #2859. In some setups where there are no dedicated coordinator setups we need to ensure no HTTP traffic is being sent and all traffic is drained before a decommission API is triggered
Describe the solution you'd like
A graceful mechanism to
- Decommission a zone/rack
- Recommission a zone/rack
POST /_cluster/decommission
{
"awareness_attribute" : {"zone" : "A-0"}
}
DELETE /_cluster/decommission
{
"awareness_attribute" : {"zone" : "A-0"}
}
Constraints
- Attribute value should be a one of the values in union(forces_zone, discovered_zone)
- There should be only one active zone under decommission or recommission
- The shard request weights on the decommissioned zone from [Feature] Support for weighted zonal search request routing policy #2859 should be set to zero
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.