Description
Is your feature request related to a problem? Please describe
Background
We have had a few great RFCs over the last year or so (#7258 and #14596) with the goals of separating indexing and search traffic within a cluster to achieve independent scalability and failure/workload isolation.
This issue sets next steps for implementation using the ideas from these RFCs with the goal of achieving full isolation. It is opinionated in that it focuses on clusters leveraging remote storage in order to achieve that full isolation. There are variants that are possible for local clusters (node-node segrep) that achieve varying levels of isolation but they are not in initial scope.
Note - One point of contention on the original RFC was to make this feature pluggable if only geared toward cloud native use cases. At this time I don’t believe it makes sense to start with a plugin as the only functionality we could reasonably pull out is allocation with ClusterPlugin. There are building blocks to this that will be beneficial to all clusters using segment replication.
Desired Architecture:
Requirements:
- Create or restore an index with specialized indexer and searcher shards.
- Allow users the option to physically separate these indexer and searchers to separate hardware.
- Searcher shard counts are independently tunable via API.
- Scale to zero - turn off all read or write traffic independently.
- Users should be able to leverage this isolation without much additional configuration.
- Code changes should be backward compatible with 2.x and not rely on a 3.0 release.
Out of scope (at least initially):
- Stateless writers / Dynamic resharding - adding/removing primary shards per index is not possible without rollover.
- Shard splitting - Splitting writer shards into multiple search shards or vice versa.
- Coordinator layer changes. To achieve separation at a coordinator layer users will need to route their traffic to separate node groups. By default all shards act as coordinators for all operation types. Rerouting between shard types does not achieve isolation. In the future however we can provide trimmed down coordinators that only handle certain request types, but that is not an initial concern.
- Cluster state filtering/splitting to specific shard types. This is out of scope for the first version but a need to achieve full isolation.
Describe the solution you'd like
Implementation
The first step in achieving our desired architecture is introducing the concept of a “search” shard while existing primaries and replicas remain specialized for indexing for write availability. A search shard internally is treated as a separate replica type within an IndexShard’s routing table. By default, search shards need not be physically isolated to separate hardware nor are they required to serve all search requests. However we can use them to achieve the desired separation. Search replicas will have the following properties to make them lower overhead than normal replicas. (related issue):
- They are not primary eligible
- They do not receive any replicated operations, particularly the primary term check on the write path.
- Interface only with remote storage to pull segments for recovery and replication flows.
Even without physical separation this takes a step toward decoupling write paths from search paths.
Routing layer changes & disabling primary eligibility
As called out in @shwetathareja ‘s RFC, In order to introduce the search replica we need to differentiate at the shard routing layer. We can do this with a simple flag in ShardRouting to denote the shard as “searchOnly”, similar to how a routing entry is flagged as “primary”. Once we have this flag we can add additional ShardRouting entries based on a configured count in API, similar to primary/replica counts. Create index flow would be updated as follows, with similar updates to restore and update:
Once the shard routing table is constructed with separate replica types, we can easily leverage this within allocation, primary selection, and request routing logic to filter out search shards appropriately.
Recovery
In remote store clusters replicas recover by fetching segments from the remote store. However, recovery of these shards is still initiated as a peer recovery and still runs all the required steps to bring the shards in-sync. With search shards we do not need to perform these steps and can leverage the existing RemoteStoreRecoverySource to recover directly from the remote store.
Replication
To make replicas pull based we use scheduled tasks to trigger segment replication cycles through SegmentReplicationTargetService.forceReplication. At the beginning of a SegRep cycle we fetch the latest metadata from the remote store and then compute a diff and fetch required segments. This scheduled task will run on a configurable interval based on freshness requirements and continue until the index is marked as read-only.
Write & refresh cycle:
Segrep Stats & backpressure:
We will also need to make changes to how we track replica status to serve stats APIs and fail replicas if they fall too far behind. Today these stats and operations are performed from primary shards only.
- Change to compute lag/bytes behind for _stats APIs at the coordinator by returning the current ReplicationCheckpoint from each shard. This object already includes all required data to compute desired stats.
- Disable normal SegRep backpressure as primaries will not know where search replicas are. We can rely on remote store pressure that applies on upload delays.
- To enforce freshness among the search shards we can internally fail them if they are too far behind.
Node Separation
Allocation
Up until this point search replicas can live side by side with other shard types. In order to make node separation opt-in, an additional setting is required to force allocation to a set of nodes either by node attribute or role. I propose we separate logic for search shards into a new setting for use with FilterAllocationDecider
specific to shard types across the cluster. This would allow more flexibility to allocate based on any node attribute and be extended to optionally support node role. This would require two changes - one to honor role as a filter, add _role as a special keyword within the filter similar to _host, _name etc..., and second a new require filter based on shard type. To enforce separation the user would then set:
search.shard.routing.allocation.require.<node-attr>: <attr-value>
or
search.shard.routing.allocation.require._role: <node role (ex search)>
Alternative:
We can leverage the Search node role directly and allocate any search shard to nodes with that role through the TargetPoolAllocationDecider and treat any search shard as REMOTE_ELIGIBLE thereby allocating to search nodes. While maybe simple out of the box it doesn’t provide users much flexibility. Further the search role is currently intended for searchable snapshot shards and there is a bit of work pending to have hot/warm shards live on the same box.
Request Routing:
Within a non isolated cluster with search replicas: Requests without any preference passed can hit any active shard.
Within a separated node cluster (there is a filter present): Requests without any preference passed will hit only search replicas.
Scale to zero:
Search:
With the feature enabled we will be able to easily scale searchers to 0 as existing primary and replica groups take no dependency on searcher presence. However, we will need to update operation routing such that any search request that arrives without primary preference will be rejected with a notice that the index is not taking reads.
Index:
TBD - will create a separate issue for these changes.
API Changes
Create/update/restore index will all need to support a separate parameter for the number of search only shards. Example:
curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1,
"number_of_search_only_shards": 1
}
}
}'
number_of_replicas will only control the counts of writer replicas while number_of_search_only_shards will control search replicas.
New Configuration
opensearch.experimental.feature.search_replica.enabled - feature flag
index.number_of_search_only_shards - to control search replica count (default: 0)
search.shard.routing.allocation.require - AllocationFilter to optionally force search replicas to specific set of nodes. (default off)
index.segment.replication.interval - interval at which to initiate replication cycles. (default 10s)
Additional API Changes:
Cluster health:
Cluster health today has limitations in that the color codes are misleading and only based on allocated shards. I propose we initially leave the definition of this status as is and adopt any new changes at part of this issue to improve cluster health.
The meaning of our color coded cluster status will mean:
green = all desired shards are active.
yellow = primaries are assigned but not all replicas or all search shards are active. Ex. 1 of 2 desired searchers per primary are active.
red = a primary is not active (no write availbility)
This would mean that if a user has zero read availability the cluster would report as yellow. Users will need to use existing APIs (_cat/shards) and metrics etc to diagnose the issue.
Alternative:
To further assist in debugging status we include more detailed information on search vs indexing status. We can provide more granular “search” shard counts, but I think to start this will help us diagnose which side is having issues and quickly figure out unassigned shards with cat API. Something like:
{
"cluster_name" : "opensearch-cluster",
"status" : "green",
"search_status": "green",
"indexing_status": "green",
...
All remaining shard counts will now include search shards.
"active_primary_shards" : 6,
"active_shards" : 12,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
...
Cat shards:
To differentiate search replicas from regular we can mark them as [s] in cat shard output.
curl localhost:9200/_cat/shards
test2 0 p STARTED 0 208b 127.0.0.1 runTask-0
test2 0 s STARTED 0 208b 127.0.0.1 runTask-1
test2 0 r STARTED 0 208b 127.0.0.1 runTask-3
Task breakdown & Rollout (to convert to meta):
- Introduce new search replica concept and changes to create API
- Add Allocation filter based on shard type.
- Changes to update index API to support toggling search replica counts.
- Updates to restore from snapshot with added search replicas
- Replication updates to make search replica pull based.
- Update Segment replication stats APIs and backpressure.
-- Experimental (target 2.17)
- Update recovery for search replicas to be directly from remote store.
- Update API specs & clients
- Updates to cat API
- Updates to cluster health API
- scale to zero writer changes (TBD)
- ISM integration to auto scale down writers on index migration
-- GA (target - pending feedback)
Related component
Other
Describe alternatives you've considered
See RFCs
Additional context
No response