Skip to content

[RFC] Reader and Writer Separation in OpenSearch #7258

Open
@shwetathareja

Description

@shwetathareja

Background:
Today, in OpenSearch indexing and search are supported via same “data” role. This causes indexing and search workload to adversely impact each other. An expensive query can take up all memory and cpu causing indexing requests to fail or otherwise. On one side we have log analytics customers who want to scale for their millions of docs per sec indexing while on the other side, there are search customers who configure large no. replicas like 20 replicas to support their search traffic and what they both really want is scale for their workload respectively. But currently, OpenSearch doesn’t support isolating indexing and search workloads.

Goal:
Support reader and writer separation in OpenSearch to provide predictable performance for indexing and search workload so that sudden surge in one shouldn’t impact the other. This document talks about how we can achieve this separation in OpenSearch.

Terminology:

  1. Reader denotes Search
  2. Writer denotes Indexing

Benefits:

  1. Workload isolation to provide more predictable performance for indexing and search traffic.
  2. Configure and scale indexing and search independent of each other.
  3. Cost savings and better performance by using specialized hardware instances (e.g. compute instances for indexing and high memory instances for search) for different workload.
  4. Tuning buffers and caches based on workload on the node to get optimal performance.

Proposal
In order to achieve Indexing and Search separation, we would build on the new node role “search”. With searchable snapshots, work has already started for “search” role for searching read only indices from snapshots. #4689. This RFC will expand on that effort. The “search” node role would act as dedicated search nodes. The term “replica” would split in the core logic as “write-replica” and “search-replica”. The write-replica/ primary would be assigned on the node with “data” role whereas search-replica will be assigned to the node with “search” role.

Request Handling and Routing
The primary copy or write-replica can support both read and write and this would be the default configuration where user has not configured any “search-replica” or “search” nodes explicitly. The user would have control at the request level to decide whether the request should be handled by “search-replica” only or fallback to primary or “write-replica” in case search-replica is not available. The latter would be the default mode.

Reader-writer

Lets take a look at the the flow for a bulk/ search request (RequestRouting):

  1. Customer sends a bulk request and it reaches the coordinator node.
  2. Coordinator role would check the RoutingTable in ClusterState. It would find the node which is hosting the primary and would route the request to that node for indexing.
  3. Primary would send the replication request to the write-replicas only.
  4. In case of search traffic and based on the user preference, Coordinator would check the RoutingTable and find nodes which have “search” role and route the request to them.

Search role

The search role on the nodes can be configured in 2 ways:

  1. User assigns dedicated search role to a node. That means this node will not have data role. All good, no issues.
  2. User configures both data and search role to the same node. Then there are 2 choices,
    1. either have strict logic to prevent same node from having data and search role
    2. or, let them configure but they might not get optimal performance and workload separation

For # 2, The proposal is to go with 2.a where we shouldn’t allow same node to have both search and data role as it would defeat the purpose of introducing this separation in the first place. Also if you think more, 2.b essentially becomes current model where any shard can land on any data node and support both read and write.

This separation would enable users to think about their indexing and search workload independent of each other and plan for their capacity accordingly e.g. the search nodes can be added/ removed on demand without impacting the indexing throughput and vice versa.

Characteristics of different shard copies

  1. primary - The primary can serve both indexing and search traffic. But, its primary function is indexing. With reader/ writer separation, primary could also become optional if no writes are expected on that index e.g. in case of read-only/ warm indices.
  2. write-replica - The first question comes to mind is why do we need it all? The write-replica serves different purpose with different underlying storage. In case of local storage, write-replica provides durability against any primary failure. While with remote storage, write-replica provides faster fail-over recovery and is optional.
  3. search-replica - The search-replica should strictly take search traffic. This would mean that it can never be promoted as “primary”. In order to provide true workload isolation and predictable performance it is critical that indexing and search shouldn’t take place on the same node otherwise it would defeat the purpose. This also means search-replica can afford to lag more from the primary when replicating writes as compared to write-replica.

Cluster State changes

Another important aspect is ShardAllocation. With new “search” role and different replicas, shard allocation logic will also have changes to accommodate them.

From leader perspective when assigning shards:

  1. It will differentiate write-replica and search-replica based on a new parameter.
  2. It will always assign primary and write-replica to data node role.
  3. The search-replica would get assigned to search role.
  4. In case of primary copy failure, it will find write-replica and promote it to primary.

Today, Leader maintains list of up-to-date copies for a shard in the cluster state as “in-sync allocations”. With reader/ writer separation, search-replicas can be more flexible and can lag more from primary compared to write-replica. There should be a mechanism for leader to track search-replicas and fail them in case they are not able to catch up within the configured thresholds.

Index (Cluster) Status - Green/ Yellow/ Red

Today, in OpenSearch if all the copies of shard are assigned, then index status is green. If any of the replicas is unassigned it turns yellow and if, all the copies (including primary) are unassigned then it turns red. This works well when single copy can serve both the functions i.e. indexing and search. With Reader/ Writer separation, red would be panic signal only for writable copies but it doesn’t indicate if all the read copies (search-replicas) are unassigned. We would need a way to indicate read and write copies health separately.

User configurations

  1. The default configuration would use same primary/ replica copy for both read and write traffic. There is no configuration needed if user is not looking for reader/ writer separation.
  2. In case user want reader/ writer separation, he/ she would configure search-replica explicitly. It would be a dynamic setting and can be configured at any time on new/ existing indices. search-replicas can come and go any time. The current replica count would by default mean write-replica only and can be configured to 0.
  3. During any search request, user can specify which copy should serve the request in the preference i.e. search-replica or fallback to primary/write-replica. In case, user has not specified explicitly per request, it will use the auto preference to check search-replica first always.
  4. User has specified search-replica for an index but has not configured any nodes, it will default to using primary/write-replica for search as well.

Why do we need to configure search-replicas with new setting?
The current “replica” setting is overloaded to act as “write-replica” in case of role separation. There is no other way to configure search-replica without addition of new setting. The current replica setting could have meant search-replica instead of write-replica but this would complicate the default configuration where user has not configured node with “search” and replicas would go unassigned.

How to turn off primary shard copy in case search-replicas are enabled?
With reader/ writer separation, user can choose to turn off primary and set write-replicas to 0 in case they don’t expect any writes to that index. e.g. log analytics customer, after rolling over over an index can set all write copies to 0 for old index and only configure “search-replica” as needed. Today, OpenSearch doesn’t offer any setting to turn off primary. This could be an explicit setting or derive it implicitly if an index is marked read-only.

Tuning of buffers/ caching based on workload
There are different node level buffers and caching threshold which are configured considering the same node will serve both indexing and search workload. Those could be fine tuned better now with the separation.

Auto management with ISM
Index State management plugin will simplify some of these aspects for the user where they don’t need to configure different replicas explicitly and ISM policies can take care of it. e.g. during migration of an index from hot to warm, it can configure primary and write-replica to 0 and set search-replica to n as configured in the policy.

A note on the underlying storage
Ideally, reader and writer separation is independent of underlying storage (local or remote) or index replication algorithm (document or segment replication). But, current document replication with local storage can’t offer the isolation between writer and reader for real as both nodes (data & search) would do same amount of work for indexing. There is no concrete plan yet for remote store with document replication. So, in the first phase it will be segment replication enabled indices which will benefit from this reader/ writer separation.

Separating Coordinators

Today, Coordinator role handles the responsibility of coordinating both indexing and search requests. The above proposal doesn't talk separating coordinators yet in OpenSearch. I would create a separate issue to discuss coordinator role separation in detail.

Thanks @dblock for calling out to capture it separately. Regarding the discussion check the comment

Future Improvements
The reader and writer separation would lay the ground work for lot more improvements in the future like auto scaling of read and write copies, partial fetching via block level fetch or on demand fetch segment/ shard only when there is read or write traffic etc. Also, in future we can look into separating segment merges also to dedicated nodes. Some of these improvements are also discussed here - #6528


How can you help?
Any feedback on the overall proposal is welcome. If you have specific requirements/ use cases around indexing and search separation which are not addressed by the above proposal, please let us know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Cluster ManagerIndexingIndexing, Bulk Indexing and anything related to indexingRFCIssues requesting major changesRoadmap:Cost/Performance/ScaleProject-wide roadmap labelenhancementEnhancement or improvement to existing feature or requestfeatureNew feature or requestideaThings we're kicking around.lucene

    Type

    No type

    Projects

    Status

    🆕 New

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions