Description
Background
Currently, in opensearch, we have various rejection mechanisms for in-flight requests and new requests when cluster is overloaded.
We have backpressure mechanisms which are reactive to node duress conditions, circuit breakers in data nodes which rejects requests based on real time memory usage. We also have queue size based rejections. But there are different gaps in each of the above solutions.
We don’t have a state based admission control framework which is capable of rejecting incoming requests before we start the execution chain.
Existing throttling solutions
1. Search backpressure
Search backpressure in opensearch currently cancels resource intensive tasks when node is in duress. Coordinator node cancels the search tasks and data nodes cancel the shard tasks.
Challenges
-
1.1 There is no server side rejection mechanism for incoming requests when a target node is in duress. Refer Server side rejection of in-coming search requests based on resource consumption #1180
-
1.2 The existing routing methods don’t intelligently route away from stressed nodes based on their resource utilization. Refer similar issue Adaptive replica Selection - Factor in node resource utilisation #1183
2. Indexing backpressure
Opensearch has node level and shard level indexing backpressure that dynamically rejects indexing requests when the cluster is under strain.
Challenges
- 2.1 Indexing backpressure mechanisms only account memory usage and throughput degradation of the shards / nodes to reject requests. They don’t consider other resource utilization parameters such as CPU , I/O etc. Refer [RFC] Shard Indexing backpressure mechanism should also protect from any CPU contention on nodes #7638
- 2.2 Since coordinator takes a local decision for its downstream nodes, it does not reflect the collective view across all nodes for the same downstream node.
3. Circuit breaker
We have circuit breakers in data nodes which rejects requests based on real time memory usage. This is the last line of defence and it prevents nodes from further going down.
Challenges
- 3.1 Circuit breaker limits cannot be configured per request type
- 3.2 Circuit breaker currently only works based on memory usage and doesn’t support other resource parameters such as CPU, I/O etc
4. Queue sized based rejection
In OpenSearch, we currently have different queues for different operations such as search, indexing etc. And we reject new requests if the respective queue is full.
Challenges
- 4.1 Queue size is not accurate representation of the load. For example, bulk requests can vary widely based on request size and number of documents to be indexed. Similarly search requests can vary widely on resource consumption based on the type of the query, eg: aggregation queries, large term queries etc.
- 4.2 Queue size limits are intentionally not honoured during few framework level actions, such as ReplicationAction, to maintain consistency, and reduce the wasteful work done. For example, the TransportReplicationAction allows the Replica nodes to enqueue the Replication requests, even if the Write queue is full or beyond its defined limit.
Proposal
We propose to implement Admission control framework for OpenSearch which rejects incoming requests based on the resource utilization stats of the nodes. This will allow real-time, state-based admission control on the nodes.
We will build a new admission control core plugin which can help in intercepting and rejecting requests in rest layer and transport layer. We will extend ‘ResponseCollector’ service to maintain performance utilization of downstream nodes in coordinator node.
Goals and benefits
- Build extensible admission control framework which will be an opensearch core plugin / module , that can intercept requests in both rest layer and transport layer entry and reject requests when the cluster is overloaded.
- This will help extending search backpressure by rejecting incoming requests when node is already in duress ( Solves 1.1 )
- We will be able to rate limit any request in data nodes / coordinator nodes. One use case is [RFC] Admission Control mechanism for Cluster Manager APIs #7520
- Build resource utilization / health status view of downstream nodes in coordinator nodes.
- This allows us to fail fast the requests in coordinator when the associated target nodes are in stress.
- This can help for cases where rejection has to be based on target nodes resource utilization such as I/O
- This can help in optimizing routing decisions in search flow.
- Intelligently route search requests away from stressed nodes if possible by enhancing existing routing methods such as ARS. We also will adjust stats of the stressed nodes so that new requests are retried on stressed nodes periodically.
- This will improve performance of the search requests as we route away from the stressed nodes.
- Framework should be configurable per request type and resource type. It should be extensible to any new resource type parameter in future based on which we’ll reject requests.
- The rejections will be more fair as limits and rejections are based on each request type ( solves 3.1)
- The framework should provide a half open state to retry requests on the stressed nodes in time based / count based manner.
- This will help for coordinator node rejections - to retry when stressed nodes are recovered.
- The framework will initially provide admission control based on CPU, JVM, IO and request size.
- This can help extend existing rejection solutions by considering CPU, I/O etc. ( Solves 2.1, 3.2 )
- I/O will be a completely new resource backpressure / admission control resource parameter which we will build from ground up
High level design
Admission control plugin
We can add a new admission control core opensearch module / plugin that extends ‘NetworkPlugin’ which intercepts the requests at rest layer and transport layer to perform rejections.
- The plugin can override ‘getRestHandlerWrapper’ to wrap incoming requests and perform admission control on them. We’ll initially perform AC only for search and indexing rest calls and can add more based on need.
- The plugin can override ‘getTransportInterceptors’ to add a new AC transport interceptor which intercepts requests based on the transport action name and perform admission control.
Admission control service
- This service will aid in rejection of incoming requests based on the resource utilization view of the nodes.
- We will utilizeresponse collector serviceto get the performance stats of the target nodes.
- We will maintain a state machine that helps rejecting requests when performance thresholds are breached with an option to retry requests periodically to check if target node is still in duress.
- CLOSED → This state will allow all the incoming requests ( Default )
- OPEN → This state will reject all the incoming requests.
- HALF_OPEN → This state will allow X percent ( configurable ) of the requests and reject other requests.
Building resource utilization view
Response collector service
We’ll extend the existing ‘ResponseCollectorService’ to collect performance statistics such as CPU, JVM and IO of downstream nodes in coordinator node.
We also will collect node unresponsiveness / timeouts when a request fails which will be treated with more severity.
The coordinator can use this service at any time to get the resource utilization of the downstream nodes.
Local node resource monitoring
We will reuse node stats monitors such as process, jvm, fs which already monitors node resources at 1 second interval.
Track the resource utilization of the downstream nodes
We will enhance the search and indexing flows to get the downstream node performance stats.
Approach 1 - Use the thread context to get the required stats from downstream nodes
- For every indexing / search request we will add the performance stats to the thread context response headers on all the nodes (primary&replica) where the request is processed.
- Once the request is completed we will get the perf stats from the thread context and update these stats in response collector service in the coordinator node. Post that, we will filter these perf stats from the response headers before we return the response to the client.
Pros
This approach has no regression / backward compatibility risks as we don’t alter any schema
Risks
We need to check if there are any security implications in carrying perf stats as part of threadcontext
Approach 2 - Schema change
Search flow
- Enhance ‘QuerySearchResult’ schema to get target nodes resource utilization stats
- We already get queue size, service time etc as part of ‘QuerySearchResult’ and hence it’s a good fit to add additional performance stats of the node.
Indexing flow
- Enhance the ‘BulkShardResponse’ to return the target nodes resource utilization stats.
Risks
- Currently ‘BulkShardResponse’ schema doesn’t have any information of any perf parameters of the target nodes. We have to make changes in serialize / deserialize to hide the perf stats info from the user.
Other approaches considered
We can enhance follower check / leader check APIs to propagate performance stats of the nodes to all other nodes.
Cons
This builds a dependency on cluster manager and might have an impact on cluster manager node’s performance.
These health checks are very critical , and so any regression will be quite problematic
Search flow enhancements - #8913
Indexing flow enhancements - #8911
Co-authored by @ajaymovva
Metadata
Metadata
Assignees
Labels
Type
Projects
Status