Skip to content

[RFC] Admission Controller framework for OpenSearch #8910

Open
@bharath-techie

Description

@bharath-techie

Background

Currently, in opensearch, we have various rejection mechanisms for in-flight requests and new requests when cluster is overloaded.
We have backpressure mechanisms which are reactive to node duress conditions, circuit breakers in data nodes which rejects requests based on real time memory usage. We also have queue size based rejections. But there are different gaps in each of the above solutions.
We don’t have a state based admission control framework which is capable of rejecting incoming requests before we start the execution chain.

Existing throttling solutions

1. Search backpressure

Search backpressure in opensearch currently cancels resource intensive tasks when node is in duress. Coordinator node cancels the search tasks and data nodes cancel the shard tasks.

Challenges

2. Indexing backpressure

Opensearch has node level and shard level indexing backpressure that dynamically rejects indexing requests when the cluster is under strain.

Challenges

3. Circuit breaker

We have circuit breakers in data nodes which rejects requests based on real time memory usage. This is the last line of defence and it prevents nodes from further going down.

Challenges

  • 3.1 Circuit breaker limits cannot be configured per request type
  • 3.2 Circuit breaker currently only works based on memory usage and doesn’t support other resource parameters such as CPU, I/O etc

4. Queue sized based rejection

In OpenSearch, we currently have different queues for different operations such as search, indexing etc. And we reject new requests if the respective queue is full.

Challenges

  • 4.1 Queue size is not accurate representation of the load. For example, bulk requests can vary widely based on request size and number of documents to be indexed. Similarly search requests can vary widely on resource consumption based on the type of the query, eg: aggregation queries, large term queries etc.
  • 4.2 Queue size limits are intentionally not honoured during few framework level actions, such as ReplicationAction, to maintain consistency, and reduce the wasteful work done. For example, the TransportReplicationAction allows the Replica nodes to enqueue the Replication requests, even if the Write queue is full or beyond its defined limit.

Proposal

We propose to implement Admission control framework for OpenSearch which rejects incoming requests based on the resource utilization stats of the nodes. This will allow real-time, state-based admission control on the nodes.
We will build a new admission control core plugin which can help in intercepting and rejecting requests in rest layer and transport layer. We will extend ‘ResponseCollector’ service to maintain performance utilization of downstream nodes in coordinator node.

Goals and benefits

  1. Build extensible admission control framework which will be an opensearch core plugin / module , that can intercept requests in both rest layer and transport layer entry and reject requests when the cluster is overloaded.
  2. Build resource utilization / health status view of downstream nodes in coordinator nodes.
    • This allows us to fail fast the requests in coordinator when the associated target nodes are in stress.
    • This can help for cases where rejection has to be based on target nodes resource utilization such as I/O
    • This can help in optimizing routing decisions in search flow.
  3. Intelligently route search requests away from stressed nodes if possible by enhancing existing routing methods such as ARS. We also will adjust stats of the stressed nodes so that new requests are retried on stressed nodes periodically.
    • This will improve performance of the search requests as we route away from the stressed nodes.
  4. Framework should be configurable per request type and resource type. It should be extensible to any new resource type parameter in future based on which we’ll reject requests.
    • The rejections will be more fair as limits and rejections are based on each request type ( solves 3.1)
  5. The framework should provide a half open state to retry requests on the stressed nodes in time based / count based manner.
    • This will help for coordinator node rejections - to retry when stressed nodes are recovered.
  6. The framework will initially provide admission control based on CPU, JVM, IO and request size.
    • This can help extend existing rejection solutions by considering CPU, I/O etc. ( Solves 2.1, 3.2 )
    • I/O will be a completely new resource backpressure / admission control resource parameter which we will build from ground up

High level design

Admission control plugin

We can add a new admission control core opensearch module / plugin that extends ‘NetworkPlugin’ which intercepts the requests at rest layer and transport layer to perform rejections.

  1. The plugin can override ‘getRestHandlerWrapper’ to wrap incoming requests and perform admission control on them. We’ll initially perform AC only for search and indexing rest calls and can add more based on need.
  2. The plugin can override ‘getTransportInterceptors’ to add a new AC transport interceptor which intercepts requests based on the transport action name and perform admission control.

AdmissionControllerEntireFlow (2)

Admission control service

  1. This service will aid in rejection of incoming requests based on the resource utilization view of the nodes.
  2. We will utilizeresponse collector serviceto get the performance stats of the target nodes.
  3. We will maintain a state machine that helps rejecting requests when performance thresholds are breached with an option to retry requests periodically to check if target node is still in duress.
    1. CLOSED → This state will allow all the incoming requests ( Default )
    2. OPEN → This state will reject all the incoming requests.
    3. HALF_OPEN → This state will allow X percent ( configurable ) of the requests and reject other requests.

StateManagementAdmissionController (1)

Building resource utilization view

Response collector service

We’ll extend the existing ‘ResponseCollectorService’ to collect performance statistics such as CPU, JVM and IO of downstream nodes in coordinator node.
We also will collect node unresponsiveness / timeouts when a request fails which will be treated with more severity.
The coordinator can use this service at any time to get the resource utilization of the downstream nodes.

Local node resource monitoring

We will reuse node stats monitors such as process, jvm, fs which already monitors node resources at 1 second interval.

Track the resource utilization of the downstream nodes

We will enhance the search and indexing flows to get the downstream node performance stats.

Approach 1 - Use the thread context to get the required stats from downstream nodes

  1. For every indexing / search request we will add the performance stats to the thread context response headers on all the nodes (primary&replica) where the request is processed.
  2. Once the request is completed we will get the perf stats from the thread context and update these stats in response collector service in the coordinator node. Post that, we will filter these perf stats from the response headers before we return the response to the client.

Pros

This approach has no regression / backward compatibility risks as we don’t alter any schema

Risks

We need to check if there are any security implications in carrying perf stats as part of threadcontext

Approach 2 - Schema change

Search flow

  1. Enhance ‘QuerySearchResult’ schema to get target nodes resource utilization stats
  2. We already get queue size, service time etc as part of ‘QuerySearchResult’ and hence it’s a good fit to add additional performance stats of the node.

Indexing flow

  1. Enhance the ‘BulkShardResponse’ to return the target nodes resource utilization stats.

Risks

  1. Currently ‘BulkShardResponse’ schema doesn’t have any information of any perf parameters of the target nodes. We have to make changes in serialize / deserialize to hide the perf stats info from the user.

Other approaches considered

We can enhance follower check / leader check APIs to propagate performance stats of the nodes to all other nodes.

Cons

This builds a dependency on cluster manager and might have an impact on cluster manager node’s performance.
These health checks are very critical , and so any regression will be quite problematic

Search flow enhancements - #8913
Indexing flow enhancements - #8911

Co-authored by @ajaymovva

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions