Description
Is your feature request related to a problem? Please describe
The current _wlm/stats API in OpenSearch provides query group statistics across nodes in a single response, which scales poorly as cluster size increases. Similar to _cat APIs (e.g., _cat/indices, _cat/shards), this API suffers from large response sizes, high latency, and increased CPU/memory consumption. This makes it difficult for users to efficiently retrieve and process query group statistics, especially in large clusters.
The need for pagination arises to:
- Limit response size, reducing memory usage and response latency.
- Prevent unnecessary aggregation of statistics for all nodes at once.
- Enable efficient navigation of query group statistics, similar to paginated APIs like _list/indices and _list/shards.
The issues and approaches discussed in the following OpenSearch GitHub issues are particularly relevant:
OpenSearch Issue #14257: Discusses pagination for _cat APIs, highlighting the impact of large responses on cluster performance.
OpenSearch Issue #15014: Tracks the introduction of _list APIs to replace _cat APIs, ensuring efficient pagination with next_token.
OpenSearch Issue #14258: Discusses pagination strategies, emphasizing deterministic sorting keys for stable pagination behavior.
Describe the solution you'd like
To address the issues of large response sizes and high resource consumption in _wlm/stats
, we propose introducing a new API endpoint (/_list/wlm_stats
) with token-based pagination. This follows the approach used in OpenSearch Issue #14257 and OpenSearch Issue #15014, where _list
APIs were introduced for paginating large _cat
responses.
Key Features
- Token-Based Pagination (
next_token
): Users can fetch query group statistics in smaller chunks, reducing resource consumption. - Sorting Support: Users can sort results by Node ID or Query Group, ensuring a stable and predictable pagination order.
- Tabular Output: The response is structured similarly to
_cat
APIs, making it easy to read and process. - Scalability: Limits the amount of data retrieved per request, preventing excessive load on the cluster.
Sorting Options
Since CPU and memory usage fluctuate frequently, sorting by these values is not supported because it would cause inconsistent pagination results. Instead, sorting will be restricted to stable attributes:
- node_id (Default): Sorts results lexicographically by Node ID, then by Query Group. Ensures structured browsing.
- query_group: Groups results by Query Group, useful for analyzing workload behavior.
Example API Calls
Fetch First Page (Sorted by Query Group)
GET /_list/wlm_stats?size=50&sort=query_group&order=asc
Returns results grouped by Query Group, making it easier to analyze workload performance.
Fetch First Page (Sorted by Node ID)
GET /_list/wlm_stats?size=50&sort=node_id&order=asc
Sorts results by Node ID, providing a stable, structured overview.
Fetch Next Page
GET /_list/wlm_stats?size=50&sort=node_id&order=asc&next_token=Base64EncodedCursor
Uses next_token
to fetch the next 50 results in a stable order.
Related component
Search
Describe alternatives you've considered
An alternative solution is to enhance the existing _wlm/stats API with filtering options, ensuring that only the most relevant statistics are retrieved.
Key Features
- Targeted Data Retrieval: Users can filter results by Node ID, Query Group, CPU Usage, and Memory Usage to retrieve only relevant information.
- Sorting Support: Supports sorting by CPU Usage, Memory Usage, Node ID, and Query Group for better analysis.
- Tabular Output: Maintains structured, easy-to-read output similar to _cat APIs.
- Performance Optimization: Eliminates unnecessary data retrieval, improving query response times.
Example API Calls
Fetch Nodes with CPU Usage Above 50%
GET/_wlm/stats?cpu_threshold=50
Returns only nodes consuming more than 50% CPU.
Fetch Nodes with High Memory Usage
GET/_wlm/stats?memory_threshold=70
Retrieves only nodes using more than 70% memory.
Fetch Query Groups for a Specific Node
GET/_wlm/stats?node_id=jPPwGjW-TA2NZB6Gn7RZtg
Returns query group statistics for the given node.
Additional context
No response
Metadata
Metadata
Assignees
Type
Projects
Status