Skip to content

[Feature Request] Paginating _wlm/stats API #17592

Open
@Lindsay-00

Description

@Lindsay-00

Is your feature request related to a problem? Please describe

The current _wlm/stats API in OpenSearch provides query group statistics across nodes in a single response, which scales poorly as cluster size increases. Similar to _cat APIs (e.g., _cat/indices, _cat/shards), this API suffers from large response sizes, high latency, and increased CPU/memory consumption. This makes it difficult for users to efficiently retrieve and process query group statistics, especially in large clusters.

The need for pagination arises to:

  1. Limit response size, reducing memory usage and response latency.
  2. Prevent unnecessary aggregation of statistics for all nodes at once.
  3. Enable efficient navigation of query group statistics, similar to paginated APIs like _list/indices and _list/shards.

The issues and approaches discussed in the following OpenSearch GitHub issues are particularly relevant:

OpenSearch Issue #14257: Discusses pagination for _cat APIs, highlighting the impact of large responses on cluster performance.
OpenSearch Issue #15014: Tracks the introduction of _list APIs to replace _cat APIs, ensuring efficient pagination with next_token.
OpenSearch Issue #14258: Discusses pagination strategies, emphasizing deterministic sorting keys for stable pagination behavior.

Describe the solution you'd like

To address the issues of large response sizes and high resource consumption in _wlm/stats, we propose introducing a new API endpoint (/_list/wlm_stats) with token-based pagination. This follows the approach used in OpenSearch Issue #14257 and OpenSearch Issue #15014, where _list APIs were introduced for paginating large _cat responses.

Key Features

  1. Token-Based Pagination (next_token): Users can fetch query group statistics in smaller chunks, reducing resource consumption.
  2. Sorting Support: Users can sort results by Node ID or Query Group, ensuring a stable and predictable pagination order.
  3. Tabular Output: The response is structured similarly to _cat APIs, making it easy to read and process.
  4. Scalability: Limits the amount of data retrieved per request, preventing excessive load on the cluster.

Sorting Options

Since CPU and memory usage fluctuate frequently, sorting by these values is not supported because it would cause inconsistent pagination results. Instead, sorting will be restricted to stable attributes:

  1. node_id (Default): Sorts results lexicographically by Node ID, then by Query Group. Ensures structured browsing.
  2. query_group: Groups results by Query Group, useful for analyzing workload behavior.

Example API Calls

Fetch First Page (Sorted by Query Group)
GET /_list/wlm_stats?size=50&sort=query_group&order=asc

Returns results grouped by Query Group, making it easier to analyze workload performance.

Fetch First Page (Sorted by Node ID)
GET /_list/wlm_stats?size=50&sort=node_id&order=asc

Sorts results by Node ID, providing a stable, structured overview.

Fetch Next Page
GET /_list/wlm_stats?size=50&sort=node_id&order=asc&next_token=Base64EncodedCursor

Uses next_token to fetch the next 50 results in a stable order.

Related component

Search

Describe alternatives you've considered

An alternative solution is to enhance the existing _wlm/stats API with filtering options, ensuring that only the most relevant statistics are retrieved.

Key Features

  1. Targeted Data Retrieval: Users can filter results by Node ID, Query Group, CPU Usage, and Memory Usage to retrieve only relevant information.
  2. Sorting Support: Supports sorting by CPU Usage, Memory Usage, Node ID, and Query Group for better analysis.
  3. Tabular Output: Maintains structured, easy-to-read output similar to _cat APIs.
  4. Performance Optimization: Eliminates unnecessary data retrieval, improving query response times.

Example API Calls

Fetch Nodes with CPU Usage Above 50%

GET/_wlm/stats?cpu_threshold=50

Returns only nodes consuming more than 50% CPU.

Fetch Nodes with High Memory Usage

GET/_wlm/stats?memory_threshold=70

Retrieves only nodes using more than 70% memory.

Fetch Query Groups for a Specific Node

GET/_wlm/stats?node_id=jPPwGjW-TA2NZB6Gn7RZtg

Returns query group statistics for the given node.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    SearchSearch query, autocomplete ...etcenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions