[BUG] Deeply nested aggregations are not terminable by any mechanism and cause Out of Memory errors in data nodes.

### Describe the bug

We have a cluster with 12 data nodes and 31 GB reserved for the JVM. We were experiencing sporadic Out of Memory  errors and managed to isolate the issue to some dashboards that were using nested aggregations with arbitrarily large sizes. We tried different approaches to terminate these client searches before they could crash some of the nodes in the cluster, but none of them worked (as described below).

The query running behind the scenes in Grafana/Dashboards was something similar to:

```json
POST /<index>/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "metadata.timestamp": {
              "gte": 1723737975837,
              "lte": 1724342775837,
              "format": "epoch_millis"
            }
          }
        },
        {
          "query_string": {
            "analyze_wildcard": true,
            "query": "..."
          }
        }
      ]
    }
  },
  "aggs": {
    "3": {
      "terms": {
        "field": "data.dst_experiment_site",
        "size": 500000000,  # the arbitrary big size
        "order": {
          "_key": "desc"
        },
        "min_doc_count": 1
      },
      "aggs": {
        "4": {
          "terms": {
            "field": "data.dst_hostname",
            "size": 500000000,  # the arbitrary big size
            "order": {
              "_key": "desc"
            },
            "min_doc_count": 1
          },
          "aggs": {
            "5": {
              "terms": {
                "field": "data.metric_name",
                "size": 500000000,  # the arbitrary big size
                "order": {
                  "_key": "desc"
                },
                "min_doc_count": 1
              },
              "aggs": {
                "2": {
                  "date_histogram": {
                    "interval": "5m",  # this one is also very small and would create a lot of buckets
                    "field": "metadata.timestamp",
                    "min_doc_count": 1,
                    "extended_bounds": {
                      "min": 1723737975837,
                      "max": 1724342775837
                    },
                    "format": "epoch_millis"
                  },
                  "aggs": {
                    "1": {
                      "max": {
                        "field": "data.status_code"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
```

We tried the following settings in our cluster:

```
GET /_cluster/settings
...
    "indices.breaker.request.limit": "40%",
    "indices.breaker.request.overhead": "1.5",
    "indices.breaker.total.limit": "70%",
    "search.default_search_timeout": "3s",
    "search.cancel_after_time_interval": "3s",
    "search.max_buckets": 65535,
    "search.low_level_cancellation": true,
    "search_backpressure": {
      "node_duress": {
        "heap_threshold": 0.5,
        "num_successive_breaches": 1
      }
```

- `default_search_timeout` and `cancel_after_time_interval` don’t have any effect. You can see this in the task monitoring:

```
GET /_tasks?actions=*search&detailed'"
...
"action": "indices:data/read/search",
"start_time_in_millis": 1724420308855,
"running_time_in_nanos": 12680749959,
"cancellable": true,
"cancelled": true,
"cancellation_time_millis": 1724420311870 <--- this is 3s after the start_time_in_millis and it never gets killed
...
```

For example, it runs for 2-3 minutes before crashing the data nodes:

```
GET /_cat/tasks?v
...
indices:data/read/search                     RldgtOhvQU69uOSumtdnRA:48608   -                            transport 1724421475326 13:57:55  50.6s  	XXX.XXXX.129.208 XXXX-monit-backup1_client5
indices:data/read/search[phase/query]        le60EDYsQB-tyYYjXC8nYw:2830    RldgtOhvQU69uOSumtdnRA:48608 transport 1724421475349 13:57:55  50.6s  	XXX.XXXX.128.25  XXXX-monit-backup1_data4
indices:data/read/search[phase/query]        AQ3e1uc9S1W-Hv6fx1NYLA:3607    RldgtOhvQU69uOSumtdnRA:48608 transport 1724421475350 13:57:55  50.6s  	XXXX.XXX.129.208 XXXX-monit-backup1_data3 
```

If you try to kill the tasks manually with `_tasks/node:task/_cancel` the cluster simply ignores it.

- Circuitbreakers settings (`indices.breaker.request.limit`, `indices.breaker.request.overhead`, ...) are designed to prevent out-of-memory errors by estimating the memory usage of requests. However, it doesn't look like OpenSearch is taking into account these aggregations to estimate the memory usage accurately in advance, leading to the query being accepted even if it eventually consumes a lot of memory.

- Backpressure is triggered, but it never actually kills the problematic query. The message about “heap usage not dominated by search requests” makes me think that aggregations follow a completely different workflow in memory usage tracking in OpenSearch, which is why they are not handled by the circuit breakers or backpressure mechanisms.

```
...
[2024-08-22T15:56:12,212][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 64%
...
[2024-08-22T15:56:16,416][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 76%
...
[2024-08-22T15:56:18,418][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 82%
[2024-08-22T15:56:18,690][DEBUG][o.o.s.b.t.HeapUsageTracker] [osabackup101-monit-backup1_data2] heap usage not dominated by search requests [0/4992899481]

-----> backpressure killed tasks, didn't make a difference here
[2024-08-22T15:56:18,692][WARN ][o.o.s.b.SearchBackpressureService] [osabackup101-monit-backup1_data2] [enforced mode] cancelling task [2269] due to high resource consumption [cpu usage exceeded [1.6m >= 15s], elapsed time exceeded [1.7m >= 30s]]
[2024-08-22T15:56:18,693][WARN ][o.o.s.b.SearchBackpressureService] [osabackup101-monit-backup1_data2] [enforced mode] cancelling task [2270] due to high resource consumption 
[elapsed time exceeded [1.7m >= 30s]]

[2024-08-22T15:56:18,996][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 83%
...
[2024-08-22T15:56:22,699][DEBUG][o.o.s.b.t.HeapUsageTracker] [osabackup101-monit-backup1_data2] heap usage not dominated by search requests [0/4992899481]
[2024-08-22T15:56:22,998][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 94%
[2024-08-22T15:56:23,399][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [osabackup101-monit-backup1_data2] attempting to trigger G1GC due to high heap usage [31819582928]
[2024-08-22T15:56:23,512][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 95%
...
[2024-08-22T15:56:24,513][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 99%
...
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
```

- `max_buckets` doesn’t seem to have an effect because it is only triggered in the reduce phase. Only if the "size" of the aggregation is reasonable and OpenSearch can compute the query, then you can hit the limit…

We've run out of ideas, so please let us know if there's something really missing from OpenSearch or if you have any other suggestions to try. We would appreciate it! 😄

### Related component

Search:Aggregations

### To Reproduce

1. Create a query with big sizes and several levels of terms aggregation (with big cardinality)
2. Wait for datanodes to OOM


### Expected behavior

- If the circuit breakers/backpressure mechanism is not taking aggregations into account, there should be another/separate mechanism to handle aggregations. This is important because users who are not experts could potentially break a cluster by running these deep nested aggregations.
- If `cancel_query_after_time` worked, it would be very useful. If a query takes more than 30 seconds, something is likely wrong. A query of this type was taking more than 2 minutes before it could kill some data nodes in the cluster.


### Additional Details

**Plugins**
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-security-analytics
opensearch-skills
opensearch-sql
repository-s3

**Host/Environment:**
 - OS: AlmaLinux 9.4
 - Version: v2.15


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Deeply nested aggregations are not terminable by any mechanism and cause Out of Memory errors in data nodes. #15413

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Deeply nested aggregations are not terminable by any mechanism and cause Out of Memory errors in data nodes. #15413

Description

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions