Skip to content

[BUG] Deeply nested aggregations are not terminable by any mechanism and cause Out of Memory errors in data nodes. #15413

Closed
@Pigueiras2

Description

@Pigueiras2

Describe the bug

We have a cluster with 12 data nodes and 31 GB reserved for the JVM. We were experiencing sporadic Out of Memory errors and managed to isolate the issue to some dashboards that were using nested aggregations with arbitrarily large sizes. We tried different approaches to terminate these client searches before they could crash some of the nodes in the cluster, but none of them worked (as described below).

The query running behind the scenes in Grafana/Dashboards was something similar to:

POST /<index>/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "metadata.timestamp": {
              "gte": 1723737975837,
              "lte": 1724342775837,
              "format": "epoch_millis"
            }
          }
        },
        {
          "query_string": {
            "analyze_wildcard": true,
            "query": "..."
          }
        }
      ]
    }
  },
  "aggs": {
    "3": {
      "terms": {
        "field": "data.dst_experiment_site",
        "size": 500000000,  # the arbitrary big size
        "order": {
          "_key": "desc"
        },
        "min_doc_count": 1
      },
      "aggs": {
        "4": {
          "terms": {
            "field": "data.dst_hostname",
            "size": 500000000,  # the arbitrary big size
            "order": {
              "_key": "desc"
            },
            "min_doc_count": 1
          },
          "aggs": {
            "5": {
              "terms": {
                "field": "data.metric_name",
                "size": 500000000,  # the arbitrary big size
                "order": {
                  "_key": "desc"
                },
                "min_doc_count": 1
              },
              "aggs": {
                "2": {
                  "date_histogram": {
                    "interval": "5m",  # this one is also very small and would create a lot of buckets
                    "field": "metadata.timestamp",
                    "min_doc_count": 1,
                    "extended_bounds": {
                      "min": 1723737975837,
                      "max": 1724342775837
                    },
                    "format": "epoch_millis"
                  },
                  "aggs": {
                    "1": {
                      "max": {
                        "field": "data.status_code"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

We tried the following settings in our cluster:

GET /_cluster/settings
...
    "indices.breaker.request.limit": "40%",
    "indices.breaker.request.overhead": "1.5",
    "indices.breaker.total.limit": "70%",
    "search.default_search_timeout": "3s",
    "search.cancel_after_time_interval": "3s",
    "search.max_buckets": 65535,
    "search.low_level_cancellation": true,
    "search_backpressure": {
      "node_duress": {
        "heap_threshold": 0.5,
        "num_successive_breaches": 1
      }
  • default_search_timeout and cancel_after_time_interval don’t have any effect. You can see this in the task monitoring:
GET /_tasks?actions=*search&detailed'"
...
"action": "indices:data/read/search",
"start_time_in_millis": 1724420308855,
"running_time_in_nanos": 12680749959,
"cancellable": true,
"cancelled": true,
"cancellation_time_millis": 1724420311870 <--- this is 3s after the start_time_in_millis and it never gets killed
...

For example, it runs for 2-3 minutes before crashing the data nodes:

GET /_cat/tasks?v
...
indices:data/read/search                     RldgtOhvQU69uOSumtdnRA:48608   -                            transport 1724421475326 13:57:55  50.6s  	XXX.XXXX.129.208 XXXX-monit-backup1_client5
indices:data/read/search[phase/query]        le60EDYsQB-tyYYjXC8nYw:2830    RldgtOhvQU69uOSumtdnRA:48608 transport 1724421475349 13:57:55  50.6s  	XXX.XXXX.128.25  XXXX-monit-backup1_data4
indices:data/read/search[phase/query]        AQ3e1uc9S1W-Hv6fx1NYLA:3607    RldgtOhvQU69uOSumtdnRA:48608 transport 1724421475350 13:57:55  50.6s  	XXXX.XXX.129.208 XXXX-monit-backup1_data3 

If you try to kill the tasks manually with _tasks/node:task/_cancel the cluster simply ignores it.

  • Circuitbreakers settings (indices.breaker.request.limit, indices.breaker.request.overhead, ...) are designed to prevent out-of-memory errors by estimating the memory usage of requests. However, it doesn't look like OpenSearch is taking into account these aggregations to estimate the memory usage accurately in advance, leading to the query being accepted even if it eventually consumes a lot of memory.

  • Backpressure is triggered, but it never actually kills the problematic query. The message about “heap usage not dominated by search requests” makes me think that aggregations follow a completely different workflow in memory usage tracking in OpenSearch, which is why they are not handled by the circuit breakers or backpressure mechanisms.

...
[2024-08-22T15:56:12,212][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 64%
...
[2024-08-22T15:56:16,416][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 76%
...
[2024-08-22T15:56:18,418][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 82%
[2024-08-22T15:56:18,690][DEBUG][o.o.s.b.t.HeapUsageTracker] [osabackup101-monit-backup1_data2] heap usage not dominated by search requests [0/4992899481]

-----> backpressure killed tasks, didn't make a difference here
[2024-08-22T15:56:18,692][WARN ][o.o.s.b.SearchBackpressureService] [osabackup101-monit-backup1_data2] [enforced mode] cancelling task [2269] due to high resource consumption [cpu usage exceeded [1.6m >= 15s], elapsed time exceeded [1.7m >= 30s]]
[2024-08-22T15:56:18,693][WARN ][o.o.s.b.SearchBackpressureService] [osabackup101-monit-backup1_data2] [enforced mode] cancelling task [2270] due to high resource consumption 
[elapsed time exceeded [1.7m >= 30s]]

[2024-08-22T15:56:18,996][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 83%
...
[2024-08-22T15:56:22,699][DEBUG][o.o.s.b.t.HeapUsageTracker] [osabackup101-monit-backup1_data2] heap usage not dominated by search requests [0/4992899481]
[2024-08-22T15:56:22,998][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 94%
[2024-08-22T15:56:23,399][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [osabackup101-monit-backup1_data2] attempting to trigger G1GC due to high heap usage [31819582928]
[2024-08-22T15:56:23,512][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 95%
...
[2024-08-22T15:56:24,513][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [osabackup101-monit-backup1_data2] Recording memory usage: 99%
...
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
  • max_buckets doesn’t seem to have an effect because it is only triggered in the reduce phase. Only if the "size" of the aggregation is reasonable and OpenSearch can compute the query, then you can hit the limit…

We've run out of ideas, so please let us know if there's something really missing from OpenSearch or if you have any other suggestions to try. We would appreciate it! 😄

Related component

Search:Aggregations

To Reproduce

  1. Create a query with big sizes and several levels of terms aggregation (with big cardinality)
  2. Wait for datanodes to OOM

Expected behavior

  • If the circuit breakers/backpressure mechanism is not taking aggregations into account, there should be another/separate mechanism to handle aggregations. This is important because users who are not experts could potentially break a cluster by running these deep nested aggregations.
  • If cancel_query_after_time worked, it would be very useful. If a query takes more than 30 seconds, something is likely wrong. A query of this type was taking more than 2 minutes before it could kill some data nodes in the cluster.

Additional Details

Plugins
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-security-analytics
opensearch-skills
opensearch-sql
repository-s3

Host/Environment:

  • OS: AlmaLinux 9.4
  • Version: v2.15

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions