Description
Describe the bug
After S3 remote-backed storage was fixed in v3, we enabled it with segment replication. After doing so, we noticed our OpenSearch cluster began issuing millions of S3 ListObjectsV2 and GetObject requests per hour to the configured S3 bucket.
This behavior appears pathological and triggered only after the shutdown of one OpenSearch node, leading to:
- Persistent get and list requests to the
/<index_uuid>/0/segments/metadata/metadata
prefixes across all indices - Sustained high CPU usage on one surviving node (likely due to the looping requests)
This causes a HUGE spike in cluster costs ($1,000+ / month) even for our very small dataset (<100MB) since over the course of a month this has generated 200M requests to S3 (and we got fairly lucky since this fortunately wasn't occurring all the time ):
In our debugging, here is what we have observed:
- This behavior is not observed during normal operations, only when a node is lost.
- It doesn't appear to matter whether the lost node is a leader or follower.
- In a three-node cluster, will only impacts one of the remaining two nodes.
- It doesn't occur all the time (~25% incident rate for us).
- The issue cannot be fixed without deleting the malfunctioning node.
- Oddly, the cluster itself appears to be functioning fine even while this is occurring -- even for indices with shards on the degenerate node.
Related component
Storage:Remote
To Reproduce
- Deploy a 3-node opensearch cluster on kubernetes using these
opensearch.yml
settings:
"cluster.allocator.existing_shards_allocator.batch_enabled": true
"cluster.indices.replication.strategy": "SEGMENT"
"cluster.initial_cluster_manager_nodes":
- "opensearch-7d0b-0"
- "opensearch-7d0b-1"
- "opensearch-7d0b-2"
"cluster.name": "opensearch-7d0b"
"cluster.remote_store.publication.enabled": true
"cluster.remote_store.routing_table.path.prefix": "routing/"
"cluster.remote_store.state.enabled": true
"cluster.remote_store.state.path.prefix": "state/"
"cluster.routing.allocation.balance.prefer_primary": true
"cluster.routing.allocation.disk.watermark.high": "90%"
"cluster.routing.allocation.shard_movement_strategy": "PRIMARY_FIRST"
"discovery.seed_hosts":
- "opensearch-7d0b-headless"
"indices.recovery.max_bytes_per_sec": "0mb"
"logger._root": "WARN"
"logger.org.opensearch.alerting.util.destinationmigration.DestinationMigrationCoordinator": "WARN"
"network.host": "0.0.0.0"
"node.attr.remote_store.repository.s3.settings.bucket": "default-opensearch-7d0b-storage-15cf133175f73b63"
"node.attr.remote_store.repository.s3.settings.region": "us-east-2"
"node.attr.remote_store.repository.s3.type": "s3"
"node.attr.remote_store.routing_table.repository": "s3"
"node.attr.remote_store.segment.repository": "s3"
"node.attr.remote_store.state.repository": "s3"
"node.attr.remote_store.translog.repository": "s3"
"node.roles":
- "cluster_manager"
- "ingest"
- "data"
- "remote_cluster_client"
"s3.client.default.endpoint": "s3.us-east-2.amazonaws.com"
"s3.client.default.region": "us-east-2"
"segrep.pressure.enabled": true
- Launch each node with the following bash script:
#!/usr/bin/env bash
set -euo pipefail
./bin/opensearch-plugin install -b repository-s3
echo "$AWS_ACCESS_KEY_ID" | ./bin/opensearch-keystore add --stdin --force s3.client.default.access_key
echo "$AWS_SECRET_ACCESS_KEY" | ./bin/opensearch-keystore add --stdin --force s3.client.default.secret_key
# Set JVM heap size dynamically to 50% of container memory request
HEAP_SIZE=$((CONTAINER_MEMORY_REQUEST / 2000000))
export OPENSEACH_JAVA_OPTS="-Xmx${HEAP_SIZE}M -Xms${HEAP_SIZE}M --enable-native-access=ALL-UNNAMED"
./bin/opensearch \
-Enode.name="$POD_NAME" \
-Eplugins.query.datasources.encryption.masterkey="$OPENSEARCH_MASTER_KEY"
-
Deploy the demo data that comes with the OpenSearch Dashboard
-
Restart a node (doesn't occur every restart -- looks to occur 25% of the time for us).
Expected behavior
The infinite loop of requests should not occur.
Additional Details
Plugins
- repository-s3
Screenshots
Host/Environment (please complete the following information):
- Environment: AWS EKS 1.30
- OS: Bottlerocket
- Version: 3.0.0
Additional context
Count of the list requests that occurred in a single minute based on the CloudTrail events logs:
1220 1vO1bkUERFWvJp2p9vNo9w/0/segments/metadata/metadata
1195 eFMZ2857RDKzqWOKoEhjmg/0/segments/metadata/metadata
1166 LZcXfsUHQoOfZ-VAe82Smw/0/segments/metadata/metadata
1142 UZA4K5ruT-Opx6EmbOKbvg/0/segments/metadata/metadata
1138 TaxiBYmZS4ypJ2DrGdxZEw/0/segments/metadata/metadata
1105 NlAaRqFzRku2H2A7ABCYlA/0/segments/metadata/metadata
1088 EkPLvxHwTZOWIWkS6_UgEA/0/segments/metadata/metadata
1077 tJPUeD9ESA6X1XsPZxUxcg/0/segments/metadata/metadata
1068 04mQ2jgvSMejOlTaUB9k6Q/0/segments/metadata/metadata
1050 OJMqg9FsSy2m76o3vxZvVQ/0/segments/metadata/metadata
I also capture all the DEBUG logs across the entire cluster during a 5-minute segment when this issue occurred. Due to the insane amount of requests, the log file is quite large (2GB): log file.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status