Skip to content

Multi-Writer Prevention : Conditional Upload Flow & Logic #18522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

x-INFiN1TY-x
Copy link

@x-INFiN1TY-x x-INFiN1TY-x commented Jun 15, 2025

Related Issues/RFCs:


Problem Statement

OpenSearch clusters using remote-backed storage are susceptible to data inconsistencies when multiple primary shards concurrently attempt to upload segment metadata—particularly during network partitions or primary failovers. A stale (previously active) primary might overwrite metadata written by the newly promoted one. This risk undermines cluster safety and complicates automation in recovery flows.

Current multi-writer detection mechanisms are not robust enough to handle this reliably.


Solution Overview

This PR introduces ETag-based conditional writes to the remote segment metadata upload process. ETags (version identifiers from cloud storage systems like S3, GCS, or Azure) allow OpenSearch to safely coordinate access to shared resources. This mechanism ensures only the correct primary shard can write metadata, while stale primaries self-detect and fence themselves.

Key enhancements:

  1. ETag-Based Conditional Writes:
    Primary shards attach the known ETag to each metadata upload using the If-Match condition. If the ETag doesn't match the current version in the remote store, the write is rejected (HTTP 412 Precondition Failed).

  2. Fixed Metadata Filename:
    To enable ETag-based coordination, segment metadata is now always written to a fixed filename (e.g., "segment_metadata") instead of legacy dynamic filenames.

  3. Stale Primary Self-Fencing:

    • On promotion, the new primary performs a non-conditional (forced) metadata upload after clearing its local ETag knowledge. This updates the remote file and its ETag.
    • If the old primary tries to write using a stale ETag, the write fails. This triggers a controlled failShard() operation, fencing off the stale node.
  4. ETag Lifecycle Managed at Shard Level:
    IndexShard now caches the ETag for its segment metadata file and updates it based on the success/failure of remote operations.

This design shifts writer validation from OpenSearch into the remote store’s atomic operations—improving correctness and simplifying state coordination.


Key Implementation Details

IndexShard

  • Introduces a MetadataETagCache per shard to hold the latest known ETag.

  • Provides methods:

    • getMetadataETag()
    • updateMetadataETag()
    • clearMetadataETag()
  • On primary promotion, invokes initiateNonConditionalRemoteMetadataUpload():

    • Clears cached ETag to trigger an unconditional upload.
    • Performs an overwrite that establishes a new ETag and “claims” primary ownership.
    • Handles transient errors gracefully, relying on future refreshes to retry.

RemoteStoreRefreshListener

  • During each metadata upload:

    • Retrieves the current ETag from the shard.
    • Invokes uploadMetadata(...) with the ETag and a structured ActionListener.
  • On success: Updates shard’s cached ETag.

  • On Precondition Failed: Treats this as a stale primary detection, clears ETag, and calls failShard() for fencing.

  • Logs other failures without failing the shard.

RemoteSegmentStoreDirectory

  • Accepts a versionIdentifier (ETag) and enhanced ActionListener.

  • Constructs ConditionalWriteOptions based on the ETag:

    • If ETag is present → ifMatch
    • If ETag is null → unconditional upload
  • Always uses "segment_metadata" as the remote filename.

RemoteDirectory & BlobStore

  • copyFrom() method now takes ConditionalWriteOptions.
  • Passes them through to the underlying blobContainer.writeBlobConditionally(...) for storage-provider-specific handling.

Testing

Unit tests in RemoteSegmentStoreDirectoryTests have been expanded to verify:

  • ETag propagation and conditional write correctness.
  • Proper fencing behavior on ETag mismatches.
  • Correct switching between conditional and unconditional uploads.

Related Issues

Check List

  • New functionality has been documented.
  • Public documentation issue/PR created
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.


Visualizing the Changes

  1. Sequence Diagram: Stale Primary Self-Fencing Mechanism
    Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-06-15-173440
    Demonstrates how an outdated ETag causes a 412 failure, triggering the stale primary’s self-initiated failShard().

  2. Architecture: Core Components for Conditional Metadata Upload
    Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-06-15-174623
    Shows interaction flow between IndexShard, RemoteStoreRefreshListener, RemoteSegmentStoreDirectory, RemoteDirectory, and BlobStore, including ETag usage.

  3. Flowchart: New Primary’s Metadata Ownership Claim
    Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-06-15-173732
    Shows how a new primary clears its ETag cache, performs a non-conditional upload, and updates its local ETag before assuming control.

@x-INFiN1TY-x
Copy link
Author

Please note that this PR depends on the following downstream changes, which are currently under review. Until they are merged, the Gradle build will fail:

opensearch-project/OpenSearch #18064

opensearch-project/OpenSearch #18092

opensearch-project/OpenSearch #18093

Copy link
Contributor

❌ Gradle check result for f1024f3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@x-INFiN1TY-x x-INFiN1TY-x requested review from sohami, VachaShah and a team as code owners June 15, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants