Skip to content

[META] Implement Conditional APIs for Multi-Writer Detection in Remote Store Clusters #17859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
x-INFiN1TY-x opened this issue Apr 9, 2025 · 1 comment
Labels
Meta Meta issue, not directly linked to a PR

Comments

@x-INFiN1TY-x
Copy link

x-INFiN1TY-x commented Apr 9, 2025

Please describe the end goal of this project

This META issue tracks the implementation of the "Approach 2: Versioned & Mutable Metadata File with Conditional Writes" (see RFC #17763). The aim is to simplify
writer and replica coordination during writes for preventing multi-writer conflicts and simplifying recovery for red clusters.
In a remote store–enabled cluster, the primary shard writes data (translog entries and segment information) to external object stores (Amazon S3, Google Cloud Storage, Azure Blob Storage) while replicas source their data remotely. The existing synchronous primary term validation mechanism—relying on blocking, inter-node calls—adds latency and complexity. This proposal leverages the remote store’s native conditional APIs to enforce writer integrity atomically, thereby reducing coordination overhead.

Key Components/Tasks

  1. S3 Client Enhancement
  • Add support for conditional writes via (if-match) based PUT calls for AWS S3
  • Implement conditional PUTs for both multipart and single upload
  • Add structured response handling for 412 (Precondition Failed) and 409 (Conflict)
  • Modifications to record and return ETag value from PUT response.
  • Implement retry logic with appropriate backoff
  1. ETag Caching Mechanism
  • Implement thread-safe infrastructure for managing ETag values at the shard level
  • Add functionality to retrieve & cache new ETag values from S3 responses
  • Ensure thread-safe access to latest ETag value for the primary shard.
  • Add comprehensive logging for transitions and conflicts
  1. Metadata File Structure Refactoring
  • Replace need for multiple metadata files with singular versioned metadata files
  • Convert translog metadata files to use a fixed name structure (metadata_translog)
  • Convert segment metadata files to use a fixed name structure (metadata_segments)
  • Add support for retrieving versioning information from remote store
  • Build path resolution adapters for the new structure
  • Implement support for reading legacy metadata formats
  1. Primary Shard Bootstrap Process
  • Implement ETag retrieval & caching from header response during metadata download while shard’s initialization
  • Update bootstrap logic to enable recovery via versioned metadata file from Remote Store
  1. Write Operation Flow Changes
  • Modify upload process to use conditional PUTs via if-match parameter with cached ETag
  • Implement differentiated shard failure handling based on routing state:
    • For STARTED state: Fail shard immediately on 412 response
    • For INITIALIZING state: Implement retry logic (max 3 attempts) with ETag refresh
  • Implement ETag caching logic after successful write (200)
  • Add detailed logging for conditional operation success/failure tracking
  1. Shard Failure Handling
  • Initiate self-failover mechanism on 412 Precondition Failed responses representing a stale primary
  • Implement appropriate client response handling during shard failures
  • Add detailed logging for failure scenarios to aid debugging
  1. Recovery and Restore Process Updates
  • Update _remotestore/restore API to work with versioned metadata files
  • Add header retrieval while syncing segments from remote store.
  • Implement proper handling of version conflicts during restore operations
  • Update recovery process to fetch & validate latest ETag from header before proceeding
  1. Garbage Collection Adjustments
  • Adapt GC logic to handle fixed-name versioned metadata file.
  • Add validation steps to prevent deletion of active versions
  • Replace ListObject Calls with ListObjectVersion calls
  1. Snapshot V2 Integration
  • Update snapshot V2 process to work with versioned metadata files
  • Ensure consistent point-in-time snapshots with versioned files
  1. Testing Infrastructure
  • Create unit tests for single and multipart conditional PUT operations
  • Implement integration tests for multi-writer scenarios
  • Add specific tests for shard failure handling in both STARTED and INITIALIZING states
  • Test retry logic for INITIALIZING shards
  • Design performance tests to measure impact on write latency

Supporting References

Issues

  1. Conditional Operations Implementation
  • [CO-1] Support Conditional Writes for Single-Part S3 Uploads
    Implement conditional operations on S3 PutObject API for single-part uploads.
  • [CO-2] Support Conditional Writes for Multipart S3 Uploads
    Extend the conditional write capability to S3 multipart uploads using conditional headers.
  • [CO-3] Tests
  1. ETag Management and Metadata Refactoring
  • [ET-1] Develop a Thread-Safe ETag Caching System
    Implement an ETag cache at the shard level to store and manage the last known ETag/generation number from remote storage.
  • [MD-1] Refactor Metadata File Naming to Fixed, Versioned Structure
    Migrate from dynamic naming to fixed names for metadata files & adapt concerned functions to the new architecture.
  • [ET-2] Integrate ETag Handling into Shard Bootstrap Process
    Update the bootstrap logic to retrieve and cache the latest ETag from the metadata file synced during shard initialization.
  1. Failure Handling and Primary Handoff
  • [FH-1] Implement Automatic Failure Handling for Conditional Write Conflicts
    Detect conditional PUT failures (HTTP 412/409) and trigger self-failover mechanisms.
  • [FH-2] Enhance Primary Relocation Process for ETag Handoff
  1. Data Lifecycle, Recovery, and Snapshot Integration
  • [DL-1] Update Garbage Collection for Fixed, Versioned Metadata
    [DL-2] Integrate Versioned Metadata with Snapshot V2
    [DL-3] Adapt Remote Store Restore API for Versioned Metadata
  1. Testing and Verification
  • [TV-1] Develop a Comprehensive Test Suite for Multi-Writer Detection
    Create unit, integration, and performance tests to simulate multi-writer scenarios, conditional write operations, and to validate recovery workflows.

Related component

Storage:Remote

@x-INFiN1TY-x x-INFiN1TY-x added Meta Meta issue, not directly linked to a PR untriaged labels Apr 9, 2025
x-INFiN1TY-x pushed a commit to x-INFiN1TY-x/OpenSearch_Local that referenced this issue Apr 15, 2025
x-INFiN1TY-x pushed a commit to x-INFiN1TY-x/OpenSearch_Local that referenced this issue Apr 15, 2025
x-INFiN1TY-x pushed a commit to x-INFiN1TY-x/OpenSearch_Local that referenced this issue Apr 23, 2025
x-INFiN1TY-x pushed a commit to x-INFiN1TY-x/OpenSearch_Local that referenced this issue Apr 27, 2025
@andrross
Copy link
Member

Catch All Triage - 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta Meta issue, not directly linked to a PR
Projects
Status: New
Development

No branches or pull requests

2 participants