Skip to content

[Discuss] Remote Storage File Format #8437

Closed
@Bukhtawar

Description

@Bukhtawar

1. Blob Store Directory Structure

|__index-uuid
            |__shard
                  |__segments
                       |__metadata
                            |__<file-prefix>_metadata_<file-gen>_<version>
                       |__data
                            |__segments_<N>__<file-gen>
                            |__<N>.si__<file-gen>
                            |__<N>.cfe__<file-gen>
                            |__<N>.cfs__<file-gen>
                  |__translogs
                       |__metadata
                            |__<file-prefix>_metadata_<file-gen>_<version>
                       |__data
                            |__primary-term
                                |__translog-<file-gen>.tlog (with checkpoint blob metadata)
  1. file-gen : Monotonically increasing file generation. At every relocation/recovery this start from the last checkpoint.

  2. version : The version of the metadata file, although captured in the file contents, but might be helpful if we decide on switching metadata to another format like avro etc

  3. file-prefix : The prefix that helps with faster searches of data for use cases like lastest metadata or files at particular point in time. S3 LIST API is guaranteed to return results in a UTF-8 binary sort order,. Azure sorts LIST results in an alphabetic sort order.. GCP too has lexicographical sort order support. Basis this, file names could be of a format that factors in timestamp as below. Since we have a heavy access pattern to get most recently created files first it imperative that we coerce the sort order using Long.MAX - term/timestamp referred to as inverted sort. The alternative is blob versioning but that provides less control to retain files based on timestamps and similarly search based on timestamps

     <inverted_primary_term>_<inverted_generation>_<inverted_timestamp>_<file_gen> : [Preferred]
    

    Benefits
    i. Fetch latest metadata files in constant time
    ii. Fetch data at a particular timestamp using binary search

    Other alternatives
    i. <inverted_timestamp><file_gen>
    ii. <inverted_timestamp>
    <inverted_primary_term><inverted_generation><file_gen>

  4. Blob metadata: Where the corresponding file metadata is less than a KB it's more optimal to attach metadata with the blob metadata eg: translog.ckp and helps reduce PUT calls significantly

2. Access Patterns

Translogs

  1. High append-only writes
  2. Full recoveries during failovers from the latest file
  3. Point-in-time restores
  4. Garbage collect unreferenced files
  5. Version upgrades using raw translogs

Segments

  1. New segment writes to remote store
  2. New segment downloads from remote store
  3. Full recovery during peer recovery
  4. Delta recovery during failover
  5. Point-in-time restores
  6. Garbage collect unreferenced files

3. Metadata File Formats

Translog

{
    "CURRENT_VERSION": 1,
    "METADATA_CODEC: "md",
    "primaryTerm": 3,
    "generation": 160,
    "minTranslogGeneration": 157,
    "generationToPrimaryTermMapper": {
        160: 3,
        159: 3,
        158: 3,
        157: 2
    }
    "checksum" : "c7h5rwdgs423fsdae570s$%dk9"
    "contentLength" : 10
}

Segments

{
    "CURRENT_VERSION": 1,
    "METADATA_CODEC: "md",
    "generation": 160,
    "metadata": {
        "_0.si": {
            "originalFilename": "_0.si",
            "uploadedFilename": "_0.si__<primary_term>",
            "checksum": "238765",
            "length": 1234,
            "writtenBy": "9.6.0"
        },
        "_1.cfs": {
            "originalFilename": "_1.cfs",
            "uploadedFilename": "_1.cfs__<primary_term>",
            "checksum": "199345",
            "length": 5678,
            "writtenBy": "9.7.0"
        }
    },
    "segmentInfosBytes": [Byte Array]
    "checksum" : "c7h5rwdgs423fsdae570s$%dk9"
    "contentLength" : 10
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    StorageIssues and PRs relating to data and metadata storageStorage:DurabilityIssues and PRs related to the durability frameworkdiscussIssues intended to help drive brainstorming and decision makingenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions