Description
1. Blob Store Directory Structure
|__index-uuid
|__shard
|__segments
|__metadata
|__<file-prefix>_metadata_<file-gen>_<version>
|__data
|__segments_<N>__<file-gen>
|__<N>.si__<file-gen>
|__<N>.cfe__<file-gen>
|__<N>.cfs__<file-gen>
|__translogs
|__metadata
|__<file-prefix>_metadata_<file-gen>_<version>
|__data
|__primary-term
|__translog-<file-gen>.tlog (with checkpoint blob metadata)
-
file-gen : Monotonically increasing file generation. At every relocation/recovery this start from the last checkpoint.
-
version : The version of the metadata file, although captured in the file contents, but might be helpful if we decide on switching metadata to another format like avro etc
-
file-prefix : The prefix that helps with faster searches of data for use cases like lastest metadata or files at particular point in time. S3 LIST API is guaranteed to return results in a UTF-8 binary sort order,. Azure sorts LIST results in an alphabetic sort order.. GCP too has lexicographical sort order support. Basis this, file names could be of a format that factors in timestamp as below. Since we have a heavy access pattern to get most recently created files first it imperative that we coerce the sort order using Long.MAX - term/timestamp referred to as inverted sort. The alternative is blob versioning but that provides less control to retain files based on timestamps and similarly search based on timestamps
<inverted_primary_term>_<inverted_generation>_<inverted_timestamp>_<file_gen> : [Preferred]
Benefits
i. Fetch latest metadata files in constant time
ii. Fetch data at a particular timestamp using binary searchOther alternatives
i. <inverted_timestamp><file_gen>
ii. <inverted_timestamp><inverted_primary_term><inverted_generation><file_gen> -
Blob metadata: Where the corresponding file metadata is less than a KB it's more optimal to attach metadata with the blob metadata eg: translog.ckp and helps reduce PUT calls significantly
2. Access Patterns
Translogs
- High append-only writes
- Full recoveries during failovers from the latest file
- Point-in-time restores
- Garbage collect unreferenced files
- Version upgrades using raw translogs
Segments
- New segment writes to remote store
- New segment downloads from remote store
- Full recovery during peer recovery
- Delta recovery during failover
- Point-in-time restores
- Garbage collect unreferenced files
3. Metadata File Formats
Translog
{
"CURRENT_VERSION": 1,
"METADATA_CODEC: "md",
"primaryTerm": 3,
"generation": 160,
"minTranslogGeneration": 157,
"generationToPrimaryTermMapper": {
160: 3,
159: 3,
158: 3,
157: 2
}
"checksum" : "c7h5rwdgs423fsdae570s$%dk9"
"contentLength" : 10
}
Segments
{
"CURRENT_VERSION": 1,
"METADATA_CODEC: "md",
"generation": 160,
"metadata": {
"_0.si": {
"originalFilename": "_0.si",
"uploadedFilename": "_0.si__<primary_term>",
"checksum": "238765",
"length": 1234,
"writtenBy": "9.6.0"
},
"_1.cfs": {
"originalFilename": "_1.cfs",
"uploadedFilename": "_1.cfs__<primary_term>",
"checksum": "199345",
"length": 5678,
"writtenBy": "9.7.0"
}
},
"segmentInfosBytes": [Byte Array]
"checksum" : "c7h5rwdgs423fsdae570s$%dk9"
"contentLength" : 10
}
Metadata
Metadata
Assignees
Labels
Type
Projects
Status