Skip to content

[Execution State] Avoid creating separate MTrie state during checkpoint creation for about -200GB peak RAM use and -32 minutes duration #2286

Closed
@fxamacker

Description

@fxamacker

EDIT: When deployed on August 24, 2022, the PR reduced peak RAM use by over 200GB (out of over 300GB total reduction). Initial estimate of -150GB was based on old checkpoint file. By August, checkpoint file grew substantially so memory savings were better. Duration is about 16 minutes today (Sep 7), it was 46-58 minutes in mid-August, and it was 11-17 hours in Dec 2021 depending on system load.

Problem

Recent increase in transactions is causing WAL files to get created more frequently, causing checkpoints to happen more frequently, increasing checkpoint file size, and increasing ledger state size in memory. These increases are causing checkpointing to consume too much RAM and take more than 2x longer than earlier this year.

File Size Checkpoint Frequency
Early 2022 53 GB 0-2 times per day
July 8, 2022 126 GB every 2 hours

Without PR #1944 the system checkpointing would currently be:

  • taking well over 20-30 hours each time, making it impossible to complete every 2 hours
  • requiring more operational RAM, making OOM crashes very frequent
  • creating billions more allocations and gc pressure, consuming CPU cycles and slowing down EN

After PR #1944 reduced Mtrie flattening and serialization phase to under 5 minutes (which sometimes took 17 hours on mainnet16), creating a separate MTrie state currently accounts for most of the duration and memory used by checkpointing. This opens up new possibilities such as reusing ledger state to significantly reduce duration and operational RAM of checkpointing again.

Updates epic #1744

The Proposed Solution

We can avoid creating a separate MTrie state during checkpoint creation. This can reduce peak RAM use by (very roughly) about 150GB and reduce checkpoint duration by 24 minutes (estimates based on snapshot of July 8, 2022). Memory savings will increase over time.

Determine if it's feasible to avoid creating a separate MTrie state during checkpoint creation. If the poof-of-concept doesn't reveal showstoppers then proceed with new PR.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions