Skip to content

STAR-13 UnifiedCompactionStrategy #132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 15, 2021
Merged

STAR-13 UnifiedCompactionStrategy #132

merged 1 commit into from
Jul 15, 2021

Conversation

Gerrrr
Copy link

@Gerrrr Gerrrr commented Apr 28, 2021

dtests PR

Ported changes are taken from:

  • DB-4422 includes the majority of commits. It might be useful to compare this PR with the internal PRs.
  • DB-4762
  • DB-4823
  • DB-4853
  • DB-4764
  • DB-4759
  • DB-4640
  • DB-4711 (got merged into STAR-13 Implement compaction shards )
  • isTrulyWrapAround from APOLLO-6

Side changes:

@Gerrrr Gerrrr force-pushed the STAR-13-UCS branch 5 times, most recently from 6baf229 to bb8a8bd Compare April 30, 2021 14:35
@Gerrrr Gerrrr force-pushed the STAR-13-UCS branch 2 times, most recently from 689cbc5 to ed83b73 Compare May 19, 2021 11:59
@Gerrrr Gerrrr force-pushed the STAR-13-UCS branch 4 times, most recently from ee2c9c5 to eb984f9 Compare May 31, 2021 15:50
@Gerrrr Gerrrr force-pushed the STAR-13-UCS branch 6 times, most recently from 2f710d7 to 78ddd39 Compare June 9, 2021 08:59
@Gerrrr Gerrrr requested a review from blambov June 9, 2021 10:54

#### UCS-tiered vs STCS

SizeTieredCompactionStrategy is pretty close to UCS. However, it defines buckets/levels by looking for sstables of similar size. This can result in some odd selections of buckets, possibly spanning sstables of wildly different sizes, while UCS's selection is more stable and predictable.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I recently realized: this bucketing of STCS can result in stranded sstables, i.e. sstables that end up in a bucket that's not part of the normal progression (e.g. at size like m*8). They won't be compacted and can cause all sorts of inefficiencies (increased RA, also problems with tombstones, time-ordered read path, etc.)

@blambov blambov changed the base branch from ds-trunk to STAR-13-base June 16, 2021 08:27
Copy link

@blambov blambov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second batch

@blambov
Copy link

blambov commented Jun 16, 2021

I had to change the base to continue the review as the ds-trunk rebase brought in a lot of other changes.

Copy link

@blambov blambov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third batch, the rest of the files except UCS itself and tests.

@@ -86,6 +86,9 @@ public void handleFSError(FSError e)
Keyspace.removeUnreadableSSTables(directory);
}
break;
case die:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like the wrong place to have a fix like this. This also applies to the bloom filter tracking changes and anything else that is a valuable fix on its own.
Do we have separate tickets for them? Let's make sure we have and prioritize them so that they are committed to OSS before we propose this. (To clarify, I do not want to remove any of these from this patch.)

Copy link

@blambov blambov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concludes my first pass over the code (I haven't had a chance to look at any tests, or any of the updates yet).

Looks like I have created a lot of extra work... Let me know if you would like me to take some of that load.

jacek-lewandowski pushed a commit that referenced this pull request Apr 19, 2022
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
jacek-lewandowski pushed a commit that referenced this pull request May 26, 2022
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
jacek-lewandowski pushed a commit that referenced this pull request May 27, 2022
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
jacek-lewandowski pushed a commit that referenced this pull request Oct 17, 2022
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
jacek-lewandowski pushed a commit that referenced this pull request Oct 18, 2022
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
mfleming pushed a commit that referenced this pull request Jul 10, 2023
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
djatnieks pushed a commit that referenced this pull request Jul 24, 2023
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
djatnieks pushed a commit that referenced this pull request Aug 21, 2023
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
(cherry picked from commit dc58b0b)
djatnieks pushed a commit that referenced this pull request Sep 12, 2023
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
(cherry picked from commit dc58b0b)

STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager

(cherry picked from commit 8c7c128)
djatnieks pushed a commit that referenced this pull request Jan 16, 2024
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
(cherry picked from commit dc58b0b)

STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager

(cherry picked from commit 8c7c128)
(cherry picked from commit 55cfeb0)
djatnieks pushed a commit that referenced this pull request Mar 29, 2024
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
(cherry picked from commit dc58b0b)

STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager

(cherry picked from commit 8c7c128)
(cherry picked from commit 55cfeb0)

STAR-13 General compile fixes: unused imports, TimeUUID instead of UUID, long instead of int gcBefore, replace System calls with Clock.Global, etc.
* Add json-simple dependency used in CompactionLogAnalyzer
* Add UCS properties to CassandraRelevantProperties
* Checkstyle allow certain uses of java.util.concurrent.CompletableFuture in CC code: CompactionManager.BackgroundCompactionCandidate
* Restore C* 5.0 methods in CompactionStrategy getPerLevelSizeBytes, isLevelCompaction, and getSSTableCountPerTWCSBucket used in TableStatsHolder
* Restore C* 5.0 CompactionStrategy.getEstimatedRemainingTasks used in StreamSession
* Restore C* 5.0 BloomFilterTracker lastFalsePositiveCount, lastTruePositiveCount, and lastTrueNegativeCount used in BloomFilterMetrics
* Add TableMetrics.bloomFilterFalseRation used in CC ColumnFamilyStore and RealEnvironment

STAR-13 Fix out-of-order enum values in CassandreRelevantProperties.
djatnieks pushed a commit that referenced this pull request Apr 1, 2024
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
(cherry picked from commit dc58b0b)

STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager

(cherry picked from commit 8c7c128)
(cherry picked from commit 55cfeb0)

STAR-13 General compile fixes: unused imports, TimeUUID instead of UUID, long instead of int gcBefore, replace System calls with Clock.Global, etc.
* Add json-simple dependency used in CompactionLogAnalyzer
* Add UCS properties to CassandraRelevantProperties
* Checkstyle allow certain uses of java.util.concurrent.CompletableFuture in CC code: CompactionManager.BackgroundCompactionCandidate
* Restore C* 5.0 methods in CompactionStrategy getPerLevelSizeBytes, isLevelCompaction, and getSSTableCountPerTWCSBucket used in TableStatsHolder
* Restore C* 5.0 CompactionStrategy.getEstimatedRemainingTasks used in StreamSession
* Restore C* 5.0 BloomFilterTracker lastFalsePositiveCount, lastTruePositiveCount, and lastTrueNegativeCount used in BloomFilterMetrics
* Add TableMetrics.bloomFilterFalseRation used in CC ColumnFamilyStore and RealEnvironment

STAR-13 Fix out-of-order enum values in CassandreRelevantProperties.
djatnieks pushed a commit that referenced this pull request Apr 16, 2024
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
(cherry picked from commit dc58b0b)

STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager

(cherry picked from commit 8c7c128)
(cherry picked from commit 55cfeb0)

STAR-13 General compile fixes: unused imports, TimeUUID instead of UUID, long instead of int gcBefore, replace System calls with Clock.Global, etc.
* Add json-simple dependency used in CompactionLogAnalyzer
* Add UCS properties to CassandraRelevantProperties
* Checkstyle allow certain uses of java.util.concurrent.CompletableFuture in CC code: CompactionManager.BackgroundCompactionCandidate
* Restore C* 5.0 methods in CompactionStrategy getPerLevelSizeBytes, isLevelCompaction, and getSSTableCountPerTWCSBucket used in TableStatsHolder
* Restore C* 5.0 CompactionStrategy.getEstimatedRemainingTasks used in StreamSession
* Restore C* 5.0 BloomFilterTracker lastFalsePositiveCount, lastTruePositiveCount, and lastTrueNegativeCount used in BloomFilterMetrics
* Add TableMetrics.bloomFilterFalseRation used in CC ColumnFamilyStore and RealEnvironment

STAR-13 Fix out-of-order enum values in CassandreRelevantProperties.
djatnieks pushed a commit that referenced this pull request Jan 30, 2025
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
(cherry picked from commit dc58b0b)

STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager

(cherry picked from commit 8c7c128)
(cherry picked from commit 55cfeb0)

STAR-13 General compile fixes: unused imports, TimeUUID instead of UUID, long instead of int gcBefore, replace System calls with Clock.Global, etc.
* Add json-simple dependency used in CompactionLogAnalyzer
* Add UCS properties to CassandraRelevantProperties
* Checkstyle allow certain uses of java.util.concurrent.CompletableFuture in CC code: CompactionManager.BackgroundCompactionCandidate
* Restore C* 5.0 methods in CompactionStrategy getPerLevelSizeBytes, isLevelCompaction, and getSSTableCountPerTWCSBucket used in TableStatsHolder
* Restore C* 5.0 CompactionStrategy.getEstimatedRemainingTasks used in StreamSession
* Restore C* 5.0 BloomFilterTracker lastFalsePositiveCount, lastTruePositiveCount, and lastTrueNegativeCount used in BloomFilterMetrics
* Add TableMetrics.bloomFilterFalseRation used in CC ColumnFamilyStore and RealEnvironment

STAR-13 Fix out-of-order enum values in CassandreRelevantProperties.
djatnieks pushed a commit that referenced this pull request May 18, 2025
This is the implementation of UnifiedCompactionStrategy, whish is
intended to not only replace all other compaction strategies and CSM,
but also to optimally choose the configuration that will result in the
minimum read and write costs, given a specific workload and dataset size.

The strategy will choose either leveled or tiered merge policies
depending on the workload and the costs associated with user queries and
inserts.

This strategy should be considered experimental at this stage.

--

This patch also introduces a compaction simulation that can be run with:

ant compactionSim

With cmd line arguments:

ant compactionSim--args="-wl R50_W50 -t adaptive"

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Introduce compaction interfaces.

This patch introduces 3 interfaces:

- CompactionStrategy : to encapsulate the behaviour of a compaction strategy
- CompactionStrategyContainer: to enacapsulate the additional behavior of CSM
- CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS)

CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been
encapsulated in the compaction package.

In a future patch, UnifiedCompactionStrategy will also implement these interfaces,
therefore standing on its own, without the need of CSM, which should eventually be removed along
with the legacy strategies.

The factory will take care of instantiating either the new strategy or CSM.

This patch also introduces CompactionStrategyOptions, for option validation.
CompactionParams now uses this class.

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Implement compaction shards

The unified compaction strategy is now decoupled from
CompactionStrategyManager by implementing its own thin compaction
container.

When a compaction candidate needs to be produced, the strategy takes a
snapshot of the eligible sstables and disk boundaries. It then applies
a set of equivalence classes, which implement the same partitioning of
sstables currently performed by CSM.

One of the equivalence classes splits sstables across token range
boundaries, or shards. The number of shards is specified in the
compaction properties, and the shards are created by splitting the
local ranges by this number. sstables are assigned to a shard by
looking at their first partition key.

When flushing or compacting, a specialized writer splits sstables
at the shard boundaries. If the current sstable is larger than a
minimum sstable size that can be specified in the compaction properties,
then it is split when a boundary is reached.

getNextBackgroundTask() now returns a list of tasks, which are processed by
the compaction executor asynchronously.

The following minor updates have been performed:

- the adaptive algorithm now searches for better choices of W every 5 minutes
  rather than 2 minutes;
- the cost calculator now uses a read multiplier of 0.1 rather than 0.25;
- all sstables in an bucket are compacted if their number is >= T. Compactions
  no longer stop at T or F. This may skip levels but has proven very effective
  in tests when switching from tiered to levelled.

The documentation for the unified strategy has been added as a mark down
document.

Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov

Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Add a section in the UCS markdown about differences with STCS and LCS.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update disk boundaries if current boundaries are null or
out of date, even if the corresponding table is just being reloaded
due to a metadata change.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies.

Fixes the failing
TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make the enable/disable and isEnabled/isActive behavior of
UnifiedCompactionContainer similar to that of CompactionStrategyManager.
This includes always starting up the backing compaction strategy.
The previous behavior resulted in every compaction task being interrupted
while autocompaction is disabled.

Fixes the failing
TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Make compaction shards split inside disks and apply disk boundaries

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Permit early open in UnifiedCompactionStrategy

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Limit the number of concurrently running "oversized"
compactions, in order to limit size amplification. "Oversized"
here is defined as close to the maximum shard size.
- Calculate the limit for the number of concurrent "oversized" compactions
based on a configurable option for max allowed (tolerable) SA as a fraction
of the expected uncompacted dataset size.
- Use reservoir sampling + keeping a non-oversized alternative to
ensure that the limited number of oversized compactions that will
be submitted are uniformally chosen from the available shards.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Remove CompactionStatistics

STAR-13 Port isTrulyWrapAround

Co-authored-by: Sylvain Lebresne <[email protected]>

STAR-13 Fix UCS tests

STAR-13 Refactor repair out from compaction strategies

Some major refactorings:
    Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore
    Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions

    CompactionStrategyContainer:
    Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Refactor compaction statistics in order to simplify
    the code and reduce duplication and add some of the statistics
    (already available in CompactionAggregateStatistics) to TableMetrics.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy:
    - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them;
    - the levels hierarchy starts at the average flush size for the table;
    - selects the tasks to run randomly to give each level and shard equal chance to run its compaction;
    - shard spanning compactions are given more chances to be selected.

    Additionally, this applies a couple of fixes:
    - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before;
    - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data.
    - Do not run multiple getNextAggregates

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Change the newly introduced compaction metrics to be
    aggregate metrics instead of per-table metrics. This will
    make them easier to record/monitor in Fallout/Grafana, and
    will also enable computing them more efficiently from a
    cached value of the AggregateCompactions metric.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Update UCS defaults

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Track compaction rate in backgroundCompactions

    Also:
    - switch to sample-based exponential moving average, which is much simpler to implement correctly
    at the expense of expressing averaging periods in terms of updates count instead of time;
    - add debug logging of the compaction task count decisions

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Spread compaction threads equally among the levels

    Fixes the problem of long-running higher-level tasks starving level 0 or any level
    from continuing by reserving compaction threads for each of the levels of the
    hierarchy. More specifically, the whole part of the ratio between compaction threads
    and number of levels is reserved for each level, and any remainder is distributed
    randomly as before.

    Replaces the oversized compaction mechanism with a simple limit for the aggregate
    size of running compactions, which is now also applied when single compactions are
    above that limit. This should prevent running out of space at the expense of several
    highest-level tables extra, i.e. slightly higher read amplification, until someone
    reacts to the warning, which I think is a sensible tradeoff.

    Also removes unsupported options from the documentation markdown and adds
    max_space_overhead description.

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter

STAR-13 Unable to set max_space_overhead in UCS

Co-authored-by: Justin Chu <[email protected]>

STAR-13 Log the number of compacted SSTables, and the shard and
    bucket identifiers when rejecting a compaction bigger than the
    max space overhead.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Fix Bloom Filter tracking

BloomFilterTracker uses meters to avoid the situation when subsequent retrievals
of recent metrics return 0. Tracking is done at CFS instance instead of
per-SSTableReader to reduce overhead. SSTableReaders set correct
BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set,
SSTableReaders use NoopBloomFilterTracker.

STAR-13 Introduce a limit on the number of sstables in a compaction and
a layout-preserving mode

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Log shard and bucket details on each getShardsWithBuckets()
at TRACE level instead of at DEBUG level.

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 TTL-based SSTable expiration in UCS

STAR-13 Inherit BackgroundCompactions when recreating UCS

STAR-13 Compaction log analyzer

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Always create mocked SSTables with stubOnly

STAR-13 Fix compaction strategy reload

Prevent resetting JMX changes when we create new strategy containers.
This patch makes sure that JMX changes that alter the container type
(CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes
that are unrelated to compaction.

STAR-13 Handle IOErrors during background compaction task execution as FSErrors

Co-authored-by: Dimitar Dimitrov <[email protected]>

STAR-13 Trigger layout compactions automatically when there are more than  F*F SSTables in a bucket

STAR-13 Use descriptor passed to ShardedMultiWriter

STAR-13 Fix condition for triggering layout compactions in case of non-uniform W

STAR-13 computeShardBoundaries handles no splitter and no disk boundaries

STAR-13 Check for partitioner mismatch before splitting local ranges

STAR-13 Use avg bucket size to adjust maxSSTablesToCompact

STAR-13 Limit the number of SSTables to compact in one operation

STAR-13 Switch CompactionsBytemanTest to signals

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Fix ShardedMultiWriterTest with compression

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Move releaseRepairData to LocalSessions

STAR-13 Decouple condition for switching writers from append

Co-authored-by: Branimir Lambov <[email protected]>

STAR-13 Shutdown previous strategy container on reload

STAR-13 Remove unused maxConcurrentOversizedCompactions

STAR-13 Metrics hold reference of a single instance controller

STAR-13 Move Controller params check to validateOptions

STAR-13 Fix target size estimation

STAR-13 Use correct bucket index for picks that contain only expired SSTables

STAR-13 Update shard index on disk change

STAR-13 Remove redundant shutdown call on keyspace drop

STAR-13 Fix amplification estimation

STAR-13 Fix calculation of the number of pending picks per bucket

Co-authored-by: Branimir Lambov <[email protected]>
Co-authored-by: Stefania Alborghetti <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 8cb7905)
(cherry picked from commit 069b6ba)
(cherry picked from commit d73878b)
(cherry picked from commit f41152b)
(cherry picked from commit fb05a5a)
(cherry picked from commit dc58b0b)

STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager

(cherry picked from commit 8c7c128)
(cherry picked from commit 55cfeb0)

STAR-13 General compile fixes: unused imports, TimeUUID instead of UUID, long instead of int gcBefore, replace System calls with Clock.Global, etc.
* Add json-simple dependency used in CompactionLogAnalyzer
* Add UCS properties to CassandraRelevantProperties
* Checkstyle allow certain uses of java.util.concurrent.CompletableFuture in CC code: CompactionManager.BackgroundCompactionCandidate
* Restore C* 5.0 methods in CompactionStrategy getPerLevelSizeBytes, isLevelCompaction, and getSSTableCountPerTWCSBucket used in TableStatsHolder
* Restore C* 5.0 CompactionStrategy.getEstimatedRemainingTasks used in StreamSession
* Restore C* 5.0 BloomFilterTracker lastFalsePositiveCount, lastTruePositiveCount, and lastTrueNegativeCount used in BloomFilterMetrics
* Add TableMetrics.bloomFilterFalseRation used in CC ColumnFamilyStore and RealEnvironment

STAR-13 Fix out-of-order enum values in CassandreRelevantProperties.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants