STAR-13 UnifiedCompactionStrategy #132

Gerrrr · 2021-04-28T09:38:56Z

dtests PR

Ported changes are taken from:

DB-4422 includes the majority of commits. It might be useful to compare this PR with the internal PRs.
DB-4762
DB-4823
DB-4853
DB-4764
DB-4759
DB-4640
DB-4711 (got merged into STAR-13 Implement compaction shards )
isTrulyWrapAround from APOLLO-6

Side changes:

Default FSErrorHandler handles disk failure policy die

doc/unified_compaction.md

blambov · 2021-06-10T13:02:09Z

doc/unified_compaction.md

+
+#### UCS-tiered vs STCS
+
+SizeTieredCompactionStrategy is pretty close to UCS. However, it defines buckets/levels by looking for sstables of similar size. This can result in some odd selections of buckets, possibly spanning sstables of wildly different sizes, while UCS's selection is more stable and predictable.


Something I recently realized: this bucketing of STCS can result in stranded sstables, i.e. sstables that end up in a bucket that's not part of the normal progression (e.g. at size like m*8). They won't be compacted and can cause all sorts of inefficiencies (increased RA, also problems with tombstones, time-ordered read path, etc.)

doc/unified_compaction.md

src/java/org/apache/cassandra/db/ColumnFamilyStore.java

blambov

Second batch

src/java/org/apache/cassandra/db/compaction/ArenaSelector.java

src/java/org/apache/cassandra/db/compaction/BackgroundCompactions.java

src/java/org/apache/cassandra/db/compaction/CompactionAggregate.java

src/java/org/apache/cassandra/db/compaction/CompactionStrategyFactory.java

src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java

blambov · 2021-06-16T09:11:37Z

I had to change the base to continue the review as the ds-trunk rebase brought in a lot of other changes.

blambov

Third batch, the rest of the files except UCS itself and tests.

src/java/org/apache/cassandra/db/compaction/CompactionStrategyOptions.java

src/java/org/apache/cassandra/db/compaction/LegacyAbstractCompactionStrategy.java

src/java/org/apache/cassandra/metrics/CompactionMetrics.java

src/java/org/apache/cassandra/metrics/TableMetrics.java

src/java/org/apache/cassandra/schema/TableParams.java

blambov · 2021-06-17T08:33:52Z

src/java/org/apache/cassandra/service/DefaultFSErrorHandler.java

@@ -86,6 +86,9 @@ public void handleFSError(FSError e)
                        Keyspace.removeUnreadableSSTables(directory);
                }
                break;
+            case die:


This feels like the wrong place to have a fix like this. This also applies to the bloom filter tracking changes and anything else that is a valuable fix on its own.
Do we have separate tickets for them? Let's make sure we have and prioritize them so that they are committed to OSS before we propose this. (To clarify, I do not want to remove any of these from this patch.)

src/java/org/apache/cassandra/utils/ExpMovingAverage.java

src/java/org/apache/cassandra/tools/CompactionLogAnalyzer.java

src/java/org/apache/cassandra/db/compaction/UnifiedCompactionContainer.java

src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStatistics.java

src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.java

blambov

This concludes my first pass over the code (I haven't had a chance to look at any tests, or any of the updates yet).

Looks like I have created a lot of extra work... Let me know if you would like me to take some of that load.

src/java/org/apache/cassandra/db/compaction/unified/UnifiedCompactionTask.java

src/java/org/apache/cassandra/db/compaction/unified/ShardedCompactionWriter.java

src/java/org/apache/cassandra/db/compaction/unified/ShardedMultiWriter.java

src/java/org/apache/cassandra/db/compaction/unified/Controller.java

src/java/org/apache/cassandra/db/compaction/unified/AdaptiveController.java

src/java/org/apache/cassandra/db/compaction/unified/Controller.java

src/java/org/apache/cassandra/db/compaction/unified/AdaptiveController.java

src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.java

This is the implementation of UnifiedCompactionStrategy, whish is intended to not only replace all other compaction strategies and CSM, but also to optimally choose the configuration that will result in the minimum read and write costs, given a specific workload and dataset size. The strategy will choose either leveled or tiered merge policies depending on the workload and the costs associated with user queries and inserts. This strategy should be considered experimental at this stage. -- This patch also introduces a compaction simulation that can be run with: ant compactionSim With cmd line arguments: ant compactionSim--args="-wl R50_W50 -t adaptive" Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Introduce compaction interfaces. This patch introduces 3 interfaces: - CompactionStrategy : to encapsulate the behaviour of a compaction strategy - CompactionStrategyContainer: to enacapsulate the additional behavior of CSM - CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS) CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been encapsulated in the compaction package. In a future patch, UnifiedCompactionStrategy will also implement these interfaces, therefore standing on its own, without the need of CSM, which should eventually be removed along with the legacy strategies. The factory will take care of instantiating either the new strategy or CSM. This patch also introduces CompactionStrategyOptions, for option validation. CompactionParams now uses this class. Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Implement compaction shards The unified compaction strategy is now decoupled from CompactionStrategyManager by implementing its own thin compaction container. When a compaction candidate needs to be produced, the strategy takes a snapshot of the eligible sstables and disk boundaries. It then applies a set of equivalence classes, which implement the same partitioning of sstables currently performed by CSM. One of the equivalence classes splits sstables across token range boundaries, or shards. The number of shards is specified in the compaction properties, and the shards are created by splitting the local ranges by this number. sstables are assigned to a shard by looking at their first partition key. When flushing or compacting, a specialized writer splits sstables at the shard boundaries. If the current sstable is larger than a minimum sstable size that can be specified in the compaction properties, then it is split when a boundary is reached. getNextBackgroundTask() now returns a list of tasks, which are processed by the compaction executor asynchronously. The following minor updates have been performed: - the adaptive algorithm now searches for better choices of W every 5 minutes rather than 2 minutes; - the cost calculator now uses a read multiplier of 0.1 rather than 0.25; - all sstables in an bucket are compacted if their number is >= T. Compactions no longer stop at T or F. This may skip levels but has proven very effective in tests when switching from tiered to levelled. The documentation for the unified strategy has been added as a mark down document. Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Add a section in the UCS markdown about differences with STCS and LCS. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update disk boundaries if current boundaries are null or out of date, even if the corresponding table is just being reloaded due to a metadata change. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make the enable/disable and isEnabled/isActive behavior of UnifiedCompactionContainer similar to that of CompactionStrategyManager. This includes always starting up the backing compaction strategy. The previous behavior resulted in every compaction task being interrupted while autocompaction is disabled. Fixes the failing TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make compaction shards split inside disks and apply disk boundaries Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Permit early open in UnifiedCompactionStrategy Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Limit the number of concurrently running "oversized" compactions, in order to limit size amplification. "Oversized" here is defined as close to the maximum shard size. - Calculate the limit for the number of concurrent "oversized" compactions based on a configurable option for max allowed (tolerable) SA as a fraction of the expected uncompacted dataset size. - Use reservoir sampling + keeping a non-oversized alternative to ensure that the limited number of oversized compactions that will be submitted are uniformally chosen from the available shards. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Remove CompactionStatistics STAR-13 Port isTrulyWrapAround Co-authored-by: Sylvain Lebresne <[email protected]> STAR-13 Fix UCS tests STAR-13 Refactor repair out from compaction strategies Some major refactorings: Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions CompactionStrategyContainer: Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired Co-authored-by: Justin Chu <[email protected]> STAR-13 Refactor compaction statistics in order to simplify the code and reduce duplication and add some of the statistics (already available in CompactionAggregateStatistics) to TableMetrics. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy: - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them; - the levels hierarchy starts at the average flush size for the table; - selects the tasks to run randomly to give each level and shard equal chance to run its compaction; - shard spanning compactions are given more chances to be selected. Additionally, this applies a couple of fixes: - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before; - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data. - Do not run multiple getNextAggregates Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Change the newly introduced compaction metrics to be aggregate metrics instead of per-table metrics. This will make them easier to record/monitor in Fallout/Grafana, and will also enable computing them more efficiently from a cached value of the AggregateCompactions metric. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update UCS defaults Co-authored-by: Justin Chu <[email protected]> STAR-13 Track compaction rate in backgroundCompactions Also: - switch to sample-based exponential moving average, which is much simpler to implement correctly at the expense of expressing averaging periods in terms of updates count instead of time; - add debug logging of the compaction task count decisions Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Spread compaction threads equally among the levels Fixes the problem of long-running higher-level tasks starving level 0 or any level from continuing by reserving compaction threads for each of the levels of the hierarchy. More specifically, the whole part of the ratio between compaction threads and number of levels is reserved for each level, and any remainder is distributed randomly as before. Replaces the oversized compaction mechanism with a simple limit for the aggregate size of running compactions, which is now also applied when single compactions are above that limit. This should prevent running out of space at the expense of several highest-level tables extra, i.e. slightly higher read amplification, until someone reacts to the warning, which I think is a sensible tradeoff. Also removes unsupported options from the documentation markdown and adds max_space_overhead description. Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter STAR-13 Unable to set max_space_overhead in UCS Co-authored-by: Justin Chu <[email protected]> STAR-13 Log the number of compacted SSTables, and the shard and bucket identifiers when rejecting a compaction bigger than the max space overhead. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Fix Bloom Filter tracking BloomFilterTracker uses meters to avoid the situation when subsequent retrievals of recent metrics return 0. Tracking is done at CFS instance instead of per-SSTableReader to reduce overhead. SSTableReaders set correct BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set, SSTableReaders use NoopBloomFilterTracker. STAR-13 Introduce a limit on the number of sstables in a compaction and a layout-preserving mode Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Log shard and bucket details on each getShardsWithBuckets() at TRACE level instead of at DEBUG level. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 TTL-based SSTable expiration in UCS STAR-13 Inherit BackgroundCompactions when recreating UCS STAR-13 Compaction log analyzer Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Always create mocked SSTables with stubOnly STAR-13 Fix compaction strategy reload Prevent resetting JMX changes when we create new strategy containers. This patch makes sure that JMX changes that alter the container type (CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes that are unrelated to compaction. STAR-13 Handle IOErrors during background compaction task execution as FSErrors Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Trigger layout compactions automatically when there are more than F*F SSTables in a bucket STAR-13 Use descriptor passed to ShardedMultiWriter STAR-13 Fix condition for triggering layout compactions in case of non-uniform W STAR-13 computeShardBoundaries handles no splitter and no disk boundaries STAR-13 Check for partitioner mismatch before splitting local ranges STAR-13 Use avg bucket size to adjust maxSSTablesToCompact STAR-13 Limit the number of SSTables to compact in one operation STAR-13 Switch CompactionsBytemanTest to signals Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix ShardedMultiWriterTest with compression Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Move releaseRepairData to LocalSessions STAR-13 Decouple condition for switching writers from append Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Shutdown previous strategy container on reload STAR-13 Remove unused maxConcurrentOversizedCompactions STAR-13 Metrics hold reference of a single instance controller STAR-13 Move Controller params check to validateOptions STAR-13 Fix target size estimation STAR-13 Use correct bucket index for picks that contain only expired SSTables STAR-13 Update shard index on disk change STAR-13 Remove redundant shutdown call on keyspace drop STAR-13 Fix amplification estimation STAR-13 Fix calculation of the number of pending picks per bucket Co-authored-by: Branimir Lambov <[email protected]> Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Dimitar Dimitrov <[email protected]> Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 8cb7905) (cherry picked from commit 069b6ba)

This is the implementation of UnifiedCompactionStrategy, whish is intended to not only replace all other compaction strategies and CSM, but also to optimally choose the configuration that will result in the minimum read and write costs, given a specific workload and dataset size. The strategy will choose either leveled or tiered merge policies depending on the workload and the costs associated with user queries and inserts. This strategy should be considered experimental at this stage. -- This patch also introduces a compaction simulation that can be run with: ant compactionSim With cmd line arguments: ant compactionSim--args="-wl R50_W50 -t adaptive" Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Introduce compaction interfaces. This patch introduces 3 interfaces: - CompactionStrategy : to encapsulate the behaviour of a compaction strategy - CompactionStrategyContainer: to enacapsulate the additional behavior of CSM - CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS) CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been encapsulated in the compaction package. In a future patch, UnifiedCompactionStrategy will also implement these interfaces, therefore standing on its own, without the need of CSM, which should eventually be removed along with the legacy strategies. The factory will take care of instantiating either the new strategy or CSM. This patch also introduces CompactionStrategyOptions, for option validation. CompactionParams now uses this class. Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Implement compaction shards The unified compaction strategy is now decoupled from CompactionStrategyManager by implementing its own thin compaction container. When a compaction candidate needs to be produced, the strategy takes a snapshot of the eligible sstables and disk boundaries. It then applies a set of equivalence classes, which implement the same partitioning of sstables currently performed by CSM. One of the equivalence classes splits sstables across token range boundaries, or shards. The number of shards is specified in the compaction properties, and the shards are created by splitting the local ranges by this number. sstables are assigned to a shard by looking at their first partition key. When flushing or compacting, a specialized writer splits sstables at the shard boundaries. If the current sstable is larger than a minimum sstable size that can be specified in the compaction properties, then it is split when a boundary is reached. getNextBackgroundTask() now returns a list of tasks, which are processed by the compaction executor asynchronously. The following minor updates have been performed: - the adaptive algorithm now searches for better choices of W every 5 minutes rather than 2 minutes; - the cost calculator now uses a read multiplier of 0.1 rather than 0.25; - all sstables in an bucket are compacted if their number is >= T. Compactions no longer stop at T or F. This may skip levels but has proven very effective in tests when switching from tiered to levelled. The documentation for the unified strategy has been added as a mark down document. Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Add a section in the UCS markdown about differences with STCS and LCS. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update disk boundaries if current boundaries are null or out of date, even if the corresponding table is just being reloaded due to a metadata change. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make the enable/disable and isEnabled/isActive behavior of UnifiedCompactionContainer similar to that of CompactionStrategyManager. This includes always starting up the backing compaction strategy. The previous behavior resulted in every compaction task being interrupted while autocompaction is disabled. Fixes the failing TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make compaction shards split inside disks and apply disk boundaries Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Permit early open in UnifiedCompactionStrategy Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Limit the number of concurrently running "oversized" compactions, in order to limit size amplification. "Oversized" here is defined as close to the maximum shard size. - Calculate the limit for the number of concurrent "oversized" compactions based on a configurable option for max allowed (tolerable) SA as a fraction of the expected uncompacted dataset size. - Use reservoir sampling + keeping a non-oversized alternative to ensure that the limited number of oversized compactions that will be submitted are uniformally chosen from the available shards. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Remove CompactionStatistics STAR-13 Port isTrulyWrapAround Co-authored-by: Sylvain Lebresne <[email protected]> STAR-13 Fix UCS tests STAR-13 Refactor repair out from compaction strategies Some major refactorings: Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions CompactionStrategyContainer: Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired Co-authored-by: Justin Chu <[email protected]> STAR-13 Refactor compaction statistics in order to simplify the code and reduce duplication and add some of the statistics (already available in CompactionAggregateStatistics) to TableMetrics. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy: - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them; - the levels hierarchy starts at the average flush size for the table; - selects the tasks to run randomly to give each level and shard equal chance to run its compaction; - shard spanning compactions are given more chances to be selected. Additionally, this applies a couple of fixes: - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before; - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data. - Do not run multiple getNextAggregates Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Change the newly introduced compaction metrics to be aggregate metrics instead of per-table metrics. This will make them easier to record/monitor in Fallout/Grafana, and will also enable computing them more efficiently from a cached value of the AggregateCompactions metric. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update UCS defaults Co-authored-by: Justin Chu <[email protected]> STAR-13 Track compaction rate in backgroundCompactions Also: - switch to sample-based exponential moving average, which is much simpler to implement correctly at the expense of expressing averaging periods in terms of updates count instead of time; - add debug logging of the compaction task count decisions Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Spread compaction threads equally among the levels Fixes the problem of long-running higher-level tasks starving level 0 or any level from continuing by reserving compaction threads for each of the levels of the hierarchy. More specifically, the whole part of the ratio between compaction threads and number of levels is reserved for each level, and any remainder is distributed randomly as before. Replaces the oversized compaction mechanism with a simple limit for the aggregate size of running compactions, which is now also applied when single compactions are above that limit. This should prevent running out of space at the expense of several highest-level tables extra, i.e. slightly higher read amplification, until someone reacts to the warning, which I think is a sensible tradeoff. Also removes unsupported options from the documentation markdown and adds max_space_overhead description. Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter STAR-13 Unable to set max_space_overhead in UCS Co-authored-by: Justin Chu <[email protected]> STAR-13 Log the number of compacted SSTables, and the shard and bucket identifiers when rejecting a compaction bigger than the max space overhead. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Fix Bloom Filter tracking BloomFilterTracker uses meters to avoid the situation when subsequent retrievals of recent metrics return 0. Tracking is done at CFS instance instead of per-SSTableReader to reduce overhead. SSTableReaders set correct BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set, SSTableReaders use NoopBloomFilterTracker. STAR-13 Introduce a limit on the number of sstables in a compaction and a layout-preserving mode Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Log shard and bucket details on each getShardsWithBuckets() at TRACE level instead of at DEBUG level. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 TTL-based SSTable expiration in UCS STAR-13 Inherit BackgroundCompactions when recreating UCS STAR-13 Compaction log analyzer Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Always create mocked SSTables with stubOnly STAR-13 Fix compaction strategy reload Prevent resetting JMX changes when we create new strategy containers. This patch makes sure that JMX changes that alter the container type (CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes that are unrelated to compaction. STAR-13 Handle IOErrors during background compaction task execution as FSErrors Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Trigger layout compactions automatically when there are more than F*F SSTables in a bucket STAR-13 Use descriptor passed to ShardedMultiWriter STAR-13 Fix condition for triggering layout compactions in case of non-uniform W STAR-13 computeShardBoundaries handles no splitter and no disk boundaries STAR-13 Check for partitioner mismatch before splitting local ranges STAR-13 Use avg bucket size to adjust maxSSTablesToCompact STAR-13 Limit the number of SSTables to compact in one operation STAR-13 Switch CompactionsBytemanTest to signals Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix ShardedMultiWriterTest with compression Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Move releaseRepairData to LocalSessions STAR-13 Decouple condition for switching writers from append Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Shutdown previous strategy container on reload STAR-13 Remove unused maxConcurrentOversizedCompactions STAR-13 Metrics hold reference of a single instance controller STAR-13 Move Controller params check to validateOptions STAR-13 Fix target size estimation STAR-13 Use correct bucket index for picks that contain only expired SSTables STAR-13 Update shard index on disk change STAR-13 Remove redundant shutdown call on keyspace drop STAR-13 Fix amplification estimation STAR-13 Fix calculation of the number of pending picks per bucket Co-authored-by: Branimir Lambov <[email protected]> Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Dimitar Dimitrov <[email protected]> Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 8cb7905) (cherry picked from commit 069b6ba) (cherry picked from commit d73878b)

This is the implementation of UnifiedCompactionStrategy, whish is intended to not only replace all other compaction strategies and CSM, but also to optimally choose the configuration that will result in the minimum read and write costs, given a specific workload and dataset size. The strategy will choose either leveled or tiered merge policies depending on the workload and the costs associated with user queries and inserts. This strategy should be considered experimental at this stage. -- This patch also introduces a compaction simulation that can be run with: ant compactionSim With cmd line arguments: ant compactionSim--args="-wl R50_W50 -t adaptive" Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Introduce compaction interfaces. This patch introduces 3 interfaces: - CompactionStrategy : to encapsulate the behaviour of a compaction strategy - CompactionStrategyContainer: to enacapsulate the additional behavior of CSM - CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS) CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been encapsulated in the compaction package. In a future patch, UnifiedCompactionStrategy will also implement these interfaces, therefore standing on its own, without the need of CSM, which should eventually be removed along with the legacy strategies. The factory will take care of instantiating either the new strategy or CSM. This patch also introduces CompactionStrategyOptions, for option validation. CompactionParams now uses this class. Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Implement compaction shards The unified compaction strategy is now decoupled from CompactionStrategyManager by implementing its own thin compaction container. When a compaction candidate needs to be produced, the strategy takes a snapshot of the eligible sstables and disk boundaries. It then applies a set of equivalence classes, which implement the same partitioning of sstables currently performed by CSM. One of the equivalence classes splits sstables across token range boundaries, or shards. The number of shards is specified in the compaction properties, and the shards are created by splitting the local ranges by this number. sstables are assigned to a shard by looking at their first partition key. When flushing or compacting, a specialized writer splits sstables at the shard boundaries. If the current sstable is larger than a minimum sstable size that can be specified in the compaction properties, then it is split when a boundary is reached. getNextBackgroundTask() now returns a list of tasks, which are processed by the compaction executor asynchronously. The following minor updates have been performed: - the adaptive algorithm now searches for better choices of W every 5 minutes rather than 2 minutes; - the cost calculator now uses a read multiplier of 0.1 rather than 0.25; - all sstables in an bucket are compacted if their number is >= T. Compactions no longer stop at T or F. This may skip levels but has proven very effective in tests when switching from tiered to levelled. The documentation for the unified strategy has been added as a mark down document. Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Add a section in the UCS markdown about differences with STCS and LCS. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update disk boundaries if current boundaries are null or out of date, even if the corresponding table is just being reloaded due to a metadata change. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make the enable/disable and isEnabled/isActive behavior of UnifiedCompactionContainer similar to that of CompactionStrategyManager. This includes always starting up the backing compaction strategy. The previous behavior resulted in every compaction task being interrupted while autocompaction is disabled. Fixes the failing TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make compaction shards split inside disks and apply disk boundaries Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Permit early open in UnifiedCompactionStrategy Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Limit the number of concurrently running "oversized" compactions, in order to limit size amplification. "Oversized" here is defined as close to the maximum shard size. - Calculate the limit for the number of concurrent "oversized" compactions based on a configurable option for max allowed (tolerable) SA as a fraction of the expected uncompacted dataset size. - Use reservoir sampling + keeping a non-oversized alternative to ensure that the limited number of oversized compactions that will be submitted are uniformally chosen from the available shards. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Remove CompactionStatistics STAR-13 Port isTrulyWrapAround Co-authored-by: Sylvain Lebresne <[email protected]> STAR-13 Fix UCS tests STAR-13 Refactor repair out from compaction strategies Some major refactorings: Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions CompactionStrategyContainer: Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired Co-authored-by: Justin Chu <[email protected]> STAR-13 Refactor compaction statistics in order to simplify the code and reduce duplication and add some of the statistics (already available in CompactionAggregateStatistics) to TableMetrics. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy: - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them; - the levels hierarchy starts at the average flush size for the table; - selects the tasks to run randomly to give each level and shard equal chance to run its compaction; - shard spanning compactions are given more chances to be selected. Additionally, this applies a couple of fixes: - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before; - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data. - Do not run multiple getNextAggregates Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Change the newly introduced compaction metrics to be aggregate metrics instead of per-table metrics. This will make them easier to record/monitor in Fallout/Grafana, and will also enable computing them more efficiently from a cached value of the AggregateCompactions metric. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update UCS defaults Co-authored-by: Justin Chu <[email protected]> STAR-13 Track compaction rate in backgroundCompactions Also: - switch to sample-based exponential moving average, which is much simpler to implement correctly at the expense of expressing averaging periods in terms of updates count instead of time; - add debug logging of the compaction task count decisions Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Spread compaction threads equally among the levels Fixes the problem of long-running higher-level tasks starving level 0 or any level from continuing by reserving compaction threads for each of the levels of the hierarchy. More specifically, the whole part of the ratio between compaction threads and number of levels is reserved for each level, and any remainder is distributed randomly as before. Replaces the oversized compaction mechanism with a simple limit for the aggregate size of running compactions, which is now also applied when single compactions are above that limit. This should prevent running out of space at the expense of several highest-level tables extra, i.e. slightly higher read amplification, until someone reacts to the warning, which I think is a sensible tradeoff. Also removes unsupported options from the documentation markdown and adds max_space_overhead description. Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter STAR-13 Unable to set max_space_overhead in UCS Co-authored-by: Justin Chu <[email protected]> STAR-13 Log the number of compacted SSTables, and the shard and bucket identifiers when rejecting a compaction bigger than the max space overhead. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Fix Bloom Filter tracking BloomFilterTracker uses meters to avoid the situation when subsequent retrievals of recent metrics return 0. Tracking is done at CFS instance instead of per-SSTableReader to reduce overhead. SSTableReaders set correct BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set, SSTableReaders use NoopBloomFilterTracker. STAR-13 Introduce a limit on the number of sstables in a compaction and a layout-preserving mode Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Log shard and bucket details on each getShardsWithBuckets() at TRACE level instead of at DEBUG level. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 TTL-based SSTable expiration in UCS STAR-13 Inherit BackgroundCompactions when recreating UCS STAR-13 Compaction log analyzer Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Always create mocked SSTables with stubOnly STAR-13 Fix compaction strategy reload Prevent resetting JMX changes when we create new strategy containers. This patch makes sure that JMX changes that alter the container type (CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes that are unrelated to compaction. STAR-13 Handle IOErrors during background compaction task execution as FSErrors Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Trigger layout compactions automatically when there are more than F*F SSTables in a bucket STAR-13 Use descriptor passed to ShardedMultiWriter STAR-13 Fix condition for triggering layout compactions in case of non-uniform W STAR-13 computeShardBoundaries handles no splitter and no disk boundaries STAR-13 Check for partitioner mismatch before splitting local ranges STAR-13 Use avg bucket size to adjust maxSSTablesToCompact STAR-13 Limit the number of SSTables to compact in one operation STAR-13 Switch CompactionsBytemanTest to signals Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix ShardedMultiWriterTest with compression Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Move releaseRepairData to LocalSessions STAR-13 Decouple condition for switching writers from append Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Shutdown previous strategy container on reload STAR-13 Remove unused maxConcurrentOversizedCompactions STAR-13 Metrics hold reference of a single instance controller STAR-13 Move Controller params check to validateOptions STAR-13 Fix target size estimation STAR-13 Use correct bucket index for picks that contain only expired SSTables STAR-13 Update shard index on disk change STAR-13 Remove redundant shutdown call on keyspace drop STAR-13 Fix amplification estimation STAR-13 Fix calculation of the number of pending picks per bucket Co-authored-by: Branimir Lambov <[email protected]> Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Dimitar Dimitrov <[email protected]> Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 8cb7905) (cherry picked from commit 069b6ba) (cherry picked from commit d73878b) (cherry picked from commit f41152b)

This is the implementation of UnifiedCompactionStrategy, whish is intended to not only replace all other compaction strategies and CSM, but also to optimally choose the configuration that will result in the minimum read and write costs, given a specific workload and dataset size. The strategy will choose either leveled or tiered merge policies depending on the workload and the costs associated with user queries and inserts. This strategy should be considered experimental at this stage. -- This patch also introduces a compaction simulation that can be run with: ant compactionSim With cmd line arguments: ant compactionSim--args="-wl R50_W50 -t adaptive" Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Introduce compaction interfaces. This patch introduces 3 interfaces: - CompactionStrategy : to encapsulate the behaviour of a compaction strategy - CompactionStrategyContainer: to enacapsulate the additional behavior of CSM - CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS) CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been encapsulated in the compaction package. In a future patch, UnifiedCompactionStrategy will also implement these interfaces, therefore standing on its own, without the need of CSM, which should eventually be removed along with the legacy strategies. The factory will take care of instantiating either the new strategy or CSM. This patch also introduces CompactionStrategyOptions, for option validation. CompactionParams now uses this class. Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Implement compaction shards The unified compaction strategy is now decoupled from CompactionStrategyManager by implementing its own thin compaction container. When a compaction candidate needs to be produced, the strategy takes a snapshot of the eligible sstables and disk boundaries. It then applies a set of equivalence classes, which implement the same partitioning of sstables currently performed by CSM. One of the equivalence classes splits sstables across token range boundaries, or shards. The number of shards is specified in the compaction properties, and the shards are created by splitting the local ranges by this number. sstables are assigned to a shard by looking at their first partition key. When flushing or compacting, a specialized writer splits sstables at the shard boundaries. If the current sstable is larger than a minimum sstable size that can be specified in the compaction properties, then it is split when a boundary is reached. getNextBackgroundTask() now returns a list of tasks, which are processed by the compaction executor asynchronously. The following minor updates have been performed: - the adaptive algorithm now searches for better choices of W every 5 minutes rather than 2 minutes; - the cost calculator now uses a read multiplier of 0.1 rather than 0.25; - all sstables in an bucket are compacted if their number is >= T. Compactions no longer stop at T or F. This may skip levels but has proven very effective in tests when switching from tiered to levelled. The documentation for the unified strategy has been added as a mark down document. Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Add a section in the UCS markdown about differences with STCS and LCS. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update disk boundaries if current boundaries are null or out of date, even if the corresponding table is just being reloaded due to a metadata change. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make the enable/disable and isEnabled/isActive behavior of UnifiedCompactionContainer similar to that of CompactionStrategyManager. This includes always starting up the backing compaction strategy. The previous behavior resulted in every compaction task being interrupted while autocompaction is disabled. Fixes the failing TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make compaction shards split inside disks and apply disk boundaries Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Permit early open in UnifiedCompactionStrategy Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Limit the number of concurrently running "oversized" compactions, in order to limit size amplification. "Oversized" here is defined as close to the maximum shard size. - Calculate the limit for the number of concurrent "oversized" compactions based on a configurable option for max allowed (tolerable) SA as a fraction of the expected uncompacted dataset size. - Use reservoir sampling + keeping a non-oversized alternative to ensure that the limited number of oversized compactions that will be submitted are uniformally chosen from the available shards. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Remove CompactionStatistics STAR-13 Port isTrulyWrapAround Co-authored-by: Sylvain Lebresne <[email protected]> STAR-13 Fix UCS tests STAR-13 Refactor repair out from compaction strategies Some major refactorings: Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions CompactionStrategyContainer: Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired Co-authored-by: Justin Chu <[email protected]> STAR-13 Refactor compaction statistics in order to simplify the code and reduce duplication and add some of the statistics (already available in CompactionAggregateStatistics) to TableMetrics. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy: - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them; - the levels hierarchy starts at the average flush size for the table; - selects the tasks to run randomly to give each level and shard equal chance to run its compaction; - shard spanning compactions are given more chances to be selected. Additionally, this applies a couple of fixes: - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before; - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data. - Do not run multiple getNextAggregates Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Change the newly introduced compaction metrics to be aggregate metrics instead of per-table metrics. This will make them easier to record/monitor in Fallout/Grafana, and will also enable computing them more efficiently from a cached value of the AggregateCompactions metric. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update UCS defaults Co-authored-by: Justin Chu <[email protected]> STAR-13 Track compaction rate in backgroundCompactions Also: - switch to sample-based exponential moving average, which is much simpler to implement correctly at the expense of expressing averaging periods in terms of updates count instead of time; - add debug logging of the compaction task count decisions Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Spread compaction threads equally among the levels Fixes the problem of long-running higher-level tasks starving level 0 or any level from continuing by reserving compaction threads for each of the levels of the hierarchy. More specifically, the whole part of the ratio between compaction threads and number of levels is reserved for each level, and any remainder is distributed randomly as before. Replaces the oversized compaction mechanism with a simple limit for the aggregate size of running compactions, which is now also applied when single compactions are above that limit. This should prevent running out of space at the expense of several highest-level tables extra, i.e. slightly higher read amplification, until someone reacts to the warning, which I think is a sensible tradeoff. Also removes unsupported options from the documentation markdown and adds max_space_overhead description. Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter STAR-13 Unable to set max_space_overhead in UCS Co-authored-by: Justin Chu <[email protected]> STAR-13 Log the number of compacted SSTables, and the shard and bucket identifiers when rejecting a compaction bigger than the max space overhead. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Fix Bloom Filter tracking BloomFilterTracker uses meters to avoid the situation when subsequent retrievals of recent metrics return 0. Tracking is done at CFS instance instead of per-SSTableReader to reduce overhead. SSTableReaders set correct BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set, SSTableReaders use NoopBloomFilterTracker. STAR-13 Introduce a limit on the number of sstables in a compaction and a layout-preserving mode Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Log shard and bucket details on each getShardsWithBuckets() at TRACE level instead of at DEBUG level. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 TTL-based SSTable expiration in UCS STAR-13 Inherit BackgroundCompactions when recreating UCS STAR-13 Compaction log analyzer Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Always create mocked SSTables with stubOnly STAR-13 Fix compaction strategy reload Prevent resetting JMX changes when we create new strategy containers. This patch makes sure that JMX changes that alter the container type (CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes that are unrelated to compaction. STAR-13 Handle IOErrors during background compaction task execution as FSErrors Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Trigger layout compactions automatically when there are more than F*F SSTables in a bucket STAR-13 Use descriptor passed to ShardedMultiWriter STAR-13 Fix condition for triggering layout compactions in case of non-uniform W STAR-13 computeShardBoundaries handles no splitter and no disk boundaries STAR-13 Check for partitioner mismatch before splitting local ranges STAR-13 Use avg bucket size to adjust maxSSTablesToCompact STAR-13 Limit the number of SSTables to compact in one operation STAR-13 Switch CompactionsBytemanTest to signals Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix ShardedMultiWriterTest with compression Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Move releaseRepairData to LocalSessions STAR-13 Decouple condition for switching writers from append Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Shutdown previous strategy container on reload STAR-13 Remove unused maxConcurrentOversizedCompactions STAR-13 Metrics hold reference of a single instance controller STAR-13 Move Controller params check to validateOptions STAR-13 Fix target size estimation STAR-13 Use correct bucket index for picks that contain only expired SSTables STAR-13 Update shard index on disk change STAR-13 Remove redundant shutdown call on keyspace drop STAR-13 Fix amplification estimation STAR-13 Fix calculation of the number of pending picks per bucket Co-authored-by: Branimir Lambov <[email protected]> Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Dimitar Dimitrov <[email protected]> Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 8cb7905) (cherry picked from commit 069b6ba) (cherry picked from commit d73878b) (cherry picked from commit f41152b) (cherry picked from commit fb05a5a)

This is the implementation of UnifiedCompactionStrategy, whish is intended to not only replace all other compaction strategies and CSM, but also to optimally choose the configuration that will result in the minimum read and write costs, given a specific workload and dataset size. The strategy will choose either leveled or tiered merge policies depending on the workload and the costs associated with user queries and inserts. This strategy should be considered experimental at this stage. -- This patch also introduces a compaction simulation that can be run with: ant compactionSim With cmd line arguments: ant compactionSim--args="-wl R50_W50 -t adaptive" Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Introduce compaction interfaces. This patch introduces 3 interfaces: - CompactionStrategy : to encapsulate the behaviour of a compaction strategy - CompactionStrategyContainer: to enacapsulate the additional behavior of CSM - CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS) CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been encapsulated in the compaction package. In a future patch, UnifiedCompactionStrategy will also implement these interfaces, therefore standing on its own, without the need of CSM, which should eventually be removed along with the legacy strategies. The factory will take care of instantiating either the new strategy or CSM. This patch also introduces CompactionStrategyOptions, for option validation. CompactionParams now uses this class. Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Implement compaction shards The unified compaction strategy is now decoupled from CompactionStrategyManager by implementing its own thin compaction container. When a compaction candidate needs to be produced, the strategy takes a snapshot of the eligible sstables and disk boundaries. It then applies a set of equivalence classes, which implement the same partitioning of sstables currently performed by CSM. One of the equivalence classes splits sstables across token range boundaries, or shards. The number of shards is specified in the compaction properties, and the shards are created by splitting the local ranges by this number. sstables are assigned to a shard by looking at their first partition key. When flushing or compacting, a specialized writer splits sstables at the shard boundaries. If the current sstable is larger than a minimum sstable size that can be specified in the compaction properties, then it is split when a boundary is reached. getNextBackgroundTask() now returns a list of tasks, which are processed by the compaction executor asynchronously. The following minor updates have been performed: - the adaptive algorithm now searches for better choices of W every 5 minutes rather than 2 minutes; - the cost calculator now uses a read multiplier of 0.1 rather than 0.25; - all sstables in an bucket are compacted if their number is >= T. Compactions no longer stop at T or F. This may skip levels but has proven very effective in tests when switching from tiered to levelled. The documentation for the unified strategy has been added as a mark down document. Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Add a section in the UCS markdown about differences with STCS and LCS. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update disk boundaries if current boundaries are null or out of date, even if the corresponding table is just being reloaded due to a metadata change. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make the enable/disable and isEnabled/isActive behavior of UnifiedCompactionContainer similar to that of CompactionStrategyManager. This includes always starting up the backing compaction strategy. The previous behavior resulted in every compaction task being interrupted while autocompaction is disabled. Fixes the failing TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make compaction shards split inside disks and apply disk boundaries Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Permit early open in UnifiedCompactionStrategy Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Limit the number of concurrently running "oversized" compactions, in order to limit size amplification. "Oversized" here is defined as close to the maximum shard size. - Calculate the limit for the number of concurrent "oversized" compactions based on a configurable option for max allowed (tolerable) SA as a fraction of the expected uncompacted dataset size. - Use reservoir sampling + keeping a non-oversized alternative to ensure that the limited number of oversized compactions that will be submitted are uniformally chosen from the available shards. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Remove CompactionStatistics STAR-13 Port isTrulyWrapAround Co-authored-by: Sylvain Lebresne <[email protected]> STAR-13 Fix UCS tests STAR-13 Refactor repair out from compaction strategies Some major refactorings: Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions CompactionStrategyContainer: Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired Co-authored-by: Justin Chu <[email protected]> STAR-13 Refactor compaction statistics in order to simplify the code and reduce duplication and add some of the statistics (already available in CompactionAggregateStatistics) to TableMetrics. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy: - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them; - the levels hierarchy starts at the average flush size for the table; - selects the tasks to run randomly to give each level and shard equal chance to run its compaction; - shard spanning compactions are given more chances to be selected. Additionally, this applies a couple of fixes: - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before; - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data. - Do not run multiple getNextAggregates Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Change the newly introduced compaction metrics to be aggregate metrics instead of per-table metrics. This will make them easier to record/monitor in Fallout/Grafana, and will also enable computing them more efficiently from a cached value of the AggregateCompactions metric. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update UCS defaults Co-authored-by: Justin Chu <[email protected]> STAR-13 Track compaction rate in backgroundCompactions Also: - switch to sample-based exponential moving average, which is much simpler to implement correctly at the expense of expressing averaging periods in terms of updates count instead of time; - add debug logging of the compaction task count decisions Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Spread compaction threads equally among the levels Fixes the problem of long-running higher-level tasks starving level 0 or any level from continuing by reserving compaction threads for each of the levels of the hierarchy. More specifically, the whole part of the ratio between compaction threads and number of levels is reserved for each level, and any remainder is distributed randomly as before. Replaces the oversized compaction mechanism with a simple limit for the aggregate size of running compactions, which is now also applied when single compactions are above that limit. This should prevent running out of space at the expense of several highest-level tables extra, i.e. slightly higher read amplification, until someone reacts to the warning, which I think is a sensible tradeoff. Also removes unsupported options from the documentation markdown and adds max_space_overhead description. Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter STAR-13 Unable to set max_space_overhead in UCS Co-authored-by: Justin Chu <[email protected]> STAR-13 Log the number of compacted SSTables, and the shard and bucket identifiers when rejecting a compaction bigger than the max space overhead. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Fix Bloom Filter tracking BloomFilterTracker uses meters to avoid the situation when subsequent retrievals of recent metrics return 0. Tracking is done at CFS instance instead of per-SSTableReader to reduce overhead. SSTableReaders set correct BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set, SSTableReaders use NoopBloomFilterTracker. STAR-13 Introduce a limit on the number of sstables in a compaction and a layout-preserving mode Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Log shard and bucket details on each getShardsWithBuckets() at TRACE level instead of at DEBUG level. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 TTL-based SSTable expiration in UCS STAR-13 Inherit BackgroundCompactions when recreating UCS STAR-13 Compaction log analyzer Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Always create mocked SSTables with stubOnly STAR-13 Fix compaction strategy reload Prevent resetting JMX changes when we create new strategy containers. This patch makes sure that JMX changes that alter the container type (CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes that are unrelated to compaction. STAR-13 Handle IOErrors during background compaction task execution as FSErrors Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Trigger layout compactions automatically when there are more than F*F SSTables in a bucket STAR-13 Use descriptor passed to ShardedMultiWriter STAR-13 Fix condition for triggering layout compactions in case of non-uniform W STAR-13 computeShardBoundaries handles no splitter and no disk boundaries STAR-13 Check for partitioner mismatch before splitting local ranges STAR-13 Use avg bucket size to adjust maxSSTablesToCompact STAR-13 Limit the number of SSTables to compact in one operation STAR-13 Switch CompactionsBytemanTest to signals Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix ShardedMultiWriterTest with compression Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Move releaseRepairData to LocalSessions STAR-13 Decouple condition for switching writers from append Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Shutdown previous strategy container on reload STAR-13 Remove unused maxConcurrentOversizedCompactions STAR-13 Metrics hold reference of a single instance controller STAR-13 Move Controller params check to validateOptions STAR-13 Fix target size estimation STAR-13 Use correct bucket index for picks that contain only expired SSTables STAR-13 Update shard index on disk change STAR-13 Remove redundant shutdown call on keyspace drop STAR-13 Fix amplification estimation STAR-13 Fix calculation of the number of pending picks per bucket Co-authored-by: Branimir Lambov <[email protected]> Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Dimitar Dimitrov <[email protected]> Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 8cb7905) (cherry picked from commit 069b6ba) (cherry picked from commit d73878b) (cherry picked from commit f41152b) (cherry picked from commit fb05a5a) (cherry picked from commit dc58b0b)

This is the implementation of UnifiedCompactionStrategy, whish is intended to not only replace all other compaction strategies and CSM, but also to optimally choose the configuration that will result in the minimum read and write costs, given a specific workload and dataset size. The strategy will choose either leveled or tiered merge policies depending on the workload and the costs associated with user queries and inserts. This strategy should be considered experimental at this stage. -- This patch also introduces a compaction simulation that can be run with: ant compactionSim With cmd line arguments: ant compactionSim--args="-wl R50_W50 -t adaptive" Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Introduce compaction interfaces. This patch introduces 3 interfaces: - CompactionStrategy : to encapsulate the behaviour of a compaction strategy - CompactionStrategyContainer: to enacapsulate the additional behavior of CSM - CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS) CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been encapsulated in the compaction package. In a future patch, UnifiedCompactionStrategy will also implement these interfaces, therefore standing on its own, without the need of CSM, which should eventually be removed along with the legacy strategies. The factory will take care of instantiating either the new strategy or CSM. This patch also introduces CompactionStrategyOptions, for option validation. CompactionParams now uses this class. Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Implement compaction shards The unified compaction strategy is now decoupled from CompactionStrategyManager by implementing its own thin compaction container. When a compaction candidate needs to be produced, the strategy takes a snapshot of the eligible sstables and disk boundaries. It then applies a set of equivalence classes, which implement the same partitioning of sstables currently performed by CSM. One of the equivalence classes splits sstables across token range boundaries, or shards. The number of shards is specified in the compaction properties, and the shards are created by splitting the local ranges by this number. sstables are assigned to a shard by looking at their first partition key. When flushing or compacting, a specialized writer splits sstables at the shard boundaries. If the current sstable is larger than a minimum sstable size that can be specified in the compaction properties, then it is split when a boundary is reached. getNextBackgroundTask() now returns a list of tasks, which are processed by the compaction executor asynchronously. The following minor updates have been performed: - the adaptive algorithm now searches for better choices of W every 5 minutes rather than 2 minutes; - the cost calculator now uses a read multiplier of 0.1 rather than 0.25; - all sstables in an bucket are compacted if their number is >= T. Compactions no longer stop at T or F. This may skip levels but has proven very effective in tests when switching from tiered to levelled. The documentation for the unified strategy has been added as a mark down document. Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Add a section in the UCS markdown about differences with STCS and LCS. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update disk boundaries if current boundaries are null or out of date, even if the corresponding table is just being reloaded due to a metadata change. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make the enable/disable and isEnabled/isActive behavior of UnifiedCompactionContainer similar to that of CompactionStrategyManager. This includes always starting up the backing compaction strategy. The previous behavior resulted in every compaction task being interrupted while autocompaction is disabled. Fixes the failing TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make compaction shards split inside disks and apply disk boundaries Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Permit early open in UnifiedCompactionStrategy Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Limit the number of concurrently running "oversized" compactions, in order to limit size amplification. "Oversized" here is defined as close to the maximum shard size. - Calculate the limit for the number of concurrent "oversized" compactions based on a configurable option for max allowed (tolerable) SA as a fraction of the expected uncompacted dataset size. - Use reservoir sampling + keeping a non-oversized alternative to ensure that the limited number of oversized compactions that will be submitted are uniformally chosen from the available shards. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Remove CompactionStatistics STAR-13 Port isTrulyWrapAround Co-authored-by: Sylvain Lebresne <[email protected]> STAR-13 Fix UCS tests STAR-13 Refactor repair out from compaction strategies Some major refactorings: Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions CompactionStrategyContainer: Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired Co-authored-by: Justin Chu <[email protected]> STAR-13 Refactor compaction statistics in order to simplify the code and reduce duplication and add some of the statistics (already available in CompactionAggregateStatistics) to TableMetrics. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy: - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them; - the levels hierarchy starts at the average flush size for the table; - selects the tasks to run randomly to give each level and shard equal chance to run its compaction; - shard spanning compactions are given more chances to be selected. Additionally, this applies a couple of fixes: - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before; - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data. - Do not run multiple getNextAggregates Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Change the newly introduced compaction metrics to be aggregate metrics instead of per-table metrics. This will make them easier to record/monitor in Fallout/Grafana, and will also enable computing them more efficiently from a cached value of the AggregateCompactions metric. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update UCS defaults Co-authored-by: Justin Chu <[email protected]> STAR-13 Track compaction rate in backgroundCompactions Also: - switch to sample-based exponential moving average, which is much simpler to implement correctly at the expense of expressing averaging periods in terms of updates count instead of time; - add debug logging of the compaction task count decisions Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Spread compaction threads equally among the levels Fixes the problem of long-running higher-level tasks starving level 0 or any level from continuing by reserving compaction threads for each of the levels of the hierarchy. More specifically, the whole part of the ratio between compaction threads and number of levels is reserved for each level, and any remainder is distributed randomly as before. Replaces the oversized compaction mechanism with a simple limit for the aggregate size of running compactions, which is now also applied when single compactions are above that limit. This should prevent running out of space at the expense of several highest-level tables extra, i.e. slightly higher read amplification, until someone reacts to the warning, which I think is a sensible tradeoff. Also removes unsupported options from the documentation markdown and adds max_space_overhead description. Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter STAR-13 Unable to set max_space_overhead in UCS Co-authored-by: Justin Chu <[email protected]> STAR-13 Log the number of compacted SSTables, and the shard and bucket identifiers when rejecting a compaction bigger than the max space overhead. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Fix Bloom Filter tracking BloomFilterTracker uses meters to avoid the situation when subsequent retrievals of recent metrics return 0. Tracking is done at CFS instance instead of per-SSTableReader to reduce overhead. SSTableReaders set correct BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set, SSTableReaders use NoopBloomFilterTracker. STAR-13 Introduce a limit on the number of sstables in a compaction and a layout-preserving mode Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Log shard and bucket details on each getShardsWithBuckets() at TRACE level instead of at DEBUG level. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 TTL-based SSTable expiration in UCS STAR-13 Inherit BackgroundCompactions when recreating UCS STAR-13 Compaction log analyzer Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Always create mocked SSTables with stubOnly STAR-13 Fix compaction strategy reload Prevent resetting JMX changes when we create new strategy containers. This patch makes sure that JMX changes that alter the container type (CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes that are unrelated to compaction. STAR-13 Handle IOErrors during background compaction task execution as FSErrors Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Trigger layout compactions automatically when there are more than F*F SSTables in a bucket STAR-13 Use descriptor passed to ShardedMultiWriter STAR-13 Fix condition for triggering layout compactions in case of non-uniform W STAR-13 computeShardBoundaries handles no splitter and no disk boundaries STAR-13 Check for partitioner mismatch before splitting local ranges STAR-13 Use avg bucket size to adjust maxSSTablesToCompact STAR-13 Limit the number of SSTables to compact in one operation STAR-13 Switch CompactionsBytemanTest to signals Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix ShardedMultiWriterTest with compression Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Move releaseRepairData to LocalSessions STAR-13 Decouple condition for switching writers from append Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Shutdown previous strategy container on reload STAR-13 Remove unused maxConcurrentOversizedCompactions STAR-13 Metrics hold reference of a single instance controller STAR-13 Move Controller params check to validateOptions STAR-13 Fix target size estimation STAR-13 Use correct bucket index for picks that contain only expired SSTables STAR-13 Update shard index on disk change STAR-13 Remove redundant shutdown call on keyspace drop STAR-13 Fix amplification estimation STAR-13 Fix calculation of the number of pending picks per bucket Co-authored-by: Branimir Lambov <[email protected]> Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Dimitar Dimitrov <[email protected]> Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 8cb7905) (cherry picked from commit 069b6ba) (cherry picked from commit d73878b) (cherry picked from commit f41152b) (cherry picked from commit fb05a5a) (cherry picked from commit dc58b0b) STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager (cherry picked from commit 8c7c128)

This is the implementation of UnifiedCompactionStrategy, whish is intended to not only replace all other compaction strategies and CSM, but also to optimally choose the configuration that will result in the minimum read and write costs, given a specific workload and dataset size. The strategy will choose either leveled or tiered merge policies depending on the workload and the costs associated with user queries and inserts. This strategy should be considered experimental at this stage. -- This patch also introduces a compaction simulation that can be run with: ant compactionSim With cmd line arguments: ant compactionSim--args="-wl R50_W50 -t adaptive" Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Introduce compaction interfaces. This patch introduces 3 interfaces: - CompactionStrategy : to encapsulate the behaviour of a compaction strategy - CompactionStrategyContainer: to enacapsulate the additional behavior of CSM - CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS) CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been encapsulated in the compaction package. In a future patch, UnifiedCompactionStrategy will also implement these interfaces, therefore standing on its own, without the need of CSM, which should eventually be removed along with the legacy strategies. The factory will take care of instantiating either the new strategy or CSM. This patch also introduces CompactionStrategyOptions, for option validation. CompactionParams now uses this class. Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Implement compaction shards The unified compaction strategy is now decoupled from CompactionStrategyManager by implementing its own thin compaction container. When a compaction candidate needs to be produced, the strategy takes a snapshot of the eligible sstables and disk boundaries. It then applies a set of equivalence classes, which implement the same partitioning of sstables currently performed by CSM. One of the equivalence classes splits sstables across token range boundaries, or shards. The number of shards is specified in the compaction properties, and the shards are created by splitting the local ranges by this number. sstables are assigned to a shard by looking at their first partition key. When flushing or compacting, a specialized writer splits sstables at the shard boundaries. If the current sstable is larger than a minimum sstable size that can be specified in the compaction properties, then it is split when a boundary is reached. getNextBackgroundTask() now returns a list of tasks, which are processed by the compaction executor asynchronously. The following minor updates have been performed: - the adaptive algorithm now searches for better choices of W every 5 minutes rather than 2 minutes; - the cost calculator now uses a read multiplier of 0.1 rather than 0.25; - all sstables in an bucket are compacted if their number is >= T. Compactions no longer stop at T or F. This may skip levels but has proven very effective in tests when switching from tiered to levelled. The documentation for the unified strategy has been added as a mark down document. Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Add a section in the UCS markdown about differences with STCS and LCS. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update disk boundaries if current boundaries are null or out of date, even if the corresponding table is just being reloaded due to a metadata change. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make the enable/disable and isEnabled/isActive behavior of UnifiedCompactionContainer similar to that of CompactionStrategyManager. This includes always starting up the backing compaction strategy. The previous behavior resulted in every compaction task being interrupted while autocompaction is disabled. Fixes the failing TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make compaction shards split inside disks and apply disk boundaries Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Permit early open in UnifiedCompactionStrategy Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Limit the number of concurrently running "oversized" compactions, in order to limit size amplification. "Oversized" here is defined as close to the maximum shard size. - Calculate the limit for the number of concurrent "oversized" compactions based on a configurable option for max allowed (tolerable) SA as a fraction of the expected uncompacted dataset size. - Use reservoir sampling + keeping a non-oversized alternative to ensure that the limited number of oversized compactions that will be submitted are uniformally chosen from the available shards. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Remove CompactionStatistics STAR-13 Port isTrulyWrapAround Co-authored-by: Sylvain Lebresne <[email protected]> STAR-13 Fix UCS tests STAR-13 Refactor repair out from compaction strategies Some major refactorings: Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions CompactionStrategyContainer: Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired Co-authored-by: Justin Chu <[email protected]> STAR-13 Refactor compaction statistics in order to simplify the code and reduce duplication and add some of the statistics (already available in CompactionAggregateStatistics) to TableMetrics. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy: - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them; - the levels hierarchy starts at the average flush size for the table; - selects the tasks to run randomly to give each level and shard equal chance to run its compaction; - shard spanning compactions are given more chances to be selected. Additionally, this applies a couple of fixes: - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before; - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data. - Do not run multiple getNextAggregates Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Change the newly introduced compaction metrics to be aggregate metrics instead of per-table metrics. This will make them easier to record/monitor in Fallout/Grafana, and will also enable computing them more efficiently from a cached value of the AggregateCompactions metric. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update UCS defaults Co-authored-by: Justin Chu <[email protected]> STAR-13 Track compaction rate in backgroundCompactions Also: - switch to sample-based exponential moving average, which is much simpler to implement correctly at the expense of expressing averaging periods in terms of updates count instead of time; - add debug logging of the compaction task count decisions Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Spread compaction threads equally among the levels Fixes the problem of long-running higher-level tasks starving level 0 or any level from continuing by reserving compaction threads for each of the levels of the hierarchy. More specifically, the whole part of the ratio between compaction threads and number of levels is reserved for each level, and any remainder is distributed randomly as before. Replaces the oversized compaction mechanism with a simple limit for the aggregate size of running compactions, which is now also applied when single compactions are above that limit. This should prevent running out of space at the expense of several highest-level tables extra, i.e. slightly higher read amplification, until someone reacts to the warning, which I think is a sensible tradeoff. Also removes unsupported options from the documentation markdown and adds max_space_overhead description. Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter STAR-13 Unable to set max_space_overhead in UCS Co-authored-by: Justin Chu <[email protected]> STAR-13 Log the number of compacted SSTables, and the shard and bucket identifiers when rejecting a compaction bigger than the max space overhead. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Fix Bloom Filter tracking BloomFilterTracker uses meters to avoid the situation when subsequent retrievals of recent metrics return 0. Tracking is done at CFS instance instead of per-SSTableReader to reduce overhead. SSTableReaders set correct BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set, SSTableReaders use NoopBloomFilterTracker. STAR-13 Introduce a limit on the number of sstables in a compaction and a layout-preserving mode Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Log shard and bucket details on each getShardsWithBuckets() at TRACE level instead of at DEBUG level. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 TTL-based SSTable expiration in UCS STAR-13 Inherit BackgroundCompactions when recreating UCS STAR-13 Compaction log analyzer Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Always create mocked SSTables with stubOnly STAR-13 Fix compaction strategy reload Prevent resetting JMX changes when we create new strategy containers. This patch makes sure that JMX changes that alter the container type (CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes that are unrelated to compaction. STAR-13 Handle IOErrors during background compaction task execution as FSErrors Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Trigger layout compactions automatically when there are more than F*F SSTables in a bucket STAR-13 Use descriptor passed to ShardedMultiWriter STAR-13 Fix condition for triggering layout compactions in case of non-uniform W STAR-13 computeShardBoundaries handles no splitter and no disk boundaries STAR-13 Check for partitioner mismatch before splitting local ranges STAR-13 Use avg bucket size to adjust maxSSTablesToCompact STAR-13 Limit the number of SSTables to compact in one operation STAR-13 Switch CompactionsBytemanTest to signals Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix ShardedMultiWriterTest with compression Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Move releaseRepairData to LocalSessions STAR-13 Decouple condition for switching writers from append Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Shutdown previous strategy container on reload STAR-13 Remove unused maxConcurrentOversizedCompactions STAR-13 Metrics hold reference of a single instance controller STAR-13 Move Controller params check to validateOptions STAR-13 Fix target size estimation STAR-13 Use correct bucket index for picks that contain only expired SSTables STAR-13 Update shard index on disk change STAR-13 Remove redundant shutdown call on keyspace drop STAR-13 Fix amplification estimation STAR-13 Fix calculation of the number of pending picks per bucket Co-authored-by: Branimir Lambov <[email protected]> Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Dimitar Dimitrov <[email protected]> Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 8cb7905) (cherry picked from commit 069b6ba) (cherry picked from commit d73878b) (cherry picked from commit f41152b) (cherry picked from commit fb05a5a) (cherry picked from commit dc58b0b) STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager (cherry picked from commit 8c7c128) (cherry picked from commit 55cfeb0)

This is the implementation of UnifiedCompactionStrategy, whish is intended to not only replace all other compaction strategies and CSM, but also to optimally choose the configuration that will result in the minimum read and write costs, given a specific workload and dataset size. The strategy will choose either leveled or tiered merge policies depending on the workload and the costs associated with user queries and inserts. This strategy should be considered experimental at this stage. -- This patch also introduces a compaction simulation that can be run with: ant compactionSim With cmd line arguments: ant compactionSim--args="-wl R50_W50 -t adaptive" Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Introduce compaction interfaces. This patch introduces 3 interfaces: - CompactionStrategy : to encapsulate the behaviour of a compaction strategy - CompactionStrategyContainer: to enacapsulate the additional behavior of CSM - CompactionStrategyFactory: to create the right compaction container (CSM or the unified CS) CompactionStrategyManager and AbstractCompactionStrategy now implement these interfaces and have been encapsulated in the compaction package. In a future patch, UnifiedCompactionStrategy will also implement these interfaces, therefore standing on its own, without the need of CSM, which should eventually be removed along with the legacy strategies. The factory will take care of instantiating either the new strategy or CSM. This patch also introduces CompactionStrategyOptions, for option validation. CompactionParams now uses this class. Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Implement compaction shards The unified compaction strategy is now decoupled from CompactionStrategyManager by implementing its own thin compaction container. When a compaction candidate needs to be produced, the strategy takes a snapshot of the eligible sstables and disk boundaries. It then applies a set of equivalence classes, which implement the same partitioning of sstables currently performed by CSM. One of the equivalence classes splits sstables across token range boundaries, or shards. The number of shards is specified in the compaction properties, and the shards are created by splitting the local ranges by this number. sstables are assigned to a shard by looking at their first partition key. When flushing or compacting, a specialized writer splits sstables at the shard boundaries. If the current sstable is larger than a minimum sstable size that can be specified in the compaction properties, then it is split when a boundary is reached. getNextBackgroundTask() now returns a list of tasks, which are processed by the compaction executor asynchronously. The following minor updates have been performed: - the adaptive algorithm now searches for better choices of W every 5 minutes rather than 2 minutes; - the cost calculator now uses a read multiplier of 0.1 rather than 0.25; - all sstables in an bucket are compacted if their number is >= T. Compactions no longer stop at T or F. This may skip levels but has proven very effective in tests when switching from tiered to levelled. The documentation for the unified strategy has been added as a mark down document. Reviewed by Branimir Lambov, Dimitar Dimitrov, Justin Chu, Aleksandr Sorokoumov Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Add a section in the UCS markdown about differences with STCS and LCS. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update disk boundaries if current boundaries are null or out of date, even if the corresponding table is just being reloaded due to a metadata change. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.compaction_strategy_switching_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Simplify container creation and reloading and allow inheriting the state of the previous container when switching strategies. Fixes the failing TestCompaction_with_UnifiedCompactionStrategy.disable_autocompaction_alter_and_nodetool_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make the enable/disable and isEnabled/isActive behavior of UnifiedCompactionContainer similar to that of CompactionStrategyManager. This includes always starting up the backing compaction strategy. The previous behavior resulted in every compaction task being interrupted while autocompaction is disabled. Fixes the failing TestDiskBalanceAfterJoiningRing.disk_balance_after_joining_ring_ucs_test Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Make compaction shards split inside disks and apply disk boundaries Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Permit early open in UnifiedCompactionStrategy Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Limit the number of concurrently running "oversized" compactions, in order to limit size amplification. "Oversized" here is defined as close to the maximum shard size. - Calculate the limit for the number of concurrent "oversized" compactions based on a configurable option for max allowed (tolerable) SA as a fraction of the expected uncompacted dataset size. - Use reservoir sampling + keeping a non-oversized alternative to ensure that the limited number of oversized compactions that will be submitted are uniformally chosen from the available shards. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Remove CompactionStatistics STAR-13 Port isTrulyWrapAround Co-authored-by: Sylvain Lebresne <[email protected]> STAR-13 Fix UCS tests STAR-13 Refactor repair out from compaction strategies Some major refactorings: Move mutateRepaired from CompactionStrategyManager to ColumnFamilyStore Move repair related codes in CompactionStrategyContainer, PendingRepairManager to LocalSessions CompactionStrategyContainer: Added a method to acquire the ReentrantReadWriteLock.WriteLock such that it can be passed and used for mutateRepaired Co-authored-by: Justin Chu <[email protected]> STAR-13 Refactor compaction statistics in order to simplify the code and reduce duplication and add some of the statistics (already available in CompactionAggregateStatistics) to TableMetrics. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Improves the handling of compactions at the lowest levels of the compaction hierarchy: - shard-spanning compactions's size is divided by the number of shards spanned for the purposes of deciding in which level to put them; - the levels hierarchy starts at the average flush size for the table; - selects the tasks to run randomly to give each level and shard equal chance to run its compaction; - shard spanning compactions are given more chances to be selected. Additionally, this applies a couple of fixes: - getNextCompaction was run too many times because the decision whether or not to run it was made at scheduling, which caused many to be scheduled, and they never checked if one didn't run before; - early opened sstables could be selected for compaction, causing multiple compaction passes over the same data. - Do not run multiple getNextAggregates Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Change the newly introduced compaction metrics to be aggregate metrics instead of per-table metrics. This will make them easier to record/monitor in Fallout/Grafana, and will also enable computing them more efficiently from a cached value of the AggregateCompactions metric. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Update UCS defaults Co-authored-by: Justin Chu <[email protected]> STAR-13 Track compaction rate in backgroundCompactions Also: - switch to sample-based exponential moving average, which is much simpler to implement correctly at the expense of expressing averaging periods in terms of updates count instead of time; - add debug logging of the compaction task count decisions Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Spread compaction threads equally among the levels Fixes the problem of long-running higher-level tasks starving level 0 or any level from continuing by reserving compaction threads for each of the levels of the hierarchy. More specifically, the whole part of the ratio between compaction threads and number of levels is reserved for each level, and any remainder is distributed randomly as before. Replaces the oversized compaction mechanism with a simple limit for the aggregate size of running compactions, which is now also applied when single compactions are above that limit. This should prevent running out of space at the expense of several highest-level tables extra, i.e. slightly higher read amplification, until someone reacts to the warning, which I think is a sensible tradeoff. Also removes unsupported options from the documentation markdown and adds max_space_overhead description. Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix Bloom Filter key estimation for ShardedCompactionWriter STAR-13 Unable to set max_space_overhead in UCS Co-authored-by: Justin Chu <[email protected]> STAR-13 Log the number of compacted SSTables, and the shard and bucket identifiers when rejecting a compaction bigger than the max space overhead. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Fix Bloom Filter tracking BloomFilterTracker uses meters to avoid the situation when subsequent retrievals of recent metrics return 0. Tracking is done at CFS instance instead of per-SSTableReader to reduce overhead. SSTableReaders set correct BloomFilterTracker in setupOnline. Before correct BloomFilterTracker is set, SSTableReaders use NoopBloomFilterTracker. STAR-13 Introduce a limit on the number of sstables in a compaction and a layout-preserving mode Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Log shard and bucket details on each getShardsWithBuckets() at TRACE level instead of at DEBUG level. Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 TTL-based SSTable expiration in UCS STAR-13 Inherit BackgroundCompactions when recreating UCS STAR-13 Compaction log analyzer Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Always create mocked SSTables with stubOnly STAR-13 Fix compaction strategy reload Prevent resetting JMX changes when we create new strategy containers. This patch makes sure that JMX changes that alter the container type (CSM->UCS or UCS->CSM) are not overwritten by subsequent metadata changes that are unrelated to compaction. STAR-13 Handle IOErrors during background compaction task execution as FSErrors Co-authored-by: Dimitar Dimitrov <[email protected]> STAR-13 Trigger layout compactions automatically when there are more than F*F SSTables in a bucket STAR-13 Use descriptor passed to ShardedMultiWriter STAR-13 Fix condition for triggering layout compactions in case of non-uniform W STAR-13 computeShardBoundaries handles no splitter and no disk boundaries STAR-13 Check for partitioner mismatch before splitting local ranges STAR-13 Use avg bucket size to adjust maxSSTablesToCompact STAR-13 Limit the number of SSTables to compact in one operation STAR-13 Switch CompactionsBytemanTest to signals Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Fix ShardedMultiWriterTest with compression Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Move releaseRepairData to LocalSessions STAR-13 Decouple condition for switching writers from append Co-authored-by: Branimir Lambov <[email protected]> STAR-13 Shutdown previous strategy container on reload STAR-13 Remove unused maxConcurrentOversizedCompactions STAR-13 Metrics hold reference of a single instance controller STAR-13 Move Controller params check to validateOptions STAR-13 Fix target size estimation STAR-13 Use correct bucket index for picks that contain only expired SSTables STAR-13 Update shard index on disk change STAR-13 Remove redundant shutdown call on keyspace drop STAR-13 Fix amplification estimation STAR-13 Fix calculation of the number of pending picks per bucket Co-authored-by: Branimir Lambov <[email protected]> Co-authored-by: Stefania Alborghetti <[email protected]> Co-authored-by: Dimitar Dimitrov <[email protected]> Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 8cb7905) (cherry picked from commit 069b6ba) (cherry picked from commit d73878b) (cherry picked from commit f41152b) (cherry picked from commit fb05a5a) (cherry picked from commit dc58b0b) STAR-13 Fix TimeWindowCompactionStrategyTest.testGroupForAntiCompaction usage of CompactionStrategyManager (cherry picked from commit 8c7c128) (cherry picked from commit 55cfeb0) STAR-13 General compile fixes: unused imports, TimeUUID instead of UUID, long instead of int gcBefore, replace System calls with Clock.Global, etc. * Add json-simple dependency used in CompactionLogAnalyzer * Add UCS properties to CassandraRelevantProperties * Checkstyle allow certain uses of java.util.concurrent.CompletableFuture in CC code: CompactionManager.BackgroundCompactionCandidate * Restore C* 5.0 methods in CompactionStrategy getPerLevelSizeBytes, isLevelCompaction, and getSSTableCountPerTWCSBucket used in TableStatsHolder * Restore C* 5.0 CompactionStrategy.getEstimatedRemainingTasks used in StreamSession * Restore C* 5.0 BloomFilterTracker lastFalsePositiveCount, lastTruePositiveCount, and lastTrueNegativeCount used in BloomFilterMetrics * Add TableMetrics.bloomFilterFalseRation used in CC ColumnFamilyStore and RealEnvironment STAR-13 Fix out-of-order enum values in CassandreRelevantProperties.

Gerrrr force-pushed the STAR-13-UCS branch 5 times, most recently from 6baf229 to bb8a8bd Compare April 30, 2021 14:35

Gerrrr force-pushed the STAR-13-UCS branch 2 times, most recently from 689cbc5 to ed83b73 Compare May 19, 2021 11:59

Gerrrr force-pushed the STAR-13-UCS branch from ed83b73 to 8db267c Compare May 21, 2021 15:41

jacek-lewandowski force-pushed the ds-trunk branch from ccf9d94 to 758aff8 Compare May 24, 2021 04:12

Gerrrr force-pushed the STAR-13-UCS branch 4 times, most recently from ee2c9c5 to eb984f9 Compare May 31, 2021 15:50

Gerrrr force-pushed the STAR-13-UCS branch 6 times, most recently from 2f710d7 to 78ddd39 Compare June 9, 2021 08:59

Gerrrr mentioned this pull request Jun 9, 2021

STAR-13 Run tests for UnifiedCompactionStrategy datastax/cassandra-dtest#22

Merged

Gerrrr requested a review from blambov June 9, 2021 10:54

blambov reviewed Jun 14, 2021

View reviewed changes

jacek-lewandowski force-pushed the ds-trunk branch from ae757d9 to c097960 Compare June 14, 2021 10:33

blambov changed the base branch from ds-trunk to STAR-13-base June 16, 2021 08:27

blambov reviewed Jun 16, 2021

View reviewed changes

blambov reviewed Jun 17, 2021

View reviewed changes

blambov reviewed Jun 18, 2021

View reviewed changes

blambov reviewed Jul 8, 2021

View reviewed changes

src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.java Outdated Show resolved Hide resolved


		#### UCS-tiered vs STCS

		SizeTieredCompactionStrategy is pretty close to UCS. However, it defines buckets/levels by looking for sstables of similar size. This can result in some odd selections of buckets, possibly spanning sstables of wildly different sizes, while UCS's selection is more stable and predictable.

STAR-13 UnifiedCompactionStrategy #132

STAR-13 UnifiedCompactionStrategy #132

Uh oh!

Conversation

Gerrrr commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

blambov Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blambov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blambov commented Jun 16, 2021

Uh oh!

blambov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blambov Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blambov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gerrrr commented Apr 28, 2021 •

edited

Loading