CNDB-13745 Avoid double counting bytes in the checksum in IndexFileUtils.writeMostSignificantBytes #1730

djatnieks · 2025-05-09T21:42:07Z

What is the issue

...

What does this PR fix and why was it fixed

...

- use `cassandra.remote_storage_handler_factory` to replace `cassandra.remote_storage_handler` - add `TableMetadata` to `StorageProvider#createDataDirectories` signature - use `enableAutoCompaction` to replace `StorageHandler#isRemote` - emit SSTaleAddingNotification in LifecycleTransaction#checkpoint before new sstables are made visible. Note that early open is not supported because it opens sstables before `trackNewWritten`

Make `RowFilter` non-iterable because it’s confusing in a tree of expressions. Also: * Change the meaning of `RowFilter#expressions()` to traverse the entire tree of expressions. * Add a `#withoutDisjunctions()` utility method for legacy indexes. * Inspect and update if needed all the uses of RowFilter iteration to ensure that they are valid. * Remove the `warnIfFilterIsATree` warning since it should be clear what each method does.

Closes riptano/cndb#11932 The anti-join strategy for the inequality operator, `!=`, can be replaced with a union strategy of two semi-ranges, which make the plan explicit and can be more efficient. This commit plans the inequality operator with the union of two semi-ranges, thus no anti-join iterator is used for it. In the case of truncatable types as Big Decimal and Big Integer it implements full index scan as the range index scan plan node without bounds. This PR also adds tests for `not contains key` to discard the usage the union strategy by testing the false negative case of an empty map.

This performance regression was introduced by 9cabec5. I haven't been able to find the exact commit responsible for this regression, but essentially, we switched from `DirectReaders` to the lucene `DirectReader` around 0774345 and 6bd00d1, and that led to a lot of unnecessary object creation. In my initial testing, this change appeared to improve numeric query throughput from `282.03it/s` for 100k queries to `333.15it/s` for 100k queries. The memory profile also showed far fewer objects created. Note: the above numbers might have been from variability in my testing. It would be helpful to test in a more controlled environment. Either way, based on my understanding of the objects involved, this should generally produce better results because we'll have fewer objects. Finally, this change is especially helpful because the `sort then filter` logic has to initialize the the iterators for search, so it helps in both execution paths.

This splits compactions that are to produce more than one output sstable into tasks that can execute in parallel. Such tasks share a transaction and have combined progress and observer. Because we cannot mark parts of an sstable as unneeded, the transaction is only applied when all tasks have succeeded. This also means that early open is not supported for such tasks. The parallelization also takes into account thread reservations, reducing the parallelism to the number of available threads for its level. The new functionality is turned on by default. Major compactions will apply the same mechanism to parallelize the operation. They will only split on pre- existing boundary points if they are also boundary points for the current UCS configuration. This is done to ensure that major compactions can re-shard data when the configuration is changed. If pre-existing boundaries match the current state, a major compaction will still be broken into multiple operations to reduce the space overhead of the operation. Also: - Introduces a parallelism parameter to major compactions (`nodetool compact -j <threads>`, defaulting to half the compaction threads) to avoid stopping all other compaction for the duration. - Changes SSTable expiration to be done in a separate `getNextBackgroundCompactions` round to improve the efficiency of expiration (separate task can run quickly and remove the relevant sstables without waiting for a compaction to end). - Applies small-partition-count correction in `ShardManager.calculateCombinedDensity`.

…CS (#1464) - add configuration to skip mutating STATS after receiving ZCS sstable

Expose some methods needed for riptano/cndb#12128

Unbounded queue length at the native transport stage can caused large backlogs of requests that the system processes, even though clients may no longer expect a response. This PR implements a limited backport of CNDB-11070, introducing the notion of a native request timeout that can shed messages with excessive queue times at the NTR stage as well as async read/write stages, if enabled. Cross-node message timeouts are also now respected earlier in the mutation verb handler. This is a fairly straightforward cherry-pick of #1393 targeting main instead of cc-main-migration-release.

This reverts commit c06c94c. It seems the removal of `Index#postProcessor` by CNDB-11762 broke some tests in CNDB's `MultiNodeBillingTest`. Unfortunately that patch [was merged before creating the PR bumping the CC version used by CNDB](#1422 (comment)). [The CNDB PR](riptano/cndb#12076) was created after that merging but it was superseded by other CC version bumps. So I'm adding this reversal so we can investigate how the removal of `Index#postProcessor` affects those tests.

This patch replaces null values of `deterministic`, `monotonic` and `monotonic_on` columns in `system_schema.functions` and `system_schema.aggregates` with negative defaults. These defaults will be addressed if/once DB-672 gets ported to CC.

There are two mechanisms of detecting that the cluster is in the upgrade state and the minimum version. Both are slightly different, and both are not pluggable which means that CNDB doesn't work properly with them. Those mechanisms are implemented in `Gossiper`. Although we do not use `Gossiper` in CNDB, there are classes like `ColumnFilter` which go to `Gossiper` to check the upgrade state. So far, using that stuff in CDNB was a bit unpredictable, some of them reported the cluster is upgraded and in the current version, the other did not. This turned out to be a problem, especially for the `ColumnFilter` because when we upgrade DSE --> CC, CC assumes that the newest filter version should be used, which is not correctly deserialized and interpreted by DSE. The fix is not small, but it probably simplifies stuff a bit. First of all, two mechanism are merged into one. Moreover, we added pluggability of it so that we can provide the appropriate implementation in CNDB coordinators and writers, which is based on ETCD.

Part of riptano/cndb#12139 Moves constant shard count outside looping shards to reduce confusion.

…with DurationSpec type and 'native_transport_timeout_in_ms' as convertible old name with Long type; add some tests.

…MemtableIndexTest and TrieMemtableIndexAllocationsHeapBuffersTest from main branch.

…strictions (#1449) Closes riptano/cndb#12139 This PR adds a test of row count of a SAI plan in the presence of restrictions. Currently it tests queries with inequality, equality and half-ranges on different SAI column types and with or without histograms.

…g VIntOutOfRangeException to the catch block of SSTableIdentityIterator.exhaust method.

…pactionProgress to return the operation type from the first entry in the shared progress; needed in cases that a CompactionTask type is changed after creation.

…opriate (#1469) Fixes riptano/cndb#12239 We found the System.nanoTime was using significant cpu cost, but because the timeout is high enough, we can accept the inaccuracy. - [ ] Make sure there is a PR in the CNDB project updating the Converged Cassandra version - [ ] Use `NoSpamLogger` for log lines that may appear frequently in the logs - [ ] Verify test results on Butler - [ ] Test coverage for new/modified code is > 80% - [ ] Proper code formatting - [ ] Proper title for each commit staring with the project-issue number, like CNDB-1234 - [ ] Each commit has a meaningful description - [ ] Each commit is not very long and contains related changes - [ ] Renames, moves and reformatting are in distinct commits

…miss) (#1475)

Fixes regression in jvector 3.0.4 when compacting PQVectors larger than 2GB

SimpleClientPerfTest has been failing in CI since changes from CNDB-10759 This change in `SimpleClientPerfTest`, updates the anonymous class `Message.Codec<QueryMessage>` to override the correct method, `public CompletableFuture<Response> maybeExecuteAsync` from `QueryMessage`, whose signature was changed as part of CNDB-10759. - [ ] Make sure there is a PR in the CNDB project updating the Converged Cassandra version - [ ] Use `NoSpamLogger` for log lines that may appear frequently in the logs - [ ] Verify test results on Butler - [ ] Test coverage for new/modified code is > 80% - [ ] Proper code formatting - [ ] Proper title for each commit staring with the project-issue number, like CNDB-1234 - [ ] Each commit has a meaningful description - [ ] Each commit is not very long and contains related changes - [ ] Renames, moves and reformatting are in distinct commits

…ing for async batchlog removal (#1485) The test asserts that the batchlog is removed immediately after the write completes, but removal of the batchlog is async and can be delayed, particularly in resource-limited environments like CI.

The test generates partition index accesses by reusing the same key, and if the key cache is enabled, the test will fail for bigtable profiles because the key will be in the key cache.

…by filtering queries (#1484) Queries creating fake index contexts each create their own context, which can then race on metric registration (as the metrics have the same patterns). This can cause a query to fail. These metrics are superfluous, we can skip creating them entirely.

… index format version 'dx', 'cx', or older.

…ate.accumulatedDataSize; it only worked to fix SensorsWriteTest.testMultipleRowsMutationWithClusteringKey for SkipListMemtable and may not be necessary if the default memtable becomes TrieMemtable. Revisit SensorsWriteTest later if necessary.

Move static class TrieMemtable.Factory to TrieMemtableFactory class; Use suggested TriePartitionUpdate.unsharedHeapSize implementation; Use InMemoryTrie.shortLived in TrieToDotTest and TrieToMermaidTest; Add specific versions aa, ca, and da to RowIndexTest;

Add addMemoryUsageTo in SkipListMemtable and TrieMemtable Add TrieMemtable.switchOut

Resolves: riptano/cndb#13372 Currently min paxos backoff is to `0` which doesn't help there is already contention. Introduced a new property `cassandra.lwt_min_backoff_ms` which defautls to 5ms (10x smaller that the default 50ms to give enough range and be conservative relative to the previous 0 value, but still a magic number) and fix the `contentionBackoffLatency` that accepts nanos

Fixes riptano/cndb#13618 CNDB-13299 prevents creating indexes with too long index names and adds corresponding tests. Thus this test in another test class is not needed.

…n parameters to index config (#1676) Fixes: riptano/cndb#12937 This pull request upgrades jvector from **4.0.0-beta.1** to **4.0.0-beta.2** and introduces three new configuration options to influence graph construction: 1. `neighborhood_overflow` 2. `alpha` 3. `enable_hierarchy` The defaults for these hyperparameters vary between in memory and on disk, but when these are configured, they will be uniformly applied to graphs built by a memtable and by compaction. **Details** - **jvector 4.0.0-beta.2** - Minor changes to the graph index architecture, including some code now under `...disk.feature...` packages. - Removed the old `CachingGraphIndex` code in Cassandra, which is no longer used. - New constructor arguments for controlling `neighborhood_overflow`, `alpha`, and hierarchical levels in the HNSW graph. - **New configuration options** 1. **`neighborhood_overflow`**: A `float` >= 1.0 controlling how aggressively the graph tries to insert extra neighbors on each HNSW layer. 2. **`alpha`**: A `float` > 0 used in the neighbor selection phase for HNSW. 3. **`enable_hierarchy`**: A `boolean` indicating whether HNSW should allow multiple layers (true) or a single-layer approximate graph (false). Defaults to false. I manually verified that the jvector upgrade is backwards compatible, meaning that when we build using the new jvector version, we can read with an old jvector version, so I did not create a new SAI on disk file format version.

…ate cdc_enabled key in base yaml (#1680) ### What is the issue #1563 upgraded our version of snakeyaml. This new version logs when duplicate keys are found. The test-cdc target concatenates a CDC-specific yaml to the base yaml. Both contain entries for the cdc_enabled key, causing tests that load this yaml to log the warning. This causes CDC CI failures. ### What does this PR fix and why was it fixed This PR removes the cdc_enabled entry in the base yaml, as it's the same as the default. This is consistent with how we handle other default keys that might be duplicated across multiple yaml fragments in the test commands.

### What is the issue riptano/cndb#13583 ### What does this PR fix and why was it fixed This PR adds comprehensive metrics for Storage Attached Indexes (SAI) vector search operations, providing crucial insights into both ANN (Approximate Nearest Neighbor) graph searches and brute force operations. New Vector Search Metrics: Search Operation Counters: - `ANNNodesVisited`: Total number of nodes visited during ANN searches (this is equivalent to approximate similarity score computations) - `ANNNodesReranked`: Number of nodes that underwent exact distance computation for reranking (this is equivalent to exact similarity score computations) - `ANNNodesExpanded`: Total number of nodes whose edges were explored - `ANNNodesExpandedBaseLayer`: Number of nodes expanded in the base layer of the graph - `ANNGraphSearches`: Count of new graph searches initiated - `ANNGraphResumes`: Count of resumed graph searches (when a search continues from previous results) - `ANNGraphSearchLatency`: Timer measuring individual graph search latency (Note: This measures per-graph search time, not total query time which may involve multiple graphs) Brute Force Operation Counters: - `BruteForceNodesVisited`: Number of nodes visited during brute force searches (approximate similarity comparisons) - `BruteForceNodesReranked`: Number of nodes that underwent exact similarity computation during brute force searches Memory Usage Tracking: - `quantizationMemoryBytes`: Current memory usage by the quantization (PQ or BQ) data structures - `ordinalsMapMemoryBytes`: Current memory usage by ordinals mapping structures (only matters in some cases) - `onDiskGraphsCount`: Number of currently loaded graph segments - `onDiskGraphVectorsCount`: Total number of vectors in currently loaded graphs These metrics will help us: 1. Understand if we are performing more comparisons than expected 2. Get insight into number of graphs queried 3. Get insight into the brute force vs graph query path 4. Understand current memory utilization The counters provide operations/second metrics, allowing calculation of per-query averages by dividing by the number of queries. The memory tracking metrics help monitor resource usage across graph segments as they are loaded and unloaded.

riptano/cndb#12937. This is a follow up on #1676, which updated the jvector dependency but didn't update the jvector version in the actual graph builder. * Adds jvector version configuration, defaulted to 4 * Tests jvector versions 2 and 4 to ensure we can query sstables with both kinds of graphs * Makes the `enable_hierarchy` config backwards compatible by ignoring it if the jvector version is too low * Removes the ability to write FusedADC since it does not work with the latest jvector version (discovered when writing compatibility tests) * Does not remove FusedADC read path since there could be indexes based on it in production * Replaces call to deprecated `graph.size()` method with `graph.size(0)`

…om DSE

…-12503

) ### What is the issue ChunkCache needs to invalidate the test file with invalidateFileNow

### What is the issue The `main-5.0` branch uses `TrieMemtable` by default and the `test/conf/cassandra.yaml` configures `ByteOrderedPartitioner`. In this environment, `CQLSSTableWriterDaemonTest` tests `testSkipBuildingIndexesWithSAI` and `testWriteWithSAI` are failing. ### What does this PR fix and why was it fixed This fix moves the SAI specific tests from `CQLSSTableWriterTest` to a new `CQLSSTableWriterSAITest` so that `Murmur3Partitioner` can be explicitly set for the SAI specific tests.

…ngWIthAggregateTest to use paging in bytes for CC. (#1702) ### What is the issue With the rebase of CC on C* 5.0.2, there is a new test CoordinatorReadLatencyMetricTest and CoordinatorReadLatencyMetricTest.internalPagingWithAggregateTest is failing. ### What does this PR fix and why was it fixed Uses paging in bytes instead of paging in rows in internalPagingWithAggregateTest.

Removes unnecessary tiebreaker column, since the word counts give different scores for text with the same term count. One value is improved to remove possibility for flakiness in the order. Adds comments with the term and word counts to help verifying the expected results. A helper function to count them is added. Also adds a small tests for analyzer with lowercase.

#1691) Executing a query with BM25 search and a condition on partial SSTable results in empty iterator access error. And there was no test with storing data in segments. The PR implements BM25 search tests with splitting data into two tables. This reproduced this bug, CNDB-13696, and demonstrates current confusion on the BM25 ordering result to be fixed by CNDB-13553. This PR adds a check for empty iterator created for a PK belonging to another segment. This fixes the bug of trying to get the first element of an empty iterator.

…d of bytes (#1699)

…nTest (#1664) ### What is the issue `PendingAntiCompactionTest.testRetriesTimeout` is flaky on `main` failing with a timeout reported in the junit test task. riptano/cndb#13524 ### What does this PR fix and why was it fixed `PendingAntiCompactionTest.testRetriesTimeout`, and a couple other tests, are calling `Future.get` without any timeout. This change adds a 30 seconds timeout to prevent an indefinite wait for the future and a log in the failing test rather than in the junit test task.

…tricsTest

### What is the issue This is a follow up test for riptano/cndb#13696. ### What does this PR fix and why was it fixed The fix was already merged, but this confirms that without the fix, collections are affected by the NoSuchElementException bug.

Relates to riptano/cndb#13766 Brings in several improvements. CNDB PR: riptano/cndb#13789

### What is the issue Didn't create an issue as this is just a new test. ### What does this PR fix and why was it fixed We implicitly rely on this and are exposing this as a feature to our end users, so we want to confirm that the stop words are the expected stop words.

Fixes: riptano/cndb#13718 We had adopted this version of lucene datastax/lucene@5ea8bb4 in order to support our custom modifications of HNSW on top of lucene. We now use https://github.com/datastax/jvector for vector search and no longer need a custom build. I propose we use 9.8.0 since that is the closest release to the one we have been using. CNDB test pr: riptano/cndb#13757

…ils.writeMostSignificantBytes

github-actions · 2025-05-09T21:42:26Z

pkolaczk · 2025-05-15T10:35:30Z

src/java/org/apache/cassandra/index/sai/disk/io/IndexFileUtils.java

+            boolean shouldUpdateChecksum = true;
+            if (buffer.remaining() < Long.BYTES)
+            {
+                // Avoid double-counting bytes in the checksum - in this path, super.writeMostSignificantBytes will call
+                // it's super, DataOutputPlus.writeMostSignificantBytes, and that laeds to the writeXXX methods which
+                // are overridden in this class will update the checksum.
+                shouldUpdateChecksum = false;
+            }


minor suggestion: you can simplify this to:

boolean shouldUpdateChecksum = (buffer.remaining() >= Long.BYTES);

pkolaczk · 2025-05-15T10:48:34Z

src/java/org/apache/cassandra/index/sai/disk/io/IndexFileUtils.java

+                // Avoid double-counting bytes in the checksum - in this path, super.writeMostSignificantBytes will call
+                // it's super, DataOutputPlus.writeMostSignificantBytes, and that laeds to the writeXXX methods which
+                // are overridden in this class will update the checksum.


So if the remaining buffer is too small to hold a single long, then super.writeMostSignificantBytes updates the checksum but otherwise it does not? If it calls to writeXXX shouldn't it always update the checksum? This existence of two paths looks very suspicious to me ;)

jasonstack and others added 30 commits January 29, 2025 17:07

rebase fix CNDB-10718: Make RowFilter non-iterable

5bfb179

CNDB-12130: read source file sequentially when sending file through Z…

cfd4707

…CS (#1464) - add configuration to skip mutating STATS after receiving ZCS sstable

CNDB-12128 Expose to CNDB some methods around ExpirationTask (#1465)

8c07be0

Expose some methods needed for riptano/cndb#12128

CNDB-12139 minor refactor row count estimation code (#1463)

3382e6c

Part of riptano/cndb#12139 Moves constant shard count outside looping shards to reduce confusion.

CC5 Fix for CNDB-11070 to use 'native_transport_timeout' as 5.0 name …

438e810

…with DurationSpec type and 'native_transport_timeout_in_ms' as convertible old name with Long type; add some tests.

CNDB-12215 Replace [On|Off]HeapTrieMemtableIndexTest classes withTrie…

d7e7c1c

…MemtableIndexTest and TrieMemtableIndexAllocationsHeapBuffersTest from main branch.

CNDB-12222 Fix flaky tests in SSTableCorruptionDetectionTest by addin…

58613e8

…g VIntOutOfRangeException to the catch block of SSTableIdentityIterator.exhaust method.

CNDB-12223 Fix CorruptedSSTablesCompactionTest by modifying SharedCom…

4da94f5

…pactionProgress to return the operation type from the first entry in the shared progress; needed in cases that a CompactionTask type is changed after creation.

CNDB-12257: do not call System.getProperty on an hotpath (ChunkCache …

5f30fdf

…miss) (#1475)

CNDB-12277: Upgrade to jvector 3.0.6 (#1478)

42b8328

Fixes regression in jvector 3.0.4 when compacting PQVectors larger than 2GB

Improve typed reads in RandomAccessReader

a898435

CNDB-11437: Disable key cache in LazyBloomFilterTest

5fb48a0

The test generates partition index accesses by reusing the same key, and if the key cache is enabled, the test will fail for bigtable profiles because the key will be in the key cache.

CNDB-9850 followup: BtiFormat set byteComparableVersion based on trie…

ec55992

… index format version 'dx', 'cx', or older.

CNDB-9850 followup: more alignment with main branch

d884ce7

Add addMemoryUsageTo in SkipListMemtable and TrieMemtable Add TrieMemtable.switchOut

aymkhalil and others added 27 commits April 11, 2025 16:22

CNDB-13618 remove long index name test after fix (#1681)

6bb8fd0

Fixes riptano/cndb#13618 CNDB-13299 prevents creating indexes with too long index names and adds corresponding tests. Thus this test in another test class is not needed.

CNDB-13582: Ignore TOC when discovering SAI components of sstables fr…

c9c57f1

…om DSE

CC5 Fix CreateTableStatementCompactionStrategiesTest

dcb8922

CC5 Fix CQLSSTableWriterClientTest.testWriteWithSAI

84f810d

CC5 Fix IndexNameTest.testQuotedAndLongColumnNames after port of CNDB…

c15a353

…-12503

CNDB 13650: fix SSTableCorruptionDetectionTest.testSSTableScanner (#1690

6d68a5b

) ### What is the issue ChunkCache needs to invalidate the test file with invalidateFileNow

CC5: IndexStatusManagerTest accomodates INDEX_BUILD_IN_PROGRESS

e9a175a

CNDB-13785: fix IndexComponentDiscover#StateBuilder to pass MB instea…

568e72c

…d of bytes (#1699)

CC5 cleanup merge conflict markers after porting CNDB-13457

124be37

CC5: move first delete to deleteWithConfirm, add debugging

e4fd01d

CC5: try a longer sleep to generate more contention in TrieMemtableMe…

af0921a

…tricsTest

CC5: don't confirm second delete in CommitLogSegmentManagerCDCTest

a55f060

CNDB-13766: Upgrade to jvector 4.0.0-beta.4 (#1698)

825ecaf

Relates to riptano/cndb#13766 Brings in several improvements. CNDB PR: riptano/cndb#13789

CNDB-13745 Avoid double counting bytes in the checksum in IndexFileUt…

95fbdbb

…ils.writeMostSignificantBytes

pkolaczk reviewed May 15, 2025

View reviewed changes

djatnieks force-pushed the main-5.0 branch from 01bd2fc to 6e759fa Compare May 28, 2025 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-13745 Avoid double counting bytes in the checksum in IndexFileUtils.writeMostSignificantBytes #1730

CNDB-13745 Avoid double counting bytes in the checksum in IndexFileUtils.writeMostSignificantBytes #1730

Uh oh!

djatnieks commented May 9, 2025

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

pkolaczk May 15, 2025

Uh oh!

pkolaczk May 15, 2025

Uh oh!

Uh oh!

CNDB-13745 Avoid double counting bytes in the checksum in IndexFileUtils.writeMostSignificantBytes #1730

Are you sure you want to change the base?

CNDB-13745 Avoid double counting bytes in the checksum in IndexFileUtils.writeMostSignificantBytes #1730

Uh oh!

Conversation

djatnieks commented May 9, 2025

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented May 9, 2025

Checklist before you submit for review

Uh oh!

pkolaczk May 15, 2025

Choose a reason for hiding this comment

Uh oh!

pkolaczk May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!