Skip to content

CNDB-13745 Avoid double counting bytes in the checksum in IndexFileUtils.writeMostSignificantBytes #1730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1,465 commits into
base: main-5.0
Choose a base branch
from

Conversation

djatnieks
Copy link

What is the issue

...

What does this PR fix and why was it fixed

...

jasonstack and others added 30 commits January 29, 2025 17:07
- use `cassandra.remote_storage_handler_factory` to replace
`cassandra.remote_storage_handler`

- add `TableMetadata` to `StorageProvider#createDataDirectories`
signature

- use `enableAutoCompaction` to replace `StorageHandler#isRemote`

- emit SSTaleAddingNotification in LifecycleTransaction#checkpoint
before new sstables are made visible. Note that early open is not
supported because it opens sstables before `trackNewWritten`
Make `RowFilter` non-iterable because it’s confusing in a tree of expressions.

Also:
* Change the meaning of `RowFilter#expressions()` to traverse the entire tree of expressions.
* Add a `#withoutDisjunctions()` utility method for legacy indexes.
* Inspect and update if needed all the uses of RowFilter iteration to ensure that they are valid.
* Remove the `warnIfFilterIsATree` warning since it should be clear what each method does.
Closes riptano/cndb#11932

The anti-join strategy for the inequality operator, `!=`, can be
replaced with a union strategy of two semi-ranges, which make the plan
explicit and can be more efficient.

This commit plans the inequality operator with the union of two semi-ranges,
thus no anti-join iterator is used for it. In the case of truncatable
types as Big Decimal and Big Integer it implements full index scan as
the range index scan plan node without bounds.

This PR also adds tests for `not contains key` to discard the usage the
union strategy by testing the false negative case of an empty map.
This performance regression was introduced by
9cabec5. I haven't been able to find
the exact commit responsible for this regression, but essentially, we
switched from `DirectReaders` to the lucene `DirectReader` around
0774345 and
6bd00d1, and that led to a lot of
unnecessary object creation.

In my initial testing, this change appeared to improve numeric query
throughput from `282.03it/s` for 100k queries to `333.15it/s` for 100k
queries. The memory profile also showed far fewer objects created.

Note: the above numbers might have been from variability in my testing.
It would be helpful to test in a more controlled environment. Either
way, based on my understanding of the objects involved, this should
generally produce better results because we'll have fewer objects.

Finally, this change is especially helpful because the `sort then
filter` logic has to initialize the the iterators for search, so it
helps in both execution paths.
This splits compactions that are to produce more than one
output sstable into tasks that can execute in parallel.
Such tasks share a transaction and have combined progress
and observer. Because we cannot mark parts of an sstable
as unneeded, the transaction is only applied when all
tasks have succeeded. This also means that early open
is not supported for such tasks.

The parallelization also takes into account thread reservations,
reducing the parallelism to the number of available threads
for its level. The new functionality is turned on by default.

Major compactions will apply the same mechanism to
parallelize the operation. They will only split on pre-
existing boundary points if they are also boundary
points for the current UCS configuration. This is done
to ensure that major compactions can re-shard data when
the configuration is changed. If pre-existing boundaries
match the current state, a major compaction will still be
broken into multiple operations to reduce the space
overhead of the operation.

Also:
- Introduces a parallelism parameter to major compactions
  (`nodetool compact -j <threads>`, defaulting to half the
  compaction threads) to avoid stopping all other compaction
  for the duration.

- Changes SSTable expiration to be done in a separate
  `getNextBackgroundCompactions` round to improve the
  efficiency of expiration (separate task can run quickly
  and remove the relevant sstables without waiting for
  a compaction to end).

- Applies small-partition-count correction in
  `ShardManager.calculateCombinedDensity`.
…CS (#1464)

- add configuration to skip mutating STATS after receiving ZCS sstable
Unbounded queue length at the native transport stage can caused large
backlogs of requests that the system processes, even though clients may
no longer expect a response.

This PR implements a limited backport of CNDB-11070, introducing the
notion of a native request timeout that can shed messages with excessive
queue times at the NTR stage as well as async read/write stages, if
enabled. Cross-node message timeouts are also now respected earlier in
the mutation verb handler.

This is a fairly straightforward cherry-pick of
#1393 targeting main instead
of cc-main-migration-release.
This reverts commit c06c94c.

It seems the removal of `Index#postProcessor` by CNDB-11762 broke some
tests in CNDB's `MultiNodeBillingTest`. Unfortunately that patch [was
merged before creating the PR bumping the CC version used by
CNDB](#1422 (comment)).
[The CNDB PR](riptano/cndb#12076) was created
after that merging but it was superseded by other CC version bumps.

So I'm adding this reversal so we can investigate how the removal of
`Index#postProcessor` affects those tests.
This patch replaces null values of `deterministic`, `monotonic` and
`monotonic_on` columns in `system_schema.functions` and
`system_schema.aggregates` with negative defaults. These defaults will
be addressed if/once DB-672 gets ported to CC.
There are two mechanisms of detecting that the cluster is in the upgrade
state and the minimum version. Both are slightly different, and both are
not pluggable which means that CNDB doesn't work properly with them.

Those mechanisms are implemented in `Gossiper`. Although we do not use
`Gossiper` in CNDB, there are classes like `ColumnFilter` which go to
`Gossiper` to check the upgrade state.

So far, using that stuff in CDNB was a bit unpredictable, some of them
reported the cluster is upgraded and in the current version, the other
did not.

This turned out to be a problem, especially for the `ColumnFilter`
because when we upgrade DSE --> CC, CC assumes that the newest filter
version should be used, which is not correctly deserialized and
interpreted by DSE.

The fix is not small, but it probably simplifies stuff a bit.

First of all, two mechanism are merged into one. Moreover, we added
pluggability of it so that we can provide the appropriate implementation
in CNDB coordinators and writers, which is based on ETCD.
Part of riptano/cndb#12139

Moves constant shard count outside looping shards to reduce confusion.
…with DurationSpec type and 'native_transport_timeout_in_ms' as convertible old name with Long type; add some tests.
…MemtableIndexTest and TrieMemtableIndexAllocationsHeapBuffersTest from main branch.
…strictions (#1449)

Closes riptano/cndb#12139

This PR adds a test of row count of a SAI plan in the presence of
restrictions. Currently it tests queries with inequality, equality and
half-ranges on different SAI column types and with or without
histograms.
…g VIntOutOfRangeException to the catch block of SSTableIdentityIterator.exhaust method.
…pactionProgress to return the operation type from the first entry in the shared progress; needed in cases that a CompactionTask type is changed after creation.
…opriate (#1469)

Fixes riptano/cndb#12239

We found the System.nanoTime was using significant cpu cost, but because
the timeout is high enough, we can accept the inaccuracy.

- [ ] Make sure there is a PR in the CNDB project updating the Converged
Cassandra version
- [ ] Use `NoSpamLogger` for log lines that may appear frequently in the
logs
- [ ] Verify test results on Butler
- [ ] Test coverage for new/modified code is > 80%
- [ ] Proper code formatting
- [ ] Proper title for each commit staring with the project-issue
number, like CNDB-1234
- [ ] Each commit has a meaningful description
- [ ] Each commit is not very long and contains related changes
- [ ] Renames, moves and reformatting are in distinct commits
Fixes regression in jvector 3.0.4 when compacting PQVectors larger than 2GB
SimpleClientPerfTest has been failing in CI since changes from
CNDB-10759

This change in `SimpleClientPerfTest`, updates the anonymous class
`Message.Codec<QueryMessage>` to override the correct method, `public
CompletableFuture<Response> maybeExecuteAsync` from `QueryMessage`,
whose signature was changed as part of CNDB-10759.

- [ ] Make sure there is a PR in the CNDB project updating the Converged
Cassandra version
- [ ] Use `NoSpamLogger` for log lines that may appear frequently in the
logs
- [ ] Verify test results on Butler
- [ ] Test coverage for new/modified code is > 80%
- [ ] Proper code formatting
- [ ] Proper title for each commit staring with the project-issue
number, like CNDB-1234
- [ ] Each commit has a meaningful description
- [ ] Each commit is not very long and contains related changes
- [ ] Renames, moves and reformatting are in distinct commits
…ing for async batchlog removal (#1485)

The test asserts that the batchlog is removed immediately after the
write completes, but removal of the batchlog is async and can be
delayed, particularly in resource-limited environments like CI.
The test generates partition index accesses by reusing the same key, and if the key cache is enabled, the test will fail for bigtable profiles because the key will be in the key cache.
…by filtering queries (#1484)

Queries creating fake index contexts each create their own context,
which can then race on metric registration (as the metrics have the same
patterns). This can cause a query to fail. These metrics are superfluous, 
we can skip creating them entirely.
…ate.accumulatedDataSize; it only worked to fix SensorsWriteTest.testMultipleRowsMutationWithClusteringKey for SkipListMemtable and may not be necessary if the default memtable becomes TrieMemtable. Revisit SensorsWriteTest later if necessary.
Move static class TrieMemtable.Factory to TrieMemtableFactory class;
Use suggested TriePartitionUpdate.unsharedHeapSize implementation;
Use InMemoryTrie.shortLived in TrieToDotTest and TrieToMermaidTest;
Add specific versions aa, ca, and da to RowIndexTest;
Add addMemoryUsageTo in SkipListMemtable and TrieMemtable
Add TrieMemtable.switchOut
aymkhalil and others added 27 commits April 11, 2025 16:22
Resolves: riptano/cndb#13372

Currently min paxos backoff is to `0` which doesn't help there is
already contention.

Introduced a new property `cassandra.lwt_min_backoff_ms` which defautls
to 5ms (10x smaller that the default 50ms to give enough range and be
conservative relative to the previous 0 value, but still a magic number)
and fix the `contentionBackoffLatency` that accepts nanos
Fixes riptano/cndb#13618

CNDB-13299 prevents creating indexes with too long index names and adds
corresponding tests. Thus this test in another test class is not
needed.
…n parameters to index config (#1676)

Fixes: riptano/cndb#12937

This pull request upgrades jvector from **4.0.0-beta.1** to
**4.0.0-beta.2** and introduces three new configuration options to
influence graph construction:

1. `neighborhood_overflow`
2. `alpha`
3. `enable_hierarchy`

The defaults for these hyperparameters vary between in memory and on
disk, but when these are configured, they will be uniformly applied to
graphs built by a memtable and by compaction.

**Details**

- **jvector 4.0.0-beta.2**
- Minor changes to the graph index architecture, including some code now
under `...disk.feature...` packages.
- Removed the old `CachingGraphIndex` code in Cassandra, which is no
longer used.
- New constructor arguments for controlling `neighborhood_overflow`,
`alpha`, and hierarchical levels in the HNSW graph.

- **New configuration options**
1. **`neighborhood_overflow`**: A `float` >= 1.0 controlling how
aggressively the graph tries to insert extra neighbors on each HNSW
layer.
2. **`alpha`**: A `float` > 0 used in the neighbor selection phase for
HNSW.
3. **`enable_hierarchy`**: A `boolean` indicating whether HNSW should
allow multiple layers (true) or a single-layer approximate graph
(false). Defaults to false.

I manually verified that the jvector upgrade is backwards compatible,
meaning that when we build using the new jvector version, we can read
with an old jvector version, so I did not create a new SAI on disk file
format version.
…ate cdc_enabled key in base yaml (#1680)

### What is the issue
#1563 upgraded our version of snakeyaml. This new version logs when
duplicate keys are found. The test-cdc target concatenates a
CDC-specific yaml to the base yaml. Both contain entries for the
cdc_enabled key, causing tests that load this yaml to log the warning.
This causes CDC CI failures.

### What does this PR fix and why was it fixed
This PR removes the cdc_enabled entry in the base yaml, as it's the same
as the default. This is consistent with how we handle other default keys
that might be duplicated across multiple yaml fragments in the test
commands.
### What is the issue
riptano/cndb#13583

### What does this PR fix and why was it fixed
This PR adds comprehensive metrics for Storage Attached Indexes (SAI)
vector search operations, providing crucial insights into both ANN
(Approximate Nearest Neighbor) graph searches and brute force
operations.

New Vector Search Metrics:

Search Operation Counters:
- `ANNNodesVisited`: Total number of nodes visited during ANN searches
(this is equivalent to approximate similarity score computations)
- `ANNNodesReranked`: Number of nodes that underwent exact distance
computation for reranking (this is equivalent to exact similarity score
computations)
- `ANNNodesExpanded`: Total number of nodes whose edges were explored
- `ANNNodesExpandedBaseLayer`: Number of nodes expanded in the base
layer of the graph
- `ANNGraphSearches`: Count of new graph searches initiated
- `ANNGraphResumes`: Count of resumed graph searches (when a search
continues from previous results)
- `ANNGraphSearchLatency`: Timer measuring individual graph search
latency (Note: This measures per-graph search time, not total query time
which may involve multiple graphs)

Brute Force Operation Counters:
- `BruteForceNodesVisited`: Number of nodes visited during brute force
searches (approximate similarity comparisons)
- `BruteForceNodesReranked`: Number of nodes that underwent exact
similarity computation during brute force searches

Memory Usage Tracking:
- `quantizationMemoryBytes`: Current memory usage by the quantization
(PQ or BQ) data structures
- `ordinalsMapMemoryBytes`: Current memory usage by ordinals mapping
structures (only matters in some cases)
- `onDiskGraphsCount`: Number of currently loaded graph segments
- `onDiskGraphVectorsCount`: Total number of vectors in currently loaded
graphs

These metrics will help us:
1. Understand if we are performing more comparisons than expected
2. Get insight into number of graphs queried
3. Get insight into the brute force vs graph query path 
4. Understand current memory utilization

The counters provide operations/second metrics, allowing calculation of
per-query averages by dividing by the number of queries. The memory
tracking metrics help monitor resource usage across graph segments as
they are loaded and unloaded.
riptano/cndb#12937. This is a follow up on
#1676, which updated the
jvector dependency but didn't update the jvector version in the actual
graph builder.

* Adds jvector version configuration, defaulted to 4
* Tests jvector versions 2 and 4 to ensure we can query sstables with
both kinds of graphs
* Makes the `enable_hierarchy` config backwards compatible by ignoring
it if the jvector version is too low
* Removes the ability to write FusedADC since it does not work with the
latest jvector version (discovered when writing compatibility tests)
* Does not remove FusedADC read path since there could be indexes based
on it in production
* Replaces call to deprecated `graph.size()` method with `graph.size(0)`
)

### What is the issue
ChunkCache needs to invalidate the test file with invalidateFileNow
### What is the issue
The `main-5.0` branch uses `TrieMemtable` by default and the
`test/conf/cassandra.yaml` configures `ByteOrderedPartitioner`.

In this environment, `CQLSSTableWriterDaemonTest` tests
`testSkipBuildingIndexesWithSAI` and `testWriteWithSAI` are failing.

### What does this PR fix and why was it fixed
This fix moves the SAI specific tests from `CQLSSTableWriterTest` to a
new `CQLSSTableWriterSAITest` so that `Murmur3Partitioner` can be
explicitly set for the SAI specific tests.
…ngWIthAggregateTest to use paging in bytes for CC. (#1702)

### What is the issue
With the rebase of CC on C* 5.0.2, there is a new test
CoordinatorReadLatencyMetricTest and
CoordinatorReadLatencyMetricTest.internalPagingWithAggregateTest is
failing.

### What does this PR fix and why was it fixed
Uses paging in bytes instead of paging in rows in
internalPagingWithAggregateTest.
Removes unnecessary tiebreaker column, since the word counts give
different scores for text with the same term count. One value is
improved to remove possibility for flakiness in the order.

Adds comments with the term and word counts to help verifying the
expected results. A helper function to count them is added.

Also adds a small tests for analyzer with lowercase.
#1691)

Executing a query with BM25 search and a condition on partial SSTable
results in empty iterator access error. And there was no test with
storing data in segments.

The PR implements BM25 search tests with splitting data into two tables.
This reproduced this bug, CNDB-13696, and demonstrates current confusion
on the BM25 ordering result to be fixed by CNDB-13553.

This PR adds a check for empty iterator created for a PK belonging to
another segment. This fixes the bug of trying to get the first element
of an empty iterator.
…nTest (#1664)

### What is the issue
`PendingAntiCompactionTest.testRetriesTimeout` is flaky on `main`
failing with a timeout reported in the junit test task.
riptano/cndb#13524

### What does this PR fix and why was it fixed
`PendingAntiCompactionTest.testRetriesTimeout`, and a couple other
tests, are calling `Future.get` without any timeout. This change adds a
30 seconds timeout to prevent an indefinite wait for the future and a
log in the failing test rather than in the junit test task.
### What is the issue
This is a follow up test for
riptano/cndb#13696.

### What does this PR fix and why was it fixed
The fix was already merged, but this confirms that without the fix,
collections are affected by the NoSuchElementException bug.
### What is the issue
Didn't create an issue as this is just a new test.

### What does this PR fix and why was it fixed
We implicitly rely on this and
are exposing this as a feature
to our end users, so we want
to confirm that the stop words
are the expected stop words.
Fixes: riptano/cndb#13718

We had adopted this version of lucene
datastax/lucene@5ea8bb4
in order to support our custom modifications of HNSW on top of lucene.
We now use https://github.com/datastax/jvector for vector search and no
longer need a custom build.

I propose we use 9.8.0 since that is the closest release to the one we
have been using.

CNDB test pr: riptano/cndb#13757
Copy link

github-actions bot commented May 9, 2025

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

Comment on lines +345 to +352
boolean shouldUpdateChecksum = true;
if (buffer.remaining() < Long.BYTES)
{
// Avoid double-counting bytes in the checksum - in this path, super.writeMostSignificantBytes will call
// it's super, DataOutputPlus.writeMostSignificantBytes, and that laeds to the writeXXX methods which
// are overridden in this class will update the checksum.
shouldUpdateChecksum = false;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestion: you can simplify this to:

boolean shouldUpdateChecksum = (buffer.remaining() >= Long.BYTES);

Comment on lines +348 to +350
// Avoid double-counting bytes in the checksum - in this path, super.writeMostSignificantBytes will call
// it's super, DataOutputPlus.writeMostSignificantBytes, and that laeds to the writeXXX methods which
// are overridden in this class will update the checksum.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the remaining buffer is too small to hold a single long, then super.writeMostSignificantBytes updates the checksum but otherwise it does not? If it calls to writeXXX shouldn't it always update the checksum? This existence of two paths looks very suspicious to me ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.