Skip to content

[Tiered Cache] Using a single cache manager for all ehcache disk caches #17513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 9, 2025

Conversation

sgup432
Copy link
Contributor

@sgup432 sgup432 commented Mar 4, 2025

Description

Earlier while trying to create N ehcache disk caches, we were creating those via N cache managers which had their own disk write thread pools (so total N). We create N disk caches based on tiered cache setting and it is decided based on number of CPU cores. So essentially we were creating (CPU_CORE * 4) disk write threads which is a lot and can cause CPU spikes with tiered cache enabled.

This change essentially creates a single cache manager, and all subsequent caches are created via this single manager. Through this we only have one disk write thread pool and is configured to have between 2 and CPU * 1.5 threads.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@sgup432 sgup432 changed the title Using a single cache manager for all ehcache disk caches [Tiered Cache] Using a single cache manager for all ehcache disk caches Mar 4, 2025
sgup432 and others added 2 commits March 4, 2025 13:54
Copy link
Contributor

github-actions bot commented Mar 4, 2025

❌ Gradle check result for 69bbc69: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@jainankitk
Copy link
Contributor

So essentially we were creating (CPU_CORE * 4) disk write threads which is a lot and can cause CPU spikes with tiered cache enabled.

I am really curious if we have observed any cases via hot_threads / flamegraph that confirms disk write threads being responsible for CPU spikes. These threads should be I/O bound, and I won't really expect them to cause observable CPU spike.

This change essentially creates a single cache manager, and all subsequent caches are created via this single manager. Through this we only have one disk write thread pool and is configured to have 2 threads by default and can go up until
max(2, CPU_CORES / 8)

The default of 2 looks really low to me. Assuming TieredCache is being written to upon successful completion of every SearchRequest and these disk write threads are blocking (don't pickup next disk write until previous is finished), shouldn't the number of threads be atleast the number of search threads?

@sgup432
Copy link
Contributor Author

sgup432 commented Mar 20, 2025

I am really curious if we have observed any cases via hot_threads / flamegraph that confirms disk write threads being responsible for CPU spikes. These threads should be I/O bound, and I won't really expect them to cause observable CPU spike.

Not yet. We don't have a performance test which is able to reproduce this scenario. We ran our OSB benchmark with/without changes, and both were pretty similar in terms of performance(latency p50, p90 etc)

The default of 2 looks really low to me. Assuming TieredCache is being written to upon successful completion of every SearchRequest and these disk write threads are blocking (don't pickup next disk write until previous is finished), shouldn't the number of threads be atleast the number of search threads?

We can discuss on the default and increase it further. But main objective of this change is to have a way to increase/decrease the number of disk write threads when needed irrespective of the number of N partitions we are creating within tiered cache. Right now, each disk cache object have its own write thread pool, and when we create N(CPU * 1.5) segments/disk cache object, we are essentially creating (N * 4) disk write threads which seems unnecessary and cause unknown problems, and it is not possible this configure to <=(CPU*1.5).

sgup432 added 4 commits April 3, 2025 14:17
Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]>
Copy link
Contributor

github-actions bot commented Apr 3, 2025

❌ Gradle check result for 112a96b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Apr 4, 2025

❌ Gradle check result for cc0fea8:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Apr 4, 2025

✅ Gradle check result for fafae93: SUCCESS

Copy link

codecov bot commented Apr 4, 2025

Codecov Report

Attention: Patch coverage is 85.84906% with 15 lines in your changes missing coverage. Please review.

Project coverage is 72.41%. Comparing base (2d42bd0) to head (13b2e85).
Report is 35 commits behind head on main.

Files with missing lines Patch % Lines
...arch/cache/store/disk/EhcacheDiskCacheManager.java 82.66% 10 Missing and 3 partials ⚠️
.../opensearch/cache/store/disk/EhcacheDiskCache.java 92.30% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17513      +/-   ##
============================================
+ Coverage     72.39%   72.41%   +0.01%     
- Complexity    66066    66235     +169     
============================================
  Files          5358     5385      +27     
  Lines        306500   307200     +700     
  Branches      44409    44560     +151     
============================================
+ Hits         221888   222447     +559     
- Misses        66474    66543      +69     
- Partials      18138    18210      +72     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Sagar Upadhyaya <[email protected]>
Copy link
Contributor

github-actions bot commented Apr 8, 2025

❕ Gradle check result for 13b2e85: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@jainankitk jainankitk added the backport 2.x Backport to 2.x branch label Apr 9, 2025
@jainankitk jainankitk merged commit 58eb44e into opensearch-project:main Apr 9, 2025
35 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-17513-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 58eb44e7ece913aca6de34d32f6b837a512541ae
# Push it to GitHub
git push --set-upstream origin backport/backport-17513-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-17513-to-2.x.

rgsriram pushed a commit to rgsriram/OpenSearch that referenced this pull request Apr 15, 2025
…es (opensearch-project#17513)

* Using a single cache manager for all ehcache disk caches

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Added changelog

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Fixing cache manager UT

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Addressing comments

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Removing commented out code

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Adding changelog

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Changes to perform mutable changes for cache manager under a lock

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Changes to fix UT

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Addressing minor comments

Signed-off-by: Sagar Upadhyaya <[email protected]>

---------

Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar <[email protected]>
Signed-off-by: Sriram Ganesh <[email protected]>
Harsh-87 pushed a commit to Harsh-87/OpenSearch that referenced this pull request May 7, 2025
…es (opensearch-project#17513)

* Using a single cache manager for all ehcache disk caches

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Added changelog

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Fixing cache manager UT

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Addressing comments

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Removing commented out code

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Adding changelog

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Changes to perform mutable changes for cache manager under a lock

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Changes to fix UT

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Addressing minor comments

Signed-off-by: Sagar Upadhyaya <[email protected]>

---------

Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar <[email protected]>
Signed-off-by: Harsh Kothari <[email protected]>
Harsh-87 pushed a commit to Harsh-87/OpenSearch that referenced this pull request May 7, 2025
…es (opensearch-project#17513)

* Using a single cache manager for all ehcache disk caches

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Added changelog

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Fixing cache manager UT

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Addressing comments

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Removing commented out code

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Adding changelog

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Changes to perform mutable changes for cache manager under a lock

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Changes to fix UT

Signed-off-by: Sagar Upadhyaya <[email protected]>

* Addressing minor comments

Signed-off-by: Sagar Upadhyaya <[email protected]>

---------

Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar <[email protected]>
Signed-off-by: Harsh Kothari <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants