Skip to content

Fix checkpoint handling to prevent segment replication infinite loop #18636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented Jun 27, 2025

Description

This fixes an issue in segrep flow which causes infinite segrep loop. The analysis is present in the issue - #18605.

Related Issues

Resolves #18605

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

✅ Gradle check result for ff5a898: SUCCESS

Copy link

codecov bot commented Jun 27, 2025

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 72.84%. Comparing base (b1d6f55) to head (ed9e60a).
Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
...s/replication/SegmentReplicationTargetService.java 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18636      +/-   ##
============================================
+ Coverage     72.79%   72.84%   +0.05%     
- Complexity    68362    68466     +104     
============================================
  Files          5563     5563              
  Lines        314174   314176       +2     
  Branches      45554    45555       +1     
============================================
+ Hits         228703   228871     +168     
+ Misses        66887    66764     -123     
+ Partials      18584    18541      -43     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

✅ Gradle check result for 27e3218: SUCCESS

@ashking94
Copy link
Member Author

ashking94 commented Jul 2, 2025

I have not been able to repro the issue in ITs. I have tried reproducing it, will try to understand why doesn't it happen on local.

@ashking94 ashking94 marked this pull request as ready for review July 2, 2025 05:12
@ashking94 ashking94 requested a review from a team as a code owner July 2, 2025 05:12
@ashking94
Copy link
Member Author

I have not been able to repro the issue in ITs. I have tried reproducing it, will try to understand why doesn't it happen on local.

I was able to modify an unit test that was showing the behaviour where the cyclic loop of segrep is seen always.

REPRODUCE WITH: ./gradlew 'null' --tests 'org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testStartReplicationListenerSuccess' -Dtests.seed=B17175E6A29DDE01 -Dtests.locale=bg-Cyrl-BG -Dtests.timezone=Europe/Vatican -Druntime.java=21

org.mockito.exceptions.verification.TooManyActualInvocations: 
segmentReplicationTargetService.processLatestReceivedCheckpoint(
    <any>,
    <any>
);
Wanted 1 time:
-> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testStartReplicationListenerSuccess(SegmentReplicationTargetServiceTests.java:628)
But was 27 times:
-> at org.opensearch.indices.replication.SegmentReplicationTargetService.afterIndexShardStarted(SegmentReplicationTargetService.java:235)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)
-> at org.opensearch.indices.replication.SegmentReplicationTargetService$1.onReplicationDone(SegmentReplicationTargetService.java:351)

Copy link
Contributor

github-actions bot commented Jul 2, 2025

✅ Gradle check result for ed9e60a: SUCCESS

@github-project-automation github-project-automation bot moved this to 👀 In review in Storage Project Board Jul 2, 2025
@ashking94
Copy link
Member Author

@mch2 I have tried making the above change which should fix the recursion loop. Do review if you can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

[BUG] Infinite loop of S3 GET/LIST requests during segment replication recovery after node loss/restart
4 participants