Skip to content

Handling of broker failure anomalies ignored due to temporary self heal disable #2269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bsandeep23 opened this issue Apr 22, 2025 · 0 comments

Comments

@bsandeep23
Copy link
Contributor

bsandeep23 commented Apr 22, 2025

Issue

When self healing is disabled and a broker failure anomaly reaches the auto fix threshold, today's behavior is the anomaly gets ignored.
In cases where self healing is permanently disabled this is accepted. However in cases where self healing is disabled temporarily, this behavior is not ideal as the under replicated partitions from this broker failure gets fixed only with a human intervention or subsequent self heal for another anomaly.

Proposed Approach

When the broker failure anomaly reaches the auto fix threshold and self healing is disabled, instead of ignoring the anomaly recheck the anomaly after certain time period in the hope that self healing gets re enabled back and a fix can be triggered.

  • Introduce a configurable max retry count so that the anomaly recheck is not done forever in cases where self healing is permanently disabled. After the tries exceed this max, the anomaly can be ignored.
  • Introduce a a recheck interval config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant