-
Notifications
You must be signed in to change notification settings - Fork 611
Fix ignored broker failure anomalies when self healing is disabled #2270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ignored broker failure anomalies when self healing is disabled #2270
Conversation
LIKAFKA-61583 Handle broker failures during self healing disabled state Intermediate debugging add new logs Working fix Remove debug logging Remove temporary debug logs Rename variables and fix CI checks
Is this some output of existing endpoint or in the logs? This looks very helpful and I don't recall I've seen such information before |
@CCisGG
|
@bsandeep23 note that ccdo is an internal command. So I'm guessing this is coming from the state endpoint output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I left some questions.
...-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java
Show resolved
Hide resolved
...-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java
Show resolved
Hide resolved
...-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java
Show resolved
Hide resolved
if (brokerFailures.brokerFailureCheckWithDelayRetryCount() <= _brokerFailureCheckWithDelayMaxRetryCount) { | ||
// This means that we can retry for checking with delay | ||
if (hasNewFailureToAlert(brokerFailures, autoFixTriggered)) { | ||
alert(brokerFailures, autoFixTriggered, selfHealingTimeMs, KafkaAnomalyType.BROKER_FAILURE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we are alerting on each retries. Would this cause noisy alerts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasNewFailureToAlert
It is doing this check before alerting, so I am assuming this will alert only if a new broker fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Let's confirm it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed from cc logs.
Disabled self healing, induced a broker failure and see that it is getting checked multiple times but the log in the method alert
is being printed only once. code #link
LOG.warn("{} detected {}. Self healing {}.", anomalyType, anomaly,
_selfHealingEnabled.get(anomalyType) ? String.format("start time %s", utcDateFor(selfHealingStartTime)) : "is disabled");
$ grep "BROKER_FAILURE detected" logs/likafka-cruise-control.log
2025/04/25 13:28:10.135 WARN [SelfHealingNotifier] [AnomalyDetector-2] [kafka-cruise-control] [] BROKER_FAILURE detected {Fixable broker failures detected: {Broker 6228 failed at 2025-04-25T13:13:10Z}}. Self healing is disabled.
app@ltx1-app10469 [ /export/content/lid/apps/likafka-cruise-control/204bdc6796e7b6ce9b55e40ee13d8772a799cf2b ]$
$ ./parse.sh
anomalyId brokerId status detectionTime statusUpdateTime
b4494501-950d-4e16-87dd-fc432a69cacd 6228 CHECK_WITH_DELAY 2025-04-25 13:13:10 2025-04-25 13:13:10
7a511679-7112-478d-bebd-83f8fc818e7e 6228 CHECK_WITH_DELAY 2025-04-25 13:28:10 2025-04-25 13:28:10
b2c34eef-8f4a-42a6-a485-d8427bc98a5a 6228 CHECK_WITH_DELAY 2025-04-25 13:43:10 2025-04-25 13:43:10
cb033f75-af05-471a-bf97-ebcf1ea3b3db 6228 CHECK_WITH_DELAY 2025-04-25 13:53:10 2025-04-25 13:53:10
sboddu-mn1:~ sboddu$
Yes it is coming from the state endpoint output. |
Summary
Why:
When self healing is disabled and a broker failure anomaly reaches the auto fix threshold, today's behavior is the anomaly gets ignored.
In cases where self healing is permanently disabled this is accepted. However in cases where self healing is disabled temporarily, this behavior is not ideal as the under replicated partitions from this broker failure gets fixed only with a human intervention or subsequent self heal for another anomaly.
What:
When the broker failure anomaly reaches the auto fix threshold and self healing is disabled, instead of ignoring the anomaly, recheck the anomaly after certain delay in the hope that self healing gets re-enabled back and a fix can be triggered. Introduce a configurable max retry count so that the anomaly recheck is not done forever in cases where self healing is permanently disabled. After the tries exceed the max allowed, the anomaly can be ignored. In addition introduce a config which gives the delay after which subsequent check is executed.
Expected Behavior
Anomaly should be checked after certain time when self healing is disabled.
Actual Behavior
Anomaly is getting ignored
Steps to Reproduce
Additional evidence
Testing Logs:
Scenario 1:
Scenario 2:
Categorization
This PR resolves #2269.