Fix ignored broker failure anomalies when self healing is disabled #2270

bsandeep23 · 2025-04-22T13:31:27Z

Summary

Why:

When self healing is disabled and a broker failure anomaly reaches the auto fix threshold, today's behavior is the anomaly gets ignored.
In cases where self healing is permanently disabled this is accepted. However in cases where self healing is disabled temporarily, this behavior is not ideal as the under replicated partitions from this broker failure gets fixed only with a human intervention or subsequent self heal for another anomaly.

What:

When the broker failure anomaly reaches the auto fix threshold and self healing is disabled, instead of ignoring the anomaly, recheck the anomaly after certain delay in the hope that self healing gets re-enabled back and a fix can be triggered. Introduce a configurable max retry count so that the anomaly recheck is not done forever in cases where self healing is permanently disabled. After the tries exceed the max allowed, the anomaly can be ignored. In addition introduce a config which gives the delay after which subsequent check is executed.

Expected Behavior

Anomaly should be checked after certain time when self healing is disabled.

Actual Behavior

Anomaly is getting ignored

Steps to Reproduce

Disable self healing.
Trigger a broker failure.
Wait until the broker failure anomaly reaches the auto fix threshold.
Reenable self healing.
The broker failure anomaly is not fixed.

Additional evidence

Testing Logs:

Scenario 1:

Self healing is disabled.
Broker 1468 fails.
Broker failure anomaly reaches the auto fix threshold and self heal is still disabled. The anomaly should be ignored
Note: Tested with max retry count 3 and recheck interval 2 minutes. The first check with delay is for the case where auto fix threshold is not yet reached.

# Self heal disabled

USER TASK ID                          CLIENT ADDRESS                   START TIME            STATUS      REQUEST URL
9cf37b50-4dc9-4407-affb-dfd91aa95a0a  <client_ip> 2025-04-21T19:14:04Z  Completed   POST /kafkacruisecontrol/admin?disable_self_healing_for=broker_failure

# Broker anomaly and status timeline

anomalyId       brokerId        status  detectionTime   statusUpdateTime
017e6f7e-13d5-407c-aaba-dfcae46bbe85    1468    CHECK_WITH_DELAY        2025-04-21 19:17:20     2025-04-21 19:17:20
94d79b77-993b-4ccb-b67c-eedf6073a47b    1468    CHECK_WITH_DELAY        2025-04-21 19:19:20     2025-04-21 19:19:20
2c7fe743-1792-4c0e-8b75-04e2acecad5b    1468    CHECK_WITH_DELAY        2025-04-21 19:21:20     2025-04-21 19:21:20
4f4b3cc7-2ee4-48d8-bee5-f32a3a19cc05    1468    CHECK_WITH_DELAY        2025-04-21 19:23:20     2025-04-21 19:23:20
5aab73a2-ae6d-4865-8f34-73a8ec1ba75c    1468    IGNORED 2025-04-21 19:25:20     2025-04-21 19:25:20

Scenario 2:

Self healing is disabled.
Broker 4600 failed.
Broker failure anomaly reached auto fix threshold.
After 2 retries of check anomaly, self healing is reenabled. Fix should be started.
Note: Tested with max retry count 3 and recheck interval 2 minutes.

# Self heal timeline

1b58c1f3-5dfb-45b0-9958-529d3418ea60 <client_ip>                  2025-04-21T20:00:44Z  Completed   POST /kafkacruisecontrol/admin?disable_self_healing_for=broker_failure
9a9be753-7f13-431e-9093-e04d719e7979  <client_ip>  2025-04-21T20:13:07Z  Completed   POST /kafkacruisecontrol/admin?enable_self_healing_for=broker_failure

# Broker anomaly and status timeline

10331511-3ae1-4028-a487-20fce5175458    4600    CHECK_WITH_DELAY        2025-04-21 20:08:43     2025-04-21 20:08:43
e6ead586-cab6-4362-883b-fc33664470a7    4600    CHECK_WITH_DELAY        2025-04-21 20:10:43     2025-04-21 20:10:43
7283326e-b7b2-486d-85b4-b55decc4e228    4600    CHECK_WITH_DELAY        2025-04-21 20:12:43     2025-04-21 20:12:43
ab4e40b2-6df1-4cf0-a6a8-444d340d75d7    4600    FIX_STARTED     2025-04-21 20:14:43     2025-04-21 20:14:45

Categorization

This PR resolves #2269.

LIKAFKA-61583 Handle broker failures during self healing disabled state Intermediate debugging add new logs Working fix Remove debug logging Remove temporary debug logs Rename variables and fix CI checks

CCisGG · 2025-04-22T21:24:28Z

10331511-3ae1-4028-a487-20fce5175458 4600 CHECK_WITH_DELAY 2025-04-21 20:08:43 2025-04-21 20:08:43
e6ead586-cab6-4362-883b-fc33664470a7 4600 CHECK_WITH_DELAY 2025-04-21 20:10:43 2025-04-21 20:10:43
7283326e-b7b2-486d-85b4-b55decc4e228 4600 CHECK_WITH_DELAY 2025-04-21 20:12:43 2025-04-21 20:12:43
ab4e40b2-6df1-4cf0-a6a8-444d340d75d7 4600 FIX_STARTED 2025-04-21 20:14:43 2025-04-21 20:14:45

Is this some output of existing endpoint or in the logs? This looks very helpful and I don't recall I've seen such information before

bsandeep23 · 2025-04-23T03:02:16Z

10331511-3ae1-4028-a487-20fce5175458 4600 CHECK_WITH_DELAY 2025-04-21 20:08:43 2025-04-21 20:08:43
e6ead586-cab6-4362-883b-fc33664470a7 4600 CHECK_WITH_DELAY 2025-04-21 20:10:43 2025-04-21 20:10:43
7283326e-b7b2-486d-85b4-b55decc4e228 4600 CHECK_WITH_DELAY 2025-04-21 20:12:43 2025-04-21 20:12:43
ab4e40b2-6df1-4cf0-a6a8-444d340d75d7 4600 FIX_STARTED 2025-04-21 20:14:43 2025-04-21 20:14:45

Is this some output of existing endpoint or in the logs? This looks very helpful and I don't recall I've seen such information before

@CCisGG
Yeah I have used the ccdo state command with substate anomaly_detector and parsed the json to present in this way.

#!/bin/bash
fabric=$1
cluster_tag=$2
(echo -e "anomalyId\tbrokerId\tstatus\tdetectionTime\tstatusUpdateTime"; \
 ccdo -f $fabric -t $cluster_tag state --substates anomaly_detector --json True | jq . | jq -r '
  .AnomalyDetectorState.recentBrokerFailures[]
  | {
      anomalyId,
      brokerId: ( .failedBrokersByTimeMs | keys_unsorted[0] ),
      status,
      detectionTime: (.detectionMs / 1000 | strftime("%Y-%m-%d %H:%M:%S")),
      statusUpdateTime: (.statusUpdateMs / 1000 | strftime("%Y-%m-%d %H:%M:%S")),
      sortKey: .statusUpdateMs
    }
  ' | jq -s -r '
    sort_by(.sortKey)[]
    | [.anomalyId, .brokerId, .status, .detectionTime, .statusUpdateTime] | @tsv
')

CCisGG · 2025-04-23T17:11:23Z

@bsandeep23 note that ccdo is an internal command. So I'm guessing this is coming from the state endpoint output.

CCisGG

Overall looks good. I left some questions.

...-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java

CCisGG · 2025-04-23T17:21:22Z

...ol/src/main/java/com/linkedin/kafka/cruisecontrol/detector/notifier/SelfHealingNotifier.java

+        if (brokerFailures.brokerFailureCheckWithDelayRetryCount() <= _brokerFailureCheckWithDelayMaxRetryCount) {
+          // This means that we can retry for checking with delay
+          if (hasNewFailureToAlert(brokerFailures, autoFixTriggered)) {
+            alert(brokerFailures, autoFixTriggered, selfHealingTimeMs, KafkaAnomalyType.BROKER_FAILURE);


Seems like we are alerting on each retries. Would this cause noisy alerts?

hasNewFailureToAlert

It is doing this check before alerting, so I am assuming this will alert only if a new broker fails.

Makes sense. Let's confirm it.

Confirmed from cc logs.
Disabled self healing, induced a broker failure and see that it is getting checked multiple times but the log in the method alert is being printed only once. code #link

LOG.warn("{} detected {}. Self healing {}.", anomalyType, anomaly, _selfHealingEnabled.get(anomalyType) ? String.format("start time %s", utcDateFor(selfHealingStartTime)) : "is disabled");

$ grep "BROKER_FAILURE detected" logs/likafka-cruise-control.log 2025/04/25 13:28:10.135 WARN [SelfHealingNotifier] [AnomalyDetector-2] [kafka-cruise-control] [] BROKER_FAILURE detected {Fixable broker failures detected: {Broker 6228 failed at 2025-04-25T13:13:10Z}}. Self healing is disabled. app@ltx1-app10469 [ /export/content/lid/apps/likafka-cruise-control/204bdc6796e7b6ce9b55e40ee13d8772a799cf2b ]$

$ ./parse.sh anomalyId brokerId status detectionTime statusUpdateTime b4494501-950d-4e16-87dd-fc432a69cacd 6228 CHECK_WITH_DELAY 2025-04-25 13:13:10 2025-04-25 13:13:10 7a511679-7112-478d-bebd-83f8fc818e7e 6228 CHECK_WITH_DELAY 2025-04-25 13:28:10 2025-04-25 13:28:10 b2c34eef-8f4a-42a6-a485-d8427bc98a5a 6228 CHECK_WITH_DELAY 2025-04-25 13:43:10 2025-04-25 13:43:10 cb033f75-af05-471a-bf97-ebcf1ea3b3db 6228 CHECK_WITH_DELAY 2025-04-25 13:53:10 2025-04-25 13:53:10 sboddu-mn1:~ sboddu$

bsandeep23 · 2025-04-24T12:20:28Z

@bsandeep23 note that ccdo is an internal command. So I'm guessing this is coming from the state endpoint output.

Yes it is coming from the state endpoint output.

Sandeep Boddu and others added 6 commits April 22, 2025 09:02

Working fix

59ad418

LIKAFKA-61583 Handle broker failures during self healing disabled state Intermediate debugging add new logs Working fix Remove debug logging Remove temporary debug logs Rename variables and fix CI checks

Add unit tests

a5fab09

Add unit tests

b3d69ed

Add unit tests and update documentation

ea79aac

Fix check style tests

842d33b

Fix logs and rename variables

e837f3e

bsandeep23 changed the title ~~Fix lost anomalies~~ Fix ignored broker failure anomalies when self healing is disabled Apr 22, 2025

Minor logging fix

894e8f3

bsandeep23 marked this pull request as ready for review April 22, 2025 19:04

CCisGG reviewed Apr 23, 2025

View reviewed changes

Fix review comments, add a missed log line

d3431ba

CCisGG approved these changes Apr 24, 2025

View reviewed changes

CCisGG merged commit 70e51ec into linkedin:migrate_to_kafka_3_0 Apr 28, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ignored broker failure anomalies when self healing is disabled #2270

Fix ignored broker failure anomalies when self healing is disabled #2270

bsandeep23 commented Apr 22, 2025 •

edited

Loading

CCisGG commented Apr 22, 2025

bsandeep23 commented Apr 23, 2025

CCisGG commented Apr 23, 2025

CCisGG left a comment

CCisGG Apr 23, 2025

bsandeep23 Apr 24, 2025 •

edited

Loading

CCisGG Apr 24, 2025

bsandeep23 Apr 25, 2025

bsandeep23 commented Apr 24, 2025

Fix ignored broker failure anomalies when self healing is disabled #2270

Fix ignored broker failure anomalies when self healing is disabled #2270

Conversation

bsandeep23 commented Apr 22, 2025 • edited Loading

Summary

Why:

What:

Expected Behavior

Actual Behavior

Steps to Reproduce

Additional evidence

Categorization

CCisGG commented Apr 22, 2025

bsandeep23 commented Apr 23, 2025

CCisGG commented Apr 23, 2025

CCisGG left a comment

Choose a reason for hiding this comment

CCisGG Apr 23, 2025

Choose a reason for hiding this comment

bsandeep23 Apr 24, 2025 • edited Loading

Choose a reason for hiding this comment

CCisGG Apr 24, 2025

Choose a reason for hiding this comment

bsandeep23 Apr 25, 2025

Choose a reason for hiding this comment

bsandeep23 commented Apr 24, 2025

bsandeep23 commented Apr 22, 2025 •

edited

Loading

bsandeep23 Apr 24, 2025 •

edited

Loading