Skip to content

Fix ignored broker failure anomalies when self healing is disabled #2270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 28, 2025

Conversation

bsandeep23
Copy link
Contributor

@bsandeep23 bsandeep23 commented Apr 22, 2025

Summary

Why:

When self healing is disabled and a broker failure anomaly reaches the auto fix threshold, today's behavior is the anomaly gets ignored.
In cases where self healing is permanently disabled this is accepted. However in cases where self healing is disabled temporarily, this behavior is not ideal as the under replicated partitions from this broker failure gets fixed only with a human intervention or subsequent self heal for another anomaly.

What:

When the broker failure anomaly reaches the auto fix threshold and self healing is disabled, instead of ignoring the anomaly, recheck the anomaly after certain delay in the hope that self healing gets re-enabled back and a fix can be triggered. Introduce a configurable max retry count so that the anomaly recheck is not done forever in cases where self healing is permanently disabled. After the tries exceed the max allowed, the anomaly can be ignored. In addition introduce a config which gives the delay after which subsequent check is executed.

Expected Behavior

Anomaly should be checked after certain time when self healing is disabled.

Actual Behavior

Anomaly is getting ignored

Steps to Reproduce

  1. Disable self healing.
  2. Trigger a broker failure.
  3. Wait until the broker failure anomaly reaches the auto fix threshold.
  4. Reenable self healing.
  5. The broker failure anomaly is not fixed.

Additional evidence

Testing Logs:

Scenario 1:

  1. Self healing is disabled.
  2. Broker 1468 fails.
  3. Broker failure anomaly reaches the auto fix threshold and self heal is still disabled. The anomaly should be ignored
  4. Note: Tested with max retry count 3 and recheck interval 2 minutes. The first check with delay is for the case where auto fix threshold is not yet reached.
# Self heal disabled

USER TASK ID                          CLIENT ADDRESS                   START TIME            STATUS      REQUEST URL
9cf37b50-4dc9-4407-affb-dfd91aa95a0a  <client_ip> 2025-04-21T19:14:04Z  Completed   POST /kafkacruisecontrol/admin?disable_self_healing_for=broker_failure

# Broker anomaly and status timeline

anomalyId       brokerId        status  detectionTime   statusUpdateTime
017e6f7e-13d5-407c-aaba-dfcae46bbe85    1468    CHECK_WITH_DELAY        2025-04-21 19:17:20     2025-04-21 19:17:20
94d79b77-993b-4ccb-b67c-eedf6073a47b    1468    CHECK_WITH_DELAY        2025-04-21 19:19:20     2025-04-21 19:19:20
2c7fe743-1792-4c0e-8b75-04e2acecad5b    1468    CHECK_WITH_DELAY        2025-04-21 19:21:20     2025-04-21 19:21:20
4f4b3cc7-2ee4-48d8-bee5-f32a3a19cc05    1468    CHECK_WITH_DELAY        2025-04-21 19:23:20     2025-04-21 19:23:20
5aab73a2-ae6d-4865-8f34-73a8ec1ba75c    1468    IGNORED 2025-04-21 19:25:20     2025-04-21 19:25:20

Scenario 2:

  1. Self healing is disabled.
  2. Broker 4600 failed.
  3. Broker failure anomaly reached auto fix threshold.
  4. After 2 retries of check anomaly, self healing is reenabled. Fix should be started.
  5. Note: Tested with max retry count 3 and recheck interval 2 minutes.
# Self heal timeline

1b58c1f3-5dfb-45b0-9958-529d3418ea60 <client_ip>                  2025-04-21T20:00:44Z  Completed   POST /kafkacruisecontrol/admin?disable_self_healing_for=broker_failure
9a9be753-7f13-431e-9093-e04d719e7979  <client_ip>  2025-04-21T20:13:07Z  Completed   POST /kafkacruisecontrol/admin?enable_self_healing_for=broker_failure

# Broker anomaly and status timeline

10331511-3ae1-4028-a487-20fce5175458    4600    CHECK_WITH_DELAY        2025-04-21 20:08:43     2025-04-21 20:08:43
e6ead586-cab6-4362-883b-fc33664470a7    4600    CHECK_WITH_DELAY        2025-04-21 20:10:43     2025-04-21 20:10:43
7283326e-b7b2-486d-85b4-b55decc4e228    4600    CHECK_WITH_DELAY        2025-04-21 20:12:43     2025-04-21 20:12:43
ab4e40b2-6df1-4cf0-a6a8-444d340d75d7    4600    FIX_STARTED     2025-04-21 20:14:43     2025-04-21 20:14:45

Categorization

  • documentation
  • bugfix
  • new feature
  • refactor
  • security/CVE
  • other

This PR resolves #2269.

Sandeep Boddu and others added 6 commits April 22, 2025 09:02
LIKAFKA-61583 Handle broker failures during self healing disabled state

Intermediate debugging

add new logs

Working fix

Remove debug logging

Remove temporary debug logs

Rename variables and fix CI checks
@bsandeep23 bsandeep23 changed the title Fix lost anomalies Fix ignored broker failure anomalies when self healing is disabled Apr 22, 2025
@bsandeep23 bsandeep23 marked this pull request as ready for review April 22, 2025 19:04
@CCisGG
Copy link
Contributor

CCisGG commented Apr 22, 2025

10331511-3ae1-4028-a487-20fce5175458 4600 CHECK_WITH_DELAY 2025-04-21 20:08:43 2025-04-21 20:08:43
e6ead586-cab6-4362-883b-fc33664470a7 4600 CHECK_WITH_DELAY 2025-04-21 20:10:43 2025-04-21 20:10:43
7283326e-b7b2-486d-85b4-b55decc4e228 4600 CHECK_WITH_DELAY 2025-04-21 20:12:43 2025-04-21 20:12:43
ab4e40b2-6df1-4cf0-a6a8-444d340d75d7 4600 FIX_STARTED 2025-04-21 20:14:43 2025-04-21 20:14:45

Is this some output of existing endpoint or in the logs? This looks very helpful and I don't recall I've seen such information before

@bsandeep23
Copy link
Contributor Author

10331511-3ae1-4028-a487-20fce5175458 4600 CHECK_WITH_DELAY 2025-04-21 20:08:43 2025-04-21 20:08:43
e6ead586-cab6-4362-883b-fc33664470a7 4600 CHECK_WITH_DELAY 2025-04-21 20:10:43 2025-04-21 20:10:43
7283326e-b7b2-486d-85b4-b55decc4e228 4600 CHECK_WITH_DELAY 2025-04-21 20:12:43 2025-04-21 20:12:43
ab4e40b2-6df1-4cf0-a6a8-444d340d75d7 4600 FIX_STARTED 2025-04-21 20:14:43 2025-04-21 20:14:45

Is this some output of existing endpoint or in the logs? This looks very helpful and I don't recall I've seen such information before

@CCisGG
Yeah I have used the ccdo state command with substate anomaly_detector and parsed the json to present in this way.

#!/bin/bash
fabric=$1
cluster_tag=$2
(echo -e "anomalyId\tbrokerId\tstatus\tdetectionTime\tstatusUpdateTime"; \
 ccdo -f $fabric -t $cluster_tag state --substates anomaly_detector --json True | jq . | jq -r '
  .AnomalyDetectorState.recentBrokerFailures[]
  | {
      anomalyId,
      brokerId: ( .failedBrokersByTimeMs | keys_unsorted[0] ),
      status,
      detectionTime: (.detectionMs / 1000 | strftime("%Y-%m-%d %H:%M:%S")),
      statusUpdateTime: (.statusUpdateMs / 1000 | strftime("%Y-%m-%d %H:%M:%S")),
      sortKey: .statusUpdateMs
    }
  ' | jq -s -r '
    sort_by(.sortKey)[]
    | [.anomalyId, .brokerId, .status, .detectionTime, .statusUpdateTime] | @tsv
')

@CCisGG
Copy link
Contributor

CCisGG commented Apr 23, 2025

@bsandeep23 note that ccdo is an internal command. So I'm guessing this is coming from the state endpoint output.

Copy link
Contributor

@CCisGG CCisGG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I left some questions.

if (brokerFailures.brokerFailureCheckWithDelayRetryCount() <= _brokerFailureCheckWithDelayMaxRetryCount) {
// This means that we can retry for checking with delay
if (hasNewFailureToAlert(brokerFailures, autoFixTriggered)) {
alert(brokerFailures, autoFixTriggered, selfHealingTimeMs, KafkaAnomalyType.BROKER_FAILURE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we are alerting on each retries. Would this cause noisy alerts?

Copy link
Contributor Author

@bsandeep23 bsandeep23 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasNewFailureToAlert

It is doing this check before alerting, so I am assuming this will alert only if a new broker fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Let's confirm it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed from cc logs.
Disabled self healing, induced a broker failure and see that it is getting checked multiple times but the log in the method alert is being printed only once. code #link

LOG.warn("{} detected {}. Self healing {}.", anomalyType, anomaly,
             _selfHealingEnabled.get(anomalyType) ? String.format("start time %s", utcDateFor(selfHealingStartTime)) : "is disabled");
$ grep "BROKER_FAILURE detected" logs/likafka-cruise-control.log
2025/04/25 13:28:10.135 WARN [SelfHealingNotifier] [AnomalyDetector-2] [kafka-cruise-control] [] BROKER_FAILURE detected {Fixable broker failures detected: {Broker 6228 failed at 2025-04-25T13:13:10Z}}. Self healing is disabled.
app@ltx1-app10469 [ /export/content/lid/apps/likafka-cruise-control/204bdc6796e7b6ce9b55e40ee13d8772a799cf2b ]$
$ ./parse.sh
anomalyId	brokerId	status	detectionTime	statusUpdateTime
b4494501-950d-4e16-87dd-fc432a69cacd	6228	CHECK_WITH_DELAY	2025-04-25 13:13:10	2025-04-25 13:13:10
7a511679-7112-478d-bebd-83f8fc818e7e	6228	CHECK_WITH_DELAY	2025-04-25 13:28:10	2025-04-25 13:28:10
b2c34eef-8f4a-42a6-a485-d8427bc98a5a	6228	CHECK_WITH_DELAY	2025-04-25 13:43:10	2025-04-25 13:43:10
cb033f75-af05-471a-bf97-ebcf1ea3b3db	6228	CHECK_WITH_DELAY	2025-04-25 13:53:10	2025-04-25 13:53:10
sboddu-mn1:~ sboddu$

@bsandeep23
Copy link
Contributor Author

@bsandeep23 note that ccdo is an internal command. So I'm guessing this is coming from the state endpoint output.

Yes it is coming from the state endpoint output.

@CCisGG CCisGG merged commit 70e51ec into linkedin:migrate_to_kafka_3_0 Apr 28, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants