Fix the KafkaRoller issue talking to controllers #10941

tinaselenge · 2024-12-11T16:07:03Z

Type of change

Select the type of your PR

Refactor

Description

Use describeCluster to discover the active controller and get quorum information from it. This is because describeQuorum can be only sent directly to the active controller. Non active controllers return an error to the request.
Remove version checks to simplify the changes as this PR will be targeting Strimzi that only supports Kafka version 3.9+.
Refactor unit tests to remove ZK based testcases and add more coverage for KRaft based clusters.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

Write tests
Make sure all tests pass
Update documentation
Check RBAC rights for Kubernetes / OpenShift roles
Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
Reference relevant issue(s) and close them after merging
Update CHANGELOG.md
Supply screenshots for visual changes, such as Grafana dashboards

scholzj · 2024-12-11T17:19:11Z

/azp run regression

azure-pipelines · 2024-12-11T17:19:23Z

Azure Pipelines successfully started running 1 pipeline(s).

showuon · 2024-12-12T10:01:39Z

As discussed in the kafka call, we might need to workaround it by sending DESCRIBE_CLUSTER first to find the leader node, and then send the DESCRIBE_QUORUM to the active controller like the metadata-quorum script did.

showuon · 2024-12-13T07:38:36Z

On second thought, I think ideally, the retry should be done in the admin client side. Opened KAFKA-18230 for the improvement.

tinaselenge · 2024-12-13T10:07:46Z

@showuon that's great. Thank you!

tinaselenge · 2024-12-13T10:08:39Z

As discussed in the kafka call, we might need to workaround it by sending DESCRIBE_CLUSTER first to find the leader node, and then send the DESCRIBE_QUORUM to the active controller like the metadata-quorum script did.

Yes, I agree. I switched the PR to draft to implement describe cluster :)

Remove version checks to simplify the changes as this change will be targeting Strimzi that only supports Kafka version 3.9+ Refactor unit tests to remove ZK based testcases and add more coverage for KRaft based clusters Signed-off-by: Gantigmaa Selenge <[email protected]>

tinaselenge · 2024-12-17T12:43:30Z

I have updated the PR so that KafkaRoller uses describeCluster when retrieving active controller id. When getting quorum information, it would create an admin client with just active controller's bootstrap address.

I also removed version check for Kafka version 3.9.0, since this change would be targeting a release that doesn't support older than 3.9.0. This simplifies the code much better.

There are a couple of TODOs I would like to discuss with others, particularly the one for the tcpProbe for individual node since this check does not make sense anymore where it was. @scholzj @ppatierno @katheris, maybe you have thoughts on those? Of course thoughts from anyone else would also be appreciated.

tinaselenge · 2024-12-17T12:46:13Z

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java

+            // TODO: if the active controller is returned as -1 (maybe election was still in progress),
+            //      use the admin client bootstrapped with all controller nodes to see if it can hit the active controller by luck.
+            //      If it does not hit the active controller, it will return an error and this node will be retried later anyway.
+            //      Otherwise we can just return false here, and the node will be then retried later.


TODO to discuss:
This could cause the Warning message with NOT_LEADER_OR_FOLLOWER but both would result in retrying the node if active controller is not found.

What exactly are the situations when we expect to run into this? When there are no running brokers and only controllers?

We are querying the active controller from the controllers, not brokers so no running brokers should not be a problem. I think they could return -1, if there is no leader elected yet/election is still in progress? Could be also an indication of quorum being in a bad state.

If we cannot get the active controller (for any reasons) how much is it dangerous returning true to roll the current controller we are checking vs returning false, not rolling and waiting a better time to get the active controller? Could it be "forever" because of a bad state in the quorum?

In this case, if we cannot get the active controller, we cannot perform the quorum check. Currently, if we cannot to perform quorum check, we do not roll the pod, it throws UnforceableProblem. So the question, do we follow the same logic, or use the admin client bootstrapped with all controller nodes.

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java

tinaselenge · 2025-02-12T10:05:41Z

Closing this PR, because instead of trying to fix this issue, we will wait for Kafka 4.0 and later versions that have the fix (KAFKA-18230) to be supported. #10941 (comment)

tinaselenge marked this pull request as ready for review December 11, 2024 16:37

tinaselenge marked this pull request as draft December 11, 2024 17:05

scholzj added this to the 0.46.0 milestone Dec 11, 2024

tinaselenge force-pushed the fix-roller branch 2 times, most recently from 09fa158 to 7f2c57a Compare December 16, 2024 14:54

tinaselenge force-pushed the fix-roller branch from 7f2c57a to eea6559 Compare December 16, 2024 23:12

tinaselenge commented Dec 17, 2024

View reviewed changes

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java Show resolved Hide resolved

scholzj reviewed Dec 17, 2024

View reviewed changes

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java Show resolved Hide resolved

tinaselenge closed this Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the KafkaRoller issue talking to controllers #10941

Fix the KafkaRoller issue talking to controllers #10941

tinaselenge commented Dec 11, 2024 •

edited

Loading

scholzj commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024

showuon commented Dec 12, 2024

showuon commented Dec 13, 2024

tinaselenge commented Dec 13, 2024

tinaselenge commented Dec 13, 2024

tinaselenge commented Dec 17, 2024 •

edited

Loading

tinaselenge Dec 17, 2024 •

edited

Loading

scholzj Dec 17, 2024

tinaselenge Dec 17, 2024 •

edited

Loading

ppatierno Dec 20, 2024

tinaselenge Jan 12, 2025

tinaselenge commented Feb 12, 2025 •

edited

Loading

Fix the KafkaRoller issue talking to controllers #10941

Fix the KafkaRoller issue talking to controllers #10941

Conversation

tinaselenge commented Dec 11, 2024 • edited Loading

Type of change

Description

Checklist

scholzj commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024

showuon commented Dec 12, 2024

showuon commented Dec 13, 2024

tinaselenge commented Dec 13, 2024

tinaselenge commented Dec 13, 2024

tinaselenge commented Dec 17, 2024 • edited Loading

tinaselenge Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

scholzj Dec 17, 2024

Choose a reason for hiding this comment

tinaselenge Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

ppatierno Dec 20, 2024

Choose a reason for hiding this comment

tinaselenge Jan 12, 2025

Choose a reason for hiding this comment

tinaselenge commented Feb 12, 2025 • edited Loading

tinaselenge commented Dec 11, 2024 •

edited

Loading

tinaselenge commented Dec 17, 2024 •

edited

Loading

tinaselenge Dec 17, 2024 •

edited

Loading

tinaselenge Dec 17, 2024 •

edited

Loading

tinaselenge commented Feb 12, 2025 •

edited

Loading