-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix the KafkaRoller issue talking to controllers #10941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
As discussed in the kafka call, we might need to workaround it by sending |
On second thought, I think ideally, the retry should be done in the admin client side. Opened KAFKA-18230 for the improvement. |
@showuon that's great. Thank you! |
Yes, I agree. I switched the PR to draft to implement describe cluster :) |
09fa158
to
7f2c57a
Compare
Remove version checks to simplify the changes as this change will be targeting Strimzi that only supports Kafka version 3.9+ Refactor unit tests to remove ZK based testcases and add more coverage for KRaft based clusters Signed-off-by: Gantigmaa Selenge <[email protected]>
7f2c57a
to
eea6559
Compare
I have updated the PR so that KafkaRoller uses describeCluster when retrieving active controller id. When getting quorum information, it would create an admin client with just active controller's bootstrap address. I also removed version check for Kafka version 3.9.0, since this change would be targeting a release that doesn't support older than 3.9.0. This simplifies the code much better. There are a couple of TODOs I would like to discuss with others, particularly the one for the tcpProbe for individual node since this check does not make sense anymore where it was. @scholzj @ppatierno @katheris, maybe you have thoughts on those? Of course thoughts from anyone else would also be appreciated. |
// TODO: if the active controller is returned as -1 (maybe election was still in progress), | ||
// use the admin client bootstrapped with all controller nodes to see if it can hit the active controller by luck. | ||
// If it does not hit the active controller, it will return an error and this node will be retried later anyway. | ||
// Otherwise we can just return false here, and the node will be then retried later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO to discuss:
This could cause the Warning message with NOT_LEADER_OR_FOLLOWER but both would result in retrying the node if active controller is not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly are the situations when we expect to run into this? When there are no running brokers and only controllers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are querying the active controller from the controllers, not brokers so no running brokers should not be a problem. I think they could return -1, if there is no leader elected yet/election is still in progress? Could be also an indication of quorum being in a bad state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we cannot get the active controller (for any reasons) how much is it dangerous returning true to roll the current controller we are checking vs returning false, not rolling and waiting a better time to get the active controller? Could it be "forever" because of a bad state in the quorum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, if we cannot get the active controller, we cannot perform the quorum check. Currently, if we cannot to perform quorum check, we do not roll the pod, it throws UnforceableProblem
. So the question, do we follow the same logic, or use the admin client bootstrapped with all controller nodes.
cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java
Show resolved
Hide resolved
cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java
Show resolved
Hide resolved
Closing this PR, because instead of trying to fix this issue, we will wait for Kafka 4.0 and later versions that have the fix (KAFKA-18230) to be supported. #10941 (comment) |
Type of change
Select the type of your PR
Description
Checklist
Please go through this checklist and make sure all applicable tasks have been done