-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLI: new health check that detects QQs without an elected leader #13433
base: main
Are you sure you want to change the base?
Conversation
reuse and extend formatting API, with amqqueue:to_printable/2
silent mode in rabbitmq-queues leader_health_check command
all queues in all vhosts on local node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is easy to see from
rabbitmq-diagnostics help | grep chec
that all existing health check commands except for the deprecated no-op one begin with check_
, so this one should be named check_for_quorum_queues_without_an_elected_leader
.
I'd rename --global
to --across-all-vhosts
.
@@ -144,6 +147,8 @@ | |||
-define(SNAPSHOT_INTERVAL, 8192). %% the ra default is 4096 | |||
% -define(UNLIMITED_PREFETCH_COUNT, 2000). %% something large for ra | |||
-define(MIN_CHECKPOINT_INTERVAL, 8192). %% the ra default is 16384 | |||
-define(LEADER_HEALTH_CHECK_TIMEOUT, 1_000). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeouts lower than 5s are guaranteed to result in false positives.
leader_health_check(QueueNameOrRegEx, VHost) -> | ||
%% Set a process limit threshold to 40% of ErlangVM process limit, beyond which | ||
%% we cannot spawn any new processes for executing QQ leader health checks. | ||
ProcessLimitThreshold = round(0.4 * erlang:system_info(process_limit)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
40% sounds like a lot for a health check. I'd use 20% at most.
Qs = | ||
case VHost of | ||
global -> | ||
rabbit_amqqueue:list(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modern modules for working with schema data stores is rabbit_db_queue
. It is aware of the metadata store used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also provides functions such as get_all_by_type/1
.
@@ -18,6 +18,10 @@ defmodule RabbitMQ.CLI.Core.Output do | |||
:ok | |||
end | |||
|
|||
def format_output({:ok, :check_passed, output}, formatter, options) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently some existing health checks return {:ok, value}
and others use {:ok, :check_passed, value}
. Fair enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great initiative!
What do you think if there would be a function check_local_leader_health()
that would only check local leaders therefore it would avoid a lot of potential inter-node communication (a ping for each remote queue) and a wrapper function which calls check_local_leader_health()
via rpc on every node? Similar to a common pattern in RabbitMQ
@gomoripeti we even have separate health checks, e.g. for local vs. cluster-wide alarms. I don't know if we need two separate health checks in this case. I'd try to get this command right first. FTR, |
and use across_all_vhosts option for global checks
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2. Update leader health check timeout to 5s and process limit threshold to 20% of node's process_limit.
@gomoripeti @michaelklishin thanks for the feedback. For local or remote call optimization, |
Proposed Changes
Hi Team 👋
The following changes are for a CLI diagnostics tool used for checking the health of Quorum Queue leaders. The main use is to detect presence quorum queues which are in a bad state (e.g. in particular, leaderless, where no messaging can be done) as reported and discussed here: #13101. There are some edge cases where quorum queues are observed to "lose leaders" and re-election is not triggered leaving such queues in bad state, and where any node restarts could result in complete message loss. This diagnostics tool allows us to carry out quick QQ leader health checks per vhost, or globally across all vhosts on a node/cluster for a specific queue match-spec. The command signature is as follows:
Output on a healthy node would be as follows:
Output on a node with unhealthy quorum queue leaders would be as follows (json formatted):
In addition to the command implementation, few items to note:
ping
as an aliveness and health check per leaderProcessLimitThreshold
is set to 40% of the node'sprocess_limit
to ensureprocess_count
when executing these checks will never reach the node'sprocess_limit
(and halt the node).:check_passed
results with a payload. This is simply to help for clarity for commands such as these in which the command check is a successful operation, while the result itself is an error (list of unhealthy QQs).This diagnostics tool would help catch leaderless queues, which on the Management UI, are listed as follows:
Some environments are quite risky to carrying out maintenance operations such as complete node shutdown. The main goal is to help avoid situations where nodes/queues are assumed to be in a good state (where other checks such as quorum critical checks which rely on quorum node count pass) while QQ leaders may be in an unhealthy state. If nodes pass both these checks, then we can deem them healthy and OK for various maintenance procedures. Following on, on this detection tool, are investigations to also fix the underlying problem reported in #13101.
Please take a look/review. We're keen on having these capabilities available upstream to catch and monitor this recurring issue^ 👍
Types of Changes
What types of changes does your code introduce to this project?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply.You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.
CONTRIBUTING.md
documentFurther Comments
If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.