Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: new health check that detects QQs without an elected leader #13433

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

Ayanda-D
Copy link
Contributor

@Ayanda-D Ayanda-D commented Feb 26, 2025

Proposed Changes

Hi Team 👋

The following changes are for a CLI diagnostics tool used for checking the health of Quorum Queue leaders. The main use is to detect presence quorum queues which are in a bad state (e.g. in particular, leaderless, where no messaging can be done) as reported and discussed here: #13101. There are some edge cases where quorum queues are observed to "lose leaders" and re-election is not triggered leaving such queues in bad state, and where any node restarts could result in complete message loss. This diagnostics tool allows us to carry out quick QQ leader health checks per vhost, or globally across all vhosts on a node/cluster for a specific queue match-spec. The command signature is as follows:

rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader [--vhost <vhost>] [--across-all-vhosts] <pattern>

Output on a healthy node would be as follows:

rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader --across-all-vhosts ".*"
Checking availability and health status of leaders for quorum queues matching .* in all vhosts ...
┌────────────────────────────────────────────────────────────────────────────────────────┐
│ Output                                                                                 │
├────────────────────────────────────────────────────────────────────────────────────────┤
│ Node rabbit@host-1 reported all quorum queue leaders as healthy                        │
└────────────────────────────────────────────────────────────────────────────────────────┘

Output on a node with unhealthy quorum queue leaders would be as follows (json formatted):

rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader --across-all-vhosts ".*"  --formatter json
{"message":"Node rabbit@host-2 reported unhealthy quorum queue leaders","queues":[{"name":"TEST-QQ-1","readable_name":"queue 'TEST-QQ-1' in vhost 'VHOST-A'","type":"rabbit_quorum_queue","virtual_host":"VHOST-A"},{"name":"TEST-QQ-2","readable_name":"queue 'TEST-QQ-2' in vhost 'VHOST-B'","type":"rabbit_quorum_queue","virtual_host":"VHOST-B"}],"result":"error"}

In addition to the command implementation, few items to note:

  • The command spawns and executes leader-health-check procedures concurrently per QQ and gathers results for fast execution
  • Each leader-health-check procedure queries the leaderboard and executes a ping as an aliveness and health check per leader
  • A ProcessLimitThreshold is set to 40% of the node's process_limit to ensure process_count when executing these checks will never reach the node's process_limit (and halt the node).
  • The CLI output.ex is extended to support :check_passed results with a payload. This is simply to help for clarity for commands such as these in which the command check is a successful operation, while the result itself is an error (list of unhealthy QQs).

This diagnostics tool would help catch leaderless queues, which on the Management UI, are listed as follows:

LEADERLESS-QUORUM-QUEUE-1

Some environments are quite risky to carrying out maintenance operations such as complete node shutdown. The main goal is to help avoid situations where nodes/queues are assumed to be in a good state (where other checks such as quorum critical checks which rely on quorum node count pass) while QQ leaders may be in an unhealthy state. If nodes pass both these checks, then we can deem them healthy and OK for various maintenance procedures. Following on, on this detection tool, are investigations to also fix the underlying problem reported in #13101.

Please take a look/review. We're keen on having these capabilities available upstream to catch and monitor this recurring issue^ 👍

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is easy to see from

rabbitmq-diagnostics help | grep chec

that all existing health check commands except for the deprecated no-op one begin with check_, so this one should be named check_for_quorum_queues_without_an_elected_leader.

I'd rename --global to --across-all-vhosts.

@@ -144,6 +147,8 @@
-define(SNAPSHOT_INTERVAL, 8192). %% the ra default is 4096
% -define(UNLIMITED_PREFETCH_COUNT, 2000). %% something large for ra
-define(MIN_CHECKPOINT_INTERVAL, 8192). %% the ra default is 16384
-define(LEADER_HEALTH_CHECK_TIMEOUT, 1_000).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouts lower than 5s are guaranteed to result in false positives.

leader_health_check(QueueNameOrRegEx, VHost) ->
%% Set a process limit threshold to 40% of ErlangVM process limit, beyond which
%% we cannot spawn any new processes for executing QQ leader health checks.
ProcessLimitThreshold = round(0.4 * erlang:system_info(process_limit)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40% sounds like a lot for a health check. I'd use 20% at most.

Qs =
case VHost of
global ->
rabbit_amqqueue:list();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modern modules for working with schema data stores is rabbit_db_queue. It is aware of the metadata store used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also provides functions such as get_all_by_type/1.

@@ -18,6 +18,10 @@ defmodule RabbitMQ.CLI.Core.Output do
:ok
end

def format_output({:ok, :check_passed, output}, formatter, options) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently some existing health checks return {:ok, value} and others use {:ok, :check_passed, value}. Fair enough.

@michaelklishin michaelklishin changed the title Diagnostics tool for QQ leader health checks CLI: new health check that detects QQs without an elected leader Feb 26, 2025
Copy link
Contributor

@gomoripeti gomoripeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great initiative!

What do you think if there would be a function check_local_leader_health() that would only check local leaders therefore it would avoid a lot of potential inter-node communication (a ping for each remote queue) and a wrapper function which calls check_local_leader_health() via rpc on every node? Similar to a common pattern in RabbitMQ

@michaelklishin
Copy link
Member

@gomoripeti we even have separate health checks, e.g. for local vs. cluster-wide alarms.

I don't know if we need two separate health checks in this case. I'd try to get this command right first.

FTR, rabbit_quorum_queue:list_with_local_promotable/0 demonstrates how to filter only the local QQs (that have a replica on node()).

and use across_all_vhosts option for global checks
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2.
Update leader health check timeout to 5s and process limit
threshold to 20% of node's process_limit.
@Ayanda-D
Copy link
Contributor Author

@gomoripeti @michaelklishin thanks for the feedback. For local or remote call optimization, gen_statem should already do that under-hood (if I'm understanding the question correctly). For this case though, large cluster sizes, purpose of this command is to list leaderless queues without any awareness of whether its local or remote, and only executed at specific times. Didnt see the need to filter locally or remote? Unless you see the need for this extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants