CLI: new health check that detects QQs without an elected leader #13433

Ayanda-D · 2025-02-26T13:45:48Z

Proposed Changes

Hi Team 👋

The following changes are for a CLI diagnostics tool used for checking the health of Quorum Queue leaders. The main use is to detect presence quorum queues which are in a bad state (e.g. in particular, leaderless, where no messaging can be done) as reported and discussed here: #13101. There are some edge cases where quorum queues are observed to "lose leaders" and re-election is not triggered leaving such queues in bad state, and where any node restarts could result in complete message loss. This diagnostics tool allows us to carry out quick QQ leader health checks per vhost, or globally across all vhosts on a node/cluster for a specific queue match-spec. The command signature is as follows:

rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader [--vhost <vhost>] [--across-all-vhosts] <pattern>

Output on a healthy node would be as follows:

rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader --across-all-vhosts ".*"
Checking availability and health status of leaders for quorum queues matching .* in all vhosts ...
┌────────────────────────────────────────────────────────────────────────────────────────┐
│ Output                                                                                 │
├────────────────────────────────────────────────────────────────────────────────────────┤
│ Node rabbit@host-1 reported all quorum queue leaders as healthy                        │
└────────────────────────────────────────────────────────────────────────────────────────┘

Output on a node with unhealthy quorum queue leaders would be as follows (json formatted):

rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader --across-all-vhosts ".*"  --formatter json
{"message":"Node rabbit@host-2 reported unhealthy quorum queue leaders","queues":[{"name":"TEST-QQ-1","readable_name":"queue 'TEST-QQ-1' in vhost 'VHOST-A'","type":"rabbit_quorum_queue","virtual_host":"VHOST-A"},{"name":"TEST-QQ-2","readable_name":"queue 'TEST-QQ-2' in vhost 'VHOST-B'","type":"rabbit_quorum_queue","virtual_host":"VHOST-B"}],"result":"error"}

In addition to the command implementation, few items to note:

The command spawns and executes leader-health-check procedures concurrently per QQ and gathers results for fast execution
Each leader-health-check procedure queries the leaderboard and executes a ping as an aliveness and health check per leader
A ProcessLimitThreshold is set to 40% of the node's process_limit to ensure process_count when executing these checks will never reach the node's process_limit (and halt the node).
The CLI output.ex is extended to support :check_passed results with a payload. This is simply to help for clarity for commands such as these in which the command check is a successful operation, while the result itself is an error (list of unhealthy QQs).

This diagnostics tool would help catch leaderless queues, which on the Management UI, are listed as follows:

Some environments are quite risky to carrying out maintenance operations such as complete node shutdown. The main goal is to help avoid situations where nodes/queues are assumed to be in a good state (where other checks such as quorum critical checks which rely on quorum node count pass) while QQ leaders may be in an unhealthy state. If nodes pass both these checks, then we can deem them healthy and OK for various maintenance procedures. Following on, on this detection tool, are investigations to also fix the underlying problem reported in #13101.

Please take a look/review. We're keen on having these capabilities available upstream to catch and monitor this recurring issue^ 👍

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

reuse and extend formatting API, with amqqueue:to_printable/2

…g in cli tests

silent mode in rabbitmq-queues leader_health_check command

all queues in all vhosts on local node

…ader checks

michaelklishin

It is easy to see from

rabbitmq-diagnostics help | grep chec

that all existing health check commands except for the deprecated no-op one begin with check_, so this one should be named check_for_quorum_queues_without_an_elected_leader.

I'd rename --global to --across-all-vhosts.

michaelklishin · 2025-02-26T15:54:28Z

deps/rabbit/src/rabbit_quorum_queue.erl

@@ -144,6 +147,8 @@
 -define(SNAPSHOT_INTERVAL, 8192). %% the ra default is 4096
 % -define(UNLIMITED_PREFETCH_COUNT, 2000). %% something large for ra
 -define(MIN_CHECKPOINT_INTERVAL, 8192). %% the ra default is 16384
+-define(LEADER_HEALTH_CHECK_TIMEOUT, 1_000).


Timeouts lower than 5s are guaranteed to result in false positives.

michaelklishin · 2025-02-26T15:55:35Z

deps/rabbit/src/rabbit_quorum_queue.erl

+leader_health_check(QueueNameOrRegEx, VHost) ->
+    %% Set a process limit threshold to 40% of ErlangVM process limit, beyond which
+    %% we cannot spawn any new processes for executing QQ leader health checks.
+    ProcessLimitThreshold = round(0.4 * erlang:system_info(process_limit)),


40% sounds like a lot for a health check. I'd use 20% at most.

michaelklishin · 2025-02-26T15:57:10Z

deps/rabbit/src/rabbit_quorum_queue.erl

+    Qs =
+        case VHost of
+            global ->
+                rabbit_amqqueue:list();


The modern modules for working with schema data stores is rabbit_db_queue. It is aware of the metadata store used.

It also provides functions such as get_all_by_type/1.

michaelklishin · 2025-02-26T16:01:03Z

deps/rabbitmq_cli/lib/rabbitmq/cli/core/output.ex

@@ -18,6 +18,10 @@ defmodule RabbitMQ.CLI.Core.Output do
    :ok
  end

+  def format_output({:ok, :check_passed, output}, formatter, options) do


Apparently some existing health checks return {:ok, value} and others use {:ok, :check_passed, value}. Fair enough.

gomoripeti

great initiative!

What do you think if there would be a function check_local_leader_health() that would only check local leaders therefore it would avoid a lot of potential inter-node communication (a ping for each remote queue) and a wrapper function which calls check_local_leader_health() via rpc on every node? Similar to a common pattern in RabbitMQ

michaelklishin · 2025-02-26T17:08:34Z

@gomoripeti we even have separate health checks, e.g. for local vs. cluster-wide alarms.

I don't know if we need two separate health checks in this case. I'd try to get this command right first.

FTR, rabbit_quorum_queue:list_with_local_promotable/0 demonstrates how to filter only the local QQs (that have a replica on node()).

and use across_all_vhosts option for global checks

and introduce rabbit_db_queue:get_all_by_type_and_vhost/2. Update leader health check timeout to 5s and process limit threshold to 20% of node's process_limit.

Ayanda-D · 2025-02-26T19:14:13Z

@gomoripeti @michaelklishin thanks for the feedback. For local or remote call optimization, gen_statem should already do that under-hood (if I'm understanding the question correctly). For this case though, large cluster sizes, purpose of this command is to list leaderless queues without any awareness of whether its local or remote, and only executed at specific times. Didnt see the need to filter locally or remote? Unless you see the need for this extension.

…unction head match

Ayanda-D added 18 commits February 26, 2025 00:00

Implement rabbitmq-queues leader_health_check command for quorum queues

c26edbe

Tests for rabbitmq-queues leader_health_check command

6cc03b0

Ensure calling ParentPID in leader health check execution and

76d66a1

reuse and extend formatting API, with amqqueue:to_printable/2

Extend core leader health check tests and update badrpc error handlin…

857e2a7

…g in cli tests

Refactor leader_health_check command validators and ignore vhost arg

6cf9339

Update leader_health_check_command description and banner

96b8bce

Improve output formatting for healthy leaders and support

239a69b

silent mode in rabbitmq-queues leader_health_check command

Support global flag to run leader health check for

48ba3e1

all queues in all vhosts on local node

Return immediately for leader health checks on empty vhosts

7873737

Rename leader health check timeout refs

b7dec89

Update banner message for global leader health check

c7da4d5

QQ leader-health-check: check_process_limit_safety before spawning le…

1736845

…ader checks

Log leader health check result in broker logs (if any leaderless queues)

1084179

Ensure check_passed result for leader health internal calls)

68739a6

Extend CLI format output to process check_passed payload

5f5e992

Format leader healthcheck result log and function exports

ebffd7d

Change leader_health_check command scope from queues to diagnostics

663fc98

Update (c) line year

df82f12

michaelklishin requested changes Feb 26, 2025

View reviewed changes

michaelklishin changed the title ~~Diagnostics tool for QQ leader health checks~~ CLI: new health check that detects QQs without an elected leader Feb 26, 2025

gomoripeti reviewed Feb 26, 2025

View reviewed changes

Ayanda-D added 2 commits February 26, 2025 18:42

Rename command to check_for_quorum_queues_without_an_elected_leader

b2acbae

and use across_all_vhosts option for global checks

Use rabbit_db_queue for qq leader health check lookups

7a8e166

and introduce rabbit_db_queue:get_all_by_type_and_vhost/2. Update leader health check timeout to 5s and process limit threshold to 20% of node's process_limit.

Ayanda-D added 3 commits February 26, 2025 19:16

Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE

9bdb81f

Fix typo (cli test module)

6158568

Small refactor - simpler final leader health check result return on f…

ea07938

…unction head match

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: new health check that detects QQs without an elected leader #13433

CLI: new health check that detects QQs without an elected leader #13433

Ayanda-D commented Feb 26, 2025 •

edited

Loading

michaelklishin left a comment

michaelklishin Feb 26, 2025

michaelklishin Feb 26, 2025

michaelklishin Feb 26, 2025

michaelklishin Feb 26, 2025

michaelklishin Feb 26, 2025

gomoripeti left a comment

michaelklishin commented Feb 26, 2025

Ayanda-D commented Feb 26, 2025

CLI: new health check that detects QQs without an elected leader #13433

Are you sure you want to change the base?

CLI: new health check that detects QQs without an elected leader #13433

Conversation

Ayanda-D commented Feb 26, 2025 • edited Loading

Proposed Changes

Types of Changes

Checklist

Further Comments

michaelklishin left a comment

Choose a reason for hiding this comment

michaelklishin Feb 26, 2025

Choose a reason for hiding this comment

michaelklishin Feb 26, 2025

Choose a reason for hiding this comment

michaelklishin Feb 26, 2025

Choose a reason for hiding this comment

michaelklishin Feb 26, 2025

Choose a reason for hiding this comment

michaelklishin Feb 26, 2025

Choose a reason for hiding this comment

gomoripeti left a comment

Choose a reason for hiding this comment

michaelklishin commented Feb 26, 2025

Ayanda-D commented Feb 26, 2025

Ayanda-D commented Feb 26, 2025 •

edited

Loading