How does alertmanager operate on high availability? is it supposed to send only one instance of a fired alert? how to properly configure it? #4347

Jonatake · 2025-04-09T19:07:38Z

I set up an alertmanager instance on high availability on openshift as a statefulset, where each pod is an instance of the cluster. However, alerts seem to be sent from all instances despite the gossip being settled and despite the status of the cluster mode in the status route being ready. Also, alerts seem to fire after one pod comes up, which is also undesired for me.
I would like for 2 things to happen:

alerts that are sent to all of the instances will be sent to the notifier exactly once (it's ok if once in a while an alert might be sent twice but the majority of the alerts must be sent exactly once)
alerts will not be sent until the gossip is settled among all cluster members.
Does a properly configured high availablity alertmanager cluster satisfy this criteria? if so I would like an explanation how to set it up.

The flags that I am currently using are:

--cluster.peer: each instance has its own route and the other pods routes as cluster peers
--cluster.listen-address
--cluster.advertise-address

As far as the configuration goes, I tried defining group wait or setting increasing the group interval , but these don't do the trick. My rules are evaluated once every 30 seconds and the default gossip interval is 0.2s so I don't think that's a problem. I would love some help

grobinson-grafana · 2025-04-09T20:08:28Z

Hi! 👋

alerts that are sent to all of the instances will be sent to the notifier exactly once (it's ok if once in a while an alert might be sent twice but the majority of the alerts must be sent exactly once)

This is how it works. The guaranteed behavior is at-least once, but in most cases you will see exactly-once.

alerts will not be sent until the gossip is settled among all cluster members.

Yes this is how it works. There is a stage in the notification pipeline that waits for gossip to be settled before proceeding.

each instance has its own route and the other pods routes as cluster peers

Not quite sure I understand this?

You can check metrics like alertmanager_cluster_peer_info to make the Alertmanagers have formed a cluster and can see each other, and that alertmanager_cluster_failed_peers is 0.

If alertmanager_cluster_messages_received_total is also 0 then Alertmanagers are unable to gossip between each other, which is required to deduplicate notifications.

Jonatake · 2025-05-03T19:59:32Z

Thanks for the answer.
So I did my research and after a bit of playing with it I still encounter some problems. As I mentioned, I set it up as a stateful set on openshift and I started playing with a few things. here are the problems I encountered, If you have any helpful insight I would love to hear it out:

The first thing I did was I used 3 cluster peer flags while defining the stateful set to run on 2 pods. In this setup, the gossip settled despite not all cluster peers being up and the alerts still fired. Why does the gossip settle despite not all cluster instances being up? if this is meant, I would like to avoid it so if it is possible I would love and explanation
After that, I checked that alerts are not sent as duplicates, but it appears that even when gossip was settled among the two instances, all alerts were still sent from each instance. Is this how it is supposed to behave? am i supposed set something else up other than the flags I mentioned for deduplication?
I then tried defining only one pod (still with 3 peers flags where one points to the pod itself), and alerts were still sent. From my understanding, gossip will not be settled and alerts will not be sent until all cluster members are up. Am I wrong on this one? in what cases will alertmanager send alerts regardless of the status of other cluster instances? is there a certain default timeout? if so can I disable that timeout?
The last thing I did was I ran my statefulset on 3 pods with 3 peer flags and checked for all the metrics you showed and they seemed ok but still alerts were duplicated. Is there anything I missed in setting up the alertmanager? Is there anything I haven't accounted for when it comes to synchronizing the instances?

thanks in advance!

grobinson-grafana added the kind/support label Apr 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does alertmanager operate on high availability? is it supposed to send only one instance of a fired alert? how to properly configure it? #4347

How does alertmanager operate on high availability? is it supposed to send only one instance of a fired alert? how to properly configure it? #4347

Jonatake commented Apr 9, 2025 •

edited

Loading

grobinson-grafana commented Apr 9, 2025

Jonatake commented May 3, 2025

How does alertmanager operate on high availability? is it supposed to send only one instance of a fired alert? how to properly configure it? #4347

How does alertmanager operate on high availability? is it supposed to send only one instance of a fired alert? how to properly configure it? #4347

Comments

Jonatake commented Apr 9, 2025 • edited Loading

grobinson-grafana commented Apr 9, 2025

Jonatake commented May 3, 2025

Jonatake commented Apr 9, 2025 •

edited

Loading