Description
What did you do?
I'm using Kthxbye. When an alert fires and I add a silence with Kthxbye, the memory usage of Alertmanager increases.
You can reproduce this without Kthxbye :
1/ generate an alert (or use any alert sent by Prometheus), for example PrometheusNotIngestingSamples
.
2/ With Alertmanager, generate silences like this :
while true; do
date # useless; just for tracing...
amtool --alertmanager.url http://localhost:9093 silence add alertname=PrometheusNotIngestingSamples -a "MemoryLeakLover" -c "Test memory leak in Alertmanager" -d "1m"
sleep 50
done
Note : The behaviour of Kthxbye is similar, but default config is 15 min instead of 1 min. However, with amtool you can see that Kthxbye has nothing to do with this bug.
What did you expect to see?
Nothing interesting (no abnormal memory increase)
What did you see instead? Under which circumstances?
Follow the metric container_memory_working_set_bytes
for Alertmanager. After some hours you can see it slightly grow up.
Here is a screenshot of the above test, for a little more than 12 hours : test started at 12h20 and finished at 9h the day after.
My Alertmanager is running with the default --data.retention=120h
. I guessed that after 5 days it would stop increasing. Wrong guess : it stops increasing only at OOM and automatic kill.
The above graph was made with Kthxbye running. The pod restarts after an OOM (left side) or after a kubectl delete pod
(right side).
Environment
-
System information:
Kubernetes (deployed with https://github.com/prometheus-community/helm-charts/tree/main/charts/alertmanager)
-
Alertmanager version:
/alertmanager $ alertmanager --version
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4
- Alertmanager configuration file:
/alertmanager $ cat /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
receivers:
- name: rocketchat
webhook_configs:
- send_resolved: true
url: https://xxxx.rocketchat.xxxx/hooks/xxxxxx/xxxxxxxxx
route:
group_by:
- xxxxxxx
- yyyyyyy
- alertname
group_interval: 5m
group_wait: 30s
receiver: rocketchat
repeat_interval: 5m
routes:
- continue: true
receiver: rocketchat
templates:
- /etc/alertmanager/*.tmpl
- Logs:
➜ k -n monitoring logs caascad-alertmanager-0
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.174Z caller=main.go:485 msg=Listening address=:9093
level=warn ts=2021-07-30T12:29:49.530Z caller=notify.go:674 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://xxxx.rocketchat.xxx/hooks/xxxxxx/xxxxxxxxx\": dial tcp x.x.x.x: connect: connection refused"
level=info ts=2021-07-30T12:32:17.213Z caller=notify.go:685 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify success" attempts=13