Skip to content

Memory leak when repeating silences #2659

Open
@ngc104

Description

@ngc104

What did you do?

I'm using Kthxbye. When an alert fires and I add a silence with Kthxbye, the memory usage of Alertmanager increases.

You can reproduce this without Kthxbye :

1/ generate an alert (or use any alert sent by Prometheus), for example PrometheusNotIngestingSamples.

2/ With Alertmanager, generate silences like this :

while true; do
  date # useless; just for tracing...
  amtool --alertmanager.url http://localhost:9093 silence add alertname=PrometheusNotIngestingSamples -a "MemoryLeakLover" -c "Test memory leak in Alertmanager" -d "1m"
  sleep 50
done

Note : The behaviour of Kthxbye is similar, but default config is 15 min instead of 1 min. However, with amtool you can see that Kthxbye has nothing to do with this bug.

What did you expect to see?

Nothing interesting (no abnormal memory increase)

What did you see instead? Under which circumstances?

Follow the metric container_memory_working_set_bytes for Alertmanager. After some hours you can see it slightly grow up.

Here is a screenshot of the above test, for a little more than 12 hours : test started at 12h20 and finished at 9h the day after.

image

My Alertmanager is running with the default --data.retention=120h. I guessed that after 5 days it would stop increasing. Wrong guess : it stops increasing only at OOM and automatic kill.

image
The above graph was made with Kthxbye running. The pod restarts after an OOM (left side) or after a kubectl delete pod (right side).

Environment

/alertmanager $ alertmanager --version
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
  build user:       root@dee35927357f
  build date:       20200617-08:54:02
  go version:       go1.14.4
  • Alertmanager configuration file:
/alertmanager $ cat /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
receivers:
- name: rocketchat
  webhook_configs:
  - send_resolved: true
    url: https://xxxx.rocketchat.xxxx/hooks/xxxxxx/xxxxxxxxx
route:
  group_by:
  - xxxxxxx
  - yyyyyyy
  - alertname
  group_interval: 5m
  group_wait: 30s
  receiver: rocketchat
  repeat_interval: 5m
  routes:
  - continue: true
    receiver: rocketchat
templates:
- /etc/alertmanager/*.tmpl

  • Logs:
➜ k -n monitoring logs caascad-alertmanager-0 
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.174Z caller=main.go:485 msg=Listening address=:9093
level=warn ts=2021-07-30T12:29:49.530Z caller=notify.go:674 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://xxxx.rocketchat.xxx/hooks/xxxxxx/xxxxxxxxx\": dial tcp x.x.x.x: connect: connection refused"
level=info ts=2021-07-30T12:32:17.213Z caller=notify.go:685 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify success" attempts=13

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions