alertmanager: Memory leak when repeating silences

What did you do?

I’m using Kthxbye. When an alert fires and I add a silence with Kthxbye, the memory usage of Alertmanager increases.

You can reproduce this without Kthxbye :

1/ generate an alert (or use any alert sent by Prometheus), for example PrometheusNotIngestingSamples.

2/ With Alertmanager, generate silences like this :

while true; do
  date # useless; just for tracing...
  amtool --alertmanager.url http://localhost:9093 silence add alertname=PrometheusNotIngestingSamples -a "MemoryLeakLover" -c "Test memory leak in Alertmanager" -d "1m"
  sleep 50
done

Note : The behaviour of Kthxbye is similar, but default config is 15 min instead of 1 min. However, with amtool you can see that Kthxbye has nothing to do with this bug.

What did you expect to see?

Nothing interesting (no abnormal memory increase)

What did you see instead? Under which circumstances?

Follow the metric container_memory_working_set_bytes for Alertmanager. After some hours you can see it slightly grow up.

Here is a screenshot of the above test, for a little more than 12 hours : test started at 12h20 and finished at 9h the day after.

image

My Alertmanager is running with the default --data.retention=120h. I guessed that after 5 days it would stop increasing. Wrong guess : it stops increasing only at OOM and automatic kill.

image The above graph was made with Kthxbye running. The pod restarts after an OOM (left side) or after a kubectl delete pod (right side).

Environment

/alertmanager $ alertmanager --version
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
  build user:       root@dee35927357f
  build date:       20200617-08:54:02
  go version:       go1.14.4
  • Alertmanager configuration file:
/alertmanager $ cat /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
receivers:
- name: rocketchat
  webhook_configs:
  - send_resolved: true
    url: https://xxxx.rocketchat.xxxx/hooks/xxxxxx/xxxxxxxxx
route:
  group_by:
  - xxxxxxx
  - yyyyyyy
  - alertname
  group_interval: 5m
  group_wait: 30s
  receiver: rocketchat
  repeat_interval: 5m
  routes:
  - continue: true
    receiver: rocketchat
templates:
- /etc/alertmanager/*.tmpl

  • Logs:
➜ k -n monitoring logs caascad-alertmanager-0 
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.174Z caller=main.go:485 msg=Listening address=:9093
level=warn ts=2021-07-30T12:29:49.530Z caller=notify.go:674 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://xxxx.rocketchat.xxx/hooks/xxxxxx/xxxxxxxxx\": dial tcp x.x.x.x: connect: connection refused"
level=info ts=2021-07-30T12:32:17.213Z caller=notify.go:685 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify success" attempts=13

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 31 (9 by maintainers)

Most upvoted comments

All tests were done with latest version.

In 2021 when I created the issue it was tested with 0.21.0. Latest test in 2023 (june 13th) was with 0.25.0.

Every time a new version was released, I tested again with the hope it was solved. But with 0.25 it is not solved yet.