prometheus: Fail to send alerts to alertmanager due to EOF

What did you do?

We configured multiple(1000+) prometheus instance to send alerts to one alertmanager, at the beginning, everything is fine, after a few minutes, we found some prometheus failed to send alerts due to the error:

level=error ts=2021-07-01T12:33:16.623Z caller=notifier.go:527 component=notifier alertmanager=https://alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"https://alertmanager/api/v2/alerts\": EOF"

What did you expect to see?

All prometheus instances send alerts to Alertmanager successfully.

What did you see instead? Under which circumstances?

We found some prometheus failed to send alerts due to the error:

level=error ts=2021-07-01T12:33:16.623Z caller=notifier.go:527 component=notifier alertmanager=https://alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"https://alertmanager/api/v2/alerts\": EOF"

Environment

Prometheus v2.4.2 Alertmanager 0.21.0

Checked the source code for the client http transport settings, we found that each prometheus keeps large max idle connection(20000 in total and 1000 per host) to send alerts to Alertmanager.

See: https://github.com/prometheus/common/blob/a1b6ede20323252d2b99a0f57178a4b7d364d0ca/config/http_config.go#L370-L380

I wonder why we hard code these settings, they should be customized by end users. For my case, we have 1000+ Prometheus to send alerts to one Alertmanager, the client connection numbers will be *1000 for the upstream Alertmanager.

If there are some rate limit in upstream, the connection will be closed. In any case, we need to make sure the connection from each Prometheus instance can be customized.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 19 (7 by maintainers)

Most upvoted comments

I have the same problem. Is the configuration file or startup command incorrect? Please go to https://github.com/prometheus/prometheus/issues/9176 to view the detailed information. I have built a set of prometheus and alertmanager environments without involving the introduction of any code layer. At the beginning, both prometheus and alertmanager are running normally, and the alarm is also normal. The error is reported after a short while, please give some solutions or suggestions

Prometheus should not use that many connections. I suggest you to update prometheus and alertmanager to the latest release to benefit for all the bugfixes, including the go bugfixes. Please also check the number of file descriptors (limit) of each process.

Le mar. 6 juil. 2021 à 04:39, Morven Cao @.***> a écrit :

What did you do?

We configured multiple(1000+) prometheus instance to send alerts to one alertmanager, at the beginning, everything is fine, after a few minutes, we found some prometheus failed to send alerts due to the error:

level=error ts=2021-07-01T12:33:16.623Z caller=notifier.go:527 component=notifier alertmanager=https://alertmanager/api/v2/alerts count=1 msg=“Error sending alert” err=“Post "https://alertmanager/api/v2/alerts\”: EOF"

What did you expect to see?

All prometheus instances send alerts to Alertmanager successfully.

What did you see instead? Under which circumstances?

We found some prometheus failed to send alerts due to the error:

level=error ts=2021-07-01T12:33:16.623Z caller=notifier.go:527 component=notifier alertmanager=https://alertmanager/api/v2/alerts count=1 msg=“Error sending alert” err=“Post "https://alertmanager/api/v2/alerts\”: EOF"

Environment

Prometheus v2.4.2 Alertmanager 0.21.0

Checked the source code for the client http transport settings, we found that each prometheus keeps large max idle connection(20000 in total and 1000 per host) to send alerts to Alertmanager.

See: https://github.com/prometheus/common/blob/a1b6ede20323252d2b99a0f57178a4b7d364d0ca/config/http_config.go#L370-L380

I wonder why we hard code these settings, they should be customized by end users. For my case, we have 1000+ Prometheus to send alerts to one Alertmanager, the client connection numbers will be *1000 for the upstream Alertmanager.

If there are some rate limit in upstream, the connection will be closed. In any case, we need to make sure the connection from each Prometheus instance can be customized.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/prometheus/prometheus/issues/9057, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACHHJR4QJ6B4QQNVTPHMRLTWJUEZANCNFSM473U2KDA .