prometheus: remote_write doesn't log error if using proxy_url
Proposal
Use case. Why is this important?
It’s important to have log output when something goes wrong. For example, if metrics are not sent to persistent storage, we need to have a log event.
Bug Report
What did you do?
I created a remote_write
section in my config with an HTTPS URL, tested it, then added a proxy_url
section containing 'http://squid.proxy.that.blocks.requests:3128/'
. The proxy is configured to block access to the url
we are using.
What did you expect to see?
I expected to see at least one error log line, saying the remote write failed.
What did you see instead? Under which circumstances?
Metrics did not show up, and no log lines were logged. Circumstances: proxy is blocking requests, as noted above.
Environment
-
System information:
Linux 2.6.32-696.el6.x86_64 x86_64
-
Prometheus version:
prometheus, version 2.10.0 (branch: HEAD, revision: d20e84d0fb64aff2f62a977adc8cfb656da4e286)
build user: root@a49185acd9b0
build date: 20190525-12:28:13
go version: go1.12.5
-
Alertmanager version:
N/A
-
Prometheus configuration file:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 2s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'some-job-name'
static_configs:
- targets:
- 'localhost:10002'
- 'localhost:10000'
- 'localhost:10001'
labels:
environment: 'someenvname'
host: 'internal-host-fqdn'
remote_write:
- url: 'https://some.cortex.instance/api/prom/push'
basic_auth:
username: 'some_user'
password: 'some_password'
proxy_url: 'http://squid.proxy.that.blocks.requests:3128/'
-
Alertmanager configuration file: N/A
-
Logs: N/A
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (7 by maintainers)
I agree that I think this is worth reopening. Dedupe logging is already used in remote storage, so spamming the log should not be an issue.
This has been a problem for us when deploying Prometheus into production in a few environments due to network ACLs and whitelisting. In these cases, it is not a transient issue.
We have centralized logging as an existing infrastructure, but even without that I believe that the idea that metrics are a full substitute for logs is problematic. Logging is a standard way to report details on a problem event, and the existing metrics are not useful for tracking down root cause (was it a timeout, connection failed, invalid cert, … ?). Using the alert manager pushes a simple problem to the customer and requires them to integrate even more things with a system that you know is not operating correctly because fundamental assumptions/requirements are broken - this is the time that the system should drop back down to a minimal set of assumptions to communicate its state and what the operator needs to fix.
journalctl provides support for regulating log disk usage, if unavailable, I believe it’s pretty reasonable to assume that the operator has to deal with the same problem (log spam) for other systems and might be using logrotate. Prometheus could also decide not to log more than at a certain rate for these things.
Is this issue still relevant? I would like to work on this. @brian-brazil
As someone who’s running prometheus via systemd, logging to journal and then mirrored off to centralized logging (fluentd/elasticsearch), this is kinda a big deal to us as well.
If there is a misconfiguration and prometheus cannot send logs to a configured destination, then it really should be logging an error not just incrementing an internal metric.
Without the
proxy_url
setting, if prometheus cannot write metrics to the remote storage, then it will log an error. So why wouldn’t you want it logging an error if there is a misconfiguration that effectively ends up with the same result?I would love to see this issue re-opened and actually fixed.