prometheus: remote_write doesn't log error if using proxy_url

Proposal

Use case. Why is this important?

It’s important to have log output when something goes wrong. For example, if metrics are not sent to persistent storage, we need to have a log event.

Bug Report

What did you do?

I created a remote_write section in my config with an HTTPS URL, tested it, then added a proxy_url section containing 'http://squid.proxy.that.blocks.requests:3128/'. The proxy is configured to block access to the url we are using.

What did you expect to see?

I expected to see at least one error log line, saying the remote write failed.

What did you see instead? Under which circumstances?

Metrics did not show up, and no log lines were logged. Circumstances: proxy is blocking requests, as noted above.

Environment

System information:

Linux 2.6.32-696.el6.x86_64 x86_64
Prometheus version:

prometheus, version 2.10.0 (branch: HEAD, revision: d20e84d0fb64aff2f62a977adc8cfb656da4e286)
  build user:       root@a49185acd9b0
  build date:       20190525-12:28:13
  go version:       go1.12.5

Alertmanager version:

N/A
Prometheus configuration file:

global:
  scrape_interval:     15s
  evaluation_interval: 15s
  scrape_timeout:      2s

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
    
  - job_name: 'some-job-name'
    static_configs:
    - targets:
        - 'localhost:10002'
        - 'localhost:10000'
        - 'localhost:10001'
      labels:
        environment: 'someenvname'
        host: 'internal-host-fqdn'

remote_write:
  - url: 'https://some.cortex.instance/api/prom/push'

    basic_auth:
      username: 'some_user'
      password: 'some_password'

    proxy_url: 'http://squid.proxy.that.blocks.requests:3128/'

Alertmanager configuration file: N/A
Logs: N/A

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 18 (7 by maintainers)

Most upvoted comments

I agree that I think this is worth reopening. Dedupe logging is already used in remote storage, so spamming the log should not be an issue.

csmarchbanks on Nov 9, 2019

This has been a problem for us when deploying Prometheus into production in a few environments due to network ACLs and whitelisting. In these cases, it is not a transient issue.

We have centralized logging as an existing infrastructure, but even without that I believe that the idea that metrics are a full substitute for logs is problematic. Logging is a standard way to report details on a problem event, and the existing metrics are not useful for tracking down root cause (was it a timeout, connection failed, invalid cert, … ?). Using the alert manager pushes a simple problem to the customer and requires them to integrate even more things with a system that you know is not operating correctly because fundamental assumptions/requirements are broken - this is the time that the system should drop back down to a minimal set of assumptions to communicate its state and what the operator needs to fix.

journalctl provides support for regulating log disk usage, if unavailable, I believe it’s pretty reasonable to assume that the operator has to deal with the same problem (log spam) for other systems and might be using logrotate. Prometheus could also decide not to log more than at a certain rate for these things.

Zenimax-DSpeed on Oct 22, 2019

Is this issue still relevant? I would like to work on this. @brian-brazil

supra08 on Feb 14, 2020

As someone who’s running prometheus via systemd, logging to journal and then mirrored off to centralized logging (fluentd/elasticsearch), this is kinda a big deal to us as well.

If there is a misconfiguration and prometheus cannot send logs to a configured destination, then it really should be logging an error not just incrementing an internal metric.

Without the proxy_url setting, if prometheus cannot write metrics to the remote storage, then it will log an error. So why wouldn’t you want it logging an error if there is a misconfiguration that effectively ends up with the same result?

I would love to see this issue re-opened and actually fixed.

chs-bnet on Oct 22, 2019