blackbox_exporter: ICMP probes fails continually after some short DNS outages, until manual restart of blackbox-exporter container

Host operating system: output of `uname -a`

Linux a382643a1270 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 GNU/Linux

Docker version:
Docker version 18.09.7, build 2d0083d

OS running docker:
Linux prometheus1 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of `blackbox_exporter -version`

blackbox_exporter, version 0.16.0 (branch: HEAD, revision: 991f89846ae10db22a3933356a7d196642fcb9a9)
  build user:       root@64f600555645
  build date:       20191111-16:27:24
  go version:       go1.13.4

Docker image:
prom/blackbox-exporter:v0.16.0

What is the blackbox.yml module config.

modules:
  icmp:
    prober: icmp

  icmp-ip4:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: ip4

What is the prometheus.yml scrape config.

  - job_name: 'blackbox-ping'
    scrape_interval: 1s
    params:
      module: [icmp-ip4]
    static_configs:
      - targets:
        - 8.8.8.8
        labels:
          blackbox_job: 'ping'
    metrics_path: /probe
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - source_labels: [instance]
      regex: "[192.168|8.8].+"
      target_label: ping_type
      replacement: 'ip'
    - target_label: __address__
      replacement: prometheus1:9115

What logging output did you get from adding `&debug=true` to the probe URL?

Logs for the probe:
ts=2020-04-01T15:59:41.659129069Z caller=main.go:304 module=icmp target=8.8.8.8 level=info msg="Beginning probe" probe=icmp timeout_seconds=119.5
ts=2020-04-01T15:59:41.659349737Z caller=icmp.go:82 module=icmp target=8.8.8.8 level=info msg="Resolving target address" ip_protocol=ip6
ts=2020-04-01T15:59:41.6594102Z caller=icmp.go:82 module=icmp target=8.8.8.8 level=info msg="Resolved target address" ip=8.8.8.8
ts=2020-04-01T15:59:41.659430967Z caller=main.go:119 module=icmp target=8.8.8.8 level=info msg="Creating socket"
ts=2020-04-01T15:59:41.660190595Z caller=main.go:119 module=icmp target=8.8.8.8 level=info msg="Creating ICMP packet" seq=62367 id=33313
ts=2020-04-01T15:59:41.660223508Z caller=main.go:119 module=icmp target=8.8.8.8 level=info msg="Writing out packet"
ts=2020-04-01T15:59:41.660358188Z caller=main.go:119 module=icmp target=8.8.8.8 level=info msg="Waiting for reply packets"
ts=2020-04-01T16:01:41.159246759Z caller=main.go:119 module=icmp target=8.8.8.8 level=warn msg="Timeout reading from socket" err="read ip4 0.0.0.0: i/o timeout"
ts=2020-04-01T16:01:41.159308704Z caller=main.go:304 module=icmp target=8.8.8.8 level=error msg="Probe failed" duration_seconds=119.500030153



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 2.2626e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 119.500030153
# HELP probe_icmp_duration_seconds Duration of icmp request by phase
# TYPE probe_icmp_duration_seconds gauge
probe_icmp_duration_seconds{phase="resolve"} 2.2626e-05
probe_icmp_duration_seconds{phase="rtt"} 0
probe_icmp_duration_seconds{phase="setup"} 0.000792348
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0



Module configuration:
prober: icmp
http:
    ip_protocol_fallback: true
tcp:
    ip_protocol_fallback: true
icmp:
    ip_protocol_fallback: true
dns:
    ip_protocol_fallback: true

What did you do that produced an error?

We run blackbox-exporter inside docker container. Suddenly, without any changes on working machine or container, ping probe starts failing for one or more targets which we are monitoring, while other targets remain ok. When i run manually ping tool inside docker container and on hosting OS outside the docker container, both succeed.

So far we experienced this behavior for two of ours internal IP targets simultaneously (both from the same datacenter) and later just for 8.8.8.8 target.

I examined the problem with a tcpdump and it shows only request packets (no reply packets):

tcpdump -i eth0 -nn -s0 -X icmp and host 8.8.8.8
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:28:48.734661 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 41979, length 36
	0x0000:  4500 0038 f40e 4000 4001 8a90 ac11 0005  E..8..@.@.......
	0x0010:  0808 0808 0800 7648 8221 a3fb 5072 6f6d  ......vH.!..Prom
	0x0020:  6574 6865 7573 2042 6c61 636b 626f 7820  etheus.Blackbox.
	0x0030:  4578 706f 7274 6572                      Exporter
15:28:48.977456 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 41982, length 36
	0x0000:  4500 0038 f41d 4000 4001 8a81 ac11 0005  E..8..@.@.......
	0x0010:  0808 0808 0800 7645 8221 a3fe 5072 6f6d  ......vE.!..Prom
	0x0020:  6574 6865 7573 2042 6c61 636b 626f 7820  etheus.Blackbox.
	0x0030:  4578 706f 7274 6572                      Exporter
15:28:49.735066 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 41990, length 36
	0x0000:  4500 0038 f475 4000 4001 8a29 ac11 0005  E..8.u@.@..)....
	0x0010:  0808 0808 0800 763d 8221 a406 5072 6f6d  ......v=.!..Prom
	0x0020:  6574 6865 7573 2042 6c61 636b 626f 7820  etheus.Blackbox.
	0x0030:  4578 706f 7274 6572                      Exporter
15:28:50.735053 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 42001, length 36
	0x0000:  4500 0038 f497 4000 4001 8a07 ac11 0005  E..8..@.@.......
	0x0010:  0808 0808 0800 7632 8221 a411 5072 6f6d  ......v2.!..Prom
	0x0020:  6574 6865 7573 2042 6c61 636b 626f 7820  etheus.Blackbox.
	0x0030:  4578 706f 7274 6572                      Exporter

I also checked if there is any zero-filled ID field in IP header, as it was discussed in a very similar issue here: https://github.com/prometheus/blackbox_exporter/issues/360, but it is not our case.

The only correlations which we found in Grafana, are very short outages of connection from the blackbox-exporter machine to some of ours internal DNS servers (spikes are in the same time as the probes starts failing) monitored with the same blackbox-exporter …

What did you expect to see?

Maybe some failed probes during a potential outage, but then successfull probes again.

What did you see instead?

Probes continually fails, for hours, just until i manually restart docker image.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 20 (8 by maintainers)

Most upvoted comments

Thanks for the update. I presume that it’s similar with other users, so I’m going to close this.

brian-brazil on Jul 1, 2020

Update for people who hit the same issue: for us this was caused was a firewall which started blocking ICMP packages with the same ID. This explains why restarting blackbox temporarily fixed the issue.

jeremybz on Jul 1, 2020