opentelemetry-collector: Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable
Describe the bug
otel-collector
running with Prometheus receiver configured to scrape Prometheus-compatible endpoints discovered via kubernetes_sd_configs
stops scraping when some service discovery endpoints change or become unreachable (which is naturally happening during every deployment and subsequent rolling restart).
The receiver seems to face a deadlock somewhere in updating the SD targets group.
Steps to reproduce otel-collector config: https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml
To trigger the issue, it’s enough to initiate a rolling restart of one of the target deployments. When this happens, the collector debug logs show the following:
{"level":"info","ts":1601986494.9710436,"caller":"service/service.go:252","msg":"Everything is ready. Begin running and processing data."}
{"level":"debug","ts":1601995775.1718767,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.37.173:1234/","err":"Get \"http://10.1.37.173:1234/\": dial tcp 10.1.37.173:1234: connect: connection refused"}
{"level":"warn","ts":1601995775.1720421,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995775171,"target_labels":"map[component:oap instance:10.1.37.173:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995776.6160927,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.7.143:1234/","err":"Get \"http://10.1.7.143:1234/\": dial tcp 10.1.7.143:1234: connect: connection refused"}
{"level":"warn","ts":1601995776.6162364,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995776615,"target_labels":"map[component:oap instance:10.1.7.143:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995798.0816824,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.49.45:1234/","err":"Get \"http://10.1.49.45:1234/\": context deadline exceeded"}
{"level":"debug","ts":1601995824.7997108,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}
{"level":"debug","ts":1601995829.799763,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}
(ad infinitum)
After this all Prometheus receiver scraping stops (or at least the Prometheus exporter endpoint is not updating).
What did you expect to see? Prometheus receiver gracefully handling some targets becoming unavailable, as well as the changes in service discovery targets.
What did you see instead? Prometheus receiver scraping stops functioning completely.
What version did you use?
from /debug/servicez
:
GitHash c8aac9e3
BuildType release
Goversion go1.14.7
OS linux
Architecture amd64
What config did you use? Config: (e.g. the yaml config file) https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml Environment
Goversion go1.14.7 OS linux Architecture amd64 Kubernetes 1.17 on EKS
Additional context
The issue exists at least in 0.2.7, 0.8.0, 0.10.0 and the latest master
.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 32 (20 by maintainers)
Commits related to this issue
- Fix the scraper/discover manager coordination on the Prometheus receiver (#2089) * Fix the scraper/discover manager coordination on the Prometheus receiver The receiver contains various unnecessar... — committed to open-telemetry/opentelemetry-collector by rakyll 4 years ago
- Bump go.uber.org/zap in /examples/prometheus-federation/prom-counter (#1909) Bumps [go.uber.org/zap](https://github.com/uber-go/zap) from 1.21.0 to 1.23.0. - [Release notes](https://github.com/uber-... — committed to hughesjj/opentelemetry-collector by dependabot[bot] 2 years ago
FWIW, we’re not observing the deadlocks with otelcol built from master including https://github.com/open-telemetry/opentelemetry-collector/pull/2121
The PR of fix has been merged, and it works in our EKS test environment. @oktocat probably can verify whether the issue has been resolved.
I am still facing this issue. Even after adding
nodes/metrics
in the ClusterRole.Can we reopen the issue?
👍 for the issue, we are trying to use this but we are getting bit by the same error.