opentelemetry-collector: Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable

Describe the bug otel-collector running with Prometheus receiver configured to scrape Prometheus-compatible endpoints discovered via kubernetes_sd_configs stops scraping when some service discovery endpoints change or become unreachable (which is naturally happening during every deployment and subsequent rolling restart). The receiver seems to face a deadlock somewhere in updating the SD targets group.

Steps to reproduce otel-collector config: https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml

To trigger the issue, it’s enough to initiate a rolling restart of one of the target deployments. When this happens, the collector debug logs show the following:

{"level":"info","ts":1601986494.9710436,"caller":"service/service.go:252","msg":"Everything is ready. Begin running and processing data."}


{"level":"debug","ts":1601995775.1718767,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.37.173:1234/","err":"Get \"http://10.1.37.173:1234/\": dial tcp 10.1.37.173:1234: connect: connection refused"}
{"level":"warn","ts":1601995775.1720421,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995775171,"target_labels":"map[component:oap instance:10.1.37.173:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995776.6160927,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.7.143:1234/","err":"Get \"http://10.1.7.143:1234/\": dial tcp 10.1.7.143:1234: connect: connection refused"}
{"level":"warn","ts":1601995776.6162364,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995776615,"target_labels":"map[component:oap instance:10.1.7.143:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995798.0816824,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.49.45:1234/","err":"Get \"http://10.1.49.45:1234/\": context deadline exceeded"}
{"level":"debug","ts":1601995824.7997108,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}
{"level":"debug","ts":1601995829.799763,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}


(ad infinitum)

After this all Prometheus receiver scraping stops (or at least the Prometheus exporter endpoint is not updating).

What did you expect to see? Prometheus receiver gracefully handling some targets becoming unavailable, as well as the changes in service discovery targets.

What did you see instead? Prometheus receiver scraping stops functioning completely.

What version did you use? from /debug/servicez:

GitHash  c8aac9e3
BuildType release
Goversion  go1.14.7
OS  linux
Architecture amd64

What config did you use? Config: (e.g. the yaml config file) https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml Environment

Goversion go1.14.7 OS linux Architecture amd64 Kubernetes 1.17 on EKS

Additional context The issue exists at least in 0.2.7, 0.8.0, 0.10.0 and the latest master.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 32 (20 by maintainers)

Commits related to this issue

Fix the scraper/discover manager coordination on the Prometheus receiver (#2089) * Fix the scraper/discover manager coordination on the Prometheus receiver The receiver contains various unnecessar... — committed to open-telemetry/opentelemetry-collector by rakyll 4 years ago
Bump go.uber.org/zap in /examples/prometheus-federation/prom-counter (#1909) Bumps [go.uber.org/zap](https://github.com/uber-go/zap) from 1.21.0 to 1.23.0. - [Release notes](https://github.com/uber-... — committed to hughesjj/opentelemetry-collector by dependabot[bot] 2 years ago

Most upvoted comments

FWIW, we’re not observing the deadlocks with otelcol built from master including https://github.com/open-telemetry/opentelemetry-collector/pull/2121

oktocat on Nov 18, 2020

The PR of fix has been merged, and it works in our EKS test environment. @oktocat probably can verify whether the issue has been resolved.

hdj630 on Nov 12, 2020

I am still facing this issue. Even after adding nodes/metrics in the ClusterRole.


2021-03-07T21:31:01.625Z	WARN	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1615152661620, "target_labels": "map[alpha_eksctl_io_cluster_name:c _nodegroup_name:t3-s beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:t3.small beta_kubernetes_io_os:linux eks_amazonaws_com_capacityType:ON_DEMAND eks_amazonaws_com_nodegroup:t3-small-nodegroup eks_amazonaws_com_nodegroup_image:ami-xx eks_amazonaws_com_sourceLaunchTemplateId:lt-x eks_amazonaws_com_sourceLaunchTemplateVersion:1 failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2b instance:ip-xx-yy-zz.us-east-2.compute.internal job:kubernetes-nodes kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-xx-yy-zz-235.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:t3.small topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2b]"}

2021-03-07T21:31:03.364Z	WARN	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1615152663358, "target_labels": "map[instance:adot-collector.adot-col.svc:8888 job:kubernetes-service]"}

Can we reopen the issue?

vishalsaugat on Mar 7, 2021

👍 for the issue, we are trying to use this but we are getting bit by the same error.

ekarlso on Nov 2, 2020