opentelemetry-collector: Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable

Describe the bug otel-collector running with Prometheus receiver configured to scrape Prometheus-compatible endpoints discovered via kubernetes_sd_configs stops scraping when some service discovery endpoints change or become unreachable (which is naturally happening during every deployment and subsequent rolling restart). The receiver seems to face a deadlock somewhere in updating the SD targets group.

Steps to reproduce otel-collector config: https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml

To trigger the issue, it’s enough to initiate a rolling restart of one of the target deployments. When this happens, the collector debug logs show the following:

{"level":"info","ts":1601986494.9710436,"caller":"service/service.go:252","msg":"Everything is ready. Begin running and processing data."}


{"level":"debug","ts":1601995775.1718767,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.37.173:1234/","err":"Get \"http://10.1.37.173:1234/\": dial tcp 10.1.37.173:1234: connect: connection refused"}
{"level":"warn","ts":1601995775.1720421,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995775171,"target_labels":"map[component:oap instance:10.1.37.173:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995776.6160927,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.7.143:1234/","err":"Get \"http://10.1.7.143:1234/\": dial tcp 10.1.7.143:1234: connect: connection refused"}
{"level":"warn","ts":1601995776.6162364,"caller":"internal/metricsbuilder.go:106","msg":"Failed to scrape Prometheus endpoint","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_timestamp":1601995776615,"target_labels":"map[component:oap instance:10.1.7.143:1234 job:oap plane:management]"}
{"level":"debug","ts":1601995798.0816824,"caller":"scrape/scrape.go:1091","msg":"Scrape failed","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus","scrape_pool":"oap","target":"http://10.1.49.45:1234/","err":"Get \"http://10.1.49.45:1234/\": context deadline exceeded"}
{"level":"debug","ts":1601995824.7997108,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}
{"level":"debug","ts":1601995829.799763,"caller":"discovery/manager.go:245","msg":"Discovery receiver's channel was full so will retry the next cycle","component_kind":"receiver","component_type":"prometheus","component_name":"prometheus"}


(ad infinitum)

After this all Prometheus receiver scraping stops (or at least the Prometheus exporter endpoint is not updating).

What did you expect to see? Prometheus receiver gracefully handling some targets becoming unavailable, as well as the changes in service discovery targets.

What did you see instead? Prometheus receiver scraping stops functioning completely.

What version did you use? from /debug/servicez:

GitHash  c8aac9e3
BuildType release
Goversion  go1.14.7
OS  linux
Architecture amd64

What config did you use? Config: (e.g. the yaml config file) https://gist.githubusercontent.com/oktocat/545e12bb8286cd676ccba8318a4095ef/raw/f298a32e235b55af122e92b12ff8ffdb459f6e9c/config.yaml Environment

Goversion go1.14.7 OS linux Architecture amd64 Kubernetes 1.17 on EKS

Additional context The issue exists at least in 0.2.7, 0.8.0, 0.10.0 and the latest master.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 32 (20 by maintainers)

Commits related to this issue

Most upvoted comments

FWIW, we’re not observing the deadlocks with otelcol built from master including https://github.com/open-telemetry/opentelemetry-collector/pull/2121

The PR of fix has been merged, and it works in our EKS test environment. @oktocat probably can verify whether the issue has been resolved.

I am still facing this issue. Even after adding nodes/metrics in the ClusterRole.


2021-03-07T21:31:01.625Z	WARN	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1615152661620, "target_labels": "map[alpha_eksctl_io_cluster_name:c _nodegroup_name:t3-s beta_kubernetes_io_arch:amd64 beta_kubernetes_io_instance_type:t3.small beta_kubernetes_io_os:linux eks_amazonaws_com_capacityType:ON_DEMAND eks_amazonaws_com_nodegroup:t3-small-nodegroup eks_amazonaws_com_nodegroup_image:ami-xx eks_amazonaws_com_sourceLaunchTemplateId:lt-x eks_amazonaws_com_sourceLaunchTemplateVersion:1 failure_domain_beta_kubernetes_io_region:us-east-2 failure_domain_beta_kubernetes_io_zone:us-east-2b instance:ip-xx-yy-zz.us-east-2.compute.internal job:kubernetes-nodes kubernetes_io_arch:amd64 kubernetes_io_hostname:ip-xx-yy-zz-235.us-east-2.compute.internal kubernetes_io_os:linux node_kubernetes_io_instance_type:t3.small topology_kubernetes_io_region:us-east-2 topology_kubernetes_io_zone:us-east-2b]"}

2021-03-07T21:31:03.364Z	WARN	internal/metricsbuilder.go:104	Failed to scrape Prometheus endpoint	{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_timestamp": 1615152663358, "target_labels": "map[instance:adot-collector.adot-col.svc:8888 job:kubernetes-service]"}

Can we reopen the issue?

👍 for the issue, we are trying to use this but we are getting bit by the same error.