linkerd2: Thanos Query cannot connect to Prometheus with Linkerd

Bug Report

What is the issue?

I have deployed Prometheus (https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) and Thanos in my cluster, without linkerd everything is working fine.

As soon as I enable linkerd for both Prometheus and Thanos the Query pod stops communicating with Prometheus. linkerd-proxy reports connection errors and stops collecting metrics.

How can it be reproduced?

  • install linkerd
  • install kube-prometheus-stack with Thanos sidecar and linkerd injection enabled at namespace level
  • install Thanos. I’ve used the configuration reported here

Logs, error output, etc

Thanos Query’s linkerd-proxy output: https://gist.github.com/irizzant/bfe93d82b293915fc4ffa7b5edb77773

Screenshot taken from linkerd-viz to show the path Thanos Query is trying to reach: screenshot-localhost_50750-2021 09 03-13_27_58

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
W0903 14:08:03.083583   23673 warnings.go:67] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ can initialize the client
√ can query the control plane API

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check

Status check results are √

Environment

  • Kubernetes Version: k3d v1.21.1+k3s1
  • Cluster Environment: (GKE, AKS, kops, …) k3d
  • Host OS:
  • Linkerd version:
Client version: stable-2.10.2
Server version: stable-2.10.2

Possible solution

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

Hey @Pothulapati you’re right, the error is because I foresaw Thanos chart version in the reproducer repo.

I updated the reproducer repo with the fix, now storegateway should start fine.

@irizzant thanks so much! this should make it much easier to track it down. we’ll take a look