linkerd2: All linkerd-viz pods in CrashLoopBackOff

Bug Report

What is the issue?

All the pods for linkerd-viz are stuck in a restart loop, flipping between CrashLoopBackOff and Running

root@hel1-k1:~/linkerd-demo# kubectl get pods -n linkerd-viz
NAME                            READY   STATUS             RESTARTS         AGE
grafana-8d54d5f6d-wwfhm         0/2     CrashLoopBackOff   26 (42s ago)     32m
metrics-api-dd848c7c4-dpdvj     0/2     CrashLoopBackOff   33 (4m42s ago)   48m
prometheus-7bbc4d8c5b-4c4ts     0/2     Running            34 (19s ago)     48m
tap-f77b59d5b-vvglz             0/2     CrashLoopBackOff   34 (4m33s ago)   48m
tap-injector-68f5c5bc46-c4xqg   0/2     CrashLoopBackOff   34 (2m37s ago)   48m
web-85bb987c55-wps72            0/2     CrashLoopBackOff   35 (5m7s ago)    48m

How can it be reproduced?

The cluster was initialised with:

kubeadm init --control-plane-endpoint=api.helsinki-cluster.foo-domain.com --pod-network-cidr=10.210.0.0/16 --apiserver-advertise-address=10.10.0.2 --service-cidr=10.211.0.0/18 --service-dns-domain="helsinki-cluster.foo-domain.com"
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

Linkerd with:

linkerd install --cluster-domain=helsinki-cluster.foo-domain.com --identity-trust-domain=helsinki-cluster.foo-domain.com | kubectl apply -f -

And linkerd-viz with:

linkerd viz install --set clusterDomain="helsinki-cluster.foo-domain.com" | k apply -f -

Logs, error output, etc

From describe pod

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 07 Nov 2021 17:50:13 +0000
      Finished:     Sun, 07 Nov 2021 17:50:46 +0000
    Ready:          False
    Restart Count:  24
    Liveness:       http-get http://:9995/ping delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:9995/ready delay=0s timeout=1s period=10s #success=1 #failure=7

... snip

Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Warning  Unhealthy  46m (x17 over 76m)    kubelet  Readiness probe failed: Get "http://10.210.165.97:9995/ready": dial tcp 10.210.165.97:9995: connect: connection refused
  Warning  Unhealthy  11m (x40 over 76m)    kubelet  Readiness probe failed: Get "http://10.210.165.97:4191/ready": dial tcp 10.210.165.97:4191: connect: connection refused
  Warning  BackOff    104s (x279 over 71m)  kubelet  Back-off restarting failed container

From the logs of linkerd-proxy

root@hel1-k1:~/linkerd-demo# k logs -n linkerd-viz metrics-api-dd848c7c4-dpdvj linkerd-proxy
time="2021-11-07T17:21:53Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2021-11-07T17:21:53Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[     0.000775s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.002281s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.002328s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.002343s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.002356s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.002460s]  INFO ThreadId(01) linkerd2_proxy: Local identity is metrics-api.linkerd-viz.serviceaccount.identity.linkerd.helsinki-cluster.foo-domain.com
[     0.002484s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.helsinki-cluster.foo-domain.com:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.helsinki-cluster.foo-domain.com)
[     0.002498s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.helsinki-cluster.foo-domain.com:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.helsinki-cluster.foo-domain.com)
[     0.032970s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: metrics-api.linkerd-viz.serviceaccount.identity.linkerd.helsinki-cluster.foo-domain.com
[     1.951912s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37016
[     1.952035s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37016}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin
[     1.952805s]  INFO ThreadId(01) inbound:server{port=9995}: linkerd_app_inbound::policy::authorize::http: Request denied server=admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:43652
[     1.952891s]  INFO ThreadId(01) inbound:server{port=9995}:rescue{client.addr=<public-ip-address-of-hel-k1>:43652}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server admin
[     2.970393s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37062
[     2.970459s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37062}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin
[     2.977567s]  INFO ThreadId(01) inbound:server{port=9995}: linkerd_app_inbound::policy::authorize::http: Request denied server=admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:43694
[     2.977810s]  INFO ThreadId(01) inbound:server{port=9995}:rescue{client.addr=<public-ip-address-of-hel-k1>:43694}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server admin
[     3.971149s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37082
[     3.971324s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37082}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin
[     3.971814s]  INFO ThreadId(01) inbound:server{port=9995}: linkerd_app_inbound::policy::authorize::http: Request denied server=admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:43718
[     3.971857s]  INFO ThreadId(01) inbound:server{port=9995}:rescue{client.addr=<public-ip-address-of-hel-k1>:43718}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server admin
[     9.610323s]  INFO ThreadId(01) inbound:server{port=9995}: linkerd_app_inbound::policy::authorize::http: Request denied server=admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:43908
[     9.610365s]  INFO ThreadId(01) inbound:server{port=9995}:rescue{client.addr=<public-ip-address-of-hel-k1>:43908}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server admin
[     9.610662s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37272
[     9.610769s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37272}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin
[     9.613311s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37276
[     9.613567s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37276}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin

linkerd check output

Please note it hangs on Running viz extension check

root@hel1-k1:~/linkerd-demo# linkerd check
Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Linkerd extensions checks
=========================
/ Running viz extension check

Environment

  • Kubernetes Version: v1.22.3
  • Cluster Environment: kubeadm built, host has public and private interface. Calico used for network overlay.
  • Host OS: Ubuntu 20.04.3
  • Linkerd version: stable-2.11.1

Possible solution

The linkerd-proxy containers are behaving correctly in other namespaces, so this issue seems linkerd-viz specific at the moment. I am curious why the servers public ip address is in the logs daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37082} I would have expected to see the private address or the hosts ip address within the node ip range, which in this case is 10.210.165.64/26

root@hel1-k1:~/linkerd-demo# calicoctl ipam show --show-blocks
+----------+------------------+-----------+------------+--------------+
| GROUPING |       CIDR       | IPS TOTAL | IPS IN USE |   IPS FREE   |
+----------+------------------+-----------+------------+--------------+
| IP Pool  | 10.210.0.0/16    |     65536 | 18 (0%)    | 65518 (100%) |
| Block    | 10.210.165.64/26 |        64 | 18 (28%)   | 46 (72%)     |
+----------+------------------+-----------+------------+--------------+

I’m not sure of the real cause of the problem though, and would appreciate any help in figuring it out 😃

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 16 (6 by maintainers)

Commits related to this issue

Most upvoted comments

Just adding a summary of the patch mentioned before in case you are a slow learner like me

Step 0

Check if there’s a proper fix mentioned in other comments below. That way you don’t even need to run this.

Step 1

run kubectl get ServerAuthorization -n linkerd-viz to get a list of the ServerAuthorization resources.

Step 2

Then edit each one with kubectl edit ServerAuthorization -n linkerd-viz <name>

Add the following below client: as shown in previous comments:

networks:
- cidr: 0.0.0.0/0

Btw not all the resources have a client section. In that case, don’t modify them.

Step 3

Restart your pods and it should be working. At least that worked for me.

kubectl -n linkerd-viz rollout restart deploy

For the lazy ones: Apply this policy to linkerd-viz namespace.

apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
  name: yolo
  namespace: linkerd-viz
spec:
  client:
    unauthenticated: true
    networks:
    - cidr: 0.0.0.0/0
  server:
    selector:
      matchLabels: {}

I’m going to close this for now, since workarounds have been documented and we have a plan to provide proper per-route authorizations in 2.12.

I have a guess at what’s going on here:

By default, the linkerd-viz namespace ships with a policy like:

apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
  labels:
    linkerd.io/extension: viz
  name: proxy-admin
  namespace: linkerd-viz
spec:
  client:
    unauthenticated: true
  server:
    name: proxy-admin

This authorizes unauthenticated requests from IPs in the clusterNetworks configuration. If the source IP (<public-ip-address-of-hel-k1>) is not in that list, these connections will be denied. To fix this, the authorization policy could be updated with the following:

apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
  labels:
    linkerd.io/extension: viz
  name: proxy-admin
  namespace: linkerd-viz
spec:
  client:
    unauthenticated: true
    networks:
      - cidr: 0.0.0.0/0
  server:
    name: proxy-admin

Can you update this resource and share whether this resolves the issue?