linkerd2: All linkerd-viz pods in CrashLoopBackOff
Bug Report
What is the issue?
All the pods for linkerd-viz are stuck in a restart loop, flipping between CrashLoopBackOff and Running
root@hel1-k1:~/linkerd-demo# kubectl get pods -n linkerd-viz
NAME READY STATUS RESTARTS AGE
grafana-8d54d5f6d-wwfhm 0/2 CrashLoopBackOff 26 (42s ago) 32m
metrics-api-dd848c7c4-dpdvj 0/2 CrashLoopBackOff 33 (4m42s ago) 48m
prometheus-7bbc4d8c5b-4c4ts 0/2 Running 34 (19s ago) 48m
tap-f77b59d5b-vvglz 0/2 CrashLoopBackOff 34 (4m33s ago) 48m
tap-injector-68f5c5bc46-c4xqg 0/2 CrashLoopBackOff 34 (2m37s ago) 48m
web-85bb987c55-wps72 0/2 CrashLoopBackOff 35 (5m7s ago) 48m
How can it be reproduced?
The cluster was initialised with:
kubeadm init --control-plane-endpoint=api.helsinki-cluster.foo-domain.com --pod-network-cidr=10.210.0.0/16 --apiserver-advertise-address=10.10.0.2 --service-cidr=10.211.0.0/18 --service-dns-domain="helsinki-cluster.foo-domain.com"
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
Linkerd with:
linkerd install --cluster-domain=helsinki-cluster.foo-domain.com --identity-trust-domain=helsinki-cluster.foo-domain.com | kubectl apply -f -
And linkerd-viz with:
linkerd viz install --set clusterDomain="helsinki-cluster.foo-domain.com" | k apply -f -
Logs, error output, etc
From describe pod
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 07 Nov 2021 17:50:13 +0000
Finished: Sun, 07 Nov 2021 17:50:46 +0000
Ready: False
Restart Count: 24
Liveness: http-get http://:9995/ping delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:9995/ready delay=0s timeout=1s period=10s #success=1 #failure=7
... snip
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 46m (x17 over 76m) kubelet Readiness probe failed: Get "http://10.210.165.97:9995/ready": dial tcp 10.210.165.97:9995: connect: connection refused
Warning Unhealthy 11m (x40 over 76m) kubelet Readiness probe failed: Get "http://10.210.165.97:4191/ready": dial tcp 10.210.165.97:4191: connect: connection refused
Warning BackOff 104s (x279 over 71m) kubelet Back-off restarting failed container
From the logs of linkerd-proxy
root@hel1-k1:~/linkerd-demo# k logs -n linkerd-viz metrics-api-dd848c7c4-dpdvj linkerd-proxy
time="2021-11-07T17:21:53Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2021-11-07T17:21:53Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[ 0.000775s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.002281s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.002328s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.002343s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.002356s] INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[ 0.002460s] INFO ThreadId(01) linkerd2_proxy: Local identity is metrics-api.linkerd-viz.serviceaccount.identity.linkerd.helsinki-cluster.foo-domain.com
[ 0.002484s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.helsinki-cluster.foo-domain.com:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.helsinki-cluster.foo-domain.com)
[ 0.002498s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.helsinki-cluster.foo-domain.com:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.helsinki-cluster.foo-domain.com)
[ 0.032970s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: metrics-api.linkerd-viz.serviceaccount.identity.linkerd.helsinki-cluster.foo-domain.com
[ 1.951912s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37016
[ 1.952035s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37016}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin
[ 1.952805s] INFO ThreadId(01) inbound:server{port=9995}: linkerd_app_inbound::policy::authorize::http: Request denied server=admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:43652
[ 1.952891s] INFO ThreadId(01) inbound:server{port=9995}:rescue{client.addr=<public-ip-address-of-hel-k1>:43652}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server admin
[ 2.970393s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37062
[ 2.970459s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37062}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin
[ 2.977567s] INFO ThreadId(01) inbound:server{port=9995}: linkerd_app_inbound::policy::authorize::http: Request denied server=admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:43694
[ 2.977810s] INFO ThreadId(01) inbound:server{port=9995}:rescue{client.addr=<public-ip-address-of-hel-k1>:43694}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server admin
[ 3.971149s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37082
[ 3.971324s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37082}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin
[ 3.971814s] INFO ThreadId(01) inbound:server{port=9995}: linkerd_app_inbound::policy::authorize::http: Request denied server=admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:43718
[ 3.971857s] INFO ThreadId(01) inbound:server{port=9995}:rescue{client.addr=<public-ip-address-of-hel-k1>:43718}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server admin
[ 9.610323s] INFO ThreadId(01) inbound:server{port=9995}: linkerd_app_inbound::policy::authorize::http: Request denied server=admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:43908
[ 9.610365s] INFO ThreadId(01) inbound:server{port=9995}:rescue{client.addr=<public-ip-address-of-hel-k1>:43908}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server admin
[ 9.610662s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37272
[ 9.610769s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37272}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin
[ 9.613311s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::authorize::http: Request denied server=proxy-admin tls=None(NoClientHello) client=<public-ip-address-of-hel-k1>:37276
[ 9.613567s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37276}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server proxy-admin
linkerd check
output
Please note it hangs on Running viz extension check
root@hel1-k1:~/linkerd-demo# linkerd check
Linkerd core checks
===================
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match
linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match
Status check results are √
Linkerd extensions checks
=========================
/ Running viz extension check
Environment
- Kubernetes Version: v1.22.3
- Cluster Environment: kubeadm built, host has public and private interface. Calico used for network overlay.
- Host OS: Ubuntu 20.04.3
- Linkerd version: stable-2.11.1
Possible solution
The linkerd-proxy containers are behaving correctly in other namespaces, so this issue seems linkerd-viz specific at the moment.
I am curious why the servers public ip address is in the logs daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=<public-ip-address-of-hel-k1>:37082}
I would have expected to see the private address or the hosts ip address within the node ip range, which in this case is 10.210.165.64/26
root@hel1-k1:~/linkerd-demo# calicoctl ipam show --show-blocks
+----------+------------------+-----------+------------+--------------+
| GROUPING | CIDR | IPS TOTAL | IPS IN USE | IPS FREE |
+----------+------------------+-----------+------------+--------------+
| IP Pool | 10.210.0.0/16 | 65536 | 18 (0%) | 65518 (100%) |
| Block | 10.210.165.64/26 | 64 | 18 (28%) | 46 (72%) |
+----------+------------------+-----------+------------+--------------+
I’m not sure of the real cause of the problem though, and would appreciate any help in figuring it out 😃
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 16 (6 by maintainers)
Commits related to this issue
- Loosen policies for linkerd viz See the github issues https://github.com/linkerd/linkerd2/issues/7233 for more information — committed to mattthias/stuff by deleted user 3 years ago
Just adding a summary of the patch mentioned before in case you are a slow learner like me
Step 0
Check if there’s a proper fix mentioned in other comments below. That way you don’t even need to run this.
Step 1
run
kubectl get ServerAuthorization -n linkerd-viz
to get a list of the ServerAuthorization resources.Step 2
Then edit each one with
kubectl edit ServerAuthorization -n linkerd-viz <name>
Add the following below
client:
as shown in previous comments:Btw not all the resources have a
client
section. In that case, don’t modify them.Step 3
Restart your pods and it should be working. At least that worked for me.
For the lazy ones: Apply this policy to linkerd-viz namespace.
I’m going to close this for now, since workarounds have been documented and we have a plan to provide proper per-route authorizations in 2.12.
I have a guess at what’s going on here:
By default, the linkerd-viz namespace ships with a policy like:
This authorizes unauthenticated requests from IPs in the
clusterNetworks
configuration. If the source IP (<public-ip-address-of-hel-k1>
) is not in that list, these connections will be denied. To fix this, the authorization policy could be updated with the following:Can you update this resource and share whether this resolves the issue?