istio: Pods are entering CrashLoopBackOff after upgrade to Istio v1.6.6
Bug description
After upgrading from Istio v1.6.5 to Istio v1.6.6 some (not all) pods having an Istio sidecar automatically injected are entering CrashLoopBackOff state after restart. In particular, the following pods are showing that behavior:
- kubernetes-dashboard, see log excerpt below:
2020/07/30 11:54:21 Starting overwatch
2020/07/30 11:54:21 Using namespace: fdlgate-system
2020/07/30 11:54:21 Using in-cluster config to connect to apiserver
2020/07/30 11:54:21 Using secret token for csrf signing
2020/07/30 11:54:21 Initializing csrf token from kubernetes-dashboard-csrf secret
panic: an error on the server ("") has prevented the request from succeeding (get secrets kubernetes-dashboard-csrf)
goroutine 1 [running]:
github.com/kubernetes/dashboard/src/app/backend/client/csrf.(*csrfTokenManager).init(0xc0001810a0)
/home/runner/work/dashboard/dashboard/src/app/backend/client/csrf/manager.go:41 +0x446
github.com/kubernetes/dashboard/src/app/backend/client/csrf.NewCsrfTokenManager(...)
/home/runner/work/dashboard/dashboard/src/app/backend/client/csrf/manager.go:66
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).initCSRFKey(0xc000152080)
/home/runner/work/dashboard/dashboard/src/app/backend/client/manager.go:501 +0xc6
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).init(0xc000152080)
/home/runner/work/dashboard/dashboard/src/app/backend/client/manager.go:469 +0x47
github.com/kubernetes/dashboard/src/app/backend/client.NewClientManager(...)
/home/runner/work/dashboard/dashboard/src/app/backend/client/manager.go:550
main.main()
/home/runner/work/dashboard/dashboard/src/app/backend/dashboard.go:105 +0x20d
- nginx-ingress-controller, see log-excerpt below:
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v0.34.1
Build: v20200715-ingress-nginx-2.11.0-8-gda5fa45e2
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.1
-------------------------------------------------------------------------------
I0730 12:43:51.917976 6 flags.go:205] Watching for Ingress class: nginx
W0730 12:43:51.918263 6 flags.go:250] SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
W0730 12:43:51.918308 6 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0730 12:43:51.918476 6 main.go:231] Creating API client for https://172.20.0.1:443
I0730 12:43:51.919187 6 main.go:251] Trying to discover Kubernetes version
I0730 12:43:51.919448 6 main.go:260] Unexpected error discovering Kubernetes version (attempt 0): Get "https://172.20.0.1:443/version?timeout=32s": dial tcp 172.20.0.1:443: connect: connection refused
I0730 12:43:52.956326 6 main.go:260] Unexpected error discovering Kubernetes version (attempt 1): Get "https://172.20.0.1:443/version?timeout=32s": dial tcp 172.20.0.1:443: connect: connection refused
I0730 12:43:54.548717 6 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:55.549380 6 request.go:907] Got a Retry-After 1s response for attempt 2 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:56.550125 6 request.go:907] Got a Retry-After 1s response for attempt 3 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:57.550655 6 request.go:907] Got a Retry-After 1s response for attempt 4 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:58.551391 6 request.go:907] Got a Retry-After 1s response for attempt 5 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:59.551990 6 request.go:907] Got a Retry-After 1s response for attempt 6 to https://172.20.0.1:443/version?timeout=32s
I0730 12:44:00.552540 6 request.go:907] Got a Retry-After 1s response for attempt 7 to https://172.20.0.1:443/version?timeout=32s
I0730 12:44:01.553339 6 request.go:907] Got a Retry-After 1s response for attempt 8 to https://172.20.0.1:443/version?timeout=32s
I0730 12:44:02.553863 6 request.go:907] Got a Retry-After 1s response for attempt 9 to https://172.20.0.1:443/version?timeout=32s
I0730 12:44:03.554563 6 main.go:260] Unexpected error discovering Kubernetes version (attempt 2): an error on the server ("") has prevented the request from succeeding
Downgrading to Istio v1.6.5 fixes the issue so that all pods are running normally as expected again. Additionally, that behavior is reproducible, so that another upgrade to Istio v1.6.6 causes the same pods to enter CrashLoopBackOff again after restart.
[ ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Expected behavior
Upgrading to Istio v1.6.6 should not break existing pods.
Steps to reproduce the bug
- Setup a Kubernetes cluster (v1.17.7)
- Install Istio v1.6.5
- Install Kubernetes-Dashboard (I’m using this Helm Chart, v2.3.0) with Istio sidecar enabled
- Access Kubernetes-Dashboard, should be working
- Upgrade Istio to v1.6.6
- Restart Kubernetes-Dashboard pod
- Kubernetes-Dashboard pod enters CrashLoopBackoff
Version (include the output of istioctl version --remote
and kubectl version
and helm version
if you used Helm)
$ istioctl version
client version: 1.6.6
control plane version: 1.6.6
data plane version: 1.6.6 (4 proxies), 1.6.5 (8 proxies)
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.6-eks-4e7f64", GitCommit:"4e7f642f9f4cbb3c39a4fc6ee84fe341a8ade94c", GitTreeState:"clean", BuildDate:"2020-06-11T13:55:35Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
$ helm version
version.BuildInfo{Version:"v3.2.4", GitCommit:"0ad800ef43d3b826f31a5ad8dfbb4fe05d143688", GitTreeState:"clean", GoVersion:"go1.13.12"}
How was Istio installed?
istioctl manifest apply -f myvalues.yaml
Environment where bug was observed (cloud vendor, OS, etc)
AWS EKS (1.17)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 19 (13 by maintainers)
View impacted endpoints:
kubectl get endpoints -A -ojson | jq -r '.items[] | select(.subsets[]?.addresses[]?.targetRef == null) | .metadata.namespace + "/" + .metadata.name'
Confirmed broken in master and 1.6.6 by bae28dde42
Ok actually I think I reproduced it on 1.6.6. will update with my findings
@howardjohn we are seeing failures only to the api-server.
What I spotted is that if you call straight away public IP (api-master) everything works fine, but if it goes through kubernetes.default.svc.cluster.local or kubernetes.default.svc, it might or might not fail. Public IP works all the time, even a simple curl command can be run in order to check that. Something like that:
kube-state-metrics is crashing constantly for example, while totally different pod is up and running but all calls to the api-server are failing. What is weird is that issue is not happening on another cluster, while the same apps and istio version is running.
We do have some PeerAuthentication policies set in order to disable mTLS, but didn’t help in this case.