istio: Pods are entering CrashLoopBackOff after upgrade to Istio v1.6.6

Bug description

After upgrading from Istio v1.6.5 to Istio v1.6.6 some (not all) pods having an Istio sidecar automatically injected are entering CrashLoopBackOff state after restart. In particular, the following pods are showing that behavior:

kubernetes-dashboard, see log excerpt below:

2020/07/30 11:54:21 Starting overwatch
2020/07/30 11:54:21 Using namespace: fdlgate-system
2020/07/30 11:54:21 Using in-cluster config to connect to apiserver
2020/07/30 11:54:21 Using secret token for csrf signing
2020/07/30 11:54:21 Initializing csrf token from kubernetes-dashboard-csrf secret
panic: an error on the server ("") has prevented the request from succeeding (get secrets kubernetes-dashboard-csrf)

goroutine 1 [running]:
github.com/kubernetes/dashboard/src/app/backend/client/csrf.(*csrfTokenManager).init(0xc0001810a0)
	/home/runner/work/dashboard/dashboard/src/app/backend/client/csrf/manager.go:41 +0x446
github.com/kubernetes/dashboard/src/app/backend/client/csrf.NewCsrfTokenManager(...)
	/home/runner/work/dashboard/dashboard/src/app/backend/client/csrf/manager.go:66
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).initCSRFKey(0xc000152080)
	/home/runner/work/dashboard/dashboard/src/app/backend/client/manager.go:501 +0xc6
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).init(0xc000152080)
	/home/runner/work/dashboard/dashboard/src/app/backend/client/manager.go:469 +0x47
github.com/kubernetes/dashboard/src/app/backend/client.NewClientManager(...)
	/home/runner/work/dashboard/dashboard/src/app/backend/client/manager.go:550
main.main()
	/home/runner/work/dashboard/dashboard/src/app/backend/dashboard.go:105 +0x20d

nginx-ingress-controller, see log-excerpt below:

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v0.34.1
  Build:         v20200715-ingress-nginx-2.11.0-8-gda5fa45e2
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.1

-------------------------------------------------------------------------------

I0730 12:43:51.917976       6 flags.go:205] Watching for Ingress class: nginx
W0730 12:43:51.918263       6 flags.go:250] SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
W0730 12:43:51.918308       6 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0730 12:43:51.918476       6 main.go:231] Creating API client for https://172.20.0.1:443
I0730 12:43:51.919187       6 main.go:251] Trying to discover Kubernetes version
I0730 12:43:51.919448       6 main.go:260] Unexpected error discovering Kubernetes version (attempt 0): Get "https://172.20.0.1:443/version?timeout=32s": dial tcp 172.20.0.1:443: connect: connection refused
I0730 12:43:52.956326       6 main.go:260] Unexpected error discovering Kubernetes version (attempt 1): Get "https://172.20.0.1:443/version?timeout=32s": dial tcp 172.20.0.1:443: connect: connection refused
I0730 12:43:54.548717       6 request.go:907] Got a Retry-After 1s response for attempt 1 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:55.549380       6 request.go:907] Got a Retry-After 1s response for attempt 2 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:56.550125       6 request.go:907] Got a Retry-After 1s response for attempt 3 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:57.550655       6 request.go:907] Got a Retry-After 1s response for attempt 4 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:58.551391       6 request.go:907] Got a Retry-After 1s response for attempt 5 to https://172.20.0.1:443/version?timeout=32s
I0730 12:43:59.551990       6 request.go:907] Got a Retry-After 1s response for attempt 6 to https://172.20.0.1:443/version?timeout=32s
I0730 12:44:00.552540       6 request.go:907] Got a Retry-After 1s response for attempt 7 to https://172.20.0.1:443/version?timeout=32s
I0730 12:44:01.553339       6 request.go:907] Got a Retry-After 1s response for attempt 8 to https://172.20.0.1:443/version?timeout=32s
I0730 12:44:02.553863       6 request.go:907] Got a Retry-After 1s response for attempt 9 to https://172.20.0.1:443/version?timeout=32s
I0730 12:44:03.554563       6 main.go:260] Unexpected error discovering Kubernetes version (attempt 2): an error on the server ("") has prevented the request from succeeding

Downgrading to Istio v1.6.5 fixes the issue so that all pods are running normally as expected again. Additionally, that behavior is reproducible, so that another upgrade to Istio v1.6.6 causes the same pods to enter CrashLoopBackOff again after restart.

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior

Upgrading to Istio v1.6.6 should not break existing pods.

Steps to reproduce the bug

Setup a Kubernetes cluster (v1.17.7)
Install Istio v1.6.5
Install Kubernetes-Dashboard (I’m using this Helm Chart, v2.3.0) with Istio sidecar enabled
Access Kubernetes-Dashboard, should be working
Upgrade Istio to v1.6.6
Restart Kubernetes-Dashboard pod
Kubernetes-Dashboard pod enters CrashLoopBackoff

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

$ istioctl version

client version: 1.6.6
control plane version: 1.6.6
data plane version: 1.6.6 (4 proxies), 1.6.5 (8 proxies)

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.6-eks-4e7f64", GitCommit:"4e7f642f9f4cbb3c39a4fc6ee84fe341a8ade94c", GitTreeState:"clean", BuildDate:"2020-06-11T13:55:35Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

$ helm version

version.BuildInfo{Version:"v3.2.4", GitCommit:"0ad800ef43d3b826f31a5ad8dfbb4fe05d143688", GitTreeState:"clean", GoVersion:"go1.13.12"}

How was Istio installed?

istioctl manifest apply -f myvalues.yaml

Environment where bug was observed (cloud vendor, OS, etc)

AWS EKS (1.17)

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 3
Comments: 19 (13 by maintainers)

Most upvoted comments

View impacted endpoints: kubectl get endpoints -A -ojson | jq -r '.items[] | select(.subsets[]?.addresses[]?.targetRef == null) | .metadata.namespace + "/" + .metadata.name'

howardjohn on Jul 30, 2020

Confirmed broken in master and 1.6.6 by bae28dde42

howardjohn on Jul 30, 2020

Ok actually I think I reproduced it on 1.6.6. will update with my findings

howardjohn on Jul 30, 2020

@howardjohn we are seeing failures only to the api-server.

What I spotted is that if you call straight away public IP (api-master) everything works fine, but if it goes through kubernetes.default.svc.cluster.local or kubernetes.default.svc, it might or might not fail. Public IP works all the time, even a simple curl command can be run in order to check that. Something like that:

 export CA_CERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
 export TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
 export NAMESPACE=$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace)

curl --cacert $CA_CERT -H "Authorization: Bearer $TOKEN" "https://kubernetes.default.svc/api/v1/namespaces/$NAMESPACE/services/"

kube-state-metrics is crashing constantly for example, while totally different pod is up and running but all calls to the api-server are failing. What is weird is that issue is not happening on another cluster, while the same apps and istio version is running.

k logs metrics-kube-state-metrics-699cdbfd79-9ztcg  kube-state-metrics
I0730 16:50:26.978169       1 main.go:89] Using collectors certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges,namespaces,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses
I0730 16:50:26.978222       1 main.go:98] Using all namespace
I0730 16:50:26.978238       1 main.go:139] metric white-blacklisting: blacklisting the following items:
W0730 16:50:26.978269       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0730 16:50:26.980061       1 main.go:184] Testing communication with server
F0730 16:50:26.980971       1 main.go:147] Failed to create client: error while trying to communicate with apiserver: Get https://10.51.240.1:443/version?timeout=32s: EOF

Exec Failure\njavax.net.ssl.SSLHandshakeException: Remote host terminated the handshake\n\tat java.base/sun.security.ssl.SSLSocketImpl.handleEOF(SSLSocketImpl.java:1320)\n\tat java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1159)\n\tat java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1062)\n\tat java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:402)\n\tat okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:320)\n\tat okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:284)\n\tat okhttp3.internal.connection.RealConnection.connect(RealConnection.java:169)\n\tat okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)\n\tat okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)\n\tat okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)\n\tat okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)\n\tat okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)\n\tat okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)\n\tat okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)\n\tat io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:134)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)\n\tat io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)\n\tat io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:112)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)\n\tat okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)\n\tat okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)\n\tat okhttp3.RealCall$AsyncCall.execute(RealCall.java:201)\n\tat okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\nCaused by: java.io.EOFException: SSL peer shut down incorrectly

We do have some PeerAuthentication policies set in order to disable mTLS, but didn’t help in this case.

r>kubectl get peerauthentication -A
NAMESPACE                 NAME           AGE
default                   mtls-disable   168m
kafka                     mtls-disable   58d
production-task-manager   mtls-disable   5h26m
redis                     mtls-disable   58d

cober on Jul 30, 2020