istio: Ingress SDS missing event to push key/cert pair to proxy

Bug description In rare cases we’re seeing that SDS is missing an event entirely that’s causing only the root cert to get pushed but not the key/cert pair. Looking through the logs we can see that the root cert gets pushed out once the secret is eventually found:

2019-11-12T22:05:53.335991Z	warn	secretFetcherLog	Cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, searching for fallback secret gateway-fallback
2019-11-12T22:05:53.336191Z	error	secretFetcherLog	cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert and cannot find fallback secret gateway-fallback
2019-11-12T22:05:53.336244Z	warn	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-112, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: SecretFetcher cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert from cache
2019-11-12T22:05:53.336308Z	warn	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-112, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: waiting for ingress gateway secret for proxy "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local"

2019-11-12T22:05:53.336165Z	warn	secretFetcherLog	Cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, searching for fallback secret gateway-fallback
2019-11-12T22:05:53.336362Z	error	secretFetcherLog	cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b and cannot find fallback secret gateway-fallback
2019-11-12T22:05:53.336382Z	warn	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-113, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: SecretFetcher cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b from cache
2019-11-12T22:05:53.336391Z	warn	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-113, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: waiting for ingress gateway secret for proxy "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local"

2019-11-12T22:05:59.991770Z	error	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: NotifyProxy failed. No connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111" can be found
2019-11-12T22:05:59.991797Z	error	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: failed to notify secret change for proxy: no connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111" can be found
2019-11-12T22:05:59.991894Z	info	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-112, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: pushed root cert to proxy

You’ll see that no key/cert pair gets pushed. This causes TLS/mTLS gateways to break completely.

The way to recover from this is to delete the secret and add it back again. This results in the following logs:

2019-11-12T23:09:37.570981Z	info	secretFetcherLog	secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b is deleted
2019-11-12T23:09:37.571063Z	error	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: NotifyProxy failed. No connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111" can be found
2019-11-12T23:09:37.571075Z	error	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: failed to notify secret change for proxy: no connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111" can be found
2019-11-12T23:09:37.571085Z	info	secretFetcherLog	secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert is deleted
2019-11-12T23:09:37.571104Z	error	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-110, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: NotifyProxy failed. No connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-110" can be found
2019-11-12T23:09:37.571115Z	error	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-110, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: failed to notify secret change for proxy: no connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-110" can be found
2019-11-12T23:10:14.200253Z	info	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-112, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: pushed root cert to proxy

2019-11-12T23:10:14.200251Z	info	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-113, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: pushed key/cert pair to proxy

Expected behavior SDS always pushing the required crypto to the proxy.

Steps to reproduce the bug Unknown, I’ve found this very hard to reproduce. It has only affected one user of ours AFAIK.

Version (include the output of istioctl version --remote and kubectl version)

client version: unknown
control plane version: 1.3.0
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:41:55Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-eks-5876d6", GitCommit:"5876d6b7429820450950ade17fe7b4bf5ccada7f", GitTreeState:"clean", BuildDate:"2019-09-24T20:54:25Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed? Imagine having a Makefile that does this where $ISTIO_VERSION is 1.3.0:

@curl -s -L https://git.io/getLatestIstio | ISTIO_VERSION=$(ISTIO_VERSION) sh -
@cat istio/istio-namespace.yaml > istio/istio.yaml
@cat istio/istio-namespace.yaml > istio/istio-crds.yaml
@helm template istio-$(ISTIO_VERSION)/install/kubernetes/helm/istio-init --name istio-init --namespace istio-system >> istio/istio-crds.yaml
@helm template istio-$(ISTIO_VERSION)/install/kubernetes/helm/istio --name istio --namespace istio-system \
    --set sidecarInjectorWebhook.rewriteAppHTTPProbe=true \
    --set servicegraph.enabled=true \
    --set gateways.istio-ingressgateway.sds.enabled=true \
    --set tracing.enabled=true \
    --set kiali.enabled=true \
    --set "kiali.dashboard.jaegerURL=http://jaeger-query:16686" \
    --set "kiali.dashboard.grafanaURL=http://grafana:3000" \
    --set grafana.enabled=true >> istio/istio.yaml
@cat istio-$(ISTIO_VERSION)/install/kubernetes/helm/istio-init/files/crd-1* > istio/istio-delete-crds.yaml
@rm -rf istio-$(ISTIO_VERSION)

A kubectl apply is later done on the istio-crds.yaml and istio.yaml files.

Environment where bug was observed (cloud vendor, OS, etc) EKS

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 23 (17 by maintainers)

Most upvoted comments

I still see this issue with 1.3.5. Can we reopen this? @incfly I don’t have perms to reopen the issue. Here’s how I’m able to repro:

Create a gateway that uses a secret like so:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: apa0401f02e6d286f2f0678eff4046e11571e58
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*.apa0401f02e6d286f2f0678eff4046e11571e58.myurl.com'
    port:
      name: https-mutual
      number: 443
      protocol: HTTPS
    tls:
      credentialName: apa0401f02e6d286f2f0678eff4046e11571e58
      mode: MUTUAL

Create the secret that the gateway uses:

apiVersion: v1
data:
  cacert: cacertdatahere
  cert: certdatahere
  key: secretdatahere
kind: Secret
metadata:
  name: apa0401f02e6d286f2f0678eff4046e11571e58
  namespace: istio-system
type: Generic

Now delete the secret and gateway and create them again using the exact same values. This causes SDS to fail to push things over to Envoy. In 1.3.5 I see that even the root cert doesn’t make it to Envoy so the TLS endpoint is completely broken:

2019-11-19T05:45:20.765756Z	warn	secretFetcherLog	Cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58-cacert, searching for fallback secret gateway-fallback
2019-11-19T05:45:20.765784Z	error	secretFetcherLog	cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58-cacert and cannot find fallback secret gateway-fallback
2019-11-19T05:45:20.765791Z	warn	cacheLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-18, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58-cacert, EVENT: SecretFetcher cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58-cacert from cache
2019-11-19T05:45:20.765800Z	warn	sdsServiceLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-18, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58-cacert, EVENT: waiting for ingress gateway secret for proxy "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local"

2019-11-19T05:45:20.769565Z	warn	secretFetcherLog	Cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58, searching for fallback secret gateway-fallback
2019-11-19T05:45:20.769693Z	error	secretFetcherLog	cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58 and cannot find fallback secret gateway-fallback
2019-11-19T05:45:20.769723Z	warn	cacheLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-17, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58, EVENT: SecretFetcher cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58 from cache
2019-11-19T05:45:20.769739Z	warn	sdsServiceLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-17, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58, EVENT: waiting for ingress gateway secret for proxy "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local"

2019-11-19T05:45:41.187776Z	error	sdsServiceLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-16, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58, EVENT: NotifyProxy failed. No connection with id "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-16" can be found
2019-11-19T05:45:41.187808Z	error	cacheLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-16, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58, EVENT: failed to notify secret change for proxy: no connection with id "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-16" can be found
2019-11-19T05:45:41.187980Z	error	sdsServiceLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-15, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58-cacert, EVENT: NotifyProxy failed. No connection with id "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-15" can be found
2019-11-19T05:45:41.187996Z	error	cacheLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-15, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58-cacert, EVENT: failed to notify secret change for proxy: no connection with id "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-15" can be found

what release will this be targeted for? Will it be backported anywhere?

It will be shipped with release-1.7. Which will be announced on August 11th (https://github.com/istio/istio/wiki/Istio-Release-1.7) @williamaronli could you cherrypick this fix into 1.5 and 1.6 as well?

I vaguely recall this being a known bug that has been fixed in one of the 1.3 patch releases. I highly recommend you upgrade to 1.3.5 which has a number of high severity fixed.

@JimmyCYJ can verify if this was fixed