istio: istio-cni does not return an error when it cannot get pod information from k8s

Bug description

If istio-cni is unable to get pod information, it logs a warning, finds that the number of containers for the pod is not greater than 1, skips injection and succeeds.

https://github.com/istio/cni/blob/c1b9ddf605584b098a9d8ff31622410a81afcaeb/cmd/istio-cni/main.go#L187-L189

It needs to fail fast in this case.

What happened in our case was that we had the cni plugin run as a daemonset in the istio-system namespace. Due to races on new node creation, we had to move it to kube-system so that we could set its priority class to system-node-critical.

In this move we forgot to update the cluster role binding to the point to the new service account in the kube-system namespace. As a result, the CNI pod did not have privileges to get pod info from the k8s API.

In such a situation, I would expect the CNI plugin to fail fast and block pod creation, as opposed to what really happened which was that a bunch of pods came up with no IPTables rules injected and caused a massive, hard-to-debug, cluster-wide outage because no traffic was being redirected to the envoy proxies.

The fact that we were rolling nodes which causes a bunch of new pods to be created at the same time made the outage worse. OTOH, it was a useful signal to figure out the root cause of a problem that would have been otherwise hard to debug.

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [X ] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (13 by maintainers)

Commits related to this issue

Most upvoted comments

@ruigulala Istio 1.4 only supports Kubernetes 1.13+ so I think its reasonable ot assume its enabmed

Using the new install mechanism istioctl manifest as described in CNI doc page, CNI daemonset/pods will be created in istio-system namespace. Is this an issue?

I have no idea why the namespace was changed. In 1.3, the daemon must be created in kube-system due to its priorityClass. @ruigulala changed the priority class, and it should not require being created in kube-system anymore. We’re lucky this was done before 1.4 release then.