istio: Istio CNI race condition on cluster restart

Bug description When restarting a cluster istio CNI doesnt setup correct routing depending on pod start time. Traffic is not proxied through sidecars. Restart of those pods with sidecars fix that. Seems to be like case 1 of https://github.com/istio/cni/issues/82 and might be related to these changes https://github.com/containernetworking/plugins/pull/269

Expected behavior Traffic should be routed through sidecars after cluster restart even without restarting the pods.

Steps to reproduce the bug Cluster with calico 3.4.4. Setup istio with istio-cni. Add some Pod with sidecar injection. Restart all cluster hosts. Result: Traffic not routed through sidecar.

Version (include the output of istioctl version --remote and kubectl version) kubernetes: 1.13.5 istio: 1.1.7 calico: 3.4.4

How was Istio installed? helm chart istio-init, istio-cni and istio with mTLS enabled

Environment where bug was observed (cloud vendor, OS, etc) Bare Metal Ubuntu 18.04, Cluster Set Up with kubespray, ipvs proxy mode

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 42 (29 by maintainers)

Most upvoted comments

I have used the phrase “software development should not be so hard” multiple times this week in various contexts 😃

software development is hard, let’s go shopping!

@rlenglet Went through all the above suggestions and did some digging. How about the following two mitigation?

  1. Change the pod priority of istio-cni from system-node-critical to system-cluster-critical. Since istio-cni depends on the “real” CNI, changing to a relatively lower priority level makes sure it’s scheduled after calico and also makes sure it’s scheduled before “normal” pods. https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/priority_preemption.html

  2. Reduce the polling period from 30 seconds to 5 seconds as suggested.

@jammerful please file a separate GitHub issue. In that new issue, please give your Istio configuration, esp. the flags you passed to enable Istio CNI. Also, please give the pod spec of a pod or replica set that is failing.