istio: Pod fails to start: Application container unable to access network before sidecar ready

When a pod starts, the sidecar and the application containers all start together.

If an application container attempts to access a network service before the sidecar is ready, the connection fails.

  1. Access can fail completely if no listener is present on the sidecar
  2. Access fails with 404 / 503 if listener is present but no routes are available.

If the application is resilient to its dependency availability, then this is not an issue. The application will continue to retry until the connection can be established. However if the application uses a network endpoint during the startup process and considers it a fatal error if the endpoint cannot be accessed, the application container will die.

As long as restartPolicy is OnFailure (or Always) k8s will restart the container while sidecar gets ready.

  • Test that this really works
  • Document mitigation

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

I wrote our sleep to be literally the first thing that the application does at startup. It’s effectively the first line of code that executes in a shared main method that all of our services use - that lets us make sure there’s standard flags for configuring the startup delay, etc. Making it absolutely the first thing that happens prevents developers from accidentally attempting to do stuff that could fail without a sidecar (like opening up connections to the database or reading some online config store, etc).

Anecdotally, with a 5 second delay at startup we’ve not seen any startup failures due to waiting on the sidecar in our continuous testing environments. How long the typical startup delay is in your system is mainly a function of Pilot load (number of services in the system, rate of change of pods, services in the system, number of sidecars connected, etc).

The full solution to this in Kubernetes is for k8s to support Sidecar containers as a first class concept, starting them up entirely before starting up the application container. We’d been hopeful this would land in the latest k8s release but it’s since been put on indefinite hold by the k8s community and will not ship with K8s 1.19 (at this point we can hope for 1.20, but I haven’t been following in k8s closely to see if that’s realistic).

Other organizations I’ve worked with have solved this problem by adding a sleep to the app container. The base framework for services we use at Tetrate incorporates a sleep at startup to paper over this pain too, for example. It’s not clean, and violates the design goal of the mesh being transparent, but until there’s better support for container lifecycles in underlying platforms that Istio runs on there’s not too much we can do here.