istio: Envoy ready before service, causes momentary failures with "upstream connect error or disconnect/reset before headers"

@mikedoug commented on Tue Apr 10 2018

Is this a BUG or FEATURE REQUEST?: Bug

Did you review https://istio.io/help/ and existing issues to identify if this is already solved or being worked on?: Yes

Bug: Yes?

What Version of Istio and Kubernetes are you using, where did you get Istio from, Installation details

istioctl version
Version: 0.7.1
GitRevision: 62110d4f0373a7613e57b8a4d559ded9cb6a1cc8
User: root@c5207293dc14
Hub: docker.io/istio
GolangVersion: go1.9
BuildStatus: Clean

kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:55:54Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Is Istio Auth enabled or not ? Did you install the stable istio.yaml, istio-auth.yaml… or if using the Helm chart please provide full command line input.

I installed using istio-auth.yaml. It worked wonderfully!

What happened: Everything works perfectly. I have a simple “hello world” service which I have ingress configured to access. It works perfectly. Until I change the deployment – when I upgrade to a new version or I deploy new pods. Kubernetes does the right thing and tries to only bring them into service when the listening port is available – but it seems like the Envoy in the side car is up and running BEFORE my simple “hello world” service and so Kubernetes decides to add this pod into the service – but accesses get me “upstream connect error or disconnect/reset before headers” for a few seconds until my service is operational.

What you expected to happen: I expect my new pod to NOT handle any connections until the service itself is actually listening and ready to go. It would seem that there needs to be some way for the healthcheck system to be extended to know about envoy or for envoy to not listen for TCP connections until it can reach the internal service.

In a live system, this will cause sporadic errors from simple and common acts of performing continuous deployment and scaling changes.

How to reproduce it: Create a deployment with Istio sidecar, a service, and an ingress pointing to this service. Setup a constant curl of that service address and then either change the deployment to use a new image or change the number of replicas. You will experience the “upstream connect error or disconnect/reset before headers” issue.

@linsun commented on Tue Apr 10 2018

Thanks for reporting this, this is an interesting prob. Agree that it is not helpful to have envoy side car declared running before the actual service container reaches running.

Do you have health check for your service? It could help k8s to know when your service is running. cc @costinm @rshriram @louiscryan @frankbu

@mikedoug commented on Tue Apr 10 2018

I do not have a health check setup. I’ll give full disclaimer that I am new to Kubernetes and am experimenting with all of this now. I will look into setting one up to see if it fixes the problem – it sounds like it will.

However, it seems bad form for the Envoy sidecar to require a manual health-check configured for every service simply because this race condition exists. Long term it would be better for Istio ensure that Envoy doesn’t exhibit this behavior.

@mikedoug commented on Tue Apr 10 2018

Using a liveness or readiness probe as an httpGet or a straight up TCP probe (using https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-command as reference) always gets me a broken service. In all cases I get a bunch of the following error line in my istio-proxy logs for that pod. The service is completely broken at that point.

[2018-04-10 21:19:07.666][14][warning][upstream] external/envoy/source/server/lds_subscription.cc:68] lds: fetch failure: error adding listener: ‘http_10.244.1.78_5000’ has duplicate address ‘10.244.1.78:5000’ as existing listener

So using liveness or readiness checks is a bust/no-go for this. Possibly another bug report… Actually I found one already documenting this issue: https://github.com/istio/istio/issues/2628 I’m reviewing that now to see if I can do something here.

@mikedoug commented on Tue Apr 10 2018

This configuration resolves my issue (where my internal service is on port 5000):

        readinessProbe:
          exec:
            command:
            - curl
            - http://localhost:5000/
          initialDelaySeconds: 1
          periodSeconds: 2

However, this should be default behavior of the Istio sidecar to ensure that the internal services are reachable before itself going reachable.

@mikedoug commented on Tue Apr 10 2018

I take that back. I am still having the same issue where, even with the readinessProbe above, my service is returning the “upstream connect error or disconnect/reset before headers” briefly before beginning to function properly.

It would seem all that this is doing is preventing my container from marking itself as ready, but the istio container is going ready ahead of this. I am using the Istio Ingress service in front of the one that I’m attempting to make this work properly.

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 2
Comments: 15 (2 by maintainers)

Most upvoted comments

@facundomedica this is unrelated to the original issue but does indicate we need to do a better job identifying what is causing the 503s.

liamawhite on Nov 29, 2018

I solved it by adding this policy:

apiVersion: authentication.istio.io/v1alpha1
kind: Policy
metadata:
  name: default
  namespace: staging
spec:
  peers:
    - mtls: {}
---
apiVersion: authentication.istio.io/v1alpha1
kind: Policy
metadata:
  name: default
  namespace: production
spec:
  peers:
    - mtls: {}

facundomedica on Nov 29, 2018

We are also having the same problem with istio-1.0, no matter what readiness/liveness probe we try, the requests still get connection err when applying changes to deployment or scaling down the pods.

Does anyone have an workaround for this issue?

stoccoPismo on Aug 23, 2018

We’re suffering the same outbound issue where workers are starting to try and send outgoing connections before the proxy is configured causing errors.

robholland on Jul 17, 2018

@FuzzOli87 – Interesting. It sounds like what needs to happen is some lock-stepping:

The istio-proxy starts up and begins to handle any outbound connections.
Then the actual service starts up.
Then istio-proxy opens whatever listening ports it is responsible for handling.

Your problems is solved by the gating of steps 1 and 2 because istio-proxy is spun up and functional before your service goes. I’m not sure of an easy way to gate that however – unless there is a way that the service could optionally “test” istio-proxy’s availability. Some sort of “ping” service on your network that must be available before your service does it’s startup process could be a hacky way of making that happen.

The 3rd step is somewhat easy to gate inside the istio-proxy itself – istio-proxy can wait until the local ports are available in LISTEN mode before it opens its own listening ports. This one solves my problem of the service not truly being ready, but istio-proxy is making it appear like it is to the cluster because it is listening.

mikedoug on May 14, 2018