istio: Almost every app gets UC errors, 0.012% of all requests in 24h period

I’ve noticed pretty much every application periodically gets this, a 503 with the envoy code UC.

It’s retried as you can see, but it’s bothering me that I cannot get to the bottom of why it is happening. My destinationrule should deal with any connection timeouts etc:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: sauron-graphql-server
spec:
  host: app
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1024
        http2MaxRequests: 10240
        idleTimeout: 5m
        maxRequestsPerConnection: 1024
      tcp:
        maxConnections: 10240
        tcpKeepalive:
          interval: 1m
          time: 3m
    loadBalancer:
      simple: LEAST_CONN
    outlierDetection:
      baseEjectionTime: 5s
      consecutiveErrors: 5
      interval: 5s
      maxEjectionPercent: 50
      minHealthPercent: 50
    tls:
      mode: ISTIO_MUTUAL

The frequency of these bothers me, because across a distributed service graph, we can get spikes in latency on client requests as a result.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 4
Comments: 94 (85 by maintainers)

Most upvoted comments

Along with silentdai’s PR, we also discovered that applications which have a low socket timeout value will terminate connections to envoy more frequently, exacerbating the issue. I’ve written about it here: https://karlstoney.com/2019/05/31/istio-503s-ucs-and-tcp-fun-times

The TLDR; here is that @silentdai’s update (hopefully in 1.1.8) combined with an awareness and adjustment of applications with low tcp socket timeouts resolves 99.99% of our 503’s.

I’m going to close this issue now as it’s become a bit of a novel, I’m thankful to the whole Istio team for their efforts helping get to the bottom of this one and I know there are more improvements to come from them which should make this less of a problem.

Stono on May 31, 2019

Do you have node.js application behind envoy? Node.js has 5 sec default connection idleTimeout, and envoy has the same one. In rare time it generate this error - envoy send reqest in the moment when nodejs closing it.

We have resolve this with DestinationRule:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: qqq-destination-rule
spec:
  host: qqq.aaa.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        idleTimeout: 3s
        maxRetries: 3

turbotankist on Dec 12, 2019

I’ve sent you the config dumps on slack

Stono on May 20, 2019

As an FYI; as I mentioned earlier one we had issues with clients that were doing their own connection pooling, it was getting to a point where it reached its connection pool limit and then was dying. My theory was that those clients pooling connections were holding onto connections that had otherwise been UC’d and then those connections weren’t being released.

In one of those clients we implemented https://hc.apache.org/httpcomponents-core-ga/httpcore/apidocs/org/apache/http/impl/NoConnectionReuseStrategy.html, and we haven’t had any problems with that application since.

This was an application which has been migrated to istio, and works fine on the old infra.

I’m not sure what this tells us, but it does all feel related.

Stono on May 16, 2019