istio: Intermittent broken pipe

Bug description We are seeing [Envoy (Epoch 0)] [2020-05-27 20:35:08.309][32][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:452] [C1340] idle timeout in istio-proxy logs for TCP connections. [ ] Configuration Infrastructure [ ] Docs [ ] Installation [X] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure Expected behavior As this is a TCP connection, we shouldn’t be seeing these 60m timeouts from what I can see in the documentation - it looks as though the connection is being treat as a HTTP connection. We upgraded recently from Istio 1.3.6 and we weren’t seeing these issues. Steps to reproduce the bug Here is the service we’re testing against:

kind: Service
metadata:
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
  creationTimestamp: "2020-05-26T07:08:07Z"
  labels:
    app: abc-opensource
    chart: abc-opensource-0.29.0
    heritage: Tiller
    product: redis
    release: abc-redis
  name: abc-redis-opensource-announce-0
  namespace: abc
  resourceVersion: "29305795"
  selfLink: /api/v1/namespaces/abc/services/abc-redis-opensource-announce-0
  uid: e3629883-79fe-4304-8cb1-9b544ed152ad
spec:
  clusterIP:  1.2.3.4
  ports:
  - name: tcp-server
    port: 6379
    protocol: TCP
    targetPort: redis
  - name: tcp-sentinel
    port: 26379
    protocol: TCP
    targetPort: sentinel
  publishNotReadyAddresses: true
  selector:
    app: abc-opensource
    release: abc-redis
    statefulset.kubernetes.io/pod-name: abc-redis-opensource-server-0
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

client version: 1.5.4
egressgateway version: 1.5.4
ingressgateway version: 1.5.4
ingressgateway version: 1.5.4
ingressgateway-public version:
pilot version: 1.5.4
data plane version: 1.5.1 (32 proxies), 1.5.4 (173 proxies)
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.9", GitCommit:"a17149e1a189050796ced469dbd78d380f2ed5ef", GitTreeState:"clean", BuildDate:"2020-04-16T11:44:51Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Client: &version.Version{SemVer:"v2.16.7", GitCommit:"5f2584fd3d35552c4af26036f0c464191287986b", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.6", GitCommit:"dd2e5695da88625b190e6b22e9542550ab503a47", GitTreeState:"clean"}

The client uses jredis. We see the issue where the client is disconnected from sentinel: Lost connection to Sentinel at abc-redis-opensource-announce-0:26379. Sleeping 5000ms and retrying. The issue that is causing us most pain though, is when we ask redis client to make a new request. It seems jredis creates a connection to redis when it starts and then this connection is timed out by istio. This means when any of our apps make a new request to jredis, it doesn’t have a connection to redis established and needs to create a new one. We see

redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketException: Broken pipe (Write failed)","rootCause":"Broken pipe (Write failed)"}]}}

We are seeing similar issues between other apps but finding that harder to replicate. We weren’t seeing these issues on 1.3.6 and we’re still not seeing these issues on other clusters running that version How was Istio installed? Using operator. Environment where bug was observed (cloud vendor, OS, etc) EKS

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 23 (11 by maintainers)

Most upvoted comments

I’m having this 1hr timeout issue also with Postgres client. Using istio1.6.7

We ran into the same problem on 1.7, but we noticed that the ISTIO_META_IDLE_TIMEOUT setting was only getting picked up on the OUTBOUND side of things, not the INBOUND. By adding an additional filter that applied to the INBOUND side of the request, we were able to successfully increase the timeout (we used 24 hours)

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: listener-timeout-tcp
  namespace: istio-system
spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: SIDECAR_INBOUND
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.tcp_proxy
    patch:
      operation: MERGE
      value:
        name: envoy.filters.network.tcp_proxy
        typed_config:
          '@type': type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy
          idle_timeout: 24h

We also created a similar filter to apply to the passthrough cluster (so that timeouts still apply to external traffic that we don’t have service entries for), since the config wasn’t being picked up there either.

here is my topology

user -> gateway -> service (sidecar istio-proxy) -> redis

I am reducing TCP proxy idle timeout to simulate the problem. It’s doesn’t matter if it’s 3s, 1h, 24h, or 7d if the client is idle more than the threshold we set, it’s causing broken pipe:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: idle-timeout-1.9-ef
  namespace: istio-system
spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: ANY
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.tcp_proxy
      proxy:
        proxyVersion: ^1\.9.*
    patch:
      operation: MERGE
      value:
        name: envoy.filters.network.tcp_proxy
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
          idle_timeout: 3s

the first request success, it’s creating TCP proxy

2021-11-02T05:14:13.260048Z	debug	envoy pool	[C47163] using existing connection
2021-11-02T05:14:13.260053Z	debug	envoy pool	[C47163] creating stream
2021-11-02T05:14:13.260066Z	debug	envoy router	[C47317][S9607723458839045519] pool ready
2021-11-02T05:14:13.260487Z	debug	envoy filter	original_dst: New connection accepted
2021-11-02T05:14:13.260556Z	debug	envoy filter	[C47318] new tcp proxy session
2021-11-02T05:14:13.260577Z	debug	envoy filter	[C47318] Creating connection to cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:13.260608Z	debug	envoy upstream	transport socket match, socket default selected for host with address 10.121.211.209:6379
2021-11-02T05:14:13.260621Z	debug	envoy upstream	Created host 10.121.211.209:6379.
2021-11-02T05:14:13.260661Z	debug	envoy upstream	addHost() adding 10.121.211.209:6379
2021-11-02T05:14:13.260667Z	debug	envoy pool	creating a new connection
2021-11-02T05:14:13.260759Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:13.260772Z	debug	envoy pool	[C47319] connecting
2021-11-02T05:14:13.260777Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:13.260762Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:13.260778Z	debug	envoy connection	[C47319] connecting to 10.121.211.209:6379
2021-11-02T05:14:13.260800Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:13.260857Z	debug	envoy connection	[C47319] connection in progress
2021-11-02T05:14:13.260871Z	debug	envoy pool	queueing request due to no available connections
2021-11-02T05:14:13.260876Z	debug	envoy conn_handler	[C47318] new connection
2021-11-02T05:14:13.260896Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:13.260907Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:13.262126Z	debug	envoy connection	[C47319] connected

after 3 seconds of the idle timeout, it seems like closing the connection, but only to upstream redis

2021-11-02T05:14:16.262443Z	debug	envoy filter	[C47318] Session timed out
2021-11-02T05:14:16.262483Z	debug	envoy connection	[C47318] closing data_to_write=0 type=1
2021-11-02T05:14:16.262486Z	debug	envoy connection	[C47318] closing socket: 1
2021-11-02T05:14:16.262648Z	debug	envoy connection	[C47319] closing data_to_write=0 type=1
2021-11-02T05:14:16.262660Z	debug	envoy connection	[C47319] closing socket: 1
2021-11-02T05:14:16.262705Z	debug	envoy pool	[C47319] client disconnected
2021-11-02T05:14:16.262717Z	debug	envoy conn_handler	[C47318] adding to cleanup list
2021-11-02T05:14:16.262737Z	debug	envoy pool	[C47319] connection destroyed

and the second request, it’s getting 502

2021-11-02T05:14:20.771115Z	debug	envoy pool	[C47171] using existing connection
2021-11-02T05:14:20.771120Z	debug	envoy pool	[C47171] creating stream
2021-11-02T05:14:20.771131Z	debug	envoy router	[C47187][S14376139327146836108] pool ready
2021-11-02T05:14:20.771559Z	debug	envoy router	[C47187][S14376139327146836108] upstream headers complete: end_stream=false
2021-11-02T05:14:20.771644Z	debug	envoy http	[C47187][S14376139327146836108] encoding headers via codec (end_stream=false):
':status', '502'
'date', 'Tue, 02 Nov 2021 05:14:20 GMT'
'content-length', '3'
'content-type', 'text/plain; charset=utf-8'
'x-envoy-upstream-service-time', '0'
'x-envoy-peer-metadata', '[REDACTED]'
'x-envoy-peer-metadata-id', 'sidecar~10.122.192.149~echo-redis-stable-5987d8dd44-kwxw8.testing~testing.svc.cluster.local'
'server', 'istio-envoy'

2021-11-02T05:14:20.771675Z	debug	envoy client	[C47171] response complete
2021-11-02T05:14:20.772082Z	debug	envoy filter	Called AuthenticationFilter : onDestroy
2021-11-02T05:14:20.772110Z	debug	envoy pool	[C47171] response complete
2021-11-02T05:14:20.772116Z	debug	envoy pool	[C47171] destroying stream: 0 remaining

and somehow, after the second request is failing, the third request success to create the TCP proxy

2021-11-02T05:14:22.059301Z	debug	envoy pool	[C47171] using existing connection
2021-11-02T05:14:22.059305Z	debug	envoy pool	[C47171] creating stream
2021-11-02T05:14:22.059314Z	debug	envoy router	[C47170][S952113864182611469] pool ready
2021-11-02T05:14:22.059671Z	debug	envoy filter	original_dst: New connection accepted
2021-11-02T05:14:22.059743Z	debug	envoy filter	[C47327] new tcp proxy session
2021-11-02T05:14:22.059799Z	debug	envoy filter	[C47327] Creating connection to cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:22.059828Z	debug	envoy upstream	transport socket match, socket default selected for host with address 10.121.211.209:6379
2021-11-02T05:14:22.059835Z	debug	envoy upstream	Created host 10.121.211.209:6379.
2021-11-02T05:14:22.059858Z	debug	envoy pool	creating a new connection
2021-11-02T05:14:22.059886Z	debug	envoy upstream	addHost() adding 10.121.211.209:6379
2021-11-02T05:14:22.059905Z	debug	envoy pool	[C47328] connecting
2021-11-02T05:14:22.059914Z	debug	envoy connection	[C47328] connecting to 10.121.211.209:6379
2021-11-02T05:14:22.059973Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:22.059986Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:22.060000Z	debug	envoy connection	[C47328] connection in progress
2021-11-02T05:14:22.060012Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:22.059988Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:22.060014Z	debug	envoy pool	queueing request due to no available connections
2021-11-02T05:14:22.060059Z	debug	envoy conn_handler	[C47327] new connection
2021-11-02T05:14:22.060089Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:22.060104Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:22.061287Z	debug	envoy connection	[C47328] connected
2021-11-02T05:14:22.061311Z	debug	envoy pool	[C47328] assigning connection
2021-11-02T05:14:22.061328Z	debug	envoy filter	[C47327] TCP:onUpstreamEvent(), requestedServerName: 
2021-11-02T05:14:22.062214Z	debug	envoy router	[C47170][S952113864182611469] upstream headers complete: end_stream=false
2021-11-02T05:14:22.062284Z	debug	envoy http	[C47170][S952113864182611469] encoding headers via codec (end_stream=false):
':status', '200'

you can see the full log (service sidecar) here: echo-redis-03.log

I have concern why the envoy sidecar is only terminating to upstream (in this case Redis), but not terminating to downstream itself (the application where the sidecar is running)

This causes us a problem since we need to configure all our applications to set TCP idle timeout below the Envoy configuration. For example, if we have a 1-hour idle timeout (default Envoy TCP proxy idle timeout), we must make sure our application that connects to Redis/Postgresql or whatever that is using TCP proxy to have an idle timeout below 1 hour so our application can initiate a TCP closing handshake. We receive lots of complaints from devs.

FYI:

  • this issue is not only on Redis, but also on PostgreSQL, or any connection that is using TCP
  • If your client keeps sending keepalive to upstream TCP, this issue will not occur since it’s not idle (our MongoDB client keep sending traffic to upstream MongoDB and don’t see this problem)

Istio version: v1.9.9

@howardjohn do you know about this? did we miss something?

@Rayzhangtian We’re having issues with this and connections to postgres (similar to @jsabalos). Could you please share what DR + KA configs you used that solved the issue?