istio: Intermittent broken pipe
Bug description
We are seeing
[Envoy (Epoch 0)] [2020-05-27 20:35:08.309][32][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:452] [C1340] idle timeout
in istio-proxy logs for TCP connections.
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
As this is a TCP connection, we shouldn’t be seeing these 60m timeouts from what I can see in the documentation - it looks as though the connection is being treat as a HTTP connection.
We upgraded recently from Istio 1.3.6 and we weren’t seeing these issues.
Steps to reproduce the bug
Here is the service we’re testing against:
kind: Service
metadata:
annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
creationTimestamp: "2020-05-26T07:08:07Z"
labels:
app: abc-opensource
chart: abc-opensource-0.29.0
heritage: Tiller
product: redis
release: abc-redis
name: abc-redis-opensource-announce-0
namespace: abc
resourceVersion: "29305795"
selfLink: /api/v1/namespaces/abc/services/abc-redis-opensource-announce-0
uid: e3629883-79fe-4304-8cb1-9b544ed152ad
spec:
clusterIP: 1.2.3.4
ports:
- name: tcp-server
port: 6379
protocol: TCP
targetPort: redis
- name: tcp-sentinel
port: 26379
protocol: TCP
targetPort: sentinel
publishNotReadyAddresses: true
selector:
app: abc-opensource
release: abc-redis
statefulset.kubernetes.io/pod-name: abc-redis-opensource-server-0
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Version (include the output of istioctl version --remote
and kubectl version
and helm version
if you used Helm)
client version: 1.5.4
egressgateway version: 1.5.4
ingressgateway version: 1.5.4
ingressgateway version: 1.5.4
ingressgateway-public version:
pilot version: 1.5.4
data plane version: 1.5.1 (32 proxies), 1.5.4 (173 proxies)
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.9", GitCommit:"a17149e1a189050796ced469dbd78d380f2ed5ef", GitTreeState:"clean", BuildDate:"2020-04-16T11:44:51Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Client: &version.Version{SemVer:"v2.16.7", GitCommit:"5f2584fd3d35552c4af26036f0c464191287986b", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.6", GitCommit:"dd2e5695da88625b190e6b22e9542550ab503a47", GitTreeState:"clean"}
The client uses jredis. We see the issue where the client is disconnected from sentinel:
Lost connection to Sentinel at abc-redis-opensource-announce-0:26379. Sleeping 5000ms and retrying.
The issue that is causing us most pain though, is when we ask redis client to make a new request. It seems jredis creates a connection to redis when it starts and then this connection is timed out by istio. This means when any of our apps make a new request to jredis, it doesn’t have a connection to redis established and needs to create a new one. We see
redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketException: Broken pipe (Write failed)","rootCause":"Broken pipe (Write failed)"}]}}
We are seeing similar issues between other apps but finding that harder to replicate. We weren’t seeing these issues on 1.3.6 and we’re still not seeing these issues on other clusters running that version How was Istio installed? Using operator. Environment where bug was observed (cloud vendor, OS, etc) EKS
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 23 (11 by maintainers)
I’m having this 1hr timeout issue also with Postgres client. Using istio1.6.7
We ran into the same problem on 1.7, but we noticed that the
ISTIO_META_IDLE_TIMEOUT
setting was only getting picked up on theOUTBOUND
side of things, not theINBOUND
. By adding an additional filter that applied to theINBOUND
side of the request, we were able to successfully increase the timeout (we used 24 hours)We also created a similar filter to apply to the passthrough cluster (so that timeouts still apply to external traffic that we don’t have service entries for), since the config wasn’t being picked up there either.
here is my topology
I am reducing TCP proxy idle timeout to simulate the problem. It’s doesn’t matter if it’s 3s, 1h, 24h, or 7d if the client is idle more than the threshold we set, it’s causing broken pipe:
the first request success, it’s creating TCP proxy
after 3 seconds of the idle timeout, it seems like closing the connection, but only to upstream redis
and the second request, it’s getting 502
and somehow, after the second request is failing, the third request success to create the TCP proxy
you can see the full log (service sidecar) here: echo-redis-03.log
I have concern why the envoy sidecar is only terminating to upstream (in this case Redis), but not terminating to downstream itself (the application where the sidecar is running)
This causes us a problem since we need to configure all our applications to set TCP idle timeout below the Envoy configuration. For example, if we have a 1-hour idle timeout (default Envoy TCP proxy idle timeout), we must make sure our application that connects to Redis/Postgresql or whatever that is using TCP proxy to have an idle timeout below 1 hour so our application can initiate a TCP closing handshake. We receive lots of complaints from devs.
FYI:
Istio version: v1.9.9
@howardjohn do you know about this? did we miss something?
@Rayzhangtian We’re having issues with this and connections to postgres (similar to @jsabalos). Could you please share what DR + KA configs you used that solved the issue?