gvisor: External HTTP Request Failure Rate 0.7%
Description
We have been experiencing an issue where HTTP connections that reuse a TCP connection (keep-alive connections) fail when talking to external services. We have seen sporadic time-out, connection reset and unexpected end-of-file (EOF) errors when deployments running on gVisor with istio-proxy sidecars talk to external endpoints such as Shopify, Cloudflare, PyPi and more.
In order to investigate the issue further a Cloudflare worker was setup, acting as a dummy API which we then subsequently called from both k6 (Golang) and Nodejs testing scripts. From these tests it was determined that 0.7-0.9% of the requests would fail, mostly with timeout errors and the occasional unexpected EOF/connection resets.
Break-pointing runsc with delve we noticed that k6 by default reused the same TCP connection for successive requests whereas Nodejs did not and when setting both tests to reuse connections we started to experience the timeout issues mentioned above. On the other hand, when disabling the HTTP keep-alive functionality the tests succeeded with no failures. When running with connection reuse, failures would start after a small period of time, roughly around 2min.
The problem presented above sounds somewhat similar to an issue we previously created (https://github.com/google/gvisor/issues/6317) where raw TCP connections to a database would fail, but after a longer period of time (20mins+ vs 2min+ for this issue).
This is consistent across multiple Kubernetes versions, Istio versions, cloud providers, k3s, baremetal and containerd versions. We also tried the same setup on Cloud Run and experienced the same failure rate.
We are happy to demonstrate the issues live over google meets/zoom if needs be.
Steps to reproduce
- Provision a new cluster with gVisor installed
- Deploy istio with
istio/deploy.sh
- Label a namespace for testing with
istio-injection: enabled
- [Optional] Build the two apps with
app/k6-test/build.sh <tag>
andapp/nodejs-test/build.sh <tag>
providing the argument<name>:<tag>
to tag the image with (some prebuilt images are already provided) - Deploy the apps with
app/deploy-k6.sh <url> <reuse-connections>
andapp/deploy-nodejs.sh <url> <reuse-connections>
<url>
in the arguments for the script is the cloudflare worker we have deployed for testing, the url we will send over private message<reuse-connections>
in the arguments sets whether the script will reuse the connections or not (true/false
)- An optional third argument can be used to specify a different tag
- Uses two public images built beforehand and published to dockerhub
champgoblem/gvisor-k6-test:latest
andchampgoblem/gvisor-nodejs-test:latest
unless specifying a different image - Each deployment also has a set of configurable enviroment variables that are detailed in each of the deploy.sh scripts
- Wait for the reuse connection pods to fail
The Cloudflare worker code is attached in the setup scripts for analysis/recreation worker/cloudflare-worker.js
.
The Cloudflare worker endpoint has been omitted from the setup scripts so as to not advertise it publicly, we will happily send this over when needed.
The debug scripts are available publicly at https://github.com/northflank/http-connection-failure-gvisor or in the attached zip submission.zip
runsc version
- Latest on go and main branches
- `google-380056102`
- And potentially more
docker version (if using docker)
Containerd versions:
- `1.4.3`
- `1.4.8`
- And potentially more
uname
Linux 5.4.104+ #1 SMP Mon Jun 7 21:53:49 PDT 2021 x86_64 Intel® Xeon® CPU @ 2.00GHz GenuineIntel GNU/Linux
kubectl (if using Kubernetes)
K8s versions:
- `v1.20.8-gke.900`
- `v1.21.3+k3s1`
Istio Versions:
- 1.9.5
- 1.9.6
- 1.9.7
- 1.10.3
repo state (if built from source)
No response
runsc debug logs (if available)
No response
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (10 by maintainers)
Commits related to this issue
- Add an integration test for istio like redirect. Updates #6441,#6317 PiperOrigin-RevId: 403217628 — committed to google/gvisor by hbhasker 3 years ago
- Add an integration test for istio like redirect. Updates #6441,#6317 PiperOrigin-RevId: 403217628 — committed to google/gvisor by hbhasker 3 years ago
- Add an integration test for istio like redirect. Updates #6441,#6317 PiperOrigin-RevId: 403217628 — committed to google/gvisor by hbhasker 3 years ago
- Add an integration test for istio like redirect. Updates #6441,#6317 PiperOrigin-RevId: 403217628 — committed to google/gvisor by hbhasker 3 years ago
- Add an integration test for istio like redirect. Updates #6441,#6317 PiperOrigin-RevId: 403217628 — committed to google/gvisor by hbhasker 3 years ago
- Add an integration test for istio like redirect. Updates #6441,#6317 PiperOrigin-RevId: 403217628 — committed to google/gvisor by hbhasker 3 years ago
- Add an integration test for istio like redirect. Updates #6441,#6317 PiperOrigin-RevId: 404872327 — committed to google/gvisor by hbhasker 3 years ago
- POC: Handle NAT entry clashes. A host can reuse a port within the timeout period of a NAT entry which is closed state in the IPTable. In such cases the new SYN will get dropped because it the previou... — committed to google/gvisor by hbhasker 3 years ago
NOTE: I opened an issue for a bug I observed in our tcp_conntrack which may also cause some of these odd behaviours.
See: https://github.com/google/gvisor/issues/6734
@DeciderWill Would you be able to provide a pcap of the traffic where this problem reproduces. It will be great if we could get a client and server pcap.
I have asked Kevin to take a look at this, In the meanwhile I have started investigating https://github.com/google/gvisor/issues/6317 and I will have an update on that as that seems to be unrelated to iptables and probably is a bug in our keep alive implementation.
@DeciderWill Sorry this fell off my radar. I will take a look today and see if I can repro the issue.
I will also sync with @kevinGC and @nybidari and see post updates once I have something concrete to share.