gvisor: External HTTP Request Failure Rate 0.7%

Description

We have been experiencing an issue where HTTP connections that reuse a TCP connection (keep-alive connections) fail when talking to external services. We have seen sporadic time-out, connection reset and unexpected end-of-file (EOF) errors when deployments running on gVisor with istio-proxy sidecars talk to external endpoints such as Shopify, Cloudflare, PyPi and more.

In order to investigate the issue further a Cloudflare worker was setup, acting as a dummy API which we then subsequently called from both k6 (Golang) and Nodejs testing scripts. From these tests it was determined that 0.7-0.9% of the requests would fail, mostly with timeout errors and the occasional unexpected EOF/connection resets.

Break-pointing runsc with delve we noticed that k6 by default reused the same TCP connection for successive requests whereas Nodejs did not and when setting both tests to reuse connections we started to experience the timeout issues mentioned above. On the other hand, when disabling the HTTP keep-alive functionality the tests succeeded with no failures. When running with connection reuse, failures would start after a small period of time, roughly around 2min.

The problem presented above sounds somewhat similar to an issue we previously created (https://github.com/google/gvisor/issues/6317) where raw TCP connections to a database would fail, but after a longer period of time (20mins+ vs 2min+ for this issue).

This is consistent across multiple Kubernetes versions, Istio versions, cloud providers, k3s, baremetal and containerd versions. We also tried the same setup on Cloud Run and experienced the same failure rate.

We are happy to demonstrate the issues live over google meets/zoom if needs be.

Steps to reproduce

  • Provision a new cluster with gVisor installed
  • Deploy istio with istio/deploy.sh
  • Label a namespace for testing with istio-injection: enabled
  • [Optional] Build the two apps with app/k6-test/build.sh <tag> and app/nodejs-test/build.sh <tag> providing the argument <name>:<tag> to tag the image with (some prebuilt images are already provided)
  • Deploy the apps with app/deploy-k6.sh <url> <reuse-connections> and app/deploy-nodejs.sh <url> <reuse-connections>
    • <url> in the arguments for the script is the cloudflare worker we have deployed for testing, the url we will send over private message
    • <reuse-connections> in the arguments sets whether the script will reuse the connections or not (true/false)
    • An optional third argument can be used to specify a different tag
    • Uses two public images built beforehand and published to dockerhub champgoblem/gvisor-k6-test:latest and champgoblem/gvisor-nodejs-test:latest unless specifying a different image
    • Each deployment also has a set of configurable enviroment variables that are detailed in each of the deploy.sh scripts
  • Wait for the reuse connection pods to fail

The Cloudflare worker code is attached in the setup scripts for analysis/recreation worker/cloudflare-worker.js.

The Cloudflare worker endpoint has been omitted from the setup scripts so as to not advertise it publicly, we will happily send this over when needed.

The debug scripts are available publicly at https://github.com/northflank/http-connection-failure-gvisor or in the attached zip submission.zip

runsc version

- Latest on go and main branches
- `google-380056102`
- And potentially more

docker version (if using docker)

Containerd versions:
- `1.4.3`
- `1.4.8`
- And potentially more

uname

Linux 5.4.104+ #1 SMP Mon Jun 7 21:53:49 PDT 2021 x86_64 Intel® Xeon® CPU @ 2.00GHz GenuineIntel GNU/Linux

kubectl (if using Kubernetes)

K8s versions:
- `v1.20.8-gke.900`
- `v1.21.3+k3s1`

Istio Versions:
- 1.9.5
- 1.9.6
- 1.9.7
- 1.10.3

repo state (if built from source)

No response

runsc debug logs (if available)

No response

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (10 by maintainers)

Commits related to this issue

Most upvoted comments

NOTE: I opened an issue for a bug I observed in our tcp_conntrack which may also cause some of these odd behaviours.

See: https://github.com/google/gvisor/issues/6734

@DeciderWill Would you be able to provide a pcap of the traffic where this problem reproduces. It will be great if we could get a client and server pcap.

I have asked Kevin to take a look at this, In the meanwhile I have started investigating https://github.com/google/gvisor/issues/6317 and I will have an update on that as that seems to be unrelated to iptables and probably is a bug in our keep alive implementation.

@DeciderWill Sorry this fell off my radar. I will take a look today and see if I can repro the issue.

I will also sync with @kevinGC and @nybidari and see post updates once I have something concrete to share.