test-infra: Test failures starting 22 September due to network timeouts

What happened:

On our around 22 September we started seeing CI failures in prow:

Example log from error:

Step 4/8 : RUN apk add --no-cache curl &&     curl -LO https://storage.googleapis.com/kubernetes-release/release/${KUBE_VERSION}/bin/linux/${ARCH}/kubectl &&     chmod +x kubectl
 ---> Running in d7e388707d87
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 25m0s timeout","severity":"error","time":"2021-09-23T21:50:51Z"}

Ref run: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_secrets-s[…]e-csi-driver-image-scan/1441072726373568512/build-log.txt

What you expected to happen:

Successful builds.

How to reproduce it (as minimally and precisely as possible):

This happens consistently on all of our builds. We’ve tried reverting commits in PRs and cannot find anything related to the test case that would cause this. The same tests running on the k8s-infra-prow-build cluster succeed.

We have also seen it fail on apt:

  Connection timed out [IP: 151.101.194.132 80]
[91mE: Failed to fetch http://deb.debian.org/debian/pool/main/u/util-linux/libblkid1_2.36.1-8_amd64.deb  Connection timed out [IP: 199.232.126.132 80]
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openssl/libssl1.1_1.1.1k-1%2bdeb11u1_amd64.deb  Connection timed out [IP: 151.101.2.132 80]
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openssl/openssl_1.1.1k-1%2bdeb11u1_amd64.deb  Connection timed out [IP: 151.101.194.132 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

We also tried setting and increasing CPU/Memory requests in https://github.com/kubernetes/test-infra/pull/23723 and https://github.com/kubernetes/test-infra/pull/23725 but were unsuccessful

Please provide links to example occurrences, if any:

Anything else we need to know?:

Discussion thread in slack: https://kubernetes.slack.com/archives/C09QZ4DQB/p1632414457031600

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 8
  • Comments: 25 (20 by maintainers)

Commits related to this issue

Most upvoted comments

Google has confirmed the issue in the support case: After reviewing the information you provided, we believe that you may be affected by a known issue: We have identified a Networking connecting issue impacting the GKE Docker workload. This is a high priority issue that we're working to resolve as soon as possible. Some customers may be experiencing a connection failure in Docker workflow to Fastly destinations and may receive a timeout error.

I would like for us to roll this change back, but I’m wary that maybe we’ll need it again someday.

So as to avoid a protracted roll out cycle, I’m planning on doing the following:

  • gate the workaround behind an env var BOOTSTRAP_MTU_WORKAROUND, defaulted in-image to true… this will let us roll out changes selectively or globally by job config changes (1 PR) instead of propagating to kubekins (bootstrap change PR + bootstramp bump PR + kubekins bump PR)
  • selectively disable the workaround on jobs that were broken before the workaround was in place via job config changes (add env stanzas setting BOOTSTRAP_MTU_WORKAROUND=false
  • disable the workaround for all jobs by setting it in the preset-dind-enabled preset
  • default the workaround to false in-image
  • …leaving us the option to enable the workaround by default using the same “env var in preset” approach if this problem surfaces again

I’m told there is a fix scheduled for Thursday.

Current suggested workaround is add an initContainer to the Docker in Docker workload like this:

initContainers:                                                                                                                                                                                         
  - name: workaround                                                                                                                                                                                      
    image: k8s.gcr.io/build-image/debian-iptables-amd64:buster-v1.6.7                                                                                                                                     
    command:                                                                                                                                                                                              
    - sh                                                                                                                                                                                                  
    - -c                                                                                                                                                                                                  
    - "iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu"                                                                                                    
    securityContext:                                                                                                                                                                                      
      capabilities:                                                                                                                                                                                       
        add:                                                                                                                                                                                              
        - NET_ADMIN                                                                                                                                                                                       
      privileged: true

@alvaroaleman that was my question too, we’re on GKE. We observed it in one zone on Friday (europe-west4-b), and excluded that zone from our egress.

Over the weekend it started in west4-a and west4-c as well so certainly feels like a change was being rolled out per-zone.

We have an open ticket with Google but haven’t had any confirmation yet… (not convinced we will get any either to be honest). Like you, whilst we have a work around - I’d really like to understand why now.

@alvaroaleman reports that setting TCP MSS clamping worksaround the issue, i.e. to execute once in the pod:

That is an educated but unverified guess based on prior failures with similar symptoms, fwiw.