test-infra: Test failures starting 22 September due to network timeouts

What happened:

On our around 22 September we started seeing CI failures in prow:

Example log from error:

Step 4/8 : RUN apk add --no-cache curl &&     curl -LO https://storage.googleapis.com/kubernetes-release/release/${KUBE_VERSION}/bin/linux/${ARCH}/kubectl &&     chmod +x kubectl
 ---> Running in d7e388707d87
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 25m0s timeout","severity":"error","time":"2021-09-23T21:50:51Z"}

Ref run: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_secrets-s[…]e-csi-driver-image-scan/1441072726373568512/build-log.txt

What you expected to happen:

Successful builds.

How to reproduce it (as minimally and precisely as possible):

This happens consistently on all of our builds. We’ve tried reverting commits in PRs and cannot find anything related to the test case that would cause this. The same tests running on the k8s-infra-prow-build cluster succeed.

We have also seen it fail on apt:

  Connection timed out [IP: 151.101.194.132 80]
[91mE: Failed to fetch http://deb.debian.org/debian/pool/main/u/util-linux/libblkid1_2.36.1-8_amd64.deb  Connection timed out [IP: 199.232.126.132 80]
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openssl/libssl1.1_1.1.1k-1%2bdeb11u1_amd64.deb  Connection timed out [IP: 151.101.2.132 80]
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openssl/openssl_1.1.1k-1%2bdeb11u1_amd64.deb  Connection timed out [IP: 151.101.194.132 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

We also tried setting and increasing CPU/Memory requests in https://github.com/kubernetes/test-infra/pull/23723 and https://github.com/kubernetes/test-infra/pull/23725 but were unsuccessful

Please provide links to example occurrences, if any:

Anything else we need to know?:

Discussion thread in slack: https://kubernetes.slack.com/archives/C09QZ4DQB/p1632414457031600

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 8
Comments: 25 (20 by maintainers)

Commits related to this issue

fix: prepend iptables fixes for #23741 — committed to tam7t/test-infra by tam7t 3 years ago
Merge pull request #23743 from tam7t/tam7t/ip-fix secret-store-csi-driver: prepend iptables fixes for #23741 — committed to kubernetes/test-infra by k8s-ci-robot 3 years ago
Workaround prow CI failures with iptables change Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to pierDipi/eventing-kafka-broker by pierDipi 3 years ago
Workaround prow CI failures with iptables change (#1261) Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to knative-extensions/eventing-kafka-broker by pierDipi 3 years ago
Workaround prow CI failures with iptables change Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to knative-prow-robot/eventing-kafka-broker by pierDipi 3 years ago
Workaround prow CI failures with iptables change Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to knative-prow-robot/eventing-kafka-broker by pierDipi 3 years ago
Workaround prow CI failures with iptables change Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to knative-prow-robot/eventing-kafka-broker by pierDipi 3 years ago
Workaround prow CI failures with iptables change (#1263) Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 Co-authored-by: Pierange... — committed to knative-extensions/eventing-kafka-broker by knative-prow-robot 3 years ago
Workaround prow CI failures with iptables change (#1264) Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 Co-authored-by: Pierange... — committed to knative-extensions/eventing-kafka-broker by knative-prow-robot 3 years ago
Workaround prow CI failures with iptables change (#1265) Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 Co-authored-by: Pierange... — committed to knative-extensions/eventing-kafka-broker by knative-prow-robot 3 years ago
Add fix for `https://github.com/kubernetes/test-infra/issues/23741` — committed to acumino/etcd-druid by acumino a year ago
Add `KIND` as a dependency for e2e test (#615) * Add `KIND` as a dependency for e2e test * Update kind node version * Add fix for `https://github.com/kubernetes/test-infra/issues/23741` * Dr... — committed to gardener/etcd-druid by acumino a year ago
Add `KIND` as a dependency for e2e test (#615) * Add `KIND` as a dependency for e2e test * Update kind node version * Add fix for `https://github.com/kubernetes/test-infra/issues/23741` * Drop dep... — committed to abdasgupta/etcd-druid by acumino a year ago

Most upvoted comments

Google has confirmed the issue in the support case: After reviewing the information you provided, we believe that you may be affected by a known issue: We have identified a Networking connecting issue impacting the GKE Docker workload. This is a high priority issue that we're working to resolve as soon as possible. Some customers may be experiencing a connection failure in Docker workflow to Fastly destinations and may receive a timeout error.

+11

Saykar on Sep 27, 2021

I would like for us to roll this change back, but I’m wary that maybe we’ll need it again someday.

So as to avoid a protracted roll out cycle, I’m planning on doing the following:

gate the workaround behind an env var BOOTSTRAP_MTU_WORKAROUND, defaulted in-image to true… this will let us roll out changes selectively or globally by job config changes (1 PR) instead of propagating to kubekins (bootstrap change PR + bootstramp bump PR + kubekins bump PR)
selectively disable the workaround on jobs that were broken before the workaround was in place via job config changes (add env stanzas setting BOOTSTRAP_MTU_WORKAROUND=false
disable the workaround for all jobs by setting it in the preset-dind-enabled preset
default the workaround to false in-image
…leaving us the option to enable the workaround by default using the same “env var in preset” approach if this problem surfaces again

spiffxp on Oct 4, 2021

I’m told there is a fix scheduled for Thursday.

Current suggested workaround is add an initContainer to the Docker in Docker workload like this:

initContainers:                                                                                                                                                                                         
  - name: workaround                                                                                                                                                                                      
    image: k8s.gcr.io/build-image/debian-iptables-amd64:buster-v1.6.7                                                                                                                                     
    command:                                                                                                                                                                                              
    - sh                                                                                                                                                                                                  
    - -c                                                                                                                                                                                                  
    - "iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu"                                                                                                    
    securityContext:                                                                                                                                                                                      
      capabilities:                                                                                                                                                                                       
        add:                                                                                                                                                                                              
        - NET_ADMIN                                                                                                                                                                                       
      privileged: true

mikesparr on Sep 28, 2021

@alvaroaleman that was my question too, we’re on GKE. We observed it in one zone on Friday (europe-west4-b), and excluded that zone from our egress.

Over the weekend it started in west4-a and west4-c as well so certainly feels like a change was being rolled out per-zone.

We have an open ticket with Google but haven’t had any confirmation yet… (not convinced we will get any either to be honest). Like you, whilst we have a work around - I’d really like to understand why now.

Stono on Sep 27, 2021

https://status.cloud.google.com/incidents/QSirAFiyN5yMeeE6GNxq is listed as resolved

spiffxp on Sep 30, 2021

@alvaroaleman reports that setting TCP MSS clamping worksaround the issue, i.e. to execute once in the pod:

That is an educated but unverified guess based on prior failures with similar symptoms, fwiw.

alvaroaleman on Sep 24, 2021