test-infra: Test failures starting 22 September due to network timeouts
What happened:
On our around 22 September we started seeing CI failures in prow:
Example log from error:
Step 4/8 : RUN apk add --no-cache curl && curl -LO https://storage.googleapis.com/kubernetes-release/release/${KUBE_VERSION}/bin/linux/${ARCH}/kubectl && chmod +x kubectl
---> Running in d7e388707d87
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 25m0s timeout","severity":"error","time":"2021-09-23T21:50:51Z"}
What you expected to happen:
Successful builds.
How to reproduce it (as minimally and precisely as possible):
This happens consistently on all of our builds. We’ve tried reverting commits in PRs and cannot find anything related to the test case that would cause this. The same tests running on the k8s-infra-prow-build cluster succeed.
We have also seen it fail on apt:
Connection timed out [IP: 151.101.194.132 80]
[91mE: Failed to fetch http://deb.debian.org/debian/pool/main/u/util-linux/libblkid1_2.36.1-8_amd64.deb Connection timed out [IP: 199.232.126.132 80]
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openssl/libssl1.1_1.1.1k-1%2bdeb11u1_amd64.deb Connection timed out [IP: 151.101.2.132 80]
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openssl/openssl_1.1.1k-1%2bdeb11u1_amd64.deb Connection timed out [IP: 151.101.194.132 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
We also tried setting and increasing CPU/Memory requests in https://github.com/kubernetes/test-infra/pull/23723 and https://github.com/kubernetes/test-infra/pull/23725 but were unsuccessful
Please provide links to example occurrences, if any:
- https://github.com/kubernetes-sigs/secrets-store-csi-driver/pull/744
- https://testgrid.k8s.io/sig-auth-secrets-store-csi-driver-periodic#secrets-store-csi-driver-image-scan
- https://github.com/kubernetes/ingress-nginx/pull/7689
- https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/ingress-nginx/7689/pull-ingress-nginx-e2e-helm-chart/1440950656234950656#1:build-log.txt%3A207
- https://testgrid.k8s.io/provider-gcp-compute-persistent-disk-csi-driver#ci-win2004-provider-gcp-compute-persistent-disk-csi-driver-migration
- https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-aws#pr-e2e-eks-main
Anything else we need to know?:
Discussion thread in slack: https://kubernetes.slack.com/archives/C09QZ4DQB/p1632414457031600
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 8
- Comments: 25 (20 by maintainers)
Commits related to this issue
- fix: prepend iptables fixes for #23741 — committed to tam7t/test-infra by tam7t 3 years ago
- Merge pull request #23743 from tam7t/tam7t/ip-fix secret-store-csi-driver: prepend iptables fixes for #23741 — committed to kubernetes/test-infra by k8s-ci-robot 3 years ago
- Workaround prow CI failures with iptables change Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to pierDipi/eventing-kafka-broker by pierDipi 3 years ago
- Workaround prow CI failures with iptables change (#1261) Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to knative-extensions/eventing-kafka-broker by pierDipi 3 years ago
- Workaround prow CI failures with iptables change Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to knative-prow-robot/eventing-kafka-broker by pierDipi 3 years ago
- Workaround prow CI failures with iptables change Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to knative-prow-robot/eventing-kafka-broker by pierDipi 3 years ago
- Workaround prow CI failures with iptables change Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 — committed to knative-prow-robot/eventing-kafka-broker by pierDipi 3 years ago
- Workaround prow CI failures with iptables change (#1263) Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 Co-authored-by: Pierange... — committed to knative-extensions/eventing-kafka-broker by knative-prow-robot 3 years ago
- Workaround prow CI failures with iptables change (#1264) Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 Co-authored-by: Pierange... — committed to knative-extensions/eventing-kafka-broker by knative-prow-robot 3 years ago
- Workaround prow CI failures with iptables change (#1265) Add workaround for connection timeout to maven central. - ref https://github.com/kubernetes/test-infra/issues/23741 Co-authored-by: Pierange... — committed to knative-extensions/eventing-kafka-broker by knative-prow-robot 3 years ago
- Add fix for `https://github.com/kubernetes/test-infra/issues/23741` — committed to acumino/etcd-druid by acumino a year ago
- Add `KIND` as a dependency for e2e test (#615) * Add `KIND` as a dependency for e2e test * Update kind node version * Add fix for `https://github.com/kubernetes/test-infra/issues/23741` * Dr... — committed to gardener/etcd-druid by acumino a year ago
- Add `KIND` as a dependency for e2e test (#615) * Add `KIND` as a dependency for e2e test * Update kind node version * Add fix for `https://github.com/kubernetes/test-infra/issues/23741` * Drop dep... — committed to abdasgupta/etcd-druid by acumino a year ago
Google has confirmed the issue in the support case:
After reviewing the information you provided, we believe that you may be affected by a known issue: We have identified a Networking connecting issue impacting the GKE Docker workload. This is a high priority issue that we're working to resolve as soon as possible. Some customers may be experiencing a connection failure in Docker workflow to Fastly destinations and may receive a timeout error.I would like for us to roll this change back, but I’m wary that maybe we’ll need it again someday.
So as to avoid a protracted roll out cycle, I’m planning on doing the following:
BOOTSTRAP_MTU_WORKAROUND, defaulted in-image to true… this will let us roll out changes selectively or globally by job config changes (1 PR) instead of propagating to kubekins (bootstrap change PR + bootstramp bump PR + kubekins bump PR)envstanzas settingBOOTSTRAP_MTU_WORKAROUND=falsepreset-dind-enabledpresetI’m told there is a fix scheduled for Thursday.
Current suggested workaround is add an initContainer to the Docker in Docker workload like this:
@alvaroaleman that was my question too, we’re on GKE. We observed it in one zone on Friday (europe-west4-b), and excluded that zone from our egress.
Over the weekend it started in west4-a and west4-c as well so certainly feels like a change was being rolled out per-zone.
We have an open ticket with Google but haven’t had any confirmation yet… (not convinced we will get any either to be honest). Like you, whilst we have a work around - I’d really like to understand why now.
https://status.cloud.google.com/incidents/QSirAFiyN5yMeeE6GNxq is listed as resolved
That is an educated but unverified guess based on prior failures with similar symptoms, fwiw.