kubernetes: [Flaky test][sig-node] Probing container should not be restarted with a exec "cat /tmp/health"

Which jobs are flaking?

master-informing gce-ubuntu-master-default

Which tests are flaking?

  • kubetest.Test
  • Kubernetes e2e suite.[sig-node] Probing container should not be restarted with a exec “cat /tmp/health” liveness probe [NodeConformance] [Conformance]

Since when has it been flaking?

  • probing test: 11/18/2021
  • kubetest: 11/16/2021

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#gce-ubuntu-master-default

Reason for failure (if possible)

run

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:630
Nov 19 20:09:51.121: pod container-probe-6479/busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da - expected number of restarts: 0, found restarts: 1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/node/container_probe.go:150
I1119 20:09:39.290] Command stdout:
I1119 20:09:39.290] 
I1119 20:09:39.291] stderr:
I1119 20:09:39.291] + echo hostName
I1119 20:09:39.291] + nc -v -t -w 2 endpoint-test2 80
I1119 20:09:39.291] nc: connect to endpoint-test2 port 80 (tcp) timed out: Operation in progress
I1119 20:09:39.291] command terminated with exit code 1
I1119 20:09:39.291] 
I1119 20:09:39.291] error:
I1119 20:09:39.291] exit status 1
I1119 20:09:39.291] Retrying...
skipped 6 lines unfold_more
I1119 20:09:39.293] Command stdout:
I1119 20:09:39.293] 
I1119 20:09:39.293] stderr:
I1119 20:09:39.293] + nc -v -t -w 2 10.0.73.65 80
I1119 20:09:39.293] + echo hostName
I1119 20:09:39.294] nc: connect to 10.0.73.65 port 80 (tcp) timed out: Operation in progress
I1119 20:09:39.294] command terminated with exit code 1
I1119 20:09:39.294] 
I1119 20:09:39.294] error:
I1119 20:09:39.294] exit status 1
I1119 20:09:39.294] Retrying...
skipped 3017 lines unfold_more
I1119 20:10:00.292] STEP: Creating pod busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da in namespace container-probe-6479
I1119 20:10:00.293] Nov 19 20:05:50.749: INFO: Started pod busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da in namespace container-probe-6479
I1119 20:10:00.293] STEP: checking the pods current state and verifying that restartCount is present
I1119 20:10:00.293] Nov 19 20:05:50.789: INFO: Initial restart count of pod busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da is 0
I1119 20:10:00.293] Nov 19 20:08:15.133: INFO: Restart count of pod container-probe-6479/busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da is now 1 (2m24.343003695s elapsed)
I1119 20:10:00.294] Nov 19 20:09:51.121: FAIL: pod container-probe-6479/busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da - expected number of restarts: 0, found restarts: 1
I1119 20:10:00.294] 
I1119 20:10:00.294] Full Stack Trace
I1119 20:10:00.294] k8s.io/kubernetes/test/e2e/common/node.glob..func2.5()
I1119 20:10:00.294] 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/node/container_probe.go:150 +0x165
I1119 20:10:00.294] k8s.io/kubernetes/test/e2e.RunE2ETests(0x23f5d57)
skipped 11 lines unfold_more
I1119 20:10:00.296] STEP: Found 7 events.
I1119 20:10:00.296] Nov 19 20:09:52.039: INFO: At 2021-11-19 20:05:46 +0000 UTC - event for busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da: {default-scheduler } Scheduled: Successfully assigned container-probe-6479/busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da to bootstrap-e2e-minion-group-ls45
I1119 20:10:00.297] Nov 19 20:09:52.039: INFO: At 2021-11-19 20:05:47 +0000 UTC - event for busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da: {kubelet bootstrap-e2e-minion-group-ls45} Pulled: Container image "k8s.gcr.io/e2e-test-images/busybox:1.29-2" already present on machine
I1119 20:10:00.297] Nov 19 20:09:52.039: INFO: At 2021-11-19 20:05:47 +0000 UTC - event for busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da: {kubelet bootstrap-e2e-minion-group-ls45} Created: Created container busybox
I1119 20:10:00.297] Nov 19 20:09:52.039: INFO: At 2021-11-19 20:05:48 +0000 UTC - event for busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da: {kubelet bootstrap-e2e-minion-group-ls45} Started: Started container busybox
I1119 20:10:00.297] Nov 19 20:09:52.039: INFO: At 2021-11-19 20:07:41 +0000 UTC - event for busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da: {kubelet bootstrap-e2e-minion-group-ls45} Unhealthy: Liveness probe failed: command "cat /tmp/health" timed out
I1119 20:10:00.298] Nov 19 20:09:52.039: INFO: At 2021-11-19 20:07:41 +0000 UTC - event for busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da: {kubelet bootstrap-e2e-minion-group-ls45} Killing: Container busybox failed liveness probe, will be restarted
I1119 20:10:00.298] Nov 19 20:09:52.040: INFO: At 2021-11-19 20:09:51 +0000 UTC - event for busybox-e59f9c20-65fd-4879-900a-3e5cfcfb43da: {kubelet bootstrap-e2e-minion-group-ls45} Killing: Stopping container busybox
I1119 20:10:00.298] Nov 19 20:09:52.262: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
I1119 20:10:00.298] Nov 19 20:09:52.262: INFO: 
I1119 20:10:00.298] Nov 19 20:09:52.660: INFO: 
skipped 191 lines unfold_more
I1119 20:10:00.393] Latency metrics for node bootstrap-e2e-minion-group-m5sl
I1119 20:10:00.393] Nov 19 20:09:59.794: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
I1119 20:10:00.393] STEP: Destroying namespace "container-probe-6479" for this suite.
I1119 20:10:00.393] 
I1119 20:10:00.393] 
I1119 20:10:00.394] • Failure [253.939 seconds]
I1119 20:10:00.394] [sig-node] Probing container
I1119 20:10:00.394] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/node/framework.go:23
I1119 20:10:00.394]   should *not* be restarted with a exec "cat /tmp/health" liveness probe [NodeConformance] [Conformance] [It]
I1119 20:10:00.394]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:630
I1119 20:10:00.394] 
skipped 12089 lines unfold_more
I1119 20:15:30.791] 
I1119 20:15:30.791] [Fail] [sig-node] Probing container [It] should *not* be restarted with a exec "cat /tmp/health" liveness probe [NodeConformance] [Conformance] 
I1119 20:15:30.792] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/node/container_probe.go:150
I1119 20:15:30.792] 
I1119 20:15:30.792] Ran 817 of 7037 Specs in 1264.853 seconds
I1119 20:15:30.792] FAIL! -- 816 Passed | 1 Failed | 0 Pending | 6220 Skipped
I1119 20:15:30.797] 
I1119 20:15:30.798] 
I1119 20:15:30.798] Ginkgo ran 1 suite in 22m2.16716872s
I1119 20:15:30.798] Test Suite Failed
I1119 20:15:30.811] Checking for custom logdump instances, if any
skipped 20 lines unfold_more
W1119 20:16:22.543] Specify --start=12583384 in the next get-serial-port-output invocation to get only the new output starting from here.
W1119 20:16:37.322] scp: /var/log/cluster-autoscaler.log*: No such file or directory
W1119 20:16:37.633] scp: /var/log/fluentd.log*: No such file or directory
W1119 20:16:37.633] scp: /var/log/kubelet.cov*: No such file or directory
W1119 20:16:37.633] scp: /var/log/startupscript.log*: No such file or directory
W1119 20:16:37.638] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 53 (52 by maintainers)

Most upvoted comments

~@BenTheElder can we try to bump containerd version to 1.6 beta in one CI job just to be sure about the reason? If api conversion is the reason for flakes i think we should fix it~

FWIW, yes, see e.g. https://github.com/kubernetes/test-infra/pull/21735/files

kube-up.sh has logic added for a while now to support dynamically installing a newer runtime from upstream builds when running ubuntu nodes. tagged versions should just work. e.g. we could do one of the v1.6.0 betas.

KIND jobs as currently setup can be switched, but is not as readily per-job, we usually maintain one known good version, we test against all current k8s branches before merging default changes there. It is possible though, with slightly higher effort at the moment.

I’ll take care of it, but it hasn’t flaked recently and wanted to be more sure before increasing timeout https://testgrid.k8s.io/sig-release-master-informing#gce-ubuntu-master-default

is this related to the following change? #106544

that is a different test

We just updated the kinder jobs to address the creation of the release-1.23 branch so the flake history is gone due to the job refresh. This unfortunately landed during debug time…

Not all kinder jobs use docker. Kinder-regular-1.23 uses containerd (not the latest containerd ver).

docker still involves dockershim and not the CRI codepaths though?

duh, thanks Ben 🙃

it seems kinder jobs are using dockeshim too, no?

Nov 24 21:37:31 kinder-xony-worker-2 kubelet[984]: I1124 21:37:31.706675     984 kubelet.go:314] "Using dockershim is deprecated, please consider using a full-fledged CRI implementation"
Nov 24 21:37:31 kinder-xony-worker-2 kubelet[984]: I1124 21:37:31.706713     984 client.go:78] "Connecting to docker on the dockerEndpoint" endpoint="unix:///var/run/docker.sock"
Nov 24 21:37:31 kinder-xony-worker-2 kubelet[984]: I1124 21:37:31.706735     984 client.go:97] "Start docker client with request timeout" timeout="2m0s"

is this affecting only jobs using dockershim? because the answer then is clear 😃

sorry i meant ci-kubernetes-e2e-kubeadm-kinder-kubelet-1-22-on-latest, IIUC it tests 1-22 kubelet on latest?

yes, kubelet there is at 1.22, control-plane, kubeadm and test suite are at latest (1.23).

any change in grpc or CRI on 11/18?

yes, https://github.com/kubernetes/kubernetes/pull/106463

This test is flaking with kubelet 1.22 and kubelet 1.21 as well, so i doubt if CRI api conversion is the reason https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=Probing container

This quoted triage link covers a lot more tests than this one. If you confine your search just to this test, you can see that the flake rate has increased significantly, and this particular failure mode only started on 11/17 (see image screenshot above): https://storage.googleapis.com/k8s-triage/index.html?test=Probing container should \*not\* be restarted with a exec "cat %2Ftmp%2Fhealth" liveness probe

I am also excluding PRs, since PRs often fail for all sorts of unrelated reasons; CI-only searches will give us better signal.

this kubeadm test job has it consistently failing for the last 8 runs: https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-kubelet-1-21-on-latest

if the infra is getting a lot of pressure for a period of time (e.g. tests in these jobs are concurrent) increasing the timeout would make sense.

oddly these tests use 2 second ticks: https://github.com/kubernetes/kubernetes/blob/f572e4d5b4a1496356eccf29c34adac02280beeb/test/e2e/common/node/container_probe.go#L751 so odd numbers are not a good pick as the cap… IMO 5 second seems low for a restart, also the tick should be 1 second.

but before increasing timeouts and adjusting ticks it is worth checking if something actually changed that slowed down restarts…

Since Liveness probe is timing out here, container is restarted and we are not getting expected timeout Also the feature gate is default enabled https://github.com/kubernetes/kubernetes/blob/master/pkg/features/kube_features.go#L553 The timeout for probe is 5 secs, maybe we should increase it to 10?

1122 04:17:39.739] Nov 22 04:17:31.292: INFO: At 2021-11-22 04:13:14 +0000 UTC - event for busybox-4897c144-9db4-40e9-ad9e-5cefb07c1cdc: {kubelet bootstrap-e2e-minion-group-tq3m} Created: Created container busybox
I1122 04:17:39.739] Nov 22 04:17:31.292: INFO: At 2021-11-22 04:13:14 +0000 UTC - event for busybox-4897c144-9db4-40e9-ad9e-5cefb07c1cdc: {kubelet bootstrap-e2e-minion-group-tq3m} Started: Started container busybox
I1122 04:17:39.740] Nov 22 04:17:31.292: INFO: At 2021-11-22 04:13:36 +0000 UTC - event for busybox-4897c144-9db4-40e9-ad9e-5cefb07c1cdc: {kubelet bootstrap-e2e-minion-group-tq3m} Unhealthy: Liveness probe failed: command "cat /tmp/health" timed out
I1122 04:17:39.740] Nov 22 04:17:31.292: INFO: At 2021-11-22 04:13:36 +0000 UTC - event for busybox-4897c144-9db4-40e9-ad9e-5cefb07c1cdc: {kubelet bootstrap-e2e-minion-group-tq3m} Killing: Container busybox failed liveness probe, will be restarted