kubernetes: Conformance test for DaemonSet RollingUpdate to rigid on timeout

When running conformance testing on a cluster with 8 nodes schedulable for jobs, the conformance test for rolling updates on daemon sets fails due to timeout.

Test name: Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]

Test code is here: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/apps/daemon_set.go#L326-L369 It utilizes a hardcoded timeout limit of 5m.

What happened: The logs seem to indicate nothing wrong other than the time it takes to complete the rolling restart. The rolling restart timeout is hardcoded at 5m which doesn’t seem to be enough time for large enough clusters.

What you expected to happen: I would expect the timeout to either be sliding or for the check to be kicking the can; ensuring that the update is proceeding by at least one node every minute or two.

How to reproduce it (as minimally and precisely as possible): Requires a cluster with at least 8 schedulable nodes; more would make the timeout more certain.

Anything else we need to know?:

I was trying to see what k8s was testing itself with and it seems (according to this log) that the test is run against a cluster with 4 schedulable nodes and takes ~3m which would line up with what I’m seeing (8 nodes for ~6m, 1 more than given by the hardcoded deadline)

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-cri-containerd-e2e-gci-gce-serial/1555/build-log.txt

Seems like there are a few options:

  • use a “kick the can” strategy to ensure the timeout scales
  • use a timeout calculated from the number of schedulable nodes
  • (mitigation) increase the max number of down nodes so the update can occur faster

Environment:

  • Kubernetes version (use kubectl version): From test logs (cluster no longer available):
INFO: e2e test version: v1.12.1
INFO: kube-apiserver version: v1.12.2-1+619f4e6f7f010f

/sig testing /kind bug

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 18 (16 by maintainers)

Most upvoted comments

The sleep in there was added over a year ago. What changed RECENTLY?

Curiously there is a serve-hostname v1.4 tagged on k8s.gcr.io already, but I can;t say where it came from, since VERSION in k/k/test/images/serve-hostname says 1.2 …

That sleep was added to test graceful termination, I think. removing it may have side-effects. Can we make the test runs set terminationGracePeriodSeconds: 2 ? Or maybe we make this termination-sleep a flag?

If we change it at all, we should bump the version to 1.5 and supercede all others.

/assign @thockin @ixdy @spiffxp

We’re going to need an image push update for conformance regression.

WRT timeout scaling… yes, but let’s base it on some data and experiments.