autoscaler: Failed to drain node - pods remaining after timeout

Is it possible to increase the node drain timeout?

seeing this in logs: Failed to scale down: Failed to delete ip-10-100-6-220.ec2.internal: Failed to drain node /ip-10-100-6-220.ec2.internal: pods remaining after timeout

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 15 (11 by maintainers)

Commits related to this issue

Merge pull request #32 from frobware/smoke-test-for-scale-down test/openshift/e2e: Smoke test for scale down — committed to frobware/autoscaler by openshift-merge-robot 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago

Most upvoted comments

The problem with accepting higher graceful termination is that we stop CA operations until the node is deleted. If gt was 10 min then CA would stop for 10 min and during this time no scale up operations would be executed. This is probably not what the users would want.

So if we want to have graceful termination significantly bigger than 1 min then we need to go asynchronous with the deletes, which will make the whole thing even more complicated. So I guess we won’t do anything around it for 1.7, maybe for 1.8 but we have other, probably more important pain points to fix.

Speaking about the issue - it seems that your app is probably ignoring SIGTERMs. And after investigating the code around this there seem to be a subtle timing/race-condition bug in the pod checking loop. The fix is on the way.

mwielgus on May 5, 2017