autoscaler: Failed to drain node - pods remaining after timeout
Is it possible to increase the node drain timeout?
seeing this in logs:
Failed to scale down: Failed to delete ip-10-100-6-220.ec2.internal: Failed to drain node /ip-10-100-6-220.ec2.internal: pods remaining after timeout
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 15 (11 by maintainers)
Commits related to this issue
- Merge pull request #32 from frobware/smoke-test-for-scale-down test/openshift/e2e: Smoke test for scale down — committed to frobware/autoscaler by openshift-merge-robot 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
- UPSTREAM: <carry>: test/openshift/e2e: don't modify replica count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake.... — committed to frobware/autoscaler by frobware 5 years ago
The problem with accepting higher graceful termination is that we stop CA operations until the node is deleted. If gt was 10 min then CA would stop for 10 min and during this time no scale up operations would be executed. This is probably not what the users would want.
So if we want to have graceful termination significantly bigger than 1 min then we need to go asynchronous with the deletes, which will make the whole thing even more complicated. So I guess we won’t do anything around it for 1.7, maybe for 1.8 but we have other, probably more important pain points to fix.
Speaking about the issue - it seems that your app is probably ignoring SIGTERMs. And after investigating the code around this there seem to be a subtle timing/race-condition bug in the pod checking loop. The fix is on the way.