autoscaler: Failed to drain node - pods remaining after timeout

Is it possible to increase the node drain timeout?

seeing this in logs: Failed to scale down: Failed to delete ip-10-100-6-220.ec2.internal: Failed to drain node /ip-10-100-6-220.ec2.internal: pods remaining after timeout

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 15 (11 by maintainers)

Commits related to this issue

Most upvoted comments

The problem with accepting higher graceful termination is that we stop CA operations until the node is deleted. If gt was 10 min then CA would stop for 10 min and during this time no scale up operations would be executed. This is probably not what the users would want.

So if we want to have graceful termination significantly bigger than 1 min then we need to go asynchronous with the deletes, which will make the whole thing even more complicated. So I guess we won’t do anything around it for 1.7, maybe for 1.8 but we have other, probably more important pain points to fix.

Speaking about the issue - it seems that your app is probably ignoring SIGTERMs. And after investigating the code around this there seem to be a subtle timing/race-condition bug in the pod checking loop. The fix is on the way.