aws-node-termination-handler: Failing to drain/cordon causes CrashLoopBackOff

[Queue processor specific]

Running into issue here around the behavior of exiting with an error code of 1 when a message is pulled from SQS and then cordoning or draining of that node fails.

In our use-case, we have multiple EKS clusters in an account in a region. When approaching how to handle this with NTH, I was going to run per-cluster SQS queues with CloudWatch events + rules that were tailored to the proper cluster/ASG combo, and this works for ASG events, but EC2 termination events do not have the same filtering capacity – meaning that you end up with events in SQS queues even though those instances may not be in the cluster. When NTH processes one of those, it will exit(1) because there will be no node in the cluster to cordon or drain. After so many, Kubernetes will mark the pod in a CrashLoopBackOff state and you will lose all NTH capabilities until the cool-down period expires and Kubernetes starts the pod again.

For an NTH pod that I’ve been running for a week, it’s seen 600+ crashes:

aws-node-termination-handler-86d9c789f4-6c9pz   0/1     CrashLoopBackOff     607        6d2h

2020/10/29 17:29:57 ??? Trying to get token from IMDSv2
2020/10/29 17:29:57 ??? Got token from IMDSv2
2020/10/29 17:29:57 ??? Startup Metadata Retrieved metadata={"accountId":"0xxxxxxxxx","availabilityZone":"us-east-1b","instanceId":"i-0xxxxxxxx","instanceType":"c5.2xlarge","localHostname":"ip-10-xxx-xxx-xxx.ec2.internal","privateIp":"10.xxx.xxx.xxx","publicHostname":"","publicIp":"","region":"us-east-1"}
2020/10/29 17:29:57 ??? aws-node-termination-handler arguments: 
	dry-run: false,
	node-name: ip-10-xxx-xxx-xxx.ec2.internal,
	metadata-url: http://169.254.169.254,
	kubernetes-service-host: 172.20.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: false,
	enable-spot-interruption-draining: false,
	enable-sqs-termination-draining: true,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: false,
	json-logging: false,
	log-level: INFO,
	webhook-proxy: ,
	webhook-headers: <not-displayed>,
	webhook-url: ,
	webhook-template: <not-displayed>,
	uptime-from-file: ,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,
	aws-region: us-east-1,
	queue-url: https://sqs.us-east-1.amazonaws.com/0xxxxxxxxxx/node_termination_handler,
	check-asg-tag-before-draining: true,
	aws-endpoint: ,

2020/10/29 17:29:57 ??? Started watching for interruption events
2020/10/29 17:29:57 ??? Kubernetes AWS Node Termination Handler has started successfully!
2020/10/29 17:29:57 ??? Started watching for event cancellations
2020/10/29 17:29:57 ??? Started monitoring for events event_type=SQS_TERMINATE
2020/10/29 17:30:00 ??? Adding new event to the event store event={"Description":"Spot Interruption event received. Instance will be interrupted at 2020-10-29 17:28:15 +0000 UTC \n","Drained":false,"EndTime":"0001-01-01T00:00:00Z","EventID":"spot-itn-event-12345","InstanceID":"i-0xxxxxxxxxx","Kind":"SQS_TERMINATE","NodeName":"ip-10-xxx-xxx-xxx.ec2.internal","StartTime":"2020-10-29T17:28:15Z","State":""}
2020/10/29 17:30:01 ??? Cordoning the node
2020/10/29 17:30:01 WRN Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2020/10/29 17:30:01 ??? There was a problem while trying to cordon and drain the node error="nodes \"ip-10-xxx-xxx-xxx.ec2.internal\" not found"

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 2
Comments: 24 (18 by maintainers)

Most upvoted comments

Yes, that would work. Or even better maybe aws-node-termination-handler/managed: cluster1 aws-node-termination-handler/managed: cluster2

paalkr on Nov 12, 2020

👍 ah gotcha (both of you 🙂 ). I’m cool with the tag configuration approach then.

@paalkr Do you think that would work for your case as well (specifying different tags for each of your clusters)?

aws-node-termination-handler/managed/cluster1 aws-node-termination-handler/managed/cluster2

bwagner5 on Nov 12, 2020

The original problem is not solved yet - the filter is great but it does not apply for all conditions like mentioned in #307

universam1 on Dec 4, 2020

PR for the configurable tag change opened. I can change it look for a value, too, but that’s some more significant code changes that I’m not sure are worth the complexity.

blakestoddard on Nov 12, 2020

I put together a rev of the customizable tag in https://github.com/blakestoddard/aws-node-termination-handler/tree/managed-asg-tag and am running it in four of our clusters (non are prod). I’ll check-in in a few days to see if it’s still being crash-happy.

blakestoddard on Nov 6, 2020

Well I’m not sure it’s a great idea still. It doesn’t delete it, but depending on your visibility timeout setting on cloudwatch (I think it’s set at 20 sec), you could get unlucky and have the wrong NTH’s pulling it every time until it hits the 300 sec deletion time.

bwagner5 on Oct 29, 2020