autoscaler: cluster-autoscaler [AWS] isn't aware of LoadBalancer inflight requests causing 502s when external traffic policy is set to Cluster

I have two deployments behind a service of type Load Balancer and External Traffic Policy set to Cluster. Deployment (say A) behind a service of type Load Balancer(A) and deployment (say B) also behind a service of type Load Balancer(B). I am also using cluster-autoscaler to scaling my worker node. Deployment App A is my app server, Deployment B is my web-server which would forward all the requests to Load Balancer A of Deployment A workloads. The RTT for each request is around 10-20 seconds. (To reproduce the issue, it wrote a sample APP to include 20 second sleep).

Whenever, I add new deployment workload(say C) to my Cluster, cluster-autoscaler would add new nodes to fulfill the workload request. Whenever, I delete the deployment workload©, the cluster-autoscaler would scale down the worker nodes (Drain -> Terminate).

As my external traffic policy is set to Cluster, all the new nodes that are joined to the cluster are also registered to the load balancer. However, when cluster-autoscaler deletes a node(let say Node 10), and all the requests which are on Node10 are closed as the node is marked for termination by cluster-autoscaler as there is no workload running resulting in 502’s for Service A. This is because cluster-autoscaler it is not aware of active/inflight requests on the node10 for Service A and Service B resulting in 502’s as the request is interrupted because of node termination/draining.

Work around: Change external traffic policy to Local

Ask: Make cluster-autoscaler more resilient and make it aware of inflight request when external traffic policy is set to Cluster.

Deployment A manifest file 
===
apiVersion: apps/v1
kind: Deployment
metadata:
  name: limits-nginx
spec:
  selector:
    matchLabels:
      run: limits-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: limits-nginx
    spec:
      containers:
      - name: limits-nginx
        image: nithmu/nithish:sample-golang-app
        env:
        - name: MSG_ENV
          value: "Hello from the environment"
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "264Mi"
            cpu: "250m"
          limits:
            memory: "300Mi"
            cpu: "300m"
===
Service A manifest file
===
{
   "kind":"Service",
   "apiVersion":"v1",
   "metadata":{
      "name":"sameple-go-app",
      "annotations":{
        "service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled": "true",
        "service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled": "true"
      },
      "labels":{
         "run": "limits-nginx"
      }
   },
   "spec":{
      "ports": [
         {
           "port":80,
           "targetPort":8080
         }
      ],
      "selector":{
         "run":"limits-nginx"
      },
      "type":"LoadBalancer"
   }
}
===
Deployment B manifest file 
===
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nithmu/nithish:nginx_echo
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
===
Service B manifest file
===
{
   "kind":"Service",
   "apiVersion":"v1",
   "metadata":{
      "name":"my-nginx",
      "annotations":{
        "service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled": "true",
        "service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled": "true"
      },
      "labels":{
         "run":"my-nginx"
      }
   },
   "spec":{
      "ports": [
         {
           "port":80,
           "targetPort":8080
         }
      ],
      "selector":{
         "run":"my-nginx"
      },
      "type":"LoadBalancer"
   }
}
===
Deployment C manifest file 
===
apiVersion: apps/v1
kind: Deployment
metadata:
  name: l-nginx
spec:
  selector:
    matchLabels:
      run: l-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: l-nginx
    spec:
      containers:
      - name: l-nginx
        image: nginx
        env:
        - name: MSG_ENV
          value: "Hello from the environment"
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "1564Mi"
            cpu: "2000m"
          limits:
            memory: "1600Mi"
            cpu: "2500m"
===

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 34
Comments: 52 (18 by maintainers)

Most upvoted comments

@krzysztof-jastrzebski So I understand that the CA uses taints with NoSchedule effect when deleting nodes instead of setting nodes.Spec.Unschedulable=true. This is done because in the case of a non-empty node we are unsure whether all evictions will succeed before we delete a node via the cloud provider: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scale_down.go#L990

Now consider that the k8s service controller computes the list of load balancer targets based on the node.Spec.Unschedulable flag and not based on node taints: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/service/service_controller.go#L554

There are thus two points to consider in the interaction of CA and service controller during node deletes: a) It is good that the CA uses taints before evicting pods: we don’t want to kick the node out of ELB target lists before we know that we’re going to delete the node for sure. b) We have a race because we never set node.Spec.Unschedulable prior to deleting the node via the cloud provider: ELBs may still round-robin traffic to a deleted node before k8s realizes the node is gone and the service controller updates all ELBs.

I think this issue exists independent of whether we set External Traffic Policy to Local or Cluster. Setting the policy to Local just makes the race less probable: node deletion only affects services with endpoint pods that used to be on the dying node up until a few seconds ago. This is in contrast to node deletion of any node to affect all services for which the traffic policy is set to Cluster.

To conclude, can the CA set node.Spec.Unschedulable and sleep right before deleting a node via the cloud provider? Alternatively perhaps we could use the ServiceNodeExclusion feature and label the dying node with “alpha.service-controller.kubernetes.io/exclude-balancer” and sleep before destroying the node?

+17

markine on May 22, 2019

We were also facing similar issue and we came across a PR in autoscaler which marks the node as cordon so that ALB can remove it from healthy list. We tried it with following changes and we could avoid 5xx error due to autoscaling fully. (This is tested in our staging env as of now and we would we pushing the change to prod env in next week). Steps

Cordon the node along with taint to be done in autoscaler code (PR - https://github.com/kubernetes/autoscaler/pull/3649). I have created docker image on top of 1.18 branch too (hosted at https://hub.docker.com/r/atulagrwl/autoscaler/ tag 1.18-c, cordon is enabled by default in this image).
Default degregistration delay is of 300 sec. Increase the delay which ever is needed for inflight request for the use case. (https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay)
Enable livecycle termination hooks on ASG for the worker nodes. (https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html). Create a new livecycle hook for Lifecycle transition as Instance terminate with Default Result as Continue and add required time to wait in seconds in Heartbeat timeout for the node to be deleted post termination.

Using above steps we were able to reduce the errors to 0 due to downscaling. Please vote for above PR so that we can get that merged into autoscaler to avoid maintaining it seperately. I will also spend sometime to move Item3 into autoscaler to wait for configured time before sending termination to node.

+12

atulaggarwal on Oct 28, 2020

@Jeffwan - Would you mind weighing in from the AWS side on this issue please? Based off this thread and my own experiences, cluster-autoscaler is currently incompatible with the LoadBalancer service. Our EKS nodes are constantly scaling up and down to accommodate backend queue load, and this issue causes constant disruption to requests to application server pods running on the cluster. Whenever cluster-autoscaler scales down a node, we see a handful of 504s:

This is over the last 3 days, and every single one aligns with an ASG scale down triggered by cluster-autoscaler. It’s a major issue for us right now, and it’s not clear how we can achieve scalability of the cluster along with stable external load balancing.

+12

jorihardman on Nov 12, 2019

Let me double check this issue after kubecon week. Due to the implementation of service controller and draining logic of CA, I am not sure if there’s a simple solution. Thanks for keep raising this issue up.

Jeffwan on Nov 18, 2019

Updates

ALB Load Balancer PR has been merged into main branch (for aws load balancer v2). It has been released into v2.1.1 release. It ensures that nodes tained with ToBeDeletedByClusterAutoscaler taint will be removed from healthy nodes. So that current version of cluster autoscaler (without UnScheduler taint) also works fine.
Cluster Autoscaler PR has also been merged. This PR adds Unschedulable taint apart from existing ToBeDeletedByClusterAutoscaler taint to the nodes which are being removed from Cluster. As per my understanding the change is in the master branch and has not been released yet.

Both of the changes are not required. Either one of the one would do at alternative to Step 1. However Step 2 and Step 3 is also required to ensure you keep the node for few mins till all the previous connections responses are returned back.

We have been running this solution in our production cluster for more than 3 months and we have not observed any issue related to cluster autoscaler yet (with our custom docker image and would be switching to aws v2.1.1 soon and move away from custom docker image). Hopefully this helps the teams facing similar issue. It took us several days to identify and fix the issue (all thanks to this thread and investigation done by everybody).

Since the issue has been handled by cluster autoscaler as well as from aws load balancer, I suggest to close this issue if everyone else agrees to it.

Edit - Updated the PR linkf or Cluster Autoscaler and suggestion to close this github issues if everyone else agrees.

atulaggarwal on Feb 6, 2021

Node Draining Solution

@jorihardman I can now confirm that the fix I put in place in our node draining has solved this particular problem for us. This chart (numbers not to scale) shows the k8s node count in the autoscaling pool against ELB 5xx errors for our nginx ingress serviceType LoadBalancer. We haven’t seen any direct ELB 5xx since our draining fix.

Node draining before completing the ASG lifecycle hook won’t solve this issue in itself. The key is to wait long enough after marking the node unschedulable for the node to be removed and drained from all ELBs it was automatically added to by k8s before you complete the lifecycle hook killing the node. I haven’t found this ability in any generic node draining rigs yet. I’m not surprised it doesn’t work with the default drainer in the new EKS node groups. I’ll hopefully have a chance to spin one of those up in the near future, and will hopefully be able to inspect what is likely a lambda rig, and see if it can be modified to do what we put in place in our drain rig.

Most drain techniques won’t remove daemonset pods, so kube-proxy should continue to run until the instance itself is gone, which in my testing continues to allow it to side-route traffic to pods. There are myriad node draining rigs out there, and it appears “EKS Node Groups” come stock with yet another one. We can use this one, github.com/aws-samples/amazon-k8s-node-drainer to illustrate what we found to be the problem.

https://github.com/aws-samples/amazon-k8s-node-drainer/blob/master/drainer/handler.py#L149

        cordon_node(v1, node_name)

        remove_all_pods(v1, node_name)

        asg.complete_lifecycle_action(LifecycleHookName=lifecycle_hook_name,
                                      AutoScalingGroupName=auto_scaling_group_name,
                                      LifecycleActionResult='CONTINUE',
                                      InstanceId=instance_id)

That is the basic runbook in the handler for the drain. A procedure like this would fire once CA has decided the node should be removed, so CA has already evicted most of the pods itself. asg.complete_lifecycle_action will actually destroy the instance. This is a very common and sane thing to do, and is similar to what we had in place; but it will still drop connections during scale-down with servicetype loadbalancer. Because, CA evicts most/all pods (but not kube-proxy) and eventually will terminate the node by calling the correct hook which will allow draining rigs to kick into gear. Taking the above code, the draining would look like:

cordon_node is called, which marks the node unschedulable (@Jeffwan 's PR will do this earlier, as part of the CA process, which may almost completely solve this issue if that merges). When the node is marked unschedulable k8s will START to remove it from all ELBs it was added to, honoring any drain settings. Remember this node may be actively holding connections EVEN THOUGH it may have nothing but kube-proxy running, because of how servicetype loadbalancer works. Cordon is non-blocking and will return virtually immediately.
remove_all_pods is called, which should evict all pods that aren’t daemonsets. Again, this should leave kube-proxy running and still allow the node to side-route traffic to pods. This will likely run very quickly, or immediately, because CA has likely already evicted the pods before this chain of events starts.
asg.complete_lifecycle_action is called telling AWS it can actually destroy the node itself, which will stop kube-proxy (obviously) breaking any connections still routing through kube-proxy.

The issue is it’s probably not safe to actually stop the node just because all pods have been evicted. cordon_node is non-blocking, and only signals that k8s should start the process of removing the nodes from the elbs, but doesn’t wait (and shouldn’t) until the nodes are actually removed from the ELBs. In our case, we have a 300s elb drain configured, so we should wait at least 300 seconds after cordon_node before terminating the node with asg.complete_lifecycle_action. Our solution was to add logic between remove_all_pods and asg.complete_lifecycle_action. Our logic right now is to make sure we’ve slept at least as long as our longest ELB drain after calling cordon and before calling asg.complete_lifecycle_action. We plan to add an actual check to make sure k8s has removed the instance from all ELBs on its own before subsequently calling the lifecycle hook, rather than relying on an arbitrary sleep. A nearly arbitrary sleep is, however, the kubernetes way. Most of these drain procedures aren’t dealing with the fact that the node is possibly still handling production traffic when all pods, save for kube-proxy and daemonsets, are gone.

I think @Jeffwan 's fix of having CA do the cordon or unschedulable will likely solve almost all of this as well. Our drain solution has proved fairly foolproof for us thusfar. We just had to orient ourselves to the fact that these nodes were possibly still serving nginx ingress traffic even though they had no more pods running on them and were never able to have nginx ingress pods running on them due to taints and tolerations.

Every time I write this stuff out I realize we need to move off of servicetype loadbalancer as soon as possible; and we’re getting close 😃.

ALB Ingress

@mollerdaniel gave some great guidance on successfully using the ALB ingress with zero downtime deploys. We’ve also had good success with it. There are definitely tricks to it, including the deregistration delay stuff and preStop termination hooks in pods; as well as readiness in the pod not being the same as readiness in the target group. The trick we saw with the ALB ingress is the pod readiness isn’t actually integrated with the ALB controller in IP mode. Meaning, as soon as the pod becomes “ready” as far as k8s is concerned, only then is the pod IP initially added to the target group. Now, this doesn’t mean it’s actually ready for traffic in the ALB target group yet, even though k8s has put it in the endpoints API and then proceeds to terminate other pods the new pods are replacing. This is because once the pod IP is added to the target group, it then has to pass the ALB health checks to become “ready” there. These two readiness checks aren’t aware of each other at all. We had to get our sleeps and timeout in our preStop all synced up with the drains as has been mentioned.

We are also looking at fully replacing all ELBs with the ALB ingress controller. I think @mollerdaniel also mentioned this, but we’re going to start by fronting our nginx ingress service with the ALB ingress controller, which sounds weird, but should work. The nginx ingress controller creates pods that serve the actual traffic and will balance the upstream pods by adding the IPs directly to the nginx upstream blocks via a lua plugin. You normally front the nginx ingress pods themselves with servicetype LoadBalancer. We’re going to front our nginx ingress pods with an ingress document configured for the alb ingress controller, basically putting an ALB in front of our nginx ingress pods that only contains the nginx pod IPs. The ingress pods themselves don’t re-deploy very often, while the services we have behind nginx ingress are literally continuously deploying during the day. We deploy about every 15 minutes.

bshelton229 on Dec 5, 2019

Hi, just ran into this ourselves. Surfacing as 500s in our applications. Would be great to have some clarity around what the autoscaling team is thinking around potential solutions for this or confirmation that this is indeed a bug in how the CA scales down nodes. Let me know if I can be a resource in any way.

devinburnette on Nov 11, 2019

To conclude, can the CA set node.Spec.Unschedulable and sleep right before deleting a node via the cloud provider? Alternatively perhaps we could use the ServiceNodeExclusion feature and label the dying node with “alpha.service-controller.kubernetes.io/exclude-balancer” and sleep before destroying the node?

let say we can the CA set node.Spec.Unschedulable after drainSuccessful = true https://github.com/kubernetes/autoscaler/blob/33bd5fc853f91bdfdc548052ccb453d5fea348c6/cluster-autoscaler/core/scale_down.go#L1140

for the “sleep”, i think how long we need to wait, depends on the implementation of the load balancer / service / ingress. How about the sleep is implemented via annotation delay-deletion.cluster-autoscaler.kubernetes.io/xxxx.

Then for the those who wish to have a quick termination (in case 300 node scale up and down), they can leave it blank. For those who are using in-tree LB-type service, they can wait for 120 sec (which is longer than the nodeSyncPeriod), to make sure it is safe to terminate the node.

dictcp on Sep 23, 2020

We’ve been able to solve for this by making some minor adjustments to our existing node draining daemonset/controller rig we’ve had running, as well as making sure our ELBs created by services had connection draining enabled.

All of our AWS autoscaling groups (ASGs) have termination lifecycle hooks. There are lots of different rigs out there (and new EKS node groups look to have this built in), but basically when nodes are sent the termination api call our daemonset kicks in and runs kubectl drain .... on the node, waits for it to drain, and then completes the ASG lifecycle hook, which tells the ASG it’s then safe to actually terminate the node. The trick was not completing the lifecycle hook until connections were drained from various ELBs, not just waiting for the pods to be evicted. The node is still a part of production routing until any in-flight requests are finished using it, regardless of what pods were running on it.

The two adjustments we made to eliminate our errors were:

We made sure all our services had the service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout annotation, which we had missed on the problem ELBs.
In our daemonset responsible for the kubectl drain we made sure that at least the time of our longest elb drain timeout had elapsed since the kubectl drain command was FIRST called before completing the lifecycle hook. So, if you want to drain connections for 60 seconds in your ELB, you’ll want to make sure at the very least, 60 seconds have gone by since you first call kubectl drain, before telling the ASG it can terminate the instance. As far as I can tell, as soon as kubectl drain is called, it marks the node in such a way as to tell the control plane to start deregistering it from all ELBs. We may possibly add polling to make sure the instance is gone from all elbs and target groups in the future, rather than relying on a coarse sleep.

The nice part about the ASG lifecycle hook and daemonset is it will be called after cluster autoscaler declares the node should be gotten rid of. The kubectl drain ... should run relatively fast as cluster autoscaler should have started evictions, hence why we had to add the additional sleep (every problem in k8s is solved by adding a sleep, just like the '90s) between the drain call finishing and the lifecycle hook being completed. We need the node to stay around while requests are still potentially being allowed to drain.

We’re initially excited about the new EKS node groups, but I need to look to see if they have enough flexibility to add this kind of logic to the drain mechanism AWS put in them. We haven’t gotten to play with them yet. I don’t see why we couldn’t cleanup and open source our take on the draining rig, I got most of it from the kube-aws project, and modified it slightly.

This issue may best be solved in a manner like this, as CA calls an API in AWS that is actually event driven. We just put this fix in place, and so far it’s gotten rid of 100% of the errors, but it’s only been one day. I’ll report back if they crop up again. Before this fix, we saw errors every single scaling event.

Semi-Related Realizations

After fixing this, it dawned on us how odd this routing strategy really is. We knew it worked this way, but this drove it home. Our ingress pods fronted by the ELB throwing these errors were already running on dedicated nodes in an ASG that doesn’t have CA enabled at all, as we’re more careful with our ingress pods given how much traffic goes through them. This really drove home that, even with that configuration, our highly volatile node group doing ETL data processing was actually serving production web traffic. Sounds crazy when you write it out.

We initially got really excited about externalTrafficPolicy Local, as it would only have nodes that contained the pods passing health checks in the ELB. However, at least in our testing we realized the Local policy has huge issues when updating deployments. As soon as a pod starts even a safe termination procedure (using preStop scripts with connection draining in the pod) traffic routing to the safely stopping pods stops working instantly, but it will take the ELB seconds to fail the health check that will take the instance with a safely draining pod out of rotation. This means that if a deployment update brings up a new pod on a new node, draining the old pod on a different node can’t be made safe; at least as far as I could tell.

So, back to serving traffic from every single node in our cluster using the nginx ingress with this fix in place. We’re looking deeply at the ALB ingress at this point, using the IP routing policy https://kubernetes-sigs.github.io/aws-alb-ingress-controller/guide/ingress/annotation/#traffic-routing. That seems fairly promising in testing so far.

bshelton229 on Nov 28, 2019

We’ve run into this while using cluster-autoscaler together with alb-ingress-controller. The workaround we’ve identified is to create a Lambda function triggered on the AWS EC2 Instance-terminate Lifecycle Action:

sets the alpha.service-controller.kubernetes.io/exclude-balancer": "true" tag on the node - so that the alb-ingress-controller doesn’t add the node back
deregisters the node for the ALB target(by this initiating ALB draining)

This is our implementation https://github.com/syscollective/kube_alb_lambda_deregister

patrat on Sep 4, 2019

Looks like there hasn’t been a cluster-autoscaler release since December, and this hasn’t been ported to any of the earlier versions. Is it possible to get a release which includes it for all supported minor versions of cluster-autoscaler?

isugimpy on May 5, 2021

externalTrafficPolicy: Cluster

hello @infa-ddeore. could you please share your code about deregister-node-from-lb-before-terminating=true ? Thanks

@SCLogo, i cant share the customized code, but its small change where we are adding node.kubernetes.io/exclude-from-external-load-balancers=true label to the node during cordon option, deregister-node-from-lb-before-terminating argument is added to the binary

infa-ddeore on Dec 20, 2023

We have the same (or similar) issue when using alb-ingress-controller with cluster-autoscaler, during scale down we are getting 5XX, this is because CA doesn’t wait for the target group to deregister the target, In our case we are using alb.ingress.kubernetes.io/target-type=ip.

I came up with this solution to make the pods “wait” till they are deregistered from the target-group:

- name: wait-till-deregistered
  image: public.ecr.aws/bitnami/aws-cli:2.11.23
  command:
    - /bin/bash
    - -c
  args:
    - |
      CLUSTER_NAME="my-cluster";
      INGRESS_GROUP="${CLUSTER_NAME}-private";
      DEPLOYMENT_NAME="my-workload";
      SERVICE_NAME="my-workload";
      PORT_NAME="http";
      STACK_FILTER="Key=ingress.k8s.aws/stack,Values=${INGRESS_GROUP}";
      RESOURCE_FILTER="Key=ingress.k8s.aws/resource,Values=${CLUSTER_NAME}/${DEPLOYMENT_NAME}-${SERVICE_NAME}:${PORT_NAME}";
      TG_ARN=$(aws resourcegroupstaggingapi get-resources --resource-type-filters elasticloadbalancing:targetgroup --tag-filters ${STACK_FILTER} --tag-filters ${RESOURCE_FILTER} --query "ResourceTagMappingList[*].ResourceARN | [0]" --output text);
      
      echo "Waiting until target Id=${MY_POD_IP} is deregistered";
      until aws elbv2 wait target-deregistered --target-group-arn ${TG_ARN} --targets Id=${MY_POD_IP};
      do
          echo "Still waiting...";
          sleep 1;
      done;
      echo "target Id=${MY_POD_IP} has been deregistered";
  env:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
  securityContext:
    allowPrivilegeEscalation: false
    runAsUser: 0

This is a side-container in our frontend workloads, all it does is wait until the pod IP is deregistered, and finally exits. I find this approach a lot more easier then using Lambda+ASG lifecycle hooks (no need for new TF code and/or CI/CD pipelines).

Keep in mind I only tested this on target-type=ip since we register our pods to target groups and not the instances, so YMMV.

azelezni on May 30, 2023

@infa-ddeore - The change should work in previous versions of k8s also as it is relying on UnSchedulable taint. As per the documentation, this taint exists from 1.10 k8s version link node.kubernetes.io/unschedulable (1.10 or later)

atulaggarwal on Feb 16, 2021

2. Cluster Autoscaler PR has also been merged. This PR adds Unschedulable taint apart from existing ToBeDeletedByClusterAutoscaler taint to the nodes which are being removed from Cluster. As per my understanding the change is in the master branch and has not been released yet

@atulaggarwal is this change available for k8 1.15 and 1.16 versions?

infa-ddeore on Feb 11, 2021

Cordon nodes -> #2868

There is PR https://github.com/kubernetes/autoscaler/pull/3014 for possible solution 504 errors. Need code review.

fradee on Apr 22, 2020

@bshelton229 really appreciate the thorough write-up. Glad I’m not alone in all this trial-and-error. It looks like I have tentatively worked around this issue by switching to alb-ingress-controller. I don’t want to crowd this thread with details on a completely different solution, but you can find my findings here: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1064.

jorihardman on Dec 13, 2019

@MaciekPytel kubectl drain will change node.Spec.NoSchedulable to true. Service Controller will periodically fetch node information and ensure Load Balancer.

https://github.com/kubernetes/kubernetes/blob/32b37f5a85e8f2987f8367e611147cd4d4dfa4f9/pkg/controller/service/service_controller.go#L588-L594 (I am confused why it’s still using fetch rather than watch…There’s a 100s nodeSyncPeriod)

Logic here will make sure nonsheduable node are removed from load balancer. This is common logic and I think the idea is not to couple with ingress/service controller logic with CA or normal drain.

To make it generic, I think CA should also still use this field in the custom drain logic in CA. maybe like this https://github.com/Jeffwan/autoscaler/commit/0c02d5bed0d8555187a2b1b289e1044c6f9e2b5c#diff-5f36cd12baa8998cd7f2ba7c4a00cbc6

Jeffwan on May 10, 2019

/assign

Jeffwan on Apr 23, 2019