aws-load-balancer-controller: 502/503 During deploys and/or pod termination

Hi! First of all, I appreciate the community and all their work on this project, it is very helpful and a good solution to route directly to pods from an ALB.

However, during testing, I’ve noticed intermittent 502/503s during deploys of our statefulset. My current hypothesis is that during a deploy, the statefulset controller kills a pod in need of updates, and there is latency between this happening and the alb ingress controller updating the alb target to draining. During this delay, requests are sent to the terminating pod and return 502 (our nginx sidecar) and/or 503 (aws alb).

Has anyone else seen this problem, and potentially have a solution for it? Ideally we’d remove the pod from the alb target group before killing the pod, if this is in fact what is happening.

I have the following Service and Ingress:

---
kind: Service
apiVersion: v1
metadata:
  name: svc-headless
  namespace: dev
spec:
  clusterIP: None
  selector:
    app: svc
  ports:
  - name: http
    port: 9000

Ingress

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: svc-external
  namespace: dev
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/security-groups: sg-xxxxxxxxxx,sg-yyyyyyyyyyy
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: 5
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: 3
    alb.ingress.kubernetes.io/success-codes: 200,201,401
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:XXXXXXXXXX:certificate/uuid
    alb.ingress.kubernetes.io/subnets: subnet-aaaaa,subnet-bbbbb,subnet-cccc
  labels:
    app: svc
spec:
  rules:
    - http:
        paths:
         - path: /*
           backend:
             serviceName: ssl-redirect
             servicePort: use-annotation
         - path: /*
           backend:
             serviceName: svc-headless
             servicePort: 9000

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 20
  • Comments: 28 (3 by maintainers)

Most upvoted comments

@douglaz See this thread which covers the same issue with a couple of solutions: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1064

tldr:

  • Add a preStop sleep to your pod so that the container is delayed prior to shut down. This keeps the container alive while the load balancer updates its targets. You might need to increase terminationGracePeriodSeconds to allow for graceful shutdown after the sleep.
  • Add --feature-gates=waf=false to alb-ingress-controller container args. Right now the controller makes WAF requests for every deploy, and AWS throttling these requests can cause a delay in updating targets. If you’re not using waf, skipping it entirely prevents these delays.

/remove-lifecycle rotten

Hi @M00nF1sh, thanks for the response.

That would work, however it gets us back to the exact problem I’m trying to solve. We have a large amount of instances, in various node groups. This quickly balloons the amount of attached instances to the target group. The pods we’d like to direct traffic to belong to a small instance group – so this would work, if we could select those ec2 instances (k8s nodes) directly. Is there a way to filter or limit which cluster nodes get attached (via kubernetes node label, ec2 tag, or otherwise) ?