aws-load-balancer-controller: 400/502/504 errors while doing rollout restart or rolling update

Hi, I’m getting errors while doing a rolling update and I can reproduce this problem consistently doing a rollout restart. I already tried the recommendations pointed in other issues related with this problem (#814), such as adding a preStop hook to sleep for some seconds so the pods can finish the on going requests but it doesn’t solve the problem.

I have also changed the configuration on the load balancer to have the health check interval and threshold-count lower than what I’m setting for the pod’s readiness probe so the load balancer could stop sending requests to the pods before receiving the SIGTERM, but without success.

Currently this is the configuration for the ingress, service and deployment:

Ingress

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig":
      { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/certificate-arn: ...
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/scheme: internet-facing
    external-dns.alpha.kubernetes.io/hostname: app.dev.codacy.org, api.dev.codacy.org
    external-dns.alpha.kubernetes.io/scope: public
    kubernetes.io/ingress.class: alb
  labels:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: codacy-api
    app.kubernetes.io/version: 4.93.0-SNAPSHOT.d94f47083
    helm.sh/chart: codacy-api-4.93.0-SNAPSHOT.d94f47083
  name: codacy-api
  namespace: codacy
spec:
  rules:
  - host: app.dev.codacy.org
    http:
      paths:
      - backend:
          serviceName: ssl-redirect
          servicePort: use-annotation
        path: /*
      - backend:
          serviceName: codacy-api
          servicePort: http
        path: /*
  - host: api.dev.codacy.org
    http:
      paths:
      - backend:
          serviceName: ssl-redirect
          servicePort: use-annotation
        path: /*
      - backend:
          serviceName: codacy-api
          servicePort: http
        path: /*

Service

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
  labels:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: codacy-api
    app.kubernetes.io/version: 4.94.0-SNAPSHOT.394a06196
    helm.sh/chart: codacy-api-4.94.0-SNAPSHOT.394a06196
  name: codacy-api
  namespace: codacy
spec:
  clusterIP: 172.20.101.186
  ports:
  - name: http
    nodePort: 30057
    port: 80
    targetPort: http
  selector:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/name: codacy-api
  type: NodePort

Deployment

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "202"
  labels:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: codacy-api
    app.kubernetes.io/version: 4.94.0-SNAPSHOT.394a06196
    helm.sh/chart: codacy-api-4.94.0-SNAPSHOT.394a06196
  name: codacy-api
  namespace: codacy
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: codacy-api
      app.kubernetes.io/name: codacy-api
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        iam.amazonaws.com/role: ...
        kubectl.kubernetes.io/restartedAt: "2019-11-06T10:54:32Z"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: codacy-api
        app.kubernetes.io/name: codacy-api
    spec:
      containers:
      - envFrom:
        - configMapRef:
            name: codacy-api
        - secretRef:
            name: codacy-api
        image: codacy/codacy-website:4.94.0-SNAPSHOT.394a06196
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          initialDelaySeconds: 75
          periodSeconds: 20
          successThreshold: 1
          timeoutSeconds: 1
        name: codacy-api
        ports:
        - containerPort: 9000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: docker-credentials
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 60

To replicate this issue I’m using fortio to call an endpoint on the application constantly during some time (like this: fortio load -a -c 8 -qps 500 -t 60s "https://app.dev.codacy.org/manual/user/project/dashboard?bid=123") and meanwhile I run kubectl rollout restart deployment/codacy-api -n codacy to restart the pods.

In the end there’s some errors caused by the rollout restart:

Fortio 1.3.1 running at 500 queries per second, 4->4 procs, for 1m0s: https://app.dev.codacy.org/manual/user/project/dashboard?bid=123
22:57:41 I httprunner.go:82> Starting http test for https://app.dev.codacy.org/manual/user/project/dashboard?bid=123 with 8 threads at 500.0 qps
22:57:41 W http_client.go:136> https requested, switching to standard go client
Starting at 500 qps with 8 thread(s) [gomax 4] for 1m0s : 3750 calls each (total 30000)
22:59:25 W periodic.go:487> T001 warning only did 257 out of 3750 calls before reaching 1m0s
22:59:25 I periodic.go:533> T001 ended after 1m0.072690186s : 257 calls. qps=4.278150340933027
22:59:25 W periodic.go:487> T002 warning only did 254 out of 3750 calls before reaching 1m0s
22:59:25 I periodic.go:533> T002 ended after 1m0.109367672s : 254 calls. qps=4.225630876471816
22:59:26 W periodic.go:487> T004 warning only did 258 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T004 ended after 1m0.133407105s : 258 calls. qps=4.290460368385607
22:59:26 W periodic.go:487> T006 warning only did 244 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T006 ended after 1m0.133490304s : 244 calls. qps=4.057639075438291
22:59:26 W periodic.go:487> T005 warning only did 249 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T005 ended after 1m0.265693232s : 249 calls. qps=4.131703903934942
22:59:26 W periodic.go:487> T007 warning only did 237 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T007 ended after 1m0.271662857s : 237 calls. qps=3.932196139374884
22:59:26 W periodic.go:487> T003 warning only did 255 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T003 ended after 1m0.272313488s : 255 calls. qps=4.230798276073633
22:59:27 W periodic.go:487> T000 warning only did 220 out of 3750 calls before reaching 1m0s
22:59:27 I periodic.go:533> T000 ended after 1m1.631718364s : 220 calls. qps=3.5695905588851025
Ended after 1m1.63173938s : 1974 calls. qps=32.029
Aggregated Sleep Time : count 1974 avg -26.514521 +/- 15.48 min -58.11073874 max -0.412190249 sum -52339.6648
# range, mid point, percentile, count
>= -58.1107 <= -0.41219 , -29.2615 , 100.00, 1974
# target 50% -29.2761
WARNING 100.00% of sleep were falling behind
Aggregated Function Time : count 1974 avg 0.24452006 +/- 0.2508 min 0.053575648 max 10.063352224 sum 482.682607
# range, mid point, percentile, count
>= 0.0535756 <= 0.06 , 0.0567878 , 0.30, 6
> 0.06 <= 0.07 , 0.065 , 0.51, 4
> 0.07 <= 0.08 , 0.075 , 0.56, 1
> 0.16 <= 0.18 , 0.17 , 1.01, 9
> 0.18 <= 0.2 , 0.19 , 12.61, 229
> 0.2 <= 0.25 , 0.225 , 83.84, 1406
> 0.25 <= 0.3 , 0.275 , 93.26, 186
> 0.3 <= 0.35 , 0.325 , 96.30, 60
> 0.35 <= 0.4 , 0.375 , 97.42, 22
> 0.4 <= 0.45 , 0.425 , 98.18, 15
> 0.45 <= 0.5 , 0.475 , 99.04, 17
> 0.5 <= 0.6 , 0.55 , 99.24, 4
> 0.6 <= 0.7 , 0.65 , 99.34, 2
> 0.8 <= 0.9 , 0.85 , 99.39, 1
> 1 <= 2 , 1.5 , 99.90, 10
> 2 <= 3 , 2.5 , 99.95, 1
> 10 <= 10.0634 , 10.0317 , 100.00, 1
# target 50% 0.226245
# target 75% 0.243794
# target 90% 0.282688
# target 99% 0.497824
# target 99.9% 2.026
Sockets used: 0 (for perfect keepalive, would be 8)
Code 200 : 1956 (99.1 %)
Code 400 : 13 (0.7 %)
Code 502 : 4 (0.2 %)
Code 504 : 1 (0.1 %)
Response Header Sizes : count 1974 avg 0 +/- 0 min 0 max 0 sum 0
Response Body/Total Sizes : count 1974 avg 41758.073 +/- 3379 min 138 max 42080 sum 82430436

I always get some errors while doing restarts during this test. This is causing some troubles in our application in production when we do rolling updates.

I noticed that in the nginx ingress controller there’s the proxy-next-upstream configuration to specify in which cases a request should be passed to the next server, is there any way to do this with this load balancer? Should I use nginx instead?

Thanks for the help.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 5
Comments: 17 (1 by maintainers)

Most upvoted comments

Both v1 and v2 controllers supports zero downtime deployment. Need to document how to setup for zero downtime deployment.

kishorj on Nov 18, 2020

I tried to fix this PR #1775, but I’ve ended up with a separate package: https://github.com/foriequal0/pod-graceful-drain

foriequal0 on Jun 10, 2021