aws-load-balancer-controller: 400/502/504 errors while doing rollout restart or rolling update
Hi, I’m getting errors while doing a rolling update and I can reproduce this problem consistently doing a rollout restart. I already tried the recommendations pointed in other issues related with this problem (#814), such as adding a preStop hook to sleep for some seconds so the pods can finish the on going requests but it doesn’t solve the problem.
I have also changed the configuration on the load balancer to have the health check interval and threshold-count lower than what I’m setting for the pod’s readiness probe so the load balancer could stop sending requests to the pods before receiving the SIGTERM, but without success.
Currently this is the configuration for the ingress, service and deployment:
Ingress
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig":
{ "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
alb.ingress.kubernetes.io/certificate-arn: ...
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
alb.ingress.kubernetes.io/scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: app.dev.codacy.org, api.dev.codacy.org
external-dns.alpha.kubernetes.io/scope: public
kubernetes.io/ingress.class: alb
labels:
app.kubernetes.io/instance: codacy-api
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: codacy-api
app.kubernetes.io/version: 4.93.0-SNAPSHOT.d94f47083
helm.sh/chart: codacy-api-4.93.0-SNAPSHOT.d94f47083
name: codacy-api
namespace: codacy
spec:
rules:
- host: app.dev.codacy.org
http:
paths:
- backend:
serviceName: ssl-redirect
servicePort: use-annotation
path: /*
- backend:
serviceName: codacy-api
servicePort: http
path: /*
- host: api.dev.codacy.org
http:
paths:
- backend:
serviceName: ssl-redirect
servicePort: use-annotation
path: /*
- backend:
serviceName: codacy-api
servicePort: http
path: /*
Service
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
labels:
app.kubernetes.io/instance: codacy-api
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: codacy-api
app.kubernetes.io/version: 4.94.0-SNAPSHOT.394a06196
helm.sh/chart: codacy-api-4.94.0-SNAPSHOT.394a06196
name: codacy-api
namespace: codacy
spec:
clusterIP: 172.20.101.186
ports:
- name: http
nodePort: 30057
port: 80
targetPort: http
selector:
app.kubernetes.io/instance: codacy-api
app.kubernetes.io/name: codacy-api
type: NodePort
Deployment
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "202"
labels:
app.kubernetes.io/instance: codacy-api
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: codacy-api
app.kubernetes.io/version: 4.94.0-SNAPSHOT.394a06196
helm.sh/chart: codacy-api-4.94.0-SNAPSHOT.394a06196
name: codacy-api
namespace: codacy
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: codacy-api
app.kubernetes.io/name: codacy-api
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
iam.amazonaws.com/role: ...
kubectl.kubernetes.io/restartedAt: "2019-11-06T10:54:32Z"
creationTimestamp: null
labels:
app.kubernetes.io/instance: codacy-api
app.kubernetes.io/name: codacy-api
spec:
containers:
- envFrom:
- configMapRef:
name: codacy-api
- secretRef:
name: codacy-api
image: codacy/codacy-website:4.94.0-SNAPSHOT.394a06196
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: http
scheme: HTTP
initialDelaySeconds: 75
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 1
name: codacy-api
ports:
- containerPort: 9000
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: docker-credentials
restartPolicy: Always
schedulerName: default-scheduler
terminationGracePeriodSeconds: 60
To replicate this issue I’m using fortio to call an endpoint on the application constantly during some time (like this: fortio load -a -c 8 -qps 500 -t 60s "https://app.dev.codacy.org/manual/user/project/dashboard?bid=123") and meanwhile I run kubectl rollout restart deployment/codacy-api -n codacy to restart the pods.
In the end there’s some errors caused by the rollout restart:
Fortio 1.3.1 running at 500 queries per second, 4->4 procs, for 1m0s: https://app.dev.codacy.org/manual/user/project/dashboard?bid=123
22:57:41 I httprunner.go:82> Starting http test for https://app.dev.codacy.org/manual/user/project/dashboard?bid=123 with 8 threads at 500.0 qps
22:57:41 W http_client.go:136> https requested, switching to standard go client
Starting at 500 qps with 8 thread(s) [gomax 4] for 1m0s : 3750 calls each (total 30000)
22:59:25 W periodic.go:487> T001 warning only did 257 out of 3750 calls before reaching 1m0s
22:59:25 I periodic.go:533> T001 ended after 1m0.072690186s : 257 calls. qps=4.278150340933027
22:59:25 W periodic.go:487> T002 warning only did 254 out of 3750 calls before reaching 1m0s
22:59:25 I periodic.go:533> T002 ended after 1m0.109367672s : 254 calls. qps=4.225630876471816
22:59:26 W periodic.go:487> T004 warning only did 258 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T004 ended after 1m0.133407105s : 258 calls. qps=4.290460368385607
22:59:26 W periodic.go:487> T006 warning only did 244 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T006 ended after 1m0.133490304s : 244 calls. qps=4.057639075438291
22:59:26 W periodic.go:487> T005 warning only did 249 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T005 ended after 1m0.265693232s : 249 calls. qps=4.131703903934942
22:59:26 W periodic.go:487> T007 warning only did 237 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T007 ended after 1m0.271662857s : 237 calls. qps=3.932196139374884
22:59:26 W periodic.go:487> T003 warning only did 255 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T003 ended after 1m0.272313488s : 255 calls. qps=4.230798276073633
22:59:27 W periodic.go:487> T000 warning only did 220 out of 3750 calls before reaching 1m0s
22:59:27 I periodic.go:533> T000 ended after 1m1.631718364s : 220 calls. qps=3.5695905588851025
Ended after 1m1.63173938s : 1974 calls. qps=32.029
Aggregated Sleep Time : count 1974 avg -26.514521 +/- 15.48 min -58.11073874 max -0.412190249 sum -52339.6648
# range, mid point, percentile, count
>= -58.1107 <= -0.41219 , -29.2615 , 100.00, 1974
# target 50% -29.2761
WARNING 100.00% of sleep were falling behind
Aggregated Function Time : count 1974 avg 0.24452006 +/- 0.2508 min 0.053575648 max 10.063352224 sum 482.682607
# range, mid point, percentile, count
>= 0.0535756 <= 0.06 , 0.0567878 , 0.30, 6
> 0.06 <= 0.07 , 0.065 , 0.51, 4
> 0.07 <= 0.08 , 0.075 , 0.56, 1
> 0.16 <= 0.18 , 0.17 , 1.01, 9
> 0.18 <= 0.2 , 0.19 , 12.61, 229
> 0.2 <= 0.25 , 0.225 , 83.84, 1406
> 0.25 <= 0.3 , 0.275 , 93.26, 186
> 0.3 <= 0.35 , 0.325 , 96.30, 60
> 0.35 <= 0.4 , 0.375 , 97.42, 22
> 0.4 <= 0.45 , 0.425 , 98.18, 15
> 0.45 <= 0.5 , 0.475 , 99.04, 17
> 0.5 <= 0.6 , 0.55 , 99.24, 4
> 0.6 <= 0.7 , 0.65 , 99.34, 2
> 0.8 <= 0.9 , 0.85 , 99.39, 1
> 1 <= 2 , 1.5 , 99.90, 10
> 2 <= 3 , 2.5 , 99.95, 1
> 10 <= 10.0634 , 10.0317 , 100.00, 1
# target 50% 0.226245
# target 75% 0.243794
# target 90% 0.282688
# target 99% 0.497824
# target 99.9% 2.026
Sockets used: 0 (for perfect keepalive, would be 8)
Code 200 : 1956 (99.1 %)
Code 400 : 13 (0.7 %)
Code 502 : 4 (0.2 %)
Code 504 : 1 (0.1 %)
Response Header Sizes : count 1974 avg 0 +/- 0 min 0 max 0 sum 0
Response Body/Total Sizes : count 1974 avg 41758.073 +/- 3379 min 138 max 42080 sum 82430436
I always get some errors while doing restarts during this test. This is causing some troubles in our application in production when we do rolling updates.
I noticed that in the nginx ingress controller there’s the proxy-next-upstream configuration to specify in which cases a request should be passed to the next server, is there any way to do this with this load balancer? Should I use nginx instead?
Thanks for the help.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 5
- Comments: 17 (1 by maintainers)
Both v1 and v2 controllers supports zero downtime deployment. Need to document how to setup for zero downtime deployment.
I tried to fix this PR #1775, but I’ve ended up with a separate package: https://github.com/foriequal0/pod-graceful-drain