ingress-nginx: http 504 response from AWS ELB when using ingress-nginx-controller
NGINX Ingress controller version
nginx-ingress-controller:0.33.0
Environment
- AWS EKS v.1.14
- EC2 instances with running nginx-ingress-controllers (3 EC2 instances, 2 controllers)
- Classical AWS load balancer in front of nginx-ingress-controllers
What happened
The client fails to send whole http request in given time (AWS ELB idle timeout), nginx-ingress-controller returns 400, which causes ELB to return 504. The issue happens rarely, ~99% of the requests are successful. Having request chain “ELB -> nginx-ingress-controller -> application”, we don’t see that in failed cases the request reaches the application (request chain stops at nginx-ingress-controller). Application at that time didn’t show any signs of loads or garbage collector activity, from all the graphics it was completely healthy. There are no restarts of nginx-ingress-controller or application when the issue happens.
Example of failed request
Nginx-ingress-controller response
{
"body_bytes_sent": 0,
"bytes_sent": 0,
"connection": 562223,
"connection_requests": 28,
"duration": 54.5,
"http": {
"method": "POST",
"status_code": 400,
"url": "/locations/4a56c371-e819-4a85-aec5-c0076474a6fd/positions",
"url_details": {
"path": "/locations/4a56c371-e819-4a85-aec5-c0076474a6fd/positions"
},
"version": 1.1
},
"http_referrer": "",
"http_user_agent": "RestSharp/106.11.4.0",
"level": "info",
"origin": "",
"remote_addr": "1.125.0.0",
"remote_user": "",
"request": "POST /locations/4a56c371-e819-4a85-aec5-c0076474a6fd/positions HTTP/1.1",
"request_length": 1420,
"request_time": 54.500,
"server_name": "api.opensc.org",
"status": 400,
"time_local": "16/Sep/2020:09:45:35 +0000",
"upstream_connect_time": "",
"upstream_header_time": "",
"upstream_response_time": ""
}
ELB response
{
"timestamp": "2020-09-16T09:44:40.849658Z",
"elb_name": "adaf41a991cee11eabeb7062e7d1b817",
"request_ip": "1.125.106.64",
"request_port": "44182",
"backend_ip": "",
"backend_port": "",
"request_processing_time": "-1.0",
"backend_processing_time": "-1.0",
"client_response_time": "-1.0",
"elb_response_code": "504",
"backend_response_code": "0",
"received_bytes": "0",
"sent_bytes": "0",
"request_verb": "POST",
"url": "https://api.opensc.org:443/locations/4a56c371-e819-4a85-aec5-c0076474a6fd/positions",
"protocol": "HTTP/1.1",
"user_agent": "RestSharp/106.11.4.0",
"ssl_cipher": "ECDHE-RSA-AES128-GCM-SHA256",
"ssl_protocol": "TLSv1.2"
}
Nginx-ingress-controller reported time = 16/Sep/2020:09:45:35 +0000 AWS ELB reported time = 2020-09-16T09:44:40.849658Z The time difference is 55s, which is equal to AWS ELB idle timeout.
Why we think it is client issue According to AWS docs (https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/ts-elb-error-message.html#ts-elb-errorcodes-http504, “Solution 2”), ELB idle timeout should be less than keep alive timeout on the server (nginx-ingress-controller in our case). As a test, we set:
- ELB timeout to be 65s
- client_header_timeout 60s;
- client_body_timeout 60s; In this case nginx-ingress-controller started to return 408 REQUEST TIMEOUT after 60s, but ELB still returned 504 for this request. Also for all successful requests of this kind, we see the client request to be ~1602 bytes, but for the failed requests it is less (1420 bytes from log example). When Nginx triggers client_body_timeout, it logs 408 but in fact it just closes the connection (see https://trac.nginx.org/nginx/ticket/1005), so from ELB perspective it seems that the node just closed the connection, that’s why 504. This proved that the client didn’t send the whole http request in given time.
Anything else we need to know
Our current setting are:
- AWS ELB idle timeout is 55s
- Nginx-ingress-controller timeouts: – keepalive_timeout is 75s; – client_header_timeout is 75s; – client_body_timeout is 75s;
- We enabled nginx-ingress-controller debug logs, but they were to cryptic for us
- According to AWS (https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/ts-elb-error-message.html#ts-elb-errorcodes-http504), there are 2 reasons of AWS ELB 504: – “The application takes longer to respond than the configured idle timeout.” which seems to be not our case, as we wait for client transmission of the request – “Registered instances closing the connection to Elastic Load Balancing.” which seems to be the case for us.
We are not sure how to proceed or what timeout configs we need to tune to eliminate the issue.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 29 (12 by maintainers)
I tested on a 2 node m5.large eks 1.22 cluster with an ALB, Classic, and NLB on ingress-nginx v1.2.1 with one controller.
I did not see any 5XXs on my testing. If you have any other input params for the testing please let me know. Tracking down .01 or .1 % issues can be difficult.
Image testing: kennethreitz/httpbin. http://httpbin.org/
Aws install per docs: https://kubernetes.github.io/ingress-nginx/deploy/#aws
Configuration
ingress-nginx configmap
k6 test script example
AWS Classic HTTP GET /
AWS ALB HTTP GET /
AWS NLB HTTP GET /
AWS Classic HTTP GET /bytes/2048
AWS ALB HTTP GET /bytes/2048
AWS NLB HTTP GET /bytes/2048
Classic 1 failure ALB NLB
AWS Classic HTTP POST 50 MB file
AWS ALB HTTP POST 50 MB file
AWS NLB HTTP POST 50 MB file
@iamNoah1 I was able to reproduce the issue with the following setup:
Scenario is the same, I simulated client that slowly sends the request. ELB response is 504.
Thank for the feedback @VladyslavKurmaz. @evredinka friendly ping. If this is not a bug for you anymore, please consider closing this issue.
@iamNoah1 In our case, it just confirmed next bulletproof rule: in 99,9% cases the issue is on our side. In our AWS k8s cluster, we have two types of node pools and there was network configuration issue between them. That is why the same configuration Nginx ingress + our application worked fine using Digital Ocean and did not using AWS. After hotfix everything works fine.