ingress-nginx: Using proxy-next-upstream when proxy-connect-timeout happens?
When an upstream pod changes its IP for some reason like pod or node restarts, we get a small fraction of 504 errors with a consistent 5 seconds timeout. In the error logs and trace, it is very clear that the upstream pod IP address that is used by Nginx does not exist anymore. With pod disruption budgets, we almost have at least 3 available replicas of these upstream pod.
We have proxy_connect_timeout=5s. These are the other the settings I’ve extracted from the Nginx ingress conf file.
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
proxy_next_upstream error timeout http_502 http_503 http_504 non_idempotent;
proxy_next_upstream_timeout 1;
proxy_next_upstream_tries 3;
Can we use proxy_next_upstream and hope that it goes to the next available pod?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (5 by maintainers)
Same issue here. In our case, a “chaos controller” terminated one node, and we could see ingress 504 errors. There were 2 pods, one which died, and one that was on a healthy node. Total of 3 ingresses, and same situation in all of them.
In a 30s period, there were total of 79 requests to the dying pod, out of which:
100.115.211.230:14444,100.104.239.13:14444 0, 0 5.000, 0.008 504, 200100.115.211.230:14444, 100.115.211.230:14444, 100.104.239.13:14444 0, 0, 1480 5.000, 5.000, 0.004 504, 504, 200100.115.211.230:14444, 100.115.211.230:14444, 100.115.211.230:14444 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504Any idea why same ingress controller has different retry setup for same service? These three cases on a same ingress were mixed in time (was not that all cases 1 came first, then case 2 and case 3 in the end, they were evenly distributed in time).
@aledbf the proxy next upstream works after testing but the behaviour is wrong.
With proxy_next_upstream_tries set to 3, and a proxy_connect_timeout of 3. I get a consistent 9s latency with my 504 errors.
It was still failing and upon checking the $upstream_addr, the 3 pod IPs used are all identical.
But it also works!
This is my config.
You can see exactly what nginx is doing reading the logs. When a retry is required the $upstream_addr field becomes an array where you can see the IPs of the pods and also the status code that triggered that retry https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/log-format/
It’s 2022 and i have the same issue. Nginx ingress controller behind an ALB, targeting a microservice running 3 instances. When doing a node upgrade we get the odd (not many but still a handfull of errors, like 10 ) 504 errors. When looking in detail we end up with the same conclusion : it is retrying on the same adress even though we have at least 2 available instances at all times (PDB with maxUnavailable is set to 1).