ingress-nginx: Using proxy-next-upstream when proxy-connect-timeout happens?

When an upstream pod changes its IP for some reason like pod or node restarts, we get a small fraction of 504 errors with a consistent 5 seconds timeout. In the error logs and trace, it is very clear that the upstream pod IP address that is used by Nginx does not exist anymore. With pod disruption budgets, we almost have at least 3 available replicas of these upstream pod.

We have proxy_connect_timeout=5s. These are the other the settings I’ve extracted from the Nginx ingress conf file.

proxy_connect_timeout                   5s;
proxy_send_timeout                      60s;
proxy_read_timeout                      60s;
proxy_next_upstream                     error timeout http_502 http_503 http_504 non_idempotent;
proxy_next_upstream_timeout             1;
proxy_next_upstream_tries               3;

Can we use proxy_next_upstream and hope that it goes to the next available pod?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 19 (5 by maintainers)

Most upvoted comments

Same issue here. In our case, a “chaos controller” terminated one node, and we could see ingress 504 errors. There were 2 pods, one which died, and one that was on a healthy node. Total of 3 ingresses, and same situation in all of them.

In a 30s period, there were total of 79 requests to the dying pod, out of which:

51 requests got 200 OK from second pod after 5s 100.115.211.230:14444,100.104.239.13:14444 0, 0 5.000, 0.008 504, 200
12 requests got 200 OK after 10s (two retries from a failed pod, third retry from healthy pod) 100.115.211.230:14444, 100.115.211.230:14444, 100.104.239.13:14444 0, 0, 1480 5.000, 5.000, 0.004 504, 504, 200
13 requests got 504 after 15s (three retires, all to the dying pod) 100.115.211.230:14444, 100.115.211.230:14444, 100.115.211.230:14444 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504

Any idea why same ingress controller has different retry setup for same service? These three cases on a same ingress were mixed in time (was not that all cases 1 came first, then case 2 and case 3 in the end, they were evenly distributed in time).

mikksoone on Jan 29, 2020

@aledbf the proxy next upstream works after testing but the behaviour is wrong.

With proxy_next_upstream_tries set to 3, and a proxy_connect_timeout of 3. I get a consistent 9s latency with my 504 errors.

It was still failing and upon checking the $upstream_addr, the 3 pod IPs used are all identical.

But it also works!

This is my config.

proxy-connect-timeout: "3"
proxy-next-upstream: "error timeout http_502 http_503 http_504 non_idempotent"
proxy-next-upstream-timeout: "10"

# Fix to prevent status code 499
http-snippet: |
  proxy_ignore_client_abort on;

bzon on Jan 17, 2020

Can we use proxy_next_upstream and hope that it goes to the next available pod?

You can see exactly what nginx is doing reading the logs. When a retry is required the $upstream_addr field becomes an array where you can see the IPs of the pods and also the status code that triggered that retry https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/log-format/

aledbf on Jan 17, 2020

It’s 2022 and i have the same issue. Nginx ingress controller behind an ALB, targeting a microservice running 3 instances. When doing a node upgrade we get the odd (not many but still a handfull of errors, like 10 ) 504 errors. When looking in detail we end up with the same conclusion : it is retrying on the same adress even though we have at least 2 available instances at all times (PDB with maxUnavailable is set to 1).

upstream_addr	100.65.187.226:8080, 100.65.187.226:8080, 100.65.187.226:8080
upstream_response_length	0, 0, 0
upstream_response_time	5.000, 5.000, 5.000
upstream_status	504, 504, 504

tontondematt on Sep 30, 2022