ingress-nginx: Dynamic reconfiguration of endpoints does not happen instantly
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): No
What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): 502 endpoints rollingUpdate
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
NGINX Ingress controller version: 0.24.1
Kubernetes version (use kubectl version
):
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- Cloud provider or hardware configuration: bare-metal
- OS (e.g. from /etc/os-release): ubuntu bionic
- Kernel (e.g.
uname -a
): 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux - Install tools: kubeadm
- Others:
What happened:
I have found that even though ingress-nginx reports Dynamic reconfiguration succeeded.
it does not mean it had succeeded, or probably it did not configure what it should have.
The case: there is a deployment with just one replica (deliberately). The application is served through ingress-nginx.
The
nginx.ingress.kubernetes.io/proxy-next-upstream: "error timeout http_502"
annotation is set.
Then I change an environment variable value and re-apply
it, so that kubernetes created a new pod and killed the old one.
Rolling update strategy is configured as such:
strategy:
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
and there is a readiness probe:
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 0
periodSeconds: 6
timeoutSeconds: 4
failureThreshold: 3
What you expected to happen: Now, I expect that the switch from the old pod to a new pod happens seamlessly. And logs confirm that it’s what should happen. While it does not.
How I tested it:
I started an infinite loop with curl that continuously sends requests, first column date, then http status, and a random number (to correlate requests in the log). 13:08:08
is when readiness probe confirmed the service is ready and kubernetes sent endpoints update event.
[05-10 13:08:08]302 8341
[05-10 13:08:08]302 32272
[05-10 13:08:08]502 15488
[05-10 13:08:09]502 12850
[05-10 13:08:09]502 27194
[05-10 13:08:09]302 5523
[05-10 13:08:14]302 1760
Now the most interesting in logs
ingress-nginx log (partially anonymised)
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /5286 HTTP/2.0" 302 420 "-" "curl/7.58.0" 47 0.047 [env-name-app-app-80] 10.51.1.176:80 432 0.048 302 44f7460c0f3bcfebca1bcada6c63f6a1
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /11485 HTTP/2.0" 302 420 "-" "curl/7.58.0" 48 0.043 [env-name-app-app-80] 10.51.1.176:80 432 0.044 302 fcbf4eea7bb6f68e3e3d3765baa490f6
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /28636 HTTP/2.0" 302 420 "-" "curl/7.58.0" 48 0.042 [env-name-app-app-80] 10.51.1.176:80 432 0.044 302 091517f8396ab1eb1d64e72fb54eda7e
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /26277 HTTP/2.0" 302 420 "-" "curl/7.58.0" 48 0.041 [env-name-app-app-80] 10.51.1.176:80 432 0.040 302 574053a6f07f1ff23c78c793024ffc84
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /8341 HTTP/2.0" 302 420 "-" "curl/7.58.0" 47 0.045 [env-name-app-app-80] 10.51.1.176:80 432 0.048 302 3663c5f1656b2ad83c72a303352b251c
I0510 01:08:08.858024 6 nginx.go:359] Event UPDATE received - object &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app,GenerateName:,Namespace:env-name-app,SelfLink:/api/v1/namespaces/env-name-app/endpoints/app,UID:ceba4410-7210-11e9-87c2-001fd0a442dc,ResourceVersion:3973200,Generation:0,CreationTimestamp:2019-05-09 04:13:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{endpoints.kubernetes.io/last-change-trigger-time: 2019-05-10T01:08:08Z,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Subsets:[{[{10.51.1.176 0xc0008d1620 ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-5b7dd68b58-qxkc5,UID:3c92cfb6-72bf-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3972578,FieldPath:,}} {10.51.1.177 0xc0008d1630 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}] [] [{http 80 TCP}]}],}
I0510 01:08:08.867260 6 queue.go:87] queuing item &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app,GenerateName:,Namespace:env-name-app,SelfLink:/api/v1/namespaces/env-name-app/endpoints/app,UID:ceba4410-7210-11e9-87c2-001fd0a442dc,ResourceVersion:3973200,Generation:0,CreationTimestamp:2019-05-09 04:13:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{endpoints.kubernetes.io/last-change-trigger-time: 2019-05-10T01:08:08Z,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Subsets:[{[{10.51.1.176 0xc0008d1620 ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-5b7dd68b58-qxkc5,UID:3c92cfb6-72bf-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3972578,FieldPath:,}} {10.51.1.177 0xc0008d1630 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}] [] [{http 80 TCP}]}],}
I0510 01:08:08.874884 6 queue.go:128] syncing env-name-app/app
I0510 01:08:08.875216 6 controller.go:683] Creating upstream "env-name-app-app-80"
I0510 01:08:08.875232 6 controller.go:788] Obtaining ports information for Service "env-name-app/app"
I0510 01:08:08.875249 6 endpoints.go:73] Getting Endpoints for Service "env-name-app/app" and port &ServicePort{Name:http,Protocol:TCP,Port:80,TargetPort:80,NodePort:0,}
I0510 01:08:08.875267 6 endpoints.go:116] Endpoints found for Service "env-name-app/app": [{10.51.1.176 80 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-5b7dd68b58-qxkc5,UID:3c92cfb6-72bf-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3972578,FieldPath:,}} {10.51.1.177 80 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}]
I0510 01:08:08.875381 6 controller.go:1265] Found SSL certificate matching host "app.org.tld": "env-name-app/tls-star-org-tld"
I0510 01:08:08.875463 6 controller.go:459] Replacing location "/" for server "app.org.tld" with upstream "upstream-default-backend" to use upstream "env-name-app-app-80" (Ingress "env-name-app/app")
I0510 01:08:08.875493 6 controller.go:241] Obtaining information about TCP stream services from ConfigMap "ingress-nginx/tcp-services"
I0510 01:08:08.875512 6 controller.go:241] Obtaining information about UDP stream services from ConfigMap "ingress-nginx/udp-services"
[10/May/2019:01:08:08 +0000]TCP200000.000
I0510 01:08:08.877711 6 controller.go:216] Dynamic reconfiguration succeeded.
I0510 01:08:08.878718 6 nginx_status.go:168] start scraping socket: /nginx_status
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /32272 HTTP/2.0" 302 420 "-" "curl/7.58.0" 48 0.051 [env-name-app-app-80] 10.51.1.176:80 432 0.052 302 baf82d0477e658439b3df7108824e3ad
I0510 01:08:08.926939 6 nginx.go:359] Event UPDATE received - object &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app,GenerateName:,Namespace:env-name-app,SelfLink:/api/v1/namespaces/env-name-app/endpoints/app,UID:ceba4410-7210-11e9-87c2-001fd0a442dc,ResourceVersion:3973205,Generation:0,CreationTimestamp:2019-05-09 04:13:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Subsets:[{[{10.51.1.177 0xc000a39860 ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}] [] [{http 80 TCP}]}],}
I0510 01:08:08.934640 6 queue.go:87] queuing item &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app,GenerateName:,Namespace:env-name-app,SelfLink:/api/v1/namespaces/env-name-app/endpoints/app,UID:ceba4410-7210-11e9-87c2-001fd0a442dc,ResourceVersion:3973205,Generation:0,CreationTimestamp:2019-05-09 04:13:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Subsets:[{[{10.51.1.177 0xc000a39860 ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}] [] [{http 80 TCP}]}],}
I0510 01:08:08.972490 6 socket.go:343] removing ingresses [] from metrics
I0510 01:08:08.973955 6 nginx_status.go:168] start scraping socket: /nginx_status
2019/05/10 01:08:08 [error] 41#41: *7198 upstream prematurely closed connection while reading response header from upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /15488 HTTP/2.0", upstream: "http://10.51.1.176:80/15488", host: "app.org.tld"
2019/05/10 01:08:09 [error] 41#41: *7198 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /15488 HTTP/2.0", upstream: "http://10.51.1.176:80/15488", host: "app.org.tld"
2019/05/10 01:08:09 [error] 41#41: *7198 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /15488 HTTP/2.0", upstream: "http://10.51.1.176:80/15488", host: "app.org.tld"
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:09 +0000] "GET /15488 HTTP/2.0" 502 150 "-" "curl/7.58.0" 48 0.017 [env-name-app-app-80] 10.51.1.176:80, 10.51.1.176:80, 10.51.1.176:80 0, 0, 0 0.012, 0.004, 0.000 502, 502, 502 df1e3d755ed000a95b576e0fcf014629
I0510 01:08:09.025418 6 queue.go:128] syncing env-name-app/app
2019/05/10 01:08:09 [error] 40#40: *7203 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /12850 HTTP/2.0", upstream: "http://10.51.1.176:80/12850", host: "app.org.tld"
2019/05/10 01:08:09 [error] 40#40: *7203 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /12850 HTTP/2.0", upstream: "http://10.51.1.176:80/12850", host: "app.org.tld"
2019/05/10 01:08:09 [error] 40#40: *7203 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /12850 HTTP/2.0", upstream: "http://10.51.1.176:80/12850", host: "app.org.tld"
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:09 +0000] "GET /12850 HTTP/2.0" 502 150 "-" "curl/7.58.0" 48 0.002 [env-name-app-app-80] 10.51.1.176:80, 10.51.1.176:80, 10.51.1.176:80 0, 0, 0 0.004, 0.000, 0.000 502, 502, 502 1d994cde4986d2b408f98c79dc4885cd
2019/05/10 01:08:09 [error] 40#40: *7208 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /27194 HTTP/2.0", upstream: "http://10.51.1.176:80/27194", host: "app.org.tld"
2019/05/10 01:08:09 [error] 40#40: *7208 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /27194 HTTP/2.0", upstream: "http://10.51.1.176:80/27194", host: "app.org.tld"
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:09 +0000] "GET /27194 HTTP/2.0" 502 150 "-" "curl/7.58.0" 48 0.003 [env-name-app-app-80] 10.51.1.176:80, 10.51.1.176:80, 10.51.1.176:80 0, 0, 0 0.000, 0.004, 0.000 502, 502, 502 fefc17531917044ef9221490019c9d96
2019/05/10 01:08:09 [error] 40#40: *7208 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /27194 HTTP/2.0", upstream: "http://10.51.1.176:80/27194", host: "app.org.tld"
I0510 01:08:12.208823 6 controller.go:683] Creating upstream "env-name-app-app-80"
I0510 01:08:12.208839 6 controller.go:788] Obtaining ports information for Service "env-name-app/app"
I0510 01:08:12.208857 6 endpoints.go:73] Getting Endpoints for Service "env-name-app/app" and port &ServicePort{Name:http,Protocol:TCP,Port:80,TargetPort:80,NodePort:0,}
I0510 01:08:12.208873 6 endpoints.go:116] Endpoints found for Service "env-name-app/app": [{10.51.1.177 80 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}]
I0510 01:08:12.208974 6 controller.go:1265] Found SSL certificate matching host "app.org.tld": "env-name-app/tls-star-org-tld"
I0510 01:08:12.209142 6 controller.go:459] Replacing location "/" for server "app.org.tld" with upstream "upstream-default-backend" to use upstream "env-name-app-app-80" (Ingress "env-name-app/app")
I0510 01:08:12.209172 6 controller.go:241] Obtaining information about TCP stream services from ConfigMap "ingress-nginx/tcp-services"
I0510 01:08:12.209195 6 controller.go:241] Obtaining information about UDP stream services from ConfigMap "ingress-nginx/udp-services"
[10/May/2019:01:08:12 +0000]TCP200000.001
I0510 01:08:12.212630 6 controller.go:216] Dynamic reconfiguration succeeded.
I0510 01:08:12.212995 6 nginx_status.go:168] start scraping socket: /nginx_status
I0510 01:08:12.253664 6 socket.go:343] removing ingresses [] from metrics
I0510 01:08:12.259530 6 nginx_status.go:168] start scraping socket: /nginx_status
2019/05/10 01:08:14 [error] 41#41: *7213 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /5523 HTTP/2.0", upstream: "http://10.51.1.176:80/5523", host: "app.org.tld"
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:14 +0000] "GET /5523 HTTP/2.0" 302 420 "-" "curl/7.58.0" 47 5.050 [env-name-app-app-80] 10.51.1.176:80, 10.51.1.177:80 0, 432 5.000, 0.052 504, 302 58c40cc9fbe4990929d2eb57b3ac5783
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:14 +0000] "GET /1760 HTTP/2.0" 302 420 "-" "curl/7.58.0" 47 0.050 [env-name-app-app-80] 10.51.1.177:80 432 0.048 302 8f93d75cc1c00f581013547a72a249e1
From the log you can see that the old pod IP is 10.51.1.176
, and the new IP is 10.51.1.177
.
Now see that at I0510 01:08:08.858024
point it receives an event with updated list of endpoints, both IPs are there. And 19 ms later at I0510 01:08:08.877711
nginx-ingress reports that it has reconfigured itself.
And 40ms later at I0510 01:08:08.926939
there is another endpoints update from kubernetes: now there is no 10.51.1.176
is in the list.
I expect at this point nginx to have 2 backends in the reverse proxy config (since the latest update had not been applied yet).
Yet it’s not the case: see that /15488
request is only forwarded to the 10.51.1.176
3 times, as configured, hence fails. Then the same happens to /12850
and /27194
.
And now, after these 3 failed requests at I0510 01:08:12.212630
we see another dynamic configuration statement.
And then the /5523
requests falls to the old 176
address, and retries with 177
.
Since then all requests are served to 177
as expected.
I expect that all requests after I0510 01:08:08.877711
should be forwarded to 2 backends, it’s not what happens.
How to reproduce it (as minimally and precisely as possible):
- Create a deployment with 1 replica,
maxUnavailable: 0
, andmaxSurge: 1
- Start a client that constantly requests an appliaction
- Change something in deployment so that kubernetes deployed another pod
Anything else we need to know:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 29 (15 by maintainers)
It’s enabled https://github.com/zerkms/ingress-nginx-4077/blob/master/ingress.yaml#L7
And it wouldn’t help, because at the point when the problem occurs it’s only one backend configured in nginx - the old one. That’s why I mentioned LB needs a way to configure number of retries (with a delay).
@zerkms if you use this deployment
it works as expected. The issue is in the
readinessProbe
(initialDelaySeconds).(tested in minikube with your repository files)
@zerkms if you are concerned about log message being imprecise/unclear then I agree. The log message
Dynamic reconfiguration succeeded
only means controller has posted endpoints data to Lua. But it does not mean every Nginx worker process has picked it up and decoded it. That takes another 1 second for every worker to pick up the new data. Until then all the workers will keep sending request to the old pod and won’t know about the new pod.In my company we are running ingress-nginx with over 100K RPS and over 300 app servers behind. We deploy dozens times in a single day and do not have an issue with this delay.
As I said if the issue you are describing is unclear/misleading log message then we can discuss that. Or if you want to get rid of that extra 1s delay, then my next question would be what is the practical benefits of that? Because doing it means adding more complexity to the system and potentially performance impact.
https://github.com/zerkms/ingress-nginx-4077 - here is a repository with a complete minimal repro
You are more than welcome to open a PR to update the docs
That is not possible. Not only in ingress-nignx but any other ingress controller that uses client-go and Informers to be updated about changes in the pods/endpoints.
Endpoints are expensive to create and update. That’s why is being rethought here https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20190603-EndpointSlice-API.md