ingress-nginx: Dynamic reconfiguration of endpoints does not happen instantly

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): No

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): 502 endpoints rollingUpdate

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.24.1

Kubernetes version (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: bare-metal
OS (e.g. from /etc/os-release): ubuntu bionic
Kernel (e.g. uname -a): 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm
Others:

What happened: I have found that even though ingress-nginx reports Dynamic reconfiguration succeeded. it does not mean it had succeeded, or probably it did not configure what it should have.

The case: there is a deployment with just one replica (deliberately). The application is served through ingress-nginx.

The

    nginx.ingress.kubernetes.io/proxy-next-upstream: "error timeout http_502"

annotation is set.

Then I change an environment variable value and re-apply it, so that kubernetes created a new pod and killed the old one.

Rolling update strategy is configured as such:

  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1

and there is a readiness probe:

          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 0
            periodSeconds: 6
            timeoutSeconds: 4
            failureThreshold: 3

What you expected to happen: Now, I expect that the switch from the old pod to a new pod happens seamlessly. And logs confirm that it’s what should happen. While it does not.

How I tested it:

I started an infinite loop with curl that continuously sends requests, first column date, then http status, and a random number (to correlate requests in the log). 13:08:08 is when readiness probe confirmed the service is ready and kubernetes sent endpoints update event.

[05-10 13:08:08]302 8341
[05-10 13:08:08]302 32272
[05-10 13:08:08]502 15488
[05-10 13:08:09]502 12850
[05-10 13:08:09]502 27194
[05-10 13:08:09]302 5523
[05-10 13:08:14]302 1760

Now the most interesting in logs

ingress-nginx log (partially anonymised)

10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /5286 HTTP/2.0" 302 420 "-" "curl/7.58.0" 47 0.047 [env-name-app-app-80] 10.51.1.176:80 432 0.048 302 44f7460c0f3bcfebca1bcada6c63f6a1
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /11485 HTTP/2.0" 302 420 "-" "curl/7.58.0" 48 0.043 [env-name-app-app-80] 10.51.1.176:80 432 0.044 302 fcbf4eea7bb6f68e3e3d3765baa490f6
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /28636 HTTP/2.0" 302 420 "-" "curl/7.58.0" 48 0.042 [env-name-app-app-80] 10.51.1.176:80 432 0.044 302 091517f8396ab1eb1d64e72fb54eda7e
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /26277 HTTP/2.0" 302 420 "-" "curl/7.58.0" 48 0.041 [env-name-app-app-80] 10.51.1.176:80 432 0.040 302 574053a6f07f1ff23c78c793024ffc84
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /8341 HTTP/2.0" 302 420 "-" "curl/7.58.0" 47 0.045 [env-name-app-app-80] 10.51.1.176:80 432 0.048 302 3663c5f1656b2ad83c72a303352b251c
I0510 01:08:08.858024       6 nginx.go:359] Event UPDATE received - object &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app,GenerateName:,Namespace:env-name-app,SelfLink:/api/v1/namespaces/env-name-app/endpoints/app,UID:ceba4410-7210-11e9-87c2-001fd0a442dc,ResourceVersion:3973200,Generation:0,CreationTimestamp:2019-05-09 04:13:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{endpoints.kubernetes.io/last-change-trigger-time: 2019-05-10T01:08:08Z,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Subsets:[{[{10.51.1.176  0xc0008d1620 ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-5b7dd68b58-qxkc5,UID:3c92cfb6-72bf-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3972578,FieldPath:,}} {10.51.1.177  0xc0008d1630 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}] [] [{http 80 TCP}]}],}
I0510 01:08:08.867260       6 queue.go:87] queuing item &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app,GenerateName:,Namespace:env-name-app,SelfLink:/api/v1/namespaces/env-name-app/endpoints/app,UID:ceba4410-7210-11e9-87c2-001fd0a442dc,ResourceVersion:3973200,Generation:0,CreationTimestamp:2019-05-09 04:13:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{endpoints.kubernetes.io/last-change-trigger-time: 2019-05-10T01:08:08Z,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Subsets:[{[{10.51.1.176  0xc0008d1620 ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-5b7dd68b58-qxkc5,UID:3c92cfb6-72bf-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3972578,FieldPath:,}} {10.51.1.177  0xc0008d1630 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}] [] [{http 80 TCP}]}],}
I0510 01:08:08.874884       6 queue.go:128] syncing env-name-app/app
I0510 01:08:08.875216       6 controller.go:683] Creating upstream "env-name-app-app-80"
I0510 01:08:08.875232       6 controller.go:788] Obtaining ports information for Service "env-name-app/app"
I0510 01:08:08.875249       6 endpoints.go:73] Getting Endpoints for Service "env-name-app/app" and port &ServicePort{Name:http,Protocol:TCP,Port:80,TargetPort:80,NodePort:0,}
I0510 01:08:08.875267       6 endpoints.go:116] Endpoints found for Service "env-name-app/app": [{10.51.1.176 80 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-5b7dd68b58-qxkc5,UID:3c92cfb6-72bf-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3972578,FieldPath:,}} {10.51.1.177 80 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}]
I0510 01:08:08.875381       6 controller.go:1265] Found SSL certificate matching host "app.org.tld": "env-name-app/tls-star-org-tld"
I0510 01:08:08.875463       6 controller.go:459] Replacing location "/" for server "app.org.tld" with upstream "upstream-default-backend" to use upstream "env-name-app-app-80" (Ingress "env-name-app/app")
I0510 01:08:08.875493       6 controller.go:241] Obtaining information about TCP stream services from ConfigMap "ingress-nginx/tcp-services"
I0510 01:08:08.875512       6 controller.go:241] Obtaining information about UDP stream services from ConfigMap "ingress-nginx/udp-services"
[10/May/2019:01:08:08 +0000]TCP200000.000
I0510 01:08:08.877711       6 controller.go:216] Dynamic reconfiguration succeeded.
I0510 01:08:08.878718       6 nginx_status.go:168] start scraping socket: /nginx_status
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:08 +0000] "GET /32272 HTTP/2.0" 302 420 "-" "curl/7.58.0" 48 0.051 [env-name-app-app-80] 10.51.1.176:80 432 0.052 302 baf82d0477e658439b3df7108824e3ad
I0510 01:08:08.926939       6 nginx.go:359] Event UPDATE received - object &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app,GenerateName:,Namespace:env-name-app,SelfLink:/api/v1/namespaces/env-name-app/endpoints/app,UID:ceba4410-7210-11e9-87c2-001fd0a442dc,ResourceVersion:3973205,Generation:0,CreationTimestamp:2019-05-09 04:13:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Subsets:[{[{10.51.1.177  0xc000a39860 ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}] [] [{http 80 TCP}]}],}
I0510 01:08:08.934640       6 queue.go:87] queuing item &Endpoints{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:app,GenerateName:,Namespace:env-name-app,SelfLink:/api/v1/namespaces/env-name-app/endpoints/app,UID:ceba4410-7210-11e9-87c2-001fd0a442dc,ResourceVersion:3973205,Generation:0,CreationTimestamp:2019-05-09 04:13:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Subsets:[{[{10.51.1.177  0xc000a39860 ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}] [] [{http 80 TCP}]}],}
I0510 01:08:08.972490       6 socket.go:343] removing ingresses [] from metrics
I0510 01:08:08.973955       6 nginx_status.go:168] start scraping socket: /nginx_status
2019/05/10 01:08:08 [error] 41#41: *7198 upstream prematurely closed connection while reading response header from upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /15488 HTTP/2.0", upstream: "http://10.51.1.176:80/15488", host: "app.org.tld"
2019/05/10 01:08:09 [error] 41#41: *7198 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /15488 HTTP/2.0", upstream: "http://10.51.1.176:80/15488", host: "app.org.tld"
2019/05/10 01:08:09 [error] 41#41: *7198 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /15488 HTTP/2.0", upstream: "http://10.51.1.176:80/15488", host: "app.org.tld"
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:09 +0000] "GET /15488 HTTP/2.0" 502 150 "-" "curl/7.58.0" 48 0.017 [env-name-app-app-80] 10.51.1.176:80, 10.51.1.176:80, 10.51.1.176:80 0, 0, 0 0.012, 0.004, 0.000 502, 502, 502 df1e3d755ed000a95b576e0fcf014629
I0510 01:08:09.025418       6 queue.go:128] syncing env-name-app/app
2019/05/10 01:08:09 [error] 40#40: *7203 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /12850 HTTP/2.0", upstream: "http://10.51.1.176:80/12850", host: "app.org.tld"
2019/05/10 01:08:09 [error] 40#40: *7203 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /12850 HTTP/2.0", upstream: "http://10.51.1.176:80/12850", host: "app.org.tld"
2019/05/10 01:08:09 [error] 40#40: *7203 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /12850 HTTP/2.0", upstream: "http://10.51.1.176:80/12850", host: "app.org.tld"
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:09 +0000] "GET /12850 HTTP/2.0" 502 150 "-" "curl/7.58.0" 48 0.002 [env-name-app-app-80] 10.51.1.176:80, 10.51.1.176:80, 10.51.1.176:80 0, 0, 0 0.004, 0.000, 0.000 502, 502, 502 1d994cde4986d2b408f98c79dc4885cd
2019/05/10 01:08:09 [error] 40#40: *7208 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /27194 HTTP/2.0", upstream: "http://10.51.1.176:80/27194", host: "app.org.tld"
2019/05/10 01:08:09 [error] 40#40: *7208 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /27194 HTTP/2.0", upstream: "http://10.51.1.176:80/27194", host: "app.org.tld"
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:09 +0000] "GET /27194 HTTP/2.0" 502 150 "-" "curl/7.58.0" 48 0.003 [env-name-app-app-80] 10.51.1.176:80, 10.51.1.176:80, 10.51.1.176:80 0, 0, 0 0.000, 0.004, 0.000 502, 502, 502 fefc17531917044ef9221490019c9d96
2019/05/10 01:08:09 [error] 40#40: *7208 connect() failed (111: Connection refused) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /27194 HTTP/2.0", upstream: "http://10.51.1.176:80/27194", host: "app.org.tld"
I0510 01:08:12.208823       6 controller.go:683] Creating upstream "env-name-app-app-80"
I0510 01:08:12.208839       6 controller.go:788] Obtaining ports information for Service "env-name-app/app"
I0510 01:08:12.208857       6 endpoints.go:73] Getting Endpoints for Service "env-name-app/app" and port &ServicePort{Name:http,Protocol:TCP,Port:80,TargetPort:80,NodePort:0,}
I0510 01:08:12.208873       6 endpoints.go:116] Endpoints found for Service "env-name-app/app": [{10.51.1.177 80 &ObjectReference{Kind:Pod,Namespace:env-name-app,Name:app-6d575854d5-hjnf9,UID:0e0562a1-72c0-11e9-87c2-001fd0a442dc,APIVersion:,ResourceVersion:3973198,FieldPath:,}}]
I0510 01:08:12.208974       6 controller.go:1265] Found SSL certificate matching host "app.org.tld": "env-name-app/tls-star-org-tld"
I0510 01:08:12.209142       6 controller.go:459] Replacing location "/" for server "app.org.tld" with upstream "upstream-default-backend" to use upstream "env-name-app-app-80" (Ingress "env-name-app/app")
I0510 01:08:12.209172       6 controller.go:241] Obtaining information about TCP stream services from ConfigMap "ingress-nginx/tcp-services"
I0510 01:08:12.209195       6 controller.go:241] Obtaining information about UDP stream services from ConfigMap "ingress-nginx/udp-services"
[10/May/2019:01:08:12 +0000]TCP200000.001
I0510 01:08:12.212630       6 controller.go:216] Dynamic reconfiguration succeeded.
I0510 01:08:12.212995       6 nginx_status.go:168] start scraping socket: /nginx_status
I0510 01:08:12.253664       6 socket.go:343] removing ingresses [] from metrics
I0510 01:08:12.259530       6 nginx_status.go:168] start scraping socket: /nginx_status
2019/05/10 01:08:14 [error] 41#41: *7213 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.50.1.2, server: app.org.tld, request: "GET /5523 HTTP/2.0", upstream: "http://10.51.1.176:80/5523", host: "app.org.tld"
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:14 +0000] "GET /5523 HTTP/2.0" 302 420 "-" "curl/7.58.0" 47 5.050 [env-name-app-app-80] 10.51.1.176:80, 10.51.1.177:80 0, 432 5.000, 0.052 504, 302 58c40cc9fbe4990929d2eb57b3ac5783
10.50.1.2 - [10.50.1.2] - - [10/May/2019:01:08:14 +0000] "GET /1760 HTTP/2.0" 302 420 "-" "curl/7.58.0" 47 0.050 [env-name-app-app-80] 10.51.1.177:80 432 0.048 302 8f93d75cc1c00f581013547a72a249e1

From the log you can see that the old pod IP is 10.51.1.176, and the new IP is 10.51.1.177.

Now see that at I0510 01:08:08.858024 point it receives an event with updated list of endpoints, both IPs are there. And 19 ms later at I0510 01:08:08.877711 nginx-ingress reports that it has reconfigured itself.

And 40ms later at I0510 01:08:08.926939 there is another endpoints update from kubernetes: now there is no 10.51.1.176 is in the list.

I expect at this point nginx to have 2 backends in the reverse proxy config (since the latest update had not been applied yet).

Yet it’s not the case: see that /15488 request is only forwarded to the 10.51.1.176 3 times, as configured, hence fails. Then the same happens to /12850 and /27194.

And now, after these 3 failed requests at I0510 01:08:12.212630 we see another dynamic configuration statement.

And then the /5523 requests falls to the old 176 address, and retries with 177.

Since then all requests are served to 177 as expected.

I expect that all requests after I0510 01:08:08.877711 should be forwarded to 2 backends, it’s not what happens.

How to reproduce it (as minimally and precisely as possible):

Create a deployment with 1 replica, maxUnavailable: 0, and maxSurge: 1
Start a client that constantly requests an appliaction
Change something in deployment so that kubernetes deployed another pod

Anything else we need to know:

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 29 (15 by maintainers)

Most upvoted comments

It’s enabled https://github.com/zerkms/ingress-nginx-4077/blob/master/ingress.yaml#L7

And it wouldn’t help, because at the point when the problem occurs it’s only one backend configured in nginx - the old one. That’s why I mentioned LB needs a way to configure number of retries (with a delay).

zerkms on Aug 15, 2019

@zerkms if you use this deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ingress-nginx-4077
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ingress-nginx-4077
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    metadata:
      labels:
        app: ingress-nginx-4077
    spec:
      containers:
        - name: nginx
          image: nginx:1.15
          ports:
            - containerPort: 80
          env:
            - name: FOO
              value: bar
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 1
            timeoutSeconds: 1
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "6"]

it works as expected. The issue is in the readinessProbe (initialDelaySeconds).

(tested in minikube with your repository files)

aledbf on Aug 15, 2019

@zerkms if you are concerned about log message being imprecise/unclear then I agree. The log message Dynamic reconfiguration succeeded only means controller has posted endpoints data to Lua. But it does not mean every Nginx worker process has picked it up and decoded it. That takes another 1 second for every worker to pick up the new data. Until then all the workers will keep sending request to the old pod and won’t know about the new pod.

In my company we are running ingress-nginx with over 100K RPS and over 300 app servers behind. We deploy dozens times in a single day and do not have an issue with this delay.

As I said if the issue you are describing is unclear/misleading log message then we can discuss that. Or if you want to get rid of that extra 1s delay, then my next question would be what is the practical benefits of that? Because doing it means adding more complexity to the system and potentially performance impact.

ElvinEfendi on Aug 15, 2019

https://github.com/zerkms/ingress-nginx-4077 - here is a repository with a complete minimal repro

zerkms on May 10, 2019

Yet they are being requested. It’s the observed behaviour that is not immediately obvious and not documented 🤷

You are more than welcome to open a PR to update the docs

aledbf on Aug 15, 2019

It was my early days exploring kubernetes and ingress-nginx: I genuinely expected that if an API notified an ingress that endpoint does not exist - an endpoint would do its job instantly, or at least it would take into account a new endpoint.

its job instantly

That is not possible. Not only in ingress-nignx but any other ingress controller that uses client-go and Informers to be updated about changes in the pods/endpoints.

an endpoint would do its job instantly, or at least it would take into account a new endpoint.

Endpoints are expensive to create and update. That’s why is being rethought here https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20190603-EndpointSlice-API.md

aledbf on Aug 15, 2019