ingress-nginx: Admission Webhooks failures in 4.5.2

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller
  Release:       v1.6.4
  Build:         69e8833858fb6bda12a44990f1d5eaa7b13f4b75
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6

Kubernetes version (use kubectl version): v1.23.12

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Ubuntu 20.04.5 LTS
Kernel (e.g. uname -a): 5.15.0-1030-aws
containerd: 1.6.6
Install tools:
- KOPS
Basic cluster related info: Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2", GitCommit:"5835544ca568b757a8ecae5c153f317e5736700e", GitTreeState:"clean", BuildDate:"2022-09-21T14:33:49Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"darwin/arm64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.12", GitCommit:"c6939792865ef0f70f92006081690d77411c8ed5", GitTreeState:"clean", BuildDate:"2022-09-21T12:13:07Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

How was the ingress-nginx-controller installed:

ingress-nginx-4.5.2

    controller:
      admissionWebhooks:
      enabled: true
... 
<irrelevant to the problem configurations>

Current State of the controller:

Name:         nginx-controller
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=nginx
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.6.4
              helm.sh/chart=ingress-nginx-4.5.2
Annotations:  meta.helm.sh/release-name: nginx
              meta.helm.sh/release-namespace: nginx
Controller:   k8s.io/ingress-nginx
Events:       <none>

Current state of ingress object, if applicable:

  NAME                                               TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE    SELECTOR
service/nginx-ingress-nginx-controller             LoadBalancer   <redacted>          <redacted>                                                  80:<redacted>/TCP,443:<redacted>/TCP   609d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx,app.kubernetes.io/name=ingress-nginx
service/nginx-ingress-nginx-controller-admission   ClusterIP      <redacted>   <none>                                                                   443/TCP                      348d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx,app.kubernetes.io/name=ingress-nginx
service/nginx-ingress-nginx-controller-metrics     ClusterIP      <redacted>   <none>                                                                   10254/TCP                    609d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx,app.kubernetes.io/name=ingress-nginx
service/nginx-ingress-nginx-defaultbackend         ClusterIP      <redacted>9    <none>                                                                   80/TCP                       609d   app.kubernetes.io/component=default-backend,app.kubernetes.io/instance=nginx,app.kubernetes.io/name=ingress-nginx

What happened:

We recently upgraded from chart version 4.0.18 to latest version 4.5.2. After this upgrade we started observing delays on ingress admission patch operations and start seeing the following error on our logs.

Error:Internal error occurred: failed calling webhook \"validate.nginx.ingress.kubernetes.io\": failed to call webhook: Post \"https://nginx-ingress-nginx-controller-admission.nginx.svc:443/networking/v1/ingresses?timeout=10s\": context deadline exceeded

This means that default timeout of 10s started not to be sufficient for the patch operation of admission controller to be completed. This happened to multiple clusters so we cannot reason it as a specific’s cluster problem. After we downgrading back to version 4.0.18 problem fixed.

As follow up on this we applied the upgrade only to 1 cluster but with increased timeout value from 10s to 20s and the problem again is fixed.

What you expected to happen:

Default timeout of 10s to be sufficient as nothing changed in the workloads apart from nginx version.

Probably between this two versions introduced a feature that degrades performance of admission webhooks. There were some relevant changes (f.e support of cert-manager)

Others:
N/A

How to reproduce this issue:

Is not reproducable on minikube/kind because I couldn’t achieve the ingresses amount to observe delays. Needs > 1500 ingresses on cluster to reproduce this behavior.

Anything else we need to know:

With similar clusters in terms of underlying software/hardware but less amount of objects delays were not observed.
Within the same cluster(s) with the helm-chart 4.0.18 delays are not observed.
Within the same cluster(s) with the helm-chart 4.5.0 delays are not observed.

About this issue

Original URL
State: open
Created a year ago
Reactions: 3
Comments: 19 (15 by maintainers)

Most upvoted comments

I’m seeing the same issue with 4.6.0, but only if more than one ingress update is applied at the same time, so as a workaround i’m applying changes one by one (using argocd)

I have exactly the same problem on 4.7.1 this is the latest version at the moment. If I update the app via Argo, the ingreses don’t sync. But if after you synchronize each op separately, then it starts to work. Who can tell me how to more effectively diagnose the problem?

jidckii on Jul 28, 2023

We are also experiencing these issues on 4.7.1, currently we have 164 ingresses per controller, which represent a total of 1968 root domains with cert-manager wildcard certificates, each domains has 6 subdomains. We are running 1.27.3-gke.1700 on a GKE autopilot cluster, very little traffic on this cluster.

somewhere around 50% of the way through confiuring the ingresses, the timeouts start happening on all 20 separate ingress controllers. I have tested adding more replicata, 5cpu/15GBmemory resources. I am currently testing a rollback to 4.5.0 to see if this indeed fixes the issue.

Here is our config (terraform -> helm): set { name = "controller.replicaCount" value = "2" } set { name = "controller.resources.requests.cpu" value = "1" } set { name = "controller.resources.requests.memory" value = "5Gi" } set { name = "controller.resources.limits.cpu" value = "1" } set { name = "controller.resources.limits.memory" value = "5Gi" } set { name = "controller.service.type" value = "LoadBalancer" } set { name = "controller.service.loadBalancerIP" value = google_compute_address.ndc12.address } set { name = "controller.service.externalTrafficPolicy" value = "Local" } set { name = "controller.ingressClassResource.name" value = google_compute_address.ndc12.name } set { name = "controller.ingressClassResource.enabled" value = "true" } set { name = "controller.ingressClassResource.default" value = "false" } set { name = "controller.ingressClassResource.controllerValue" value = "k8s.io/ingress-nginx-${google_compute_address.ndc12.name}" }

Here is the nginx_ingress_ndc.yaml which I just added tonight to see if it would fix any of the issues. controller: config: use-forwarded-headers: 'true' large-client-header-buffers: '4 16k' proxy-body-size: '20m' proxy-send-timeout: 300 proxy-read-timeout: 300 proxy-connect-timeout: 300 client-body-timeout: 300 client-header-timeout: 300 upstream-keepalive-timeout: 300 keep-alive: 300

thomascooper on Oct 1, 2023

I’m seeing the same issue with 4.6.0, but only if more than one ingress update is applied at the same time, so as a workaround i’m applying changes one by one (using argocd)

an-tex on Apr 12, 2023

To provide some additional context we upgraded our workloads to helm-chart 4.5.0 (just 1 version before the one we used when we observed the problem) and this behaviour is not observed. We are going to slowly rollout the 4.5.2 to see if we are going to notice the behaviour again or not. We will keep you posted.

stafot on Mar 27, 2023

cc @tao12345666333 @rikatz Performance issues and clear data on performance at scale

longwuyuan on Mar 16, 2023

ingress-nginx: Admission Webhooks failures in 4.5.2

Current state of ingress object, if applicable:

About this issue

Most upvoted comments