ingress-nginx: Admission Webhooks failures in 4.5.2
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
NGINX Ingress controller
Release: v1.6.4
Build: 69e8833858fb6bda12a44990f1d5eaa7b13f4b75
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.21.6
Kubernetes version (use kubectl version): v1.23.12
Environment:
-
Cloud provider or hardware configuration: AWS
-
OS (e.g. from /etc/os-release): Ubuntu 20.04.5 LTS
-
Kernel (e.g.
uname -a): 5.15.0-1030-aws -
containerd: 1.6.6
-
Install tools:
- KOPS
-
Basic cluster related info:
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2", GitCommit:"5835544ca568b757a8ecae5c153f317e5736700e", GitTreeState:"clean", BuildDate:"2022-09-21T14:33:49Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"darwin/arm64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.12", GitCommit:"c6939792865ef0f70f92006081690d77411c8ed5", GitTreeState:"clean", BuildDate:"2022-09-21T12:13:07Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"} -
How was the ingress-nginx-controller installed:
- ingress-nginx-4.5.2
controller: admissionWebhooks: enabled: true ... <irrelevant to the problem configurations> -
Current State of the controller:
Name: nginx-controller
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=nginx
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=ingress-nginx
app.kubernetes.io/part-of=ingress-nginx
app.kubernetes.io/version=1.6.4
helm.sh/chart=ingress-nginx-4.5.2
Annotations: meta.helm.sh/release-name: nginx
meta.helm.sh/release-namespace: nginx
Controller: k8s.io/ingress-nginx
Events: <none>
-
Current state of ingress object, if applicable:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/nginx-ingress-nginx-controller LoadBalancer <redacted> <redacted> 80:<redacted>/TCP,443:<redacted>/TCP 609d app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx,app.kubernetes.io/name=ingress-nginx
service/nginx-ingress-nginx-controller-admission ClusterIP <redacted> <none> 443/TCP 348d app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx,app.kubernetes.io/name=ingress-nginx
service/nginx-ingress-nginx-controller-metrics ClusterIP <redacted> <none> 10254/TCP 609d app.kubernetes.io/component=controller,app.kubernetes.io/instance=nginx,app.kubernetes.io/name=ingress-nginx
service/nginx-ingress-nginx-defaultbackend ClusterIP <redacted>9 <none> 80/TCP 609d app.kubernetes.io/component=default-backend,app.kubernetes.io/instance=nginx,app.kubernetes.io/name=ingress-nginx
What happened:
We recently upgraded from chart version 4.0.18 to latest version 4.5.2. After this upgrade we started observing delays on ingress admission patch operations and start seeing the following error on our logs.
Error:Internal error occurred: failed calling webhook \"validate.nginx.ingress.kubernetes.io\": failed to call webhook: Post \"https://nginx-ingress-nginx-controller-admission.nginx.svc:443/networking/v1/ingresses?timeout=10s\": context deadline exceeded
This means that default timeout of 10s started not to be sufficient for the patch operation of admission controller to be completed. This happened to multiple clusters so we cannot reason it as a specific’s cluster problem.
After we downgrading back to version 4.0.18 problem fixed.
As follow up on this we applied the upgrade only to 1 cluster but with increased timeout value from 10s to 20s and the problem again is fixed.
What you expected to happen:
Default timeout of 10s to be sufficient as nothing changed in the workloads apart from nginx version.
Probably between this two versions introduced a feature that degrades performance of admission webhooks. There were some relevant changes (f.e support of cert-manager)
- Others:
- N/A
How to reproduce this issue:
- Is not reproducable on minikube/kind because I couldn’t achieve the ingresses amount to observe delays. Needs > 1500 ingresses on cluster to reproduce this behavior.
Anything else we need to know:
- With similar clusters in terms of underlying software/hardware but less amount of objects delays were not observed.
- Within the same cluster(s) with the helm-chart 4.0.18 delays are not observed.
- Within the same cluster(s) with the helm-chart 4.5.0 delays are not observed.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 3
- Comments: 19 (15 by maintainers)
I have exactly the same problem on 4.7.1 this is the latest version at the moment. If I update the app via Argo, the ingreses don’t sync. But if after you synchronize each op separately, then it starts to work. Who can tell me how to more effectively diagnose the problem?
We are also experiencing these issues on 4.7.1, currently we have 164 ingresses per controller, which represent a total of 1968 root domains with cert-manager wildcard certificates, each domains has 6 subdomains. We are running 1.27.3-gke.1700 on a GKE autopilot cluster, very little traffic on this cluster.
somewhere around 50% of the way through confiuring the ingresses, the timeouts start happening on all 20 separate ingress controllers. I have tested adding more replicata, 5cpu/15GBmemory resources. I am currently testing a rollback to 4.5.0 to see if this indeed fixes the issue.
Here is our config (terraform -> helm):
set { name = "controller.replicaCount" value = "2" } set { name = "controller.resources.requests.cpu" value = "1" } set { name = "controller.resources.requests.memory" value = "5Gi" } set { name = "controller.resources.limits.cpu" value = "1" } set { name = "controller.resources.limits.memory" value = "5Gi" } set { name = "controller.service.type" value = "LoadBalancer" } set { name = "controller.service.loadBalancerIP" value = google_compute_address.ndc12.address } set { name = "controller.service.externalTrafficPolicy" value = "Local" } set { name = "controller.ingressClassResource.name" value = google_compute_address.ndc12.name } set { name = "controller.ingressClassResource.enabled" value = "true" } set { name = "controller.ingressClassResource.default" value = "false" } set { name = "controller.ingressClassResource.controllerValue" value = "k8s.io/ingress-nginx-${google_compute_address.ndc12.name}" }Here is the nginx_ingress_ndc.yaml which I just added tonight to see if it would fix any of the issues.
controller: config: use-forwarded-headers: 'true' large-client-header-buffers: '4 16k' proxy-body-size: '20m' proxy-send-timeout: 300 proxy-read-timeout: 300 proxy-connect-timeout: 300 client-body-timeout: 300 client-header-timeout: 300 upstream-keepalive-timeout: 300 keep-alive: 300I’m seeing the same issue with 4.6.0, but only if more than one ingress update is applied at the same time, so as a workaround i’m applying changes one by one (using argocd)
To provide some additional context we upgraded our workloads to helm-chart
4.5.0(just 1 version before the one we used when we observed the problem) and this behaviour is not observed. We are going to slowly rollout the4.5.2to see if we are going to notice the behaviour again or not. We will keep you posted.cc @tao12345666333 @rikatz Performance issues and clear data on performance at scale