cilium: Cilium unable to remove taints on Azure AKS
Is there an existing issue for this?
- I have searched the existing issues
What happened?
It seems that Microsoft in their infinite wisdom made a breaking change in 24.04.22. release of AKS regarding taints removal.
https://github.com/Azure/AKS/releases/tag/2022-04-24
Taints and labels applied using the AKS nodepool API are not modifiable from the Kubernetes API and vice versa. Also, any modifications to system taints will not be allowed.
Our clusters were deployed using Terraform, which is how we’re setting this taint on the nodes (in other words by using nodepool API and not K8S API).
I found this out 30 minutes ago when, during production deployment, pods were stuck in pending state.
I’m not yet sure how to handle this situation, a workaround that comes to mind might be to provision a new node pool with no taint applied, to manually apply taint using kubectl as Cilium should be able to remove this taint.
In any case I think this is something you should be aware of and possibly update documentation accordingly.
Cilium Version
❯ cilium version cilium-cli: 0.11.1 compiled with go1.18.1 on darwin/arm64 cilium image (default): v1.11.3 cilium image (stable): v1.11.4 cilium image (running): v1.11.3
Kernel Version
5.4.0-1077-azure
Kubernetes Version
❯ k version Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:41:58Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"darwin/arm64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6", GitCommit:"07959215dd83b4ae6317b33c824f845abd578642", GitTreeState:"clean", BuildDate:"2022-03-30T18:28:25Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Sysdump
No response
Relevant log output
`Errors: cilium cilium-sxrrr controller mark-k8s-node-as-available is failing since 6s (204x): admission webhook "aks-node-validating-webhook.azmk8s.io" denied the request: (UID: 7ffad4f3-9897-4d9a-a87d-d5af33c87e81) Taint delete request "node.cilium.io/agent-not-ready=true:NoSchedule" refused. User is attempting to delete a taint configured on aks node pool "d16asv5".`
Anything else?
No response
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (3 by maintainers)
I think I managed to get it working using https://github.com/kubemod/kubemod, here’s a brief overview how, to point anyone in the right direction:
CriticalAddonsOnly=true:NoScheduletaint set, I had to move to a different installation method. Luckily there are also Kubernetes manifests being provided in the kubemod repo (https://raw.githubusercontent.com/kubemod/kubemod/v0.15.0/bundle.yaml). I downloaded those and updated Job, CronJob and Deployment with this toleration:MutatingWebhookConfigurationin the manifests bundle you will see that controller intercepts pretty much all k8s objects you might want to change. I did not need this so I removed everything and just left this:After a new node joins the cluster it will get a taint applied to it by the controller, after Cilium agent is installed it will remove the taint and pods can be scheduled on that node. There’s not much point in reapplying a taint on each UPDATE so I removed it from list of operations. 3) Finally you need a ModRule object that adds the correct taint, it looks something like this:
Apply that and check the logs of kubemod operator, it should look something like this:
{"level":"info","ts":"2022-05-12 14:21:53.168Z","logger":"modrule-webhook","msg":"Applying ModRule patch","request uid":"06a40d51-141e-4cdc-9f65-dacc16837958" ,"namespace":"","resource":"nodes/aks-d16asv5-36614584-vmss000000","operation":"CREATE","patch":[{"op":"add","path":"/spec/taints/2","value":{"effect":"NoSchedule","key":"node.cilium.io/agent-not-ready","value":"true"}}]}