secrets-store-csi-driver-provider-azure: "driver name secrets-store.csi.k8s.io not found in the list of registered CSI drivers" errors during AKS upgrade
What steps did you take and what happened: Today I upgraded my 2-node AKS cluster with 40+ custom pods in total from 1.16.9 to 1.17.7, and during the upgrade (which creates a new node with 1.17.7, then cordons/drains the existing nodes one by one) I got a number of errors like this:
pod/xxx-yyyyy-zzz-zzz-6c7b7fd9f8-sqq69 MountVolume.SetUp failed for volume "csi-key-vault-func-secrets-svc-yyyyy-zzz" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name secrets-store.csi.k8s.io not found in the list of registered CSI drivers
where xxx-yyyyy-zzz-zzz-6c7b7fd9f8-sqq69
is the pod name (anonymized) and csi-key-vault-func-secrets-svc-yyyyy-zzz
is the volume name mapped using driver: secrets-store.csi.k8s.io …
In total I had 62 such errors during the whole upgrade (2 nodes with about 40+ custom pods).
What did you expect to happen: No such errors at all
Anything else you would like to add: Related to https://github.com/Azure/secrets-store-csi-driver-provider-azure/issues/101, but I do not see myself modifying any K8s config, as mine is AKS-hosted.
Which access mode did you use to access the Azure Key Vault instance:
usePodIdentity: "true"
Environment:
- Secrets Store CSI Driver version: (use the image tag):
Image: docker.io/deislabs/secrets-store-csi:v0.0.11
Image ID: docker-pullable://deislabs/secrets-store-csi@sha256:824f71fba93d4e43a59c866082ce812d69b6faf16a083ad36233008a5f51a5d6
- Azure Key Vault provider version: (use the image tag):
Image: mcr.microsoft.com/k8s/csi/secrets-store/provider-azure:0.0.6
Image ID: docker-pullable://mcr.microsoft.com/k8s/csi/secrets-store/provider-azure@sha256:92a5de47c31e22c92d2937cb1cb58842cfc9d079255665274e2b391fb9002ab4
- Kubernetes version: (use
kubectl version
andkubectl get nodes -o wide
): 1.16.9 -> 1.17.7 - Cluster type: (e.g. AKS, aks-engine, etc): AKS
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 34 (15 by maintainers)
@nilekhc I mitigated this issue by mirroring the images required for the CSI driver in our container registry and using those in the helm chart.
The fix for ACR
unauthorized
error (https://github.com/kubernetes/kubernetes/pull/92330) has been merged to master and back ported to previous Kubernetes releases. This should be available in the next set of Kubernetes releases and will be supported in AKS. This additional delay that’s being caused during upgrade because of image pull errors should be resolved with the fix.K8s schedules all pods to the new node as part of the upgrade. There is no primitive to tell K8s to schedule critical addons before work load pods. But in this scenario, the CSI driver is being scheduled along with the workload pods. The image pull for all 3 containers take 1m on average and in some cases close to 1m20s. During this time kubelet attempt to mount volume for workload pods will fail as the driver is not up and running. From the events posted, the driver successfully mounted the volume right after it was started. Kubelet then tried to pull the images for the work load pod which failed with
unauthorized
error and took close to 2m30s to succeed. This is the reason for the workload pod being down for a long time. The pod start up time is proportional to the time for scheduling, sandbox creation, volume mount, image pull and then starting the container.The CSI driver pods are scheduled at the same time as the workload. It’s the buffer around image pull time that can generate mount failed events. But once the CSI driver is up, the mount succeeds as kubelet periodically reconciles the volume mounts. In my tests, the work load images aren’t in ACR. The mount succeeds as soon as the driver pod is running and my workload is also up and running. We’ll switch to using images hosted in MCR as part of next release instead of using images hosted in dockerhub or quay. These images will also be added to the VHD, so the CSI driver starts quickly.
@ahmedsza If your workload is running on windows node, then you’ll also need to run the driver on windows node. While deploying it through helm, you can set
--windows.enabled=true
and--secrets-store-csi-driver.windows.enabled=true
. The helm chart configuration options are documented here - https://github.com/Azure/secrets-store-csi-driver-provider-azure/tree/master/charts/csi-secrets-store-provider-azure#configurationIf using windows, the required version is 1.18+ as documented here